moonshotai/Kimi-K2-Thinking logo

moonshotai/Kimi-K2-Thinking

Kimi K2 Thinking is the first open-weights model to achieve SOTA performance against leading closed-source models (GPT-5, Claude 4.5 Sonnet) across major benchmarks including HLE (44.9%), BrowseComp (60.2%), and SWE-Bench Verified (71.3%). Built on a 1T parameter MoE architecture with 32B active per token and native INT4 quantization via QAT, it maintains stable tool-use across 200–300 sequential calls within a 256K context window.

Moonshot AI Chat 256K Tokens
Get API Key
Deposit $5 to get started Unlock API access and start running inference right away. See how many million tokens $5 gets you

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "moonshotai/Kimi-K2-Thinking",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16384,
  "stream": true,
  "top_p": 0.95
}'

Technical Specifications

Model Architecture & Performance

Variant Thinking
Model Size 1T params (32B active)
Context Length 256K Tokens
Quantization INT4 (QAT)
Tokens/sec 50
Architecture Sparse MoE Transformer β€” 1T total / 32B active, 61 layers (1 dense), 384 experts (8 selected per token), MLA attention, SwiGLU
Precision INT4 (QAT)
License Modified MIT License
Release Date November 2025
Developers Moonshot AI

Pricing

Pay-per-use, no commitments

Input Tokens $0.60/1M Tokens
Output Tokens $2.50/1M Tokens
Cached Input Tokens $0.30/1M Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 1 Recommended temperature is 1.0 for Kimi-K2-Thinking.
max_tokens number 16384 Maximum number of tokens to generate.
top_p number 0.95 Controls nucleus sampling.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
First open-source model to beat closed frontier models (HLE, BrowseComp, SWE-bench)
1T MoE with only 32B active per token
Native INT4 via QAT β€” 2x speed vs FP8
Interleaved chain-of-thought with dynamic tool calling
Stable across 200-300 sequential tool calls
256K context window
Requires 512GB+ RAM for full deployment
~600GB model size (large infrastructure needed)
Thinking mode means higher latency than non-reasoning models
Temperature must be set to 1.0 for recommended performance

Use cases

Recommended applications for this model

Complex agentic research workflows
Long-horizon coding and debugging
Advanced mathematical reasoning
Multi-step tool orchestration
Autonomous writing and analysis
Scientific reasoning tasks

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid scaled our personalized outreach from hundreds to tens of thousands of prospects. AI-driven research and content generation doubled our campaign velocity without sacrificing quality."

Demand Generation Team

Marketing & Sales Operations