NVIDIA Nemotron 3 Nano 30B-A3B

Nemotron 3 Nano 30B-A3B is NVIDIA’s flagship open reasoning model using a hybrid Mamba-2 + Transformer Mixture-of-Experts architecture. Although it has 31.6B total parameters, only 3.2B are active per forward pass, delivering significantly higher throughput while maintaining state-of-the-art reasoning accuracy.

NVIDIA Chat 262144 Tokens
Get API Key
Try in Playground
Free Trial Credit No Credit Card Required
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/chat/completions" \
  -H "Authorization: Bearer $QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 500
}'

Technical Specifications

Model Architecture & Performance

Model Size 31.6B total / 3.2B active
Context Length 262144 Tokens
Quantization FP8
Tokens/Second 220
License NVIDIA Open Model License
Release Date December 15, 2025
Developers NVIDIA

Pricing

Pay-per-use, no commitments

Input Tokens $0.00004/1K Tokens
Output Tokens $0.00022/1K Tokens

API Reference

Complete parameter documentation

ParameterTypeDefaultDescription
streambooleantrueEnable streaming responses for real-time output.
temperaturenumber0.3Controls randomness. Higher values produce more creative but less predictable output.
max_tokensnumber8192Maximum number of tokens the model can generate.
top_pnumber1Nucleus sampling threshold for token selection.
enable_thinkingbooleantrueEnable chain-of-thought reasoning traces.
thinking_budgetnumber16384Maximum tokens allocated for reasoning traces.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

StrengthsConsiderations
Hybrid Mamba-2 + Transformer MoE architecture
Only 3.2B active parameters per inference
Up to 3.3× higher throughput than comparable 30B models
Supports extremely long context (up to 1M tokens)
Configurable reasoning depth with thinking budget
Native tool calling and function execution
FP8 optimized for memory efficiency and speed
Strong performance on SWE-Bench, GPQA Diamond, and AIME benchmarks
Requires 32GB+ VRAM for FP8 inference
BF16 requires 60GB+ VRAM
Hybrid architecture has less community tooling than pure transformers
FlashInfer backend requires CUDA toolkit support

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid helped us turn a collection of AI scripts into structured production workflows. We now have better reliability, visibility, and control over every run."

AI Infrastructure Team

Automation & Orchestration