openai/gpt-oss-20b

Welcome to the gpt-oss series, OpenAI's open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. gpt-oss-20b is a 21.5B parameter model with Mixture-of-Experts (MoE) architecture, featuring 3.6B active parameters during inference. It's optimized for lower latency and local or specialized use-cases, supporting configurable reasoning depth for agentic applications.

OpenAI Chat 131.1k Tokens

Get API Key

Try in Playground

Free Trial Credit No Credit Card Required

$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "openai/gpt-oss-20b",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 4096,
  "stream": true,
  "top_p": 1
}'

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

stream = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=4096,
    temperature=0.7,
    top_p=1,
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

print("\n")

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://platform.qubrid.com/v1',
  apiKey: 'QUBRID_API_KEY',
});

const stream = await client.chat.completions.create({
  model: 'openai/gpt-oss-20b',
  messages: [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  max_tokens: 4096,
  temperature: 0.7,
  top_p: 1,
  stream: true
});

for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
console.log('\n');

package main

import (
  "bytes"
  "encoding/json"
  "net/http"
)

func main() {
  url := "https://platform.qubrid.com/v1/chat/completions"

  data := {
  "model": "openai/gpt-oss-20b",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 4096,
  "stream": true,
  "top_p": 1
}
  jsonData, _ := json.Marshal(data)

  req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
  req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
  req.Header.Set("Content-Type", "application/json")

  client := &http.Client{}
  res, _ := client.Do(req)
}

Technical Specifications

Model Architecture & Performance

Model Size 20.9B Params

Context Length 131.1k Tokens

Quantization fp16

Tokens/Second 386

Architecture Compact Mixture-of-Experts (MoE) with SwiGLU activations, Token-choice MoE, Alternating attention mechanism

Precision FP4 quantization (MXFP4), 8-bit precision support

License Apache 2.0

Release Date August 2024

Developers OpenAI

Pricing

Pay-per-use, no commitments

Input Tokens $0.05/1M Tokens

Output Tokens $0.28/1M Tokens

API Reference

Complete parameter documentation

Parameter	Type	Default	Description
stream	boolean	true	Enable streaming responses for real-time output.
temperature	number	0.7	Controls randomness. Higher values mean more creative but less predictable output.
max_tokens	number	4096	Maximum number of tokens to generate in the response.
top_p	number	1	Nucleus sampling: considers tokens with top_p probability mass.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths	Considerations
Compact Mixture-of-Experts (MoE) design with SwiGLU activations Token-choice MoE optimized for single-GPU efficiency Native FP4 quantization for optimal inference speed Single B200 GPU deployment capability 131K context window with efficient memory usage Adjustable reasoning effort levels for task-specific optimization Supports function calling with defined schemas Apache 2.0 license for commercial use	Smaller than largest frontier models May require fine-tuning for specialized domains MoE architecture complexity for some use cases

Strengths

Considerations

Compact Mixture-of-Experts (MoE) design with SwiGLU activations

Token-choice MoE optimized for single-GPU efficiency

Native FP4 quantization for optimal inference speed

Single B200 GPU deployment capability

131K context window with efficient memory usage

Adjustable reasoning effort levels for task-specific optimization

Supports function calling with defined schemas

Apache 2.0 license for commercial use

Smaller than largest frontier models

May require fine-tuning for specialized domains

MoE architecture complexity for some use cases

Use cases

Recommended applications for this model

Function calling with schemas

Web browsing and browser automation

Agentic tasks

Chain-of-thought reasoning

Local and low-latency deployments

Rapid prototyping and development support

Code generation and optimization

Customer support automation

Content generation and editing

Process automation and workflow optimization

Enterprise
Platform Integration

Docker Support

Official Docker images for containerized deployments

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

Get Started

"Qubrid AI reduced our document processing time by over 60% and significantly improved retrieval accuracy across our RAG workflows."

Enterprise AI Team

Document Intelligence Platform