Available Models

DOS AI serves high-quality open-source LLMs via an OpenAI-compatible API. Self-hosted models run on dedicated RTX Pro 6000 GPUs with 96 GB VRAM in Asia-Southeast 1. Cloud models are served via partner providers for maximum coverage.

Smart Routing

Use dos-auto as the model ID to let DOS AI automatically select the best model for each request. Smart routing uses a 15-dimension classifier to analyze your prompt and route to the optimal model based on task complexity, cost, and latency.

response = client.chat.completions.create(
    model="dos-auto",  # Smart routing picks the best model
    messages=[{"role": "user", "content": "..."}],
)

Model Catalog

Self-Hosted (Lowest Latency)

Model
Provider
Context
Input
Output
Model ID

Qwen3.5-35B-A3B

Alibaba

128K

$0.15 / 1M

$0.15 / 1M

dos-ai

Cloud Models

Model
Provider
Context
Input
Output
Model ID

Llama 4 Maverick 17B-128E

Meta / DeepInfra

1M

$0.17 / 1M

$0.66 / 1M

llama-4-maverick

Llama 4 Scout 17B-16E

Meta / DeepInfra

640K

$0.11 / 1M

$0.38 / 1M

llama-4-scout

DeepSeek V3

DeepSeek

128K

$0.25 / 1M

$0.25 / 1M

deepseek-v3

Llama 3.3 70B

Meta

128K

$0.20 / 1M

$0.20 / 1M

llama-3.3-70b

Llama 3.1 8B

Meta

128K

$0.05 / 1M

$0.05 / 1M

llama-3.1-8b

All prices are in USD. The catalog is DB-driven -- new models are added regularly. Check GET /v1/catalog or the dashboard for the latest list. See Pricing for billing details.

Embedding Models

Model
Provider
Dimensions
Model ID

Qwen3-Embedding-4B AWQ

Alibaba / Self-hosted

2560

qwen3-embedding-4b

Model Details

Qwen3.5-35B-A3B (default)

Alibaba's Mixture-of-Experts model with 35 billion total parameters and 3 billion active parameters per forward pass. This architecture delivers excellent quality at remarkably low cost and latency, making it our recommended default model for most use cases.

  • Best for: General-purpose chat, code generation, reasoning, multilingual tasks

  • Strengths: Outstanding cost-efficiency, fast response times, strong multilingual support (especially CJK languages)

  • Model ID: dos-ai

Llama 4 Maverick 17B-128E

Meta's latest Mixture-of-Experts model with 17 billion active parameters and 128 experts. Strong reasoning and multilingual capabilities with an industry-leading 1 million token context window.

  • Best for: Complex reasoning, long-context analysis, multilingual tasks

  • Strengths: Massive context window, strong benchmark scores, efficient MoE architecture

  • Model ID: llama-4-maverick

Llama 4 Scout 17B-16E

Meta's efficient MoE model with 17 billion active parameters and 16 experts. Fast and cost-effective for everyday tasks with a 640K context window.

  • Best for: Everyday tasks, fast responses, cost-sensitive workloads

  • Strengths: Good balance of speed and quality, large context window

  • Model ID: llama-4-scout

DeepSeek V3

DeepSeek's latest Mixture-of-Experts model, known for strong performance across coding, math, and reasoning benchmarks.

  • Best for: Code generation, mathematical reasoning, structured output

  • Strengths: Competitive benchmark scores, good at structured/JSON output, strong code capabilities

  • Model ID: deepseek-v3

Llama 3.3 70B

Meta's 70-billion-parameter dense model. Offers top-tier reasoning and instruction-following capabilities.

  • Best for: Complex reasoning, long-form content, detailed analysis

  • Strengths: Strong English performance, excellent instruction following, robust safety tuning

  • Model ID: llama-3.3-70b

Llama 3.1 8B

Meta's efficient 8-billion-parameter model. An excellent choice when you need fast, affordable responses and the task does not require the full capability of a larger model.

  • Best for: Simple tasks, high-throughput workloads, prototyping, cost-sensitive applications

  • Strengths: Very low latency, lowest cost, suitable for classification and extraction tasks

  • Model ID: llama-3.1-8b

Choosing the Right Model

Use Case
Recommended Model
Why

Let DOS AI decide

dos-auto

Smart routing picks the best model per request

General assistant / chatbot

Qwen3.5-35B-A3B

Best balance of quality, speed, and cost

Long-context analysis (100K+ tokens)

Llama 4 Maverick

1M context window, strong reasoning

Complex reasoning / analysis

Llama 3.3 70B

Dense model, top reasoning capability

Code generation / math

DeepSeek V3

Top coding and math benchmark scores

High-volume / low-cost tasks

Llama 3.1 8B

Fastest and cheapest option

Multilingual (CJK languages)

Qwen3.5-35B-A3B

Superior CJK language performance

Listing Models via API

You can retrieve the current list of available models programmatically:

For the full retail catalog with pricing and metadata:

See the Models API reference for the full response schema.

Last updated