Context Window
131,072
Input tokens
Full-context input ≈ $0.08
Model Cost Profile
Developer: nvidia· Tokenizer: Llama3
Pricing updated Mar 22, 2026
Live Pricing
Input: $0.6000
Output: $1.80
Last synced Mar 22, 2026 · MMLU score via public benchmark data
Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) optimized for advanced reasoning, human-interactive chat, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta’s Llama-3.1-405B-Instruct, it has been significantly customized using Neural Architecture Search (NAS), resulting in enhanced efficiency, reduced memory usage, and improved inference latency. The model supports a context length of up to 128K tokens and can operate efficiently on an 8x NVIDIA H100 node. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.
Context Window
131,072
Input tokens
Full-context input ≈ $0.08
Max Output
—
Not specified
Input Price / 1M
$0.6000
Prompt tokens
Output Price / 1M
$1.80
Completion tokens
Top Benchmark
82.5
MMLU score — highest of MMLU, GPQA, MATH, HumanEval
Evaluation scores for NVIDIA: Llama 3.1 Nemotron Ultra 253B v1. The “Top Benchmark” shown above is the highest score across MMLU, GPQA, MATH & HumanEval.
Price History
Not enough data yet. Price tracking started recently — check back in a few days.
Performance History
Not enough data yet. Performance tracking started recently — check back in a few days.
| Usage Type | Price / 1M Tokens |
|---|---|
| Input (Prompt) | $0.6000 |
| Output (Completion) | $1.80 |
Estimate monthly spend for NVIDIA: Llama 3.1 Nemotron Ultra 253B v1 based on your workload.
Estimated Monthly Cost
$37
25M input + 12M output tokens
Quick links for cost-down decisions before production rollout.