Chips & Compute layer
Inference Cost Calculator
Per-million-tokens cost for self-hosted inference across H100 / H200 / B200 / MI300.
The engineer question
What does it cost to self-host a 70B model at 100k QPS?
Result
- $ / 1M output tokensdecode only
- $0.43
- Tokens / sec / GPU (est.)scaled from 1,800 at 70B
- 1,800 tok/s
- Aggregate token rate100.0k req/s × 500 tok
- 50.00M tok/s
- GPU-hours / day
- 666.7k
- GPUs needed (steady state)no redundancy / headroom
- 27,778
- Cost / day (compute only)
- $1.87M
- Recommended cluster shape27,778 GPUs total
- 3,473 × 8-GPU nodes
Recommendation
~27,778 GPUs (3,473 nodes) is a serious dedicated cluster. Owned hardware or a multi-year reserved commit will beat on-demand $/hr by roughly 2–4×, so treat the $/1M-token figure as an on-demand upper bound. Add ~20% GPU headroom for traffic spikes and node failures.
Assumptions
- · FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
- · Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
- · $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
- · NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
- · Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
- · Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
- · EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
- · MoE models: enter active params per token, not total params, or throughput will be badly underestimated.
Worked example (default inputs)
Result
- $ / 1M output tokensdecode only
- $0.43
- Tokens / sec / GPU (est.)scaled from 1,800 at 70B
- 1,800 tok/s
- Aggregate token rate100.0k req/s × 500 tok
- 50.00M tok/s
- GPU-hours / day
- 666.7k
- GPUs needed (steady state)no redundancy / headroom
- 27,778
- Cost / day (compute only)
- $1.87M
- Recommended cluster shape27,778 GPUs total
- 3,473 × 8-GPU nodes
Recommendation
~27,778 GPUs (3,473 nodes) is a serious dedicated cluster. Owned hardware or a multi-year reserved commit will beat on-demand $/hr by roughly 2–4×, so treat the $/1M-token figure as an on-demand upper bound. Add ~20% GPU headroom for traffic spikes and node failures.
Assumptions
- · FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
- · Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
- · $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
- · NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
- · Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
- · Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
- · EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
- · MoE models: enter active params per token, not total params, or throughput will be badly underestimated.