Skip to main content

Chips & Compute layer

Inference Cost Calculator

Per-million-tokens cost for self-hosted inference across H100 / H200 / B200 / MI300.

The engineer question
What does it cost to self-host a 70B model at 100k QPS?

Inputs

Total (dense-equivalent) parameter count. For MoE, use active params per token.

Generated tokens per request. Prefill (input) cost is excluded — see assumptions.

Sustained requests per second the cluster must serve.

Result

$ / 1M output tokensdecode only
$0.43
Tokens / sec / GPU (est.)scaled from 1,800 at 70B
1,800 tok/s
Aggregate token rate100.0k req/s × 500 tok
50.00M tok/s
GPU-hours / day
666.7k
GPUs needed (steady state)no redundancy / headroom
27,778
Cost / day (compute only)
$1.87M
Recommended cluster shape27,778 GPUs total
3,473 × 8-GPU nodes

Recommendation

~27,778 GPUs (3,473 nodes) is a serious dedicated cluster. Owned hardware or a multi-year reserved commit will beat on-demand $/hr by roughly 2–4×, so treat the $/1M-token figure as an on-demand upper bound. Add ~20% GPU headroom for traffic spikes and node failures.

Assumptions

  • · FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
  • · Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
  • · $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
  • · NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
  • · Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
  • · Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
  • · EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
  • · MoE models: enter active params per token, not total params, or throughput will be badly underestimated.

Worked example (default inputs)

Result

$ / 1M output tokensdecode only
$0.43
Tokens / sec / GPU (est.)scaled from 1,800 at 70B
1,800 tok/s
Aggregate token rate100.0k req/s × 500 tok
50.00M tok/s
GPU-hours / day
666.7k
GPUs needed (steady state)no redundancy / headroom
27,778
Cost / day (compute only)
$1.87M
Recommended cluster shape27,778 GPUs total
3,473 × 8-GPU nodes

Recommendation

~27,778 GPUs (3,473 nodes) is a serious dedicated cluster. Owned hardware or a multi-year reserved commit will beat on-demand $/hr by roughly 2–4×, so treat the $/1M-token figure as an on-demand upper bound. Add ~20% GPU headroom for traffic spikes and node failures.

Assumptions

  • · FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
  • · Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
  • · $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
  • · NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
  • · Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
  • · Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
  • · EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
  • · MoE models: enter active params per token, not total params, or throughput will be badly underestimated.

Related tools in the Chips & Compute layer

Get notified when Inference Cost Calculator numbers update

We refresh the inputs as the market moves. One email when they change.