Skip to main content

Chips & Compute layer

Memory Bandwidth Bottleneck Detector

Given a model + accelerator, decide whether you are bandwidth-bound or compute-bound.

The engineer question
Is my 70B inference bandwidth-bound on H100?

Inputs

Total parameter count in billions (e.g. 70 for Llama-3-70B).

Tokens generated in parallel per step. Decode arithmetic intensity ≈ batch size.

Result

Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight
1.0 FLOP/byte
Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point)
295 FLOP/byte
Bottleneckintensity is 0% of ridge
Memory-bandwidth-bound
Est. HBM-bandwidth utilisation
100%
Est. compute (FLOP) utilisationtensor cores mostly idle
0%
Roofline token throughputupper bound; excludes attention/KV/overhead
~23.9 tok/s
Weight footprintvs 80 GB HBM on one device
140.0 GB

Recommendation

Decode is bandwidth-bound: arithmetic intensity (1.0 FLOP/byte) sits below the NVIDIA H100 SXM (80 GB) ridge point (295 FLOP/byte), so tensor cores idle ~100% of decode time while HBM is saturated. Fixes, in order of impact: (1) raise batch size — the crossover to compute-bound is ~batch 296 at this dtype; (2) quantize weights further (fewer bytes/param ⇒ less HBM traffic ⇒ higher tokens/s and higher intensity); (3) move to a higher-HBM-bandwidth part (H200/B200/MI300X) — for memory-bound decode, throughput scales ~linearly with HBM BW, not FLOPS; (4) speculative decoding / medusa to verify multiple tokens per weight read. Note: weights don't fit one device — tensor/pipeline parallel adds interconnect traffic this model ignores.

Assumptions

  • · Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
  • · NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
  • · Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
  • · Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
  • · Token throughput is a ceiling: it counts weight matmuls only.
  • · EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
  • · Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.

Worked example (default inputs)

Result

Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight
1.0 FLOP/byte
Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point)
295 FLOP/byte
Bottleneckintensity is 0% of ridge
Memory-bandwidth-bound
Est. HBM-bandwidth utilisation
100%
Est. compute (FLOP) utilisationtensor cores mostly idle
0%
Roofline token throughputupper bound; excludes attention/KV/overhead
~23.9 tok/s
Weight footprintvs 80 GB HBM on one device
140.0 GB

Recommendation

Decode is bandwidth-bound: arithmetic intensity (1.0 FLOP/byte) sits below the NVIDIA H100 SXM (80 GB) ridge point (295 FLOP/byte), so tensor cores idle ~100% of decode time while HBM is saturated. Fixes, in order of impact: (1) raise batch size — the crossover to compute-bound is ~batch 296 at this dtype; (2) quantize weights further (fewer bytes/param ⇒ less HBM traffic ⇒ higher tokens/s and higher intensity); (3) move to a higher-HBM-bandwidth part (H200/B200/MI300X) — for memory-bound decode, throughput scales ~linearly with HBM BW, not FLOPS; (4) speculative decoding / medusa to verify multiple tokens per weight read. Note: weights don't fit one device — tensor/pipeline parallel adds interconnect traffic this model ignores.

Assumptions

  • · Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
  • · NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
  • · Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
  • · Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
  • · Token throughput is a ceiling: it counts weight matmuls only.
  • · EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
  • · Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.

Related tools in the Chips & Compute layer

Get notified when Memory Bandwidth Bottleneck Detector numbers update

We refresh the inputs as the market moves. One email when they change.