Chips & Compute layer
Memory Bandwidth Bottleneck Detector
Given a model + accelerator, decide whether you are bandwidth-bound or compute-bound.
The engineer question
Is my 70B inference bandwidth-bound on H100?
Result
- Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight
- 1.0 FLOP/byte
- Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point)
- 295 FLOP/byte
- Bottleneckintensity is 0% of ridge
- Memory-bandwidth-bound
- Est. HBM-bandwidth utilisation
- 100%
- Est. compute (FLOP) utilisationtensor cores mostly idle
- 0%
- Roofline token throughputupper bound; excludes attention/KV/overhead
- ~23.9 tok/s
- Weight footprintvs 80 GB HBM on one device
- 140.0 GB
Recommendation
Decode is bandwidth-bound: arithmetic intensity (1.0 FLOP/byte) sits below the NVIDIA H100 SXM (80 GB) ridge point (295 FLOP/byte), so tensor cores idle ~100% of decode time while HBM is saturated. Fixes, in order of impact: (1) raise batch size — the crossover to compute-bound is ~batch 296 at this dtype; (2) quantize weights further (fewer bytes/param ⇒ less HBM traffic ⇒ higher tokens/s and higher intensity); (3) move to a higher-HBM-bandwidth part (H200/B200/MI300X) — for memory-bound decode, throughput scales ~linearly with HBM BW, not FLOPS; (4) speculative decoding / medusa to verify multiple tokens per weight read. Note: weights don't fit one device — tensor/pipeline parallel adds interconnect traffic this model ignores.
Assumptions
- · Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
- · NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
- · Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
- · Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
- · Token throughput is a ceiling: it counts weight matmuls only.
- · EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
- · Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.
Worked example (default inputs)
Result
- Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight
- 1.0 FLOP/byte
- Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point)
- 295 FLOP/byte
- Bottleneckintensity is 0% of ridge
- Memory-bandwidth-bound
- Est. HBM-bandwidth utilisation
- 100%
- Est. compute (FLOP) utilisationtensor cores mostly idle
- 0%
- Roofline token throughputupper bound; excludes attention/KV/overhead
- ~23.9 tok/s
- Weight footprintvs 80 GB HBM on one device
- 140.0 GB
Recommendation
Decode is bandwidth-bound: arithmetic intensity (1.0 FLOP/byte) sits below the NVIDIA H100 SXM (80 GB) ridge point (295 FLOP/byte), so tensor cores idle ~100% of decode time while HBM is saturated. Fixes, in order of impact: (1) raise batch size — the crossover to compute-bound is ~batch 296 at this dtype; (2) quantize weights further (fewer bytes/param ⇒ less HBM traffic ⇒ higher tokens/s and higher intensity); (3) move to a higher-HBM-bandwidth part (H200/B200/MI300X) — for memory-bound decode, throughput scales ~linearly with HBM BW, not FLOPS; (4) speculative decoding / medusa to verify multiple tokens per weight read. Note: weights don't fit one device — tensor/pipeline parallel adds interconnect traffic this model ignores.
Assumptions
- · Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
- · NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
- · Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
- · Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
- · Token throughput is a ceiling: it counts weight matmuls only.
- · EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
- · Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.