What does the Memory Bandwidth Bottleneck Detector tool output?

Bandwidth utilization; Compute utilization; Bottleneck verdict + fix list

What inputs does Memory Bandwidth Bottleneck Detector need?

Model size + batch size; Accelerator; Quantization scheme

Chips & Compute layer

Memory Bandwidth Bottleneck Detector

Given a model + accelerator, decide whether you are bandwidth-bound or compute-bound.

The engineer question
Is my 70B inference bandwidth-bound on H100?

Result

Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight: 1.0 FLOP/byte
Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point): 295 FLOP/byte
Bottleneckintensity is 0% of ridge: Memory-bandwidth-bound
Est. HBM-bandwidth utilisation: 100%
Est. compute (FLOP) utilisationtensor cores mostly idle: 0%
Roofline token throughputupper bound; excludes attention/KV/overhead: ~23.9 tok/s
Weight footprintvs 80 GB HBM on one device: 140.0 GB

Recommendation

Decode is bandwidth-bound: arithmetic intensity (1.0 FLOP/byte) sits below the NVIDIA H100 SXM (80 GB) ridge point (295 FLOP/byte), so tensor cores idle ~100% of decode time while HBM is saturated. Fixes, in order of impact: (1) raise batch size — the crossover to compute-bound is ~batch 296 at this dtype; (2) quantize weights further (fewer bytes/param ⇒ less HBM traffic ⇒ higher tokens/s and higher intensity); (3) move to a higher-HBM-bandwidth part (H200/B200/MI300X) — for memory-bound decode, throughput scales ~linearly with HBM BW, not FLOPS; (4) speculative decoding / medusa to verify multiple tokens per weight read. Note: weights don't fit one device — tensor/pipeline parallel adds interconnect traffic this model ignores.

Assumptions

· Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
· NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
· Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
· Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
· Token throughput is a ceiling: it counts weight matmuls only.
· EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
· Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.

Worked example (default inputs)

Result

Arithmetic intensity (decode)≈ 2 × batch ÷ bytes-per-weight: 1.0 FLOP/byte
Accelerator machine balancepeak FLOPS ÷ HBM bandwidth (ridge point): 295 FLOP/byte
Bottleneckintensity is 0% of ridge: Memory-bandwidth-bound
Est. HBM-bandwidth utilisation: 100%
Est. compute (FLOP) utilisationtensor cores mostly idle: 0%
Roofline token throughputupper bound; excludes attention/KV/overhead: ~23.9 tok/s
Weight footprintvs 80 GB HBM on one device: 140.0 GB

Recommendation

Assumptions

· Model: standard roofline for autoregressive DECODE only (1 token/step). Weight-matmul work ≈ 2 FLOPs/param/token; per step every weight is read from HBM exactly once and reused across the batch ⇒ arithmetic intensity ≈ 2 × batch ÷ bytes-per-weight. Source: classic Williams et al. roofline + standard LLM-inference analyses.
· NVIDIA H100 SXM (80 GB): HBM 80 GB @ 3.35 TB/s; dense peak 989 TFLOP/s FP16, 1979 FP8, 3958 INT4. Figures are public vendor spec sheets (NVIDIA/AMD datasheets, mid-2026); approximate, ±5–10% across SKU/clock bins and cloud instances.
· Quantization: FP16 / BF16 (2 bytes). bytes/param drives HBM traffic. FLOPS column = matching dense tensor-core path.
· Utilisation is a roofline UPPER BOUND: the binding ceiling is modelled at 100% and the other scales by the time ratio. Real kernels realise ~50–75% of peak FLOPS and ~70–90% of peak HBM BW — treat the reported % as best-case, not measured.
· Token throughput is a ceiling: it counts weight matmuls only.
· EXCLUDED: attention/self-attention FLOPs, KV-cache reads+writes (which dominate at long context and large batch and can flip the verdict), prefill, MoE routing/sparsity, activation memory, kernel-launch + Python overhead, tensor/pipeline-parallel interconnect (NVLink/IB) traffic when weights span multiple devices, and sustained-vs-peak derating.
· Not authoritative/audited — an order-of-magnitude planning aid. Benchmark your actual serving stack (vLLM/TRT-LLM/SGLang) for committed numbers.

Related tools in the Chips & Compute layer

→ Get this data as JSON

Inputs

Result

Result

Related tools in the Chips & Compute layer