Model configuration
Model preset
Layers
i
KV heads
i
Head dim
i
Params (B)
i
Weight precision
i
KV cache precision
i
Concurrent users
i
5
Context window (tokens)
i
128K
Avg context utilization
i
60%
Framework overhead (GB)
i
Memory breakdown
Avg VRAM required
—
at avg utilization
Worst-case VRAM
—
all users at full context
Model weights—
KV cache (avg utilization)—
KV cache (worst case)—
Serving framework overhead—
Total (avg)—
KV cache math — step by step
Adjust inputs above to see the calculation.
GPU recommendation matrix
Click any GPU row to show headroom in the breakdown panel above. Fit rating is based on
average VRAM required.
| GPU | VRAM | Mem BW | MIG | NVLink | 1× | 2× | 4× | 6× | 8× | 16× |
|---|