Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster Inference
Share
Ollama Does Not Care About Your GPU's Temperature — But Your GPU Does
Ollama makes running local LLMs simple. Install it, pull a model, and start chatting. It abstracts away the complexity of model loading, quantization, and GPU memory management so that anyone with an NVIDIA GPU can run local AI.
What Ollama does not manage is your hardware's thermal environment. And that thermal environment directly affects your inference speed.
This is the guide that connects the software side (what Ollama does to your GPU) to the hardware side (how to keep that GPU cool enough to perform at its best). Ollama's 52 million monthly downloads in early 2026 suggest there are a lot of people running local LLMs who have not thought about cooling. If your tokens per second are lower than expected, or if your GPU fans sound like they are trying to achieve liftoff, this article explains why and what to do about it.
How Ollama Uses Your GPU
Ollama loads model weights into VRAM and uses the GPU's tensor cores for inference. The workload profile is different from gaming in ways that matter for cooling.
| Characteristic | Gaming | Ollama Inference |
|---|---|---|
| GPU Utilization Pattern | Fluctuates (30-100% per frame) | Constant (80-100% during generation) |
| VRAM Usage | Partial (2-12GB typical) | Near-full (fills available VRAM with model weights) |
| Duration | 1-4 hour sessions | Hours to days (servers run 24/7) |
| Power Draw Pattern | Peaks and valleys | Sustained near-maximum |
| Thermal Profile | Core-intensive, VRAM moderate | Both core and VRAM heavily utilized |
The sustained nature of Ollama workloads means your GPU reaches thermal equilibrium and stays there. Gaming GPUs cool down between demanding scenes. Ollama GPUs do not get that break.
Why Thermals Affect Inference Speed
NVIDIA GPUs use dynamic frequency scaling (GPU Boost). The chip boosts to higher clock speeds when thermal and power conditions allow, and reduces clock speed when limits are reached. For Ollama users, this creates a direct relationship between temperature and tokens per second.
Clock Throttling
When the GPU core temperature hits 83C (the default thermal limit on most RTX cards), the GPU begins reducing clock speed to prevent overheating. Each step down in clock speed reduces computational throughput proportionally.
Example for an RTX 4090 running Llama 3 70B (4-bit quantized):
| GPU Core Temp | Sustained Boost Clock | Approximate Tokens/sec | Impact |
|---|---|---|---|
| 50-60C (water cooled) | 2580-2670 MHz | ~42 t/s | Maximum performance |
| 70-78C (good air cooling) | 2520-2580 MHz | ~40 t/s | Minor (-5%) |
| 83-87C (stock air, sustained) | 2400-2520 MHz | ~37-38 t/s | Moderate (-7-10%) |
| 90C+ (poor airflow) | 2200-2400 MHz | ~33-35 t/s | Significant (-15-20%) |
VRAM Throttling
This is the less visible but more impactful problem. GDDR6X memory on the RTX 3090, 4090, and 5090 has a separate thermal limit: 92-95C junction temperature. When VRAM overheats, the memory controller reduces bandwidth to prevent damage. Since Ollama loads entire model weights into VRAM and continuously reads them during inference, VRAM bandwidth is critical to performance.
VRAM throttling causes:
- Token generation speed drops by 10-30%
- Context window operations slow down (prompt processing becomes the bottleneck)
- In extreme cases, CUDA out-of-memory errors at temperatures where VRAM should have capacity — because the controller restricts access to overheating memory modules
Air coolers primarily cool the GPU die. They make indirect contact with VRAM modules through a shared heatsink or small thermal pads, but this is not their design priority. Under sustained Ollama workloads, air-cooled VRAM temperatures routinely hit 90-100C. For more on this specific issue, read our VRAM overheating guide.
Monitoring Your GPU During Ollama Workloads
nvidia-smi (Linux and Windows)
The most basic monitoring tool. Run in a terminal while Ollama is generating:
nvidia-smi --query-gpu=temperature.gpu,temperature.memory,clocks.current.graphics,power.draw,memory.used --format=csv -l 5
This outputs GPU core temperature, memory temperature (on supported GPUs), current clock speed, power draw, and VRAM usage every 5 seconds.
Limitation: On many GPU models, nvidia-smi does not report VRAM junction temperature — it only shows the GPU core sensor. You need HWiNFO64 (Windows) or nvtop (Linux) for the full picture.
nvtop (Linux)
Install via your package manager (apt install nvtop or pacman -S nvtop). Provides a real-time dashboard showing GPU utilization, temperatures, fan speed, and per-process VRAM usage. Better than nvidia-smi for continuous monitoring.
HWiNFO64 (Windows)
The most comprehensive hardware monitor on Windows. Shows GPU core temperature, VRAM junction temperature (labeled "GPU Memory Junction Temperature"), VRM temperature, fan speed, clock speed, and power draw. Run it alongside Ollama and check the "Maximum" column after a sustained generation to see your peak temperatures.
What Numbers to Worry About
| Metric | Safe | Concerning | Take Action |
|---|---|---|---|
| GPU Core Temperature | Under 75C | 75-85C | Above 85C sustained |
| VRAM Junction Temperature | Under 80C | 80-92C | Above 92C (active throttling) |
| Fan Speed (RPM) | Under 1500 RPM | 1500-2500 RPM (loud) | Above 2500 RPM (very loud, near max) |
| GPU Clock (sustained) | Within 100 MHz of rated boost | 100-200 MHz below rated | 200+ MHz below rated (thermal throttling) |
Air Cooling Limits for 24/7 Ollama
Air cooling works for intermittent Ollama use (a few conversations per day, batch jobs that finish in minutes). It hits its limits when Ollama runs continuously — serving as a local API endpoint, processing long documents, or running as a persistent chat server.
Why Air Cooling Struggles with Sustained Load
- Case temperature rise: After 2-3 hours of sustained GPU load, the air inside your case warms by 5-10C above ambient. The air cooler is now trying to cool the GPU using pre-warmed air. Efficiency drops.
- VRAM is secondary: Stock air coolers are designed to cool the GPU die. VRAM modules get indirect cooling through the heatsink baseplate and small thermal pads. Under sustained load, these are not sufficient for GDDR6X memory.
- Noise feedback loop: As temperatures rise, fan curves increase RPM. Higher RPM creates turbulence that reduces cooling efficiency per RPM. You hit a ceiling where fans are at 100% but temperatures are still at the limit.
What You Can Do with Air Cooling
Before upgrading to water cooling, try these zero-cost improvements:
- Improve case airflow: Add or reposition case fans. Ensure front intake and rear/top exhaust are working. Remove any panels blocking airflow.
- Undervolt the GPU: Reduce power consumption by 5-15% with zero performance loss. See our undervolting guide (the air cooling section applies even without water cooling).
-
Set a power limit:
nvidia-smi -pl 350on a 4090 reduces heat output by 20% with only a 5-8% inference speed loss. For many Ollama use cases, this is an acceptable trade. - Keep the side panel off: Not elegant, but immediately drops temperatures by 5-8C by eliminating the case as a heat trap. For a workstation under a desk, nobody sees it anyway.
If these do not get you below 85C core and 92C VRAM during sustained generation, water cooling is the next step.
The Water Cooling Solution
A full-cover GPU waterblock makes direct contact with both the GPU die and VRAM modules, using thermal paste and thermal pads respectively. This provides simultaneous cooling to both heat sources — something no air cooler can match.
Typical temperature improvement for a GPU running sustained Ollama inference:
| Component | Air Cooled (Sustained) | Water Cooled (Sustained) | Improvement |
|---|---|---|---|
| GPU Core | 82-88C | 45-55C | -30 to -38C |
| VRAM Junction | 90-100C | 58-72C | -25 to -35C |
| Fan/Pump Noise | 50-65 dBA | 25-32 dBA | Dramatically quieter |
| Sustained Boost Clock | 2400-2520 MHz | 2580-2670 MHz | +60-170 MHz sustained |
The performance gain from water cooling in Ollama is not a single dramatic number. It is the elimination of thermal throttling — your GPU maintains maximum clock speed indefinitely instead of degrading over the first 15-20 minutes of sustained use. For a tokens-per-second comparison, that is typically 5-12% faster sustained inference on a water-cooled 4090 vs. the same card on air.
Recommended Hardware by Ollama Model Size
Different model sizes have different GPU requirements and different thermal profiles. Here is what to pair with what.
Small Models (7B-13B Parameters)
Examples: Llama 3.2 8B, Mistral 7B, Gemma 2 9B, Phi-3 mini
- GPU: RTX 4060 Ti (16GB), RTX 3060 (12GB), or RTX 4070
- VRAM Usage: 4-8GB (quantized)
- Thermal Load: Moderate (150-200W)
- Cooling: Air cooling is sufficient for most use. Water cooling only for noise-sensitive environments.
Medium Models (32B-40B Parameters)
Examples: Llama 3.3 70B (4-bit quant fits 24GB), Mixtral 8x7B, Qwen 2.5 72B (4-bit)
- GPU: RTX 4090 (24GB) or RTX 3090 (24GB)
- VRAM Usage: 16-24GB (fills the card)
- Thermal Load: High (350-450W sustained)
- Cooling: Water cooling strongly recommended for 24/7 use
- Waterblock options: See our RTX 4090 guide or RTX 3090 revival guide
Large Models (65B-70B+ Parameters at Higher Precision)
Examples: Llama 3.1 70B (8-bit needs 48GB), DeepSeek V3 (multi-GPU)
- GPU: Dual RTX 3090 NVLink (48GB combined), RTX 5090 (32GB), or dual RTX 4090
- VRAM Usage: 32-48GB+
- Thermal Load: Very high (700-1000W sustained for dual GPU)
- Cooling: Water cooling is required, not optional. Air cooling cannot handle 700W+ in a standard case.
- Build guide: See our dual 3090 NVLink cooling guide
The GPU Decision for Ollama Users
If you are choosing a GPU specifically for Ollama, the decision matrix looks different from gaming:
| GPU | VRAM | Max Model Size (4-bit) | Price (2026) | Value for Ollama |
|---|---|---|---|---|
| RTX 3090 (used) | 24GB | ~70B (tight) | $500-700 | Best value per VRAM dollar |
| RTX 4090 | 24GB | ~70B (comfortable) | $1,400-1,800 | Best single-GPU performance |
| RTX 5090 | 32GB | ~100B | $2,000-2,500 | Maximum single-GPU capability |
| 2x RTX 3090 NVLink | 48GB | ~120B | $1,000-1,400 | Best value for 48GB VRAM pool |
For a deeper comparison, read our Which GPU for Local AI guide.
Putting It Together: The Quiet Ollama Workstation
The ideal Ollama workstation is one you forget is running. It sits on or under your desk, serves local AI models around the clock, and produces no more noise than a refrigerator in the next room.
For an RTX 4090 Ollama build:
- GPU waterblock: Bykski block for your specific 4090 model
- Radiator: 480mm + 240mm minimum (Bykski 480mm + Barrow 240mm)
- Pump: Barrow D5 pump-reservoir combo at speed 2-3
- Fans at 600-800 RPM: effectively inaudible
- Undervolt -100mV: saves 50-80W of heat output with no performance loss
Result: GPU core at 45-50C, VRAM at 62-68C, noise at 25-28 dBA, sustained boost clock at maximum, tokens per second at the card's full potential, 24/7.
Common Ollama Thermal Problems and Solutions
"My tokens/sec drops after 10 minutes"
This is thermal throttling. The GPU starts cool, boosts to maximum clock, and delivers peak performance. As temperature rises over 10-15 minutes of sustained generation, the GPU reduces clock speed to stay within thermal limits. Monitor GPU temperature during generation — if it crosses 83C, you are throttling.
Quick fix: Set a power limit with nvidia-smi -pl 350 (for RTX 4090). You lose 5-8% peak performance but gain consistent sustained performance because the card runs cooler and does not throttle.
Permanent fix: Water cooling eliminates the temperature rise entirely. The GPU stays at 45-55C indefinitely — no throttling, no performance degradation over time.
"Ollama gives OOM errors after running for hours but works fine at startup"
This is likely VRAM thermal throttling. As GDDR6X memory heats up past 92C, the memory controller restricts bandwidth and effective capacity. The model that fit in VRAM at boot (when VRAM was cool) becomes too large when VRAM is thermally restricted. Read our VRAM overheating guide for the full diagnosis and fix.
"My GPU fans are so loud I cannot work in the same room"
An RTX 4090 at full load with stock cooling produces 50-65 dBA — louder than a normal conversation. This is the single most common complaint from Ollama users in home offices. Water cooling drops noise to 25-32 dBA (quiet library). If you are on air and cannot water-cool yet, undervolting reduces power draw and thus fan speed — a free partial fix. See our undervolting guide.
"I want to run Ollama overnight but the noise wakes me up"
This is the most water-cooling-specific use case. A water-cooled AI rig at 25-28 dBA is quieter than most room air conditioning units. You can run sustained inference in the same room you sleep in. On air cooling, this is genuinely impossible with any high-end GPU under sustained load.
That is what a well-cooled Ollama rig looks like. Browse the complete AI Workstation Cooling collection to get started, or read the cost breakdown if you need to justify the investment.
Related Articles
12VHPWR Safer RTX 5090 Build: Why Water Cooling Lets You Skip the Cable Drama
12vhpwr · 2026 · cable safety · rtx 5090 · safer build · shunt resistor · undervolt · water coolingDual RTX 3090 NVLink for 70B LLMs: The Cooling Guide
70b llm · dual rtx 3090 · llama 70b · multi gpu · nvlink · ollama · sovereign ai · water coolingH100 Water Cooling Guide: Liquid Cooling for AI Research GPUs
ai research · bykski · h100 · h200 · nvidia · sovereign ai · university lab · water cooling
RTX 4090 Water Cooling Guide: Silence Your 450W AI Workhorse
4090 quiet · ai workstation · barrow · bykski · custom loop · granzon · rtx 4090 · water cooling · waterblock