GPU waterblock water cooling setup for AI workstation — Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster

Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster Inference

Ollama Does Not Care About Your GPU's Temperature — But Your GPU Does

Ollama makes running local LLMs simple. Install it, pull a model, and start chatting. It abstracts away the complexity of model loading, quantization, and GPU memory management so that anyone with an NVIDIA GPU can run local AI.

What Ollama does not manage is your hardware's thermal environment. And that thermal environment directly affects your inference speed.

This is the guide that connects the software side (what Ollama does to your GPU) to the hardware side (how to keep that GPU cool enough to perform at its best). Ollama's 52 million monthly downloads in early 2026 suggest there are a lot of people running local LLMs who have not thought about cooling. If your tokens per second are lower than expected, or if your GPU fans sound like they are trying to achieve liftoff, this article explains why and what to do about it.

How Ollama Uses Your GPU

Ollama loads model weights into VRAM and uses the GPU's tensor cores for inference. The workload profile is different from gaming in ways that matter for cooling.

Characteristic Gaming Ollama Inference
GPU Utilization Pattern Fluctuates (30-100% per frame) Constant (80-100% during generation)
VRAM Usage Partial (2-12GB typical) Near-full (fills available VRAM with model weights)
Duration 1-4 hour sessions Hours to days (servers run 24/7)
Power Draw Pattern Peaks and valleys Sustained near-maximum
Thermal Profile Core-intensive, VRAM moderate Both core and VRAM heavily utilized

The sustained nature of Ollama workloads means your GPU reaches thermal equilibrium and stays there. Gaming GPUs cool down between demanding scenes. Ollama GPUs do not get that break.

Why Thermals Affect Inference Speed

NVIDIA GPUs use dynamic frequency scaling (GPU Boost). The chip boosts to higher clock speeds when thermal and power conditions allow, and reduces clock speed when limits are reached. For Ollama users, this creates a direct relationship between temperature and tokens per second.

Clock Throttling

When the GPU core temperature hits 83C (the default thermal limit on most RTX cards), the GPU begins reducing clock speed to prevent overheating. Each step down in clock speed reduces computational throughput proportionally.

Example for an RTX 4090 running Llama 3 70B (4-bit quantized):

GPU Core Temp Sustained Boost Clock Approximate Tokens/sec Impact
50-60C (water cooled) 2580-2670 MHz ~42 t/s Maximum performance
70-78C (good air cooling) 2520-2580 MHz ~40 t/s Minor (-5%)
83-87C (stock air, sustained) 2400-2520 MHz ~37-38 t/s Moderate (-7-10%)
90C+ (poor airflow) 2200-2400 MHz ~33-35 t/s Significant (-15-20%)

VRAM Throttling

This is the less visible but more impactful problem. GDDR6X memory on the RTX 3090, 4090, and 5090 has a separate thermal limit: 92-95C junction temperature. When VRAM overheats, the memory controller reduces bandwidth to prevent damage. Since Ollama loads entire model weights into VRAM and continuously reads them during inference, VRAM bandwidth is critical to performance.

VRAM throttling causes:

  • Token generation speed drops by 10-30%
  • Context window operations slow down (prompt processing becomes the bottleneck)
  • In extreme cases, CUDA out-of-memory errors at temperatures where VRAM should have capacity — because the controller restricts access to overheating memory modules

Air coolers primarily cool the GPU die. They make indirect contact with VRAM modules through a shared heatsink or small thermal pads, but this is not their design priority. Under sustained Ollama workloads, air-cooled VRAM temperatures routinely hit 90-100C. For more on this specific issue, read our VRAM overheating guide.

Monitoring Your GPU During Ollama Workloads

nvidia-smi (Linux and Windows)

The most basic monitoring tool. Run in a terminal while Ollama is generating:

nvidia-smi --query-gpu=temperature.gpu,temperature.memory,clocks.current.graphics,power.draw,memory.used --format=csv -l 5

This outputs GPU core temperature, memory temperature (on supported GPUs), current clock speed, power draw, and VRAM usage every 5 seconds.

Limitation: On many GPU models, nvidia-smi does not report VRAM junction temperature — it only shows the GPU core sensor. You need HWiNFO64 (Windows) or nvtop (Linux) for the full picture.

nvtop (Linux)

Install via your package manager (apt install nvtop or pacman -S nvtop). Provides a real-time dashboard showing GPU utilization, temperatures, fan speed, and per-process VRAM usage. Better than nvidia-smi for continuous monitoring.

HWiNFO64 (Windows)

The most comprehensive hardware monitor on Windows. Shows GPU core temperature, VRAM junction temperature (labeled "GPU Memory Junction Temperature"), VRM temperature, fan speed, clock speed, and power draw. Run it alongside Ollama and check the "Maximum" column after a sustained generation to see your peak temperatures.

What Numbers to Worry About

Metric Safe Concerning Take Action
GPU Core Temperature Under 75C 75-85C Above 85C sustained
VRAM Junction Temperature Under 80C 80-92C Above 92C (active throttling)
Fan Speed (RPM) Under 1500 RPM 1500-2500 RPM (loud) Above 2500 RPM (very loud, near max)
GPU Clock (sustained) Within 100 MHz of rated boost 100-200 MHz below rated 200+ MHz below rated (thermal throttling)

Air Cooling Limits for 24/7 Ollama

Air cooling works for intermittent Ollama use (a few conversations per day, batch jobs that finish in minutes). It hits its limits when Ollama runs continuously — serving as a local API endpoint, processing long documents, or running as a persistent chat server.

Why Air Cooling Struggles with Sustained Load

  • Case temperature rise: After 2-3 hours of sustained GPU load, the air inside your case warms by 5-10C above ambient. The air cooler is now trying to cool the GPU using pre-warmed air. Efficiency drops.
  • VRAM is secondary: Stock air coolers are designed to cool the GPU die. VRAM modules get indirect cooling through the heatsink baseplate and small thermal pads. Under sustained load, these are not sufficient for GDDR6X memory.
  • Noise feedback loop: As temperatures rise, fan curves increase RPM. Higher RPM creates turbulence that reduces cooling efficiency per RPM. You hit a ceiling where fans are at 100% but temperatures are still at the limit.

What You Can Do with Air Cooling

Before upgrading to water cooling, try these zero-cost improvements:

  1. Improve case airflow: Add or reposition case fans. Ensure front intake and rear/top exhaust are working. Remove any panels blocking airflow.
  2. Undervolt the GPU: Reduce power consumption by 5-15% with zero performance loss. See our undervolting guide (the air cooling section applies even without water cooling).
  3. Set a power limit: nvidia-smi -pl 350 on a 4090 reduces heat output by 20% with only a 5-8% inference speed loss. For many Ollama use cases, this is an acceptable trade.
  4. Keep the side panel off: Not elegant, but immediately drops temperatures by 5-8C by eliminating the case as a heat trap. For a workstation under a desk, nobody sees it anyway.

If these do not get you below 85C core and 92C VRAM during sustained generation, water cooling is the next step.

The Water Cooling Solution

A full-cover GPU waterblock makes direct contact with both the GPU die and VRAM modules, using thermal paste and thermal pads respectively. This provides simultaneous cooling to both heat sources — something no air cooler can match.

Typical temperature improvement for a GPU running sustained Ollama inference:

Component Air Cooled (Sustained) Water Cooled (Sustained) Improvement
GPU Core 82-88C 45-55C -30 to -38C
VRAM Junction 90-100C 58-72C -25 to -35C
Fan/Pump Noise 50-65 dBA 25-32 dBA Dramatically quieter
Sustained Boost Clock 2400-2520 MHz 2580-2670 MHz +60-170 MHz sustained

The performance gain from water cooling in Ollama is not a single dramatic number. It is the elimination of thermal throttling — your GPU maintains maximum clock speed indefinitely instead of degrading over the first 15-20 minutes of sustained use. For a tokens-per-second comparison, that is typically 5-12% faster sustained inference on a water-cooled 4090 vs. the same card on air.

Recommended Hardware by Ollama Model Size

Different model sizes have different GPU requirements and different thermal profiles. Here is what to pair with what.

Small Models (7B-13B Parameters)

Examples: Llama 3.2 8B, Mistral 7B, Gemma 2 9B, Phi-3 mini

  • GPU: RTX 4060 Ti (16GB), RTX 3060 (12GB), or RTX 4070
  • VRAM Usage: 4-8GB (quantized)
  • Thermal Load: Moderate (150-200W)
  • Cooling: Air cooling is sufficient for most use. Water cooling only for noise-sensitive environments.

Medium Models (32B-40B Parameters)

Examples: Llama 3.3 70B (4-bit quant fits 24GB), Mixtral 8x7B, Qwen 2.5 72B (4-bit)

  • GPU: RTX 4090 (24GB) or RTX 3090 (24GB)
  • VRAM Usage: 16-24GB (fills the card)
  • Thermal Load: High (350-450W sustained)
  • Cooling: Water cooling strongly recommended for 24/7 use
  • Waterblock options: See our RTX 4090 guide or RTX 3090 revival guide

Large Models (65B-70B+ Parameters at Higher Precision)

Examples: Llama 3.1 70B (8-bit needs 48GB), DeepSeek V3 (multi-GPU)

  • GPU: Dual RTX 3090 NVLink (48GB combined), RTX 5090 (32GB), or dual RTX 4090
  • VRAM Usage: 32-48GB+
  • Thermal Load: Very high (700-1000W sustained for dual GPU)
  • Cooling: Water cooling is required, not optional. Air cooling cannot handle 700W+ in a standard case.
  • Build guide: See our dual 3090 NVLink cooling guide

The GPU Decision for Ollama Users

If you are choosing a GPU specifically for Ollama, the decision matrix looks different from gaming:

GPU VRAM Max Model Size (4-bit) Price (2026) Value for Ollama
RTX 3090 (used) 24GB ~70B (tight) $500-700 Best value per VRAM dollar
RTX 4090 24GB ~70B (comfortable) $1,400-1,800 Best single-GPU performance
RTX 5090 32GB ~100B $2,000-2,500 Maximum single-GPU capability
2x RTX 3090 NVLink 48GB ~120B $1,000-1,400 Best value for 48GB VRAM pool

For a deeper comparison, read our Which GPU for Local AI guide.

Putting It Together: The Quiet Ollama Workstation

The ideal Ollama workstation is one you forget is running. It sits on or under your desk, serves local AI models around the clock, and produces no more noise than a refrigerator in the next room.

For an RTX 4090 Ollama build:

  1. GPU waterblock: Bykski block for your specific 4090 model
  2. Radiator: 480mm + 240mm minimum (Bykski 480mm + Barrow 240mm)
  3. Pump: Barrow D5 pump-reservoir combo at speed 2-3
  4. Fans at 600-800 RPM: effectively inaudible
  5. Undervolt -100mV: saves 50-80W of heat output with no performance loss

Result: GPU core at 45-50C, VRAM at 62-68C, noise at 25-28 dBA, sustained boost clock at maximum, tokens per second at the card's full potential, 24/7.

Common Ollama Thermal Problems and Solutions

"My tokens/sec drops after 10 minutes"

This is thermal throttling. The GPU starts cool, boosts to maximum clock, and delivers peak performance. As temperature rises over 10-15 minutes of sustained generation, the GPU reduces clock speed to stay within thermal limits. Monitor GPU temperature during generation — if it crosses 83C, you are throttling.

Quick fix: Set a power limit with nvidia-smi -pl 350 (for RTX 4090). You lose 5-8% peak performance but gain consistent sustained performance because the card runs cooler and does not throttle.

Permanent fix: Water cooling eliminates the temperature rise entirely. The GPU stays at 45-55C indefinitely — no throttling, no performance degradation over time.

"Ollama gives OOM errors after running for hours but works fine at startup"

This is likely VRAM thermal throttling. As GDDR6X memory heats up past 92C, the memory controller restricts bandwidth and effective capacity. The model that fit in VRAM at boot (when VRAM was cool) becomes too large when VRAM is thermally restricted. Read our VRAM overheating guide for the full diagnosis and fix.

"My GPU fans are so loud I cannot work in the same room"

An RTX 4090 at full load with stock cooling produces 50-65 dBA — louder than a normal conversation. This is the single most common complaint from Ollama users in home offices. Water cooling drops noise to 25-32 dBA (quiet library). If you are on air and cannot water-cool yet, undervolting reduces power draw and thus fan speed — a free partial fix. See our undervolting guide.

"I want to run Ollama overnight but the noise wakes me up"

This is the most water-cooling-specific use case. A water-cooled AI rig at 25-28 dBA is quieter than most room air conditioning units. You can run sustained inference in the same room you sleep in. On air cooling, this is genuinely impossible with any high-end GPU under sustained load.

That is what a well-cooled Ollama rig looks like. Browse the complete AI Workstation Cooling collection to get started, or read the cost breakdown if you need to justify the investment.

Zurück zum Blog

Hinterlasse einen Kommentar