What GPU waterblocks does FormulaMod carry?

FormulaMod stocks full-cover GPU waterblocks for over 500 graphics card models, covering RTX 5090, 4090, 3090, RX 9070 XT, and H100/H200 GPUs. We stock over 1,600 water cooling products with same-week worldwide shipping from Guangzhou.

How do I choose a waterblock for my GPU?

Match the waterblock to your exact GPU model and PCB variant. Reference-design cards (Founders Edition, reference Sapphire/PowerColor) use universal reference blocks. Non-reference cards from ASUS ROG Strix, MSI Gaming X Trio, Gigabyte AORUS, and EVGA FTW3 require model-specific full-cover blocks that match their unique PCB layouts. Always verify your GPU's exact model number before ordering.

Will a GPU waterblock reduce noise during AI workloads like Ollama or Stable Diffusion?

Yes. Stock GPU air coolers run at 45-65 dBA under sustained AI inference loads. A full-cover waterblock mounted to a 360mm radiator with a quiet fan curve typically reduces noise to under 30 dBA. VRAM temperatures also drop from 90°C+ to 60-70°C, which prevents throttling during long generation runs.

What components do I need for a complete custom water cooling loop?

A complete custom loop requires: (1) a full-cover GPU waterblock, (2) a radiator — 360mm recommended for single-GPU systems, 480mm for dual GPU or CPU+GPU loops, (3) a D5 or DDC pump with reservoir combo, (4) G1/4 threaded fittings — compression fittings for soft tubing or hard fittings for PETG/acrylic hard tubing, (5) tubing, and (6) premixed coolant. FormulaMod sells individual components and complete kits starting at $249 USD.

Does FormulaMod ship internationally?

Yes. FormulaMod ships worldwide from Guangzhou, China. Standard shipping to the US, EU, UK, Canada, and Australia typically takes 7-14 business days. Express DHL/FedEx options are available at checkout for 3-5 business day delivery. All orders include a tracking number.

How long has FormulaMod been operating?

FormulaMod has operated as an independent water-cooling specialist since 2013. Continuous online presence is publicly verifiable via the Internet Archive Wayback Machine. FormulaMod is a U.S. registered trademark (USPTO Reg. No. 6073949) and ships worldwide from our Guangzhou warehouse with full manufacturer warranty.

GPU waterblock water cooling setup for AI workstation — Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster

Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster Inference

By Liang Huang, FormulaMod Technical Team · Published Apr 18, 2026 · Updated Apr 22, 2026

18. April 2026

Ollama Does Not Care About Your GPU's Temperature — But Your GPU Does

Ollama makes running local LLMs simple. Install it, pull a model, and start chatting. It abstracts away the complexity of model loading, quantization, and GPU memory management so that anyone with an NVIDIA GPU can run local AI.

What Ollama does not manage is your hardware's thermal environment. And that thermal environment directly affects your inference speed.

This is the guide that connects the software side (what Ollama does to your GPU) to the hardware side (how to keep that GPU cool enough to perform at its best). Ollama's 52 million monthly downloads in early 2026 suggest there are a lot of people running local LLMs who have not thought about cooling. If your tokens per second are lower than expected, or if your GPU fans sound like they are trying to achieve liftoff, this article explains why and what to do about it.

How Ollama Uses Your GPU

Ollama loads model weights into VRAM and uses the GPU's tensor cores for inference. The workload profile is different from gaming in ways that matter for cooling.

Characteristic	Gaming	Ollama Inference
GPU Utilization Pattern	Fluctuates (30-100% per frame)	Constant (80-100% during generation)
VRAM Usage	Partial (2-12GB typical)	Near-full (fills available VRAM with model weights)
Duration	1-4 hour sessions	Hours to days (servers run 24/7)
Power Draw Pattern	Peaks and valleys	Sustained near-maximum
Thermal Profile	Core-intensive, VRAM moderate	Both core and VRAM heavily utilized

The sustained nature of Ollama workloads means your GPU reaches thermal equilibrium and stays there. Gaming GPUs cool down between demanding scenes. Ollama GPUs do not get that break.

Why Thermals Affect Inference Speed

NVIDIA GPUs use dynamic frequency scaling (GPU Boost). The chip boosts to higher clock speeds when thermal and power conditions allow, and reduces clock speed when limits are reached. For Ollama users, this creates a direct relationship between temperature and tokens per second.

Clock Throttling

When the GPU core temperature hits 83C (the default thermal limit on most RTX cards), the GPU begins reducing clock speed to prevent overheating. Each step down in clock speed reduces computational throughput proportionally.

Example for an RTX 4090 running Llama 3 70B (4-bit quantized):

GPU Core Temp	Sustained Boost Clock	Approximate Tokens/sec	Impact
50-60C (water cooled)	2580-2670 MHz	~42 t/s	Maximum performance
70-78C (good air cooling)	2520-2580 MHz	~40 t/s	Minor (-5%)
83-87C (stock air, sustained)	2400-2520 MHz	~37-38 t/s	Moderate (-7-10%)
90C+ (poor airflow)	2200-2400 MHz	~33-35 t/s	Significant (-15-20%)

VRAM Throttling

This is the less visible but more impactful problem. GDDR6X memory on the RTX 3090, 4090, and 5090 has a separate thermal limit: 92-95C junction temperature. When VRAM overheats, the memory controller reduces bandwidth to prevent damage. Since Ollama loads entire model weights into VRAM and continuously reads them during inference, VRAM bandwidth is critical to performance.

VRAM throttling causes:

Token generation speed drops by 10-30%
Context window operations slow down (prompt processing becomes the bottleneck)
In extreme cases, CUDA out-of-memory errors at temperatures where VRAM should have capacity — because the controller restricts access to overheating memory modules

Air coolers primarily cool the GPU die. They make indirect contact with VRAM modules through a shared heatsink or small thermal pads, but this is not their design priority. Under sustained Ollama workloads, air-cooled VRAM temperatures routinely hit 90-100C. For more on this specific issue, read our VRAM overheating guide.

Monitoring Your GPU During Ollama Workloads

nvidia-smi (Linux and Windows)

The most basic monitoring tool. Run in a terminal while Ollama is generating:

nvidia-smi --query-gpu=temperature.gpu,temperature.memory,clocks.current.graphics,power.draw,memory.used --format=csv -l 5

This outputs GPU core temperature, memory temperature (on supported GPUs), current clock speed, power draw, and VRAM usage every 5 seconds.

Limitation: On many GPU models, nvidia-smi does not report VRAM junction temperature — it only shows the GPU core sensor. You need HWiNFO64 (Windows) or nvtop (Linux) for the full picture.

nvtop (Linux)

Install via your package manager (apt install nvtop or pacman -S nvtop). Provides a real-time dashboard showing GPU utilization, temperatures, fan speed, and per-process VRAM usage. Better than nvidia-smi for continuous monitoring.

HWiNFO64 (Windows)

The most comprehensive hardware monitor on Windows. Shows GPU core temperature, VRAM junction temperature (labeled "GPU Memory Junction Temperature"), VRM temperature, fan speed, clock speed, and power draw. Run it alongside Ollama and check the "Maximum" column after a sustained generation to see your peak temperatures.

What Numbers to Worry About

Metric	Safe	Concerning	Take Action
GPU Core Temperature	Under 75C	75-85C	Above 85C sustained
VRAM Junction Temperature	Under 80C	80-92C	Above 92C (active throttling)
Fan Speed (RPM)	Under 1500 RPM	1500-2500 RPM (loud)	Above 2500 RPM (very loud, near max)
GPU Clock (sustained)	Within 100 MHz of rated boost	100-200 MHz below rated	200+ MHz below rated (thermal throttling)

Air Cooling Limits for 24/7 Ollama

Air cooling works for intermittent Ollama use (a few conversations per day, batch jobs that finish in minutes). It hits its limits when Ollama runs continuously — serving as a local API endpoint, processing long documents, or running as a persistent chat server.

Why Air Cooling Struggles with Sustained Load

Case temperature rise: After 2-3 hours of sustained GPU load, the air inside your case warms by 5-10C above ambient. The air cooler is now trying to cool the GPU using pre-warmed air. Efficiency drops.
VRAM is secondary: Stock air coolers are designed to cool the GPU die. VRAM modules get indirect cooling through the heatsink baseplate and small thermal pads. Under sustained load, these are not sufficient for GDDR6X memory.
Noise feedback loop: As temperatures rise, fan curves increase RPM. Higher RPM creates turbulence that reduces cooling efficiency per RPM. You hit a ceiling where fans are at 100% but temperatures are still at the limit.

What You Can Do with Air Cooling

Before upgrading to water cooling, try these zero-cost improvements:

Improve case airflow: Add or reposition case fans. Ensure front intake and rear/top exhaust are working. Remove any panels blocking airflow.
Undervolt the GPU: Reduce power consumption by 5-15% with zero performance loss. See our undervolting guide (the air cooling section applies even without water cooling).
Set a power limit: nvidia-smi -pl 350 on a 4090 reduces heat output by 20% with only a 5-8% inference speed loss. For many Ollama use cases, this is an acceptable trade.
Keep the side panel off: Not elegant, but immediately drops temperatures by 5-8C by eliminating the case as a heat trap. For a workstation under a desk, nobody sees it anyway.

If these do not get you below 85C core and 92C VRAM during sustained generation, water cooling is the next step.

The Water Cooling Solution

A full-cover GPU waterblock makes direct contact with both the GPU die and VRAM modules, using thermal paste and thermal pads respectively. This provides simultaneous cooling to both heat sources — something no air cooler can match.

Typical temperature improvement for a GPU running sustained Ollama inference:

Component	Air Cooled (Sustained)	Water Cooled (Sustained)	Improvement
GPU Core	82-88C	45-55C	-30 to -38C
VRAM Junction	90-100C	58-72C	-25 to -35C
Fan/Pump Noise	50-65 dBA	25-32 dBA	Dramatically quieter
Sustained Boost Clock	2400-2520 MHz	2580-2670 MHz	+60-170 MHz sustained

The performance gain from water cooling in Ollama is not a single dramatic number. It is the elimination of thermal throttling — your GPU maintains maximum clock speed indefinitely instead of degrading over the first 15-20 minutes of sustained use. For a tokens-per-second comparison, that is typically 5-12% faster sustained inference on a water-cooled 4090 vs. the same card on air.

Recommended Hardware by Ollama Model Size

Different model sizes have different GPU requirements and different thermal profiles. Here is what to pair with what.

Small Models (7B-13B Parameters)

Examples: Llama 3.2 8B, Mistral 7B, Gemma 2 9B, Phi-3 mini

GPU: RTX 4060 Ti (16GB), RTX 3060 (12GB), or RTX 4070
VRAM Usage: 4-8GB (quantized)
Thermal Load: Moderate (150-200W)
Cooling: Air cooling is sufficient for most use. Water cooling only for noise-sensitive environments.

Medium Models (32B-40B Parameters)

Examples: Llama 3.3 70B (4-bit quant fits 24GB), Mixtral 8x7B, Qwen 2.5 72B (4-bit)

GPU: RTX 4090 (24GB) or RTX 3090 (24GB)
VRAM Usage: 16-24GB (fills the card)
Thermal Load: High (350-450W sustained)
Cooling: Water cooling strongly recommended for 24/7 use
Waterblock options: See our RTX 4090 guide or RTX 3090 revival guide

Large Models (65B-70B+ Parameters at Higher Precision)

Examples: Llama 3.1 70B (8-bit needs 48GB), DeepSeek V3 (multi-GPU)

GPU: Dual RTX 3090 NVLink (48GB combined), RTX 5090 (32GB), or dual RTX 4090
VRAM Usage: 32-48GB+
Thermal Load: Very high (700-1000W sustained for dual GPU)
Cooling: Water cooling is required, not optional. Air cooling cannot handle 700W+ in a standard case.
Build guide: See our dual 3090 NVLink cooling guide

The GPU Decision for Ollama Users

If you are choosing a GPU specifically for Ollama, the decision matrix looks different from gaming:

GPU	VRAM	Max Model Size (4-bit)	Price (2026)	Value for Ollama
RTX 3090 (used)	24GB	~70B (tight)	$500-700	Best value per VRAM dollar
RTX 4090	24GB	~70B (comfortable)	$1,400-1,800	Best single-GPU performance
RTX 5090	32GB	~100B	$2,000-2,500	Maximum single-GPU capability
2x RTX 3090 NVLink	48GB	~120B	$1,000-1,400	Best value for 48GB VRAM pool

For a deeper comparison, read our Which GPU for Local AI guide.

Putting It Together: The Quiet Ollama Workstation

The ideal Ollama workstation is one you forget is running. It sits on or under your desk, serves local AI models around the clock, and produces no more noise than a refrigerator in the next room.

For an RTX 4090 Ollama build:

GPU waterblock: Bykski block for your specific 4090 model
Radiator: 480mm + 240mm minimum (Bykski 480mm + Barrow 240mm)
Pump: Barrow D5 pump-reservoir combo at speed 2-3
Fans at 600-800 RPM: effectively inaudible
Undervolt -100mV: saves 50-80W of heat output with no performance loss

Result: GPU core at 45-50C, VRAM at 62-68C, noise at 25-28 dBA, sustained boost clock at maximum, tokens per second at the card's full potential, 24/7.

Common Ollama Thermal Problems and Solutions

"My tokens/sec drops after 10 minutes"

This is thermal throttling. The GPU starts cool, boosts to maximum clock, and delivers peak performance. As temperature rises over 10-15 minutes of sustained generation, the GPU reduces clock speed to stay within thermal limits. Monitor GPU temperature during generation — if it crosses 83C, you are throttling.

Quick fix: Set a power limit with nvidia-smi -pl 350 (for RTX 4090). You lose 5-8% peak performance but gain consistent sustained performance because the card runs cooler and does not throttle.

Permanent fix: Water cooling eliminates the temperature rise entirely. The GPU stays at 45-55C indefinitely — no throttling, no performance degradation over time.

"Ollama gives OOM errors after running for hours but works fine at startup"

This is likely VRAM thermal throttling. As GDDR6X memory heats up past 92C, the memory controller restricts bandwidth and effective capacity. The model that fit in VRAM at boot (when VRAM was cool) becomes too large when VRAM is thermally restricted. Read our VRAM overheating guide for the full diagnosis and fix.

"My GPU fans are so loud I cannot work in the same room"

An RTX 4090 at full load with stock cooling produces 50-65 dBA — louder than a normal conversation. This is the single most common complaint from Ollama users in home offices. Water cooling drops noise to 25-32 dBA (quiet library). If you are on air and cannot water-cool yet, undervolting reduces power draw and thus fan speed — a free partial fix. See our undervolting guide.

"I want to run Ollama overnight but the noise wakes me up"

This is the most water-cooling-specific use case. A water-cooled AI rig at 25-28 dBA is quieter than most room air conditioning units. You can run sustained inference in the same room you sleep in. On air cooling, this is genuinely impossible with any high-end GPU under sustained load.

That is what a well-cooled Ollama rig looks like. Browse the complete AI Workstation Cooling collection to get started, or read the cost breakdown if you need to justify the investment.

Zurück zum Blog

Artikel wurde in den Warenkorb gelegt

Ollama Hardware Cooling Guide: Keep Your GPU Cool for Faster Inference

Ollama Does Not Care About Your GPU's Temperature — But Your GPU Does

How Ollama Uses Your GPU

Why Thermals Affect Inference Speed

Clock Throttling

VRAM Throttling

Monitoring Your GPU During Ollama Workloads

nvidia-smi (Linux and Windows)

nvtop (Linux)

HWiNFO64 (Windows)

What Numbers to Worry About

Air Cooling Limits for 24/7 Ollama

Why Air Cooling Struggles with Sustained Load

What You Can Do with Air Cooling

The Water Cooling Solution

Recommended Hardware by Ollama Model Size

Small Models (7B-13B Parameters)

Medium Models (32B-40B Parameters)

Large Models (65B-70B+ Parameters at Higher Precision)

The GPU Decision for Ollama Users

Putting It Together: The Quiet Ollama Workstation

Common Ollama Thermal Problems and Solutions

"My tokens/sec drops after 10 minutes"

"Ollama gives OOM errors after running for hours but works fine at startup"

"My GPU fans are so loud I cannot work in the same room"

"I want to run Ollama overnight but the noise wakes me up"

Hinterlasse einen Kommentar

Land/Region

Sprache

Ollama Does Not Care About Your GPU's Temperature — But Your GPU Does

How Ollama Uses Your GPU

Why Thermals Affect Inference Speed

Clock Throttling

VRAM Throttling

Monitoring Your GPU During Ollama Workloads

nvidia-smi (Linux and Windows)

nvtop (Linux)

HWiNFO64 (Windows)

What Numbers to Worry About

Air Cooling Limits for 24/7 Ollama

Why Air Cooling Struggles with Sustained Load

What You Can Do with Air Cooling

The Water Cooling Solution

Recommended Hardware by Ollama Model Size

Small Models (7B-13B Parameters)

Medium Models (32B-40B Parameters)

Large Models (65B-70B+ Parameters at Higher Precision)

The GPU Decision for Ollama Users

Putting It Together: The Quiet Ollama Workstation

Common Ollama Thermal Problems and Solutions

"My tokens/sec drops after 10 minutes"

"Ollama gives OOM errors after running for hours but works fine at startup"

"My GPU fans are so loud I cannot work in the same room"

"I want to run Ollama overnight but the noise wakes me up"

Related Articles

Hinterlasse einen Kommentar

SUBSCRIBE TO OUR EMAILS