GPU waterblock water cooling setup for AI workstation — VRAM Overheating: How to Monitor and Fix GPU Memory Temperat

VRAM Overheating: How to Monitor and Fix GPU Memory Temperatures

The Temperature You Are Not Watching

When people check their GPU temperature, they look at one number: the GPU core temperature. MSI Afterburner shows it. nvidia-smi shows it. The number stays at 75-85C under load, and they think everything is fine.

It is not fine.

Modern high-end GPUs have a second thermal hotspot that almost no one monitors: the VRAM. On the RTX 3090, 4090, and 5090, the GDDR6X memory modules run significantly hotter than the GPU die — and they have their own thermal throttle point that directly affects your AI workload performance.

The XDA Developers article "Your VRAM is Overheating While You Watch the Wrong Temperature" brought this issue to mainstream attention. This guide goes deeper: what VRAM thermal throttling actually is, how to monitor it, why it specifically affects AI workloads, and how to fix it — from free software changes to the permanent hardware solution.

What Is VRAM Thermal Throttling?

GDDR6X memory (used on RTX 3090, 3090 Ti, 4090, and 5090) operates at extremely high data rates — up to 21 Gbps per pin on the 4090. That speed generates substantial heat in each memory module. The memory controller monitors the junction temperature of the VRAM modules and begins reducing memory bandwidth when junction temperature exceeds 92-95C (varies by SKU).

This is not the same as GPU core throttling. You can have a GPU core at 70C (perfectly fine) while your VRAM is at 98C (actively throttling). The two temperatures are not correlated in the way most people expect.

Why VRAM Runs Hotter Than the GPU Die

  • Physical location: VRAM modules are spread around the GPU die on the PCB. Air coolers focus their cooling mass (heatpipes, vapor chamber) on the GPU die. VRAM modules get secondary cooling through the baseplate or small dedicated thermal pads — far less effective.
  • Power density: GDDR6X modules are small (each die is roughly 10x14mm) but each dissipates 2-4W. With 12-24 modules on a 4090, that is 25-90W of heat from a surface area that is difficult to cool with passive contact alone.
  • Backside modules: On some GPU models, VRAM modules are on both sides of the PCB. The modules on the back (away from the cooler) are cooled only through the backplate — often poorly.

How VRAM Throttling Affects AI Workloads

VRAM throttling matters more for AI than for gaming because of how AI workloads use memory.

Gaming vs. AI Memory Access Patterns

Access Pattern Gaming AI Inference (Ollama, vLLM, ComfyUI)
VRAM utilization 4-12GB (partial) 16-24GB (near-full for large models)
Access frequency Burst reads per frame, idle between frames Continuous reads during token generation
Bandwidth sensitivity Moderate (GPU compute is usually the bottleneck) High (model weights must be read from VRAM every token)
Impact of throttling 1-3 FPS drop (barely noticeable) 10-30% tokens/sec drop (very noticeable)

When Ollama generates text, it reads model weights from VRAM on every forward pass. For a 70B model (4-bit quantized) on an RTX 4090, that is roughly 35GB of VRAM reads per token. The RTX 4090's memory bandwidth is 1008 GB/s at full speed. VRAM throttling can reduce effective bandwidth to 700-800 GB/s — a 20-30% hit that directly translates to 20-30% fewer tokens per second.

Symptoms of VRAM Throttling During AI Workloads

  • Tokens/sec drops after 10-20 minutes: Performance starts strong (while VRAM is cool) and degrades as temperatures rise. If your first minute of generation is 40 t/s and it drops to 32 t/s after 15 minutes, VRAM throttling is the probable cause.
  • Stable Diffusion images take progressively longer: The first image in a batch renders fast. Each subsequent image takes slightly longer as VRAM heats up.
  • Unexpected OOM errors at high temperature: Your model fits in VRAM at boot, but after sustained use, you get CUDA out-of-memory errors. The memory controller is restricting access to overheating modules, effectively reducing available VRAM.
  • nvidia-smi shows memory clock below rated speed: If you see memory clock dropping from the rated value (e.g., 10501 MHz on RTX 4090) during sustained load, memory is thermally throttling.

How to Monitor VRAM Temperature

Windows: HWiNFO64

HWiNFO64 is the most reliable way to read VRAM junction temperature on Windows.

  1. Download and install HWiNFO64 (free for personal use)
  2. Launch in "Sensors Only" mode
  3. Scroll to the GPU section
  4. Look for "GPU Memory Junction Temperature" — this is the VRAM junction temp
  5. Run your AI workload and watch the "Maximum" column to see peak VRAM temperature

Important: Not all GPU models expose this sensor. Most RTX 3090, 4090, and 5090 variants do. Some older or lower-end cards do not. If you do not see this reading, your GPU's firmware may not expose the VRAM thermal sensor.

Windows: GPU-Z

GPU-Z also reads VRAM temperature on supported cards. Open the Sensors tab and look for "Memory Temperature." It updates in real-time and can log to a file for long-term monitoring.

Linux: nvidia-smi (Limited)

On Linux, nvidia-smi reports basic GPU temperature but often does not expose VRAM junction temperature separately. For NVIDIA GPUs on Linux, you have a few options:

  • nvidia-smi -q -d TEMPERATURE — shows all available temperature sensors, including memory temperature on supported GPUs
  • nvidia-smi --query-gpu=temperature.memory --format=csv — directly queries memory temperature (returns "N/A" on unsupported GPUs)
  • nvtop — some versions display VRAM temperature if available from the driver

The Linux situation is less complete than Windows for VRAM monitoring. This is one of the challenges flagged by the r/LocalLLaMA community — "AI Linux enthusiasts running RTX GPUs: your cards can overheat silently" because standard Linux tools may not expose the critical VRAM sensor.

What Is a Safe VRAM Temperature?

Temperature Range Status Action
Under 80C Excellent — no performance impact No action needed
80-90C Acceptable — no throttling but approaching limits Monitor regularly, ensure cooling is not degrading
90-95C Warning — memory controller begins throttling Improve cooling: better thermal pads, case airflow, or water cooling
95-105C Critical — active throttling, performance reduced Immediate action: undervolt, add airflow, plan water cooling upgrade
Above 105C Dangerous — risk of permanent damage Stop the workload, investigate cooling failure

Fix 1: Thermal Pad Replacement (Air Cooler)

If you want to stay on air cooling, the most effective upgrade is replacing the stock thermal pads between the VRAM modules and the heatsink baseplate.

Why Stock Thermal Pads Are Often Inadequate

GPU manufacturers use thermal pads that balance cost, manufacturability, and adequate (not optimal) thermal performance. Stock pads are typically 1-3 W/mK thermal conductivity. Aftermarket pads reach 6-12.5 W/mK — a 2-6x improvement in thermal transfer.

For used RTX 3090s specifically, thermal pad degradation is a primary reason for poor VRAM temperatures. After 2-3 years of use, pads compress, dry out, and lose contact. Replacing them with fresh high-quality pads can drop VRAM temperatures by 10-15C.

Recommended Thermal Pads

Pad Thickness Matters

Using the wrong thickness thermal pad is a common mistake. Too thin: the pad does not make contact, leaving an air gap (thermal insulator). Too thick: the pad compresses and pushes the heatsink away from the GPU die, reducing GPU die cooling.

Check your GPU model's teardown guide (available on YouTube and forums) for the correct pad thicknesses. Common values:

  • VRAM modules (front): 1.0-2.0mm depending on model
  • VRAM modules (back/backplate): 1.5-3.0mm depending on model
  • VRM/MOSFET: 1.0-1.5mm

Fix 2: Full-Cover Waterblock (The Permanent Solution)

A full-cover GPU waterblock solves the VRAM temperature problem at its root. Unlike air coolers that focus cooling on the GPU die, a full-cover block makes direct thermal contact with:

  • The GPU die (via thermal paste through a copper cold plate)
  • All front-side VRAM modules (via thermal pads against the block's baseplate)
  • VRM/MOSFET components (via thermal pads)
  • Back-side VRAM (via thermal pads against an active backplate, on blocks that include one)

Coolant flows directly over the areas where VRAM thermal pads make contact, carrying heat away continuously. This is fundamentally more effective than an air cooler where VRAM heat must conduct through a shared metal baseplate before reaching the fin stack.

Typical VRAM Temperature Improvement

GPU VRAM Temp (Air, Sustained AI) VRAM Temp (Water, Sustained AI) Improvement
RTX 3090 (used, aged thermal pads) 95-110C 55-68C -35 to -45C
RTX 3090 (fresh thermal pads on air) 85-95C 55-68C -25 to -30C
RTX 4090 88-100C 58-72C -25 to -30C
RTX 5090 85-95C 55-70C -25 to -30C

The used RTX 3090 shows the most dramatic improvement because the starting point is often the worst — aged thermal pads, 2-3 years of heat cycling, and GDDR6X modules that were already marginal on air. A waterblock plus fresh thermal pads is effectively a thermal reset for the entire card. See our used RTX 3090 revival guide for the complete process.

Waterblock Options

RTX 3090 blocks with active backplate cooling (best for VRAM on the back of the PCB):

RTX 4090 blocks:

RTX 5090 blocks:

Before and After: Real-World Data

Community-reported data from RTX 4090 builds running sustained AI inference workloads:

Metric Stock Air Cooler Air + Replaced Pads Full-Cover Waterblock
GPU Core (sustained) 83-87C 80-84C 45-55C
VRAM Junction (sustained) 94-102C 82-90C 58-72C
Memory Clock (sustained) 10001-10501 MHz (throttled) 10501 MHz (stable) 10501 MHz (stable)
Tokens/sec (Llama 70B 4-bit) 34-38 t/s (degrades over time) 38-40 t/s (stable) 40-43 t/s (stable)
Noise Level 50-65 dBA 50-65 dBA 25-32 dBA

Key finding: thermal pad replacement on air cooling stabilizes VRAM temperatures and prevents throttling, but does not significantly reduce GPU core temperature or noise. Water cooling addresses all three: VRAM, core, and noise.

Decision Framework

Your Situation Recommendation Expected Cost
VRAM above 95C but want to stay on air Replace thermal pads with 12.5 W/mK pads $15-30
Used RTX 3090 with unknown pad condition Waterblock + fresh pads (complete thermal reset) $80-130 for block + pads
RTX 4090/5090, 24/7 AI workload, noise matters Full-cover waterblock + custom loop $350-500 for complete loop
RTX 4060 Ti / 4070, intermittent Ollama use Monitor VRAM temps, likely fine on air $0

Start by monitoring. If you do not know your VRAM temperatures, you cannot make an informed decision. Open HWiNFO64 or GPU-Z, run your typical AI workload for 30 minutes, and check the peak VRAM junction temperature. If it is under 90C, you are fine. If it is above 92C, you are losing performance right now.

Linux Users: The Silent Overheating Problem

A specific warning for Linux AI builders: on many Linux distributions, VRAM junction temperature is not exposed through standard monitoring tools. The r/LocalLLaMA community has documented cases where Linux users ran RTX 3090s and 4090s at VRAM temperatures above 100C for weeks without realizing it — because neither nvidia-smi nor nvtop showed the VRAM sensor.

If you run AI workloads on Linux and cannot see VRAM temperature in your monitoring tools, assume it is 15-25C higher than your GPU core temperature under sustained load. For an air-cooled 4090 showing 82C core temperature during inference, VRAM is likely at 95-105C — actively throttling.

The safest approach for Linux users is to use nvidia-smi's power limit feature (nvidia-smi -pl 350) to reduce total heat output and thus VRAM temperature. Or install a waterblock — which makes the monitoring gap moot because VRAM temperatures stay well below the throttle point regardless.

Active Backplate Cooling: The VRAM Solution for RTX 3090

The RTX 3090 has a unique thermal challenge: VRAM modules on both sides of the PCB. The front-side modules contact the waterblock directly. The back-side modules are only cooled through the backplate — which on most blocks is a passive metal plate that relies on convection and radiation to dissipate heat.

Active backplate waterblocks route coolant through the backplate itself, providing direct liquid cooling to the back-side VRAM. This drops back-side VRAM temperatures by an additional 10-15C compared to passive backplates.

Bykski's active backplane (TC series) blocks for the RTX 3090 include this feature:

For used RTX 3090 buyers running large models where VRAM stability is critical, an active backplate block is the strongest thermal insurance available.

For the used RTX 3090 revival path, see our complete guide. For GPU-specific waterblock selection, browse our AI Workstation Cooling collection. And for the full cooling system design, read our radiator sizing guide and cost breakdown.

Retour au blog

Laisser un commentaire