Dual RTX 3090 NVLink for 70B LLMs: The Cooling Guide

The migration from 32B to 70B language models is real, and it changes the hardware equation. A single RTX 4090 or 3090 with 24GB of VRAM can run a quantized 70B model, but at heavy quantization that sacrifices quality. Two RTX 3090s connected with NVLink pool their VRAM into 48GB, enough to run Llama 3.3 70B at Q4_K_M with full context length and quality that approaches the unquantized model. At $800-1,100 per used card, a dual-3090 NVLink build costs $1,600-2,200 for the GPUs alone -- still cheaper than a single RTX 5090 at $2,900-3,500 street price, and it gives you 48GB of VRAM instead of 32GB. The catch is cooling: two 350W GPUs generating 700W of sustained heat in a single chassis requires water cooling. Air cooling two 3090s in the same case is not practically possible for 24/7 AI workloads. This guide covers the complete build from hardware selection through loop design and expected performance.

Key Takeaways

  • Dual RTX 3090s with NVLink = 48GB combined VRAM for $1,600-2,200. This is the cheapest path to running 70B models without aggressive quantization.
  • NVLink is available on the RTX 3090 (NOT the 3090 Ti). The NVLink bridge connects both cards at 112.5 GB/s bidirectional bandwidth.
  • Two 350W cards = 700W sustained heat output. Water cooling is the only viable option for this thermal load in a standard tower case.
  • The loop requires: 2 GPU waterblocks, a 480mm radiator (minimum), a D5 pump, and a distro plate or parallel water path.
  • Expected performance: Llama 3.3 70B Q4_K_M at approximately 18 tok/sec, stable 24/7, at 28 dBA noise level.

The 32B to 70B Migration: Why Single GPU Is Done

In 2024, 32B-parameter models were the frontier for local AI. Llama 2 70B existed but required extreme quantization on consumer hardware. By early 2026, the landscape shifted. Llama 3.3 70B, Qwen 72B, and DeepSeek R1 32B set new quality benchmarks that made 32B models feel noticeably worse for complex reasoning, coding assistance, and document analysis.

The r/LocalLLaMA community consensus is clear: once you experience 70B model quality, going back to 32B feels like a downgrade. But 70B models at Q4_K_M quantization need approximately 40-44GB of VRAM for the model weights plus KV cache at reasonable context lengths. No single consumer GPU has that much VRAM. The RTX 5090 has 32GB, which is tight for 70B Q4 with limited context. The RTX PRO 6000 has 96GB but costs $8,000+.

Two RTX 3090s with NVLink give you 48GB of pooled VRAM at a fraction of the cost.

Why Dual 3090 Beats Every Other Sub-$2,000 Path

Configuration Total VRAM GPU Cost 70B Q4 tok/sec Notes
Dual RTX 3090 + NVLink 48GB $1,600-2,200 ~18 Cheapest 48GB path; NVLink bandwidth helps
Single RTX 5090 32GB $2,900-3,500 ~44 Faster but limited to Q4 with short context at 70B
Single RTX 4090 24GB $1,800-2,200 ~35 (Q2 quant) Requires Q2 quantization at 70B = quality loss
Dual RTX 4090 (no NVLink) 48GB (split) $3,600-4,400 ~30 No NVLink on 4090; PCIe tensor parallelism is slower
Mac Studio M4 Ultra (192GB) 192GB unified $6,999+ ~12 Huge VRAM but much slower inference than NVIDIA

Prices as of April 2026. tok/sec estimates for Llama 3.3 70B Q4_K_M on Ollama.

The dual 3090 is not the fastest option. The single RTX 5090 runs inference 2.4x faster. But the dual 3090 costs 40-60% less and has 50% more VRAM. For workloads where VRAM capacity matters more than raw speed -- serving a 70B model for personal use, running RAG with large context windows, or fine-tuning LoRAs on 70B base models -- the dual 3090 is the practical choice.

NVLink on the 3090: What It Buys You

NVLink on the RTX 3090 provides 112.5 GB/s bidirectional bandwidth between two GPUs. This is roughly 4x faster than PCIe 4.0 x16 (31.5 GB/s) for GPU-to-GPU communication.

For language model inference, NVLink provides two main benefits:

  • Unified VRAM pool: With NVLink, the two GPUs can access each other's VRAM as if it were one 48GB pool. Ollama and llama.cpp handle this transparently -- you load a 70B model and the runtime splits it across both GPUs.
  • Faster tensor parallelism: When the model is split across GPUs, each inference step requires the GPUs to exchange intermediate results. NVLink's high bandwidth makes this exchange fast, minimizing the performance penalty of splitting.

Critical note: NVLink is available on the RTX 3090 but NOT on the RTX 3090 Ti. The 3090 Ti removed the NVLink connector. If you are buying used 3090s for a dual build, verify that you are getting the standard 3090, not the 3090 Ti. Also note that the NVLink bridge is a separate purchase -- NVIDIA's 3-slot and 4-slot bridges for the 3090 are available on the used market for $40-80.

The Cooling Math: 700W Sustained in One Case

Two RTX 3090s under sustained Ollama load draw approximately 300-350W each, totaling 600-700W of GPU heat alone. Add CPU, motherboard, and SSD power, and the total system heat output approaches 800-900W.

Air cooling this is not viable for 24/7 operation. Here is why:

  • Two triple-slot air coolers in the same case block each other's airflow. The top card's intake is the bottom card's exhaust. The top GPU runs 10-15C hotter than the bottom.
  • At 700W, case fans must run at maximum RPM to move enough air. Total noise: 65-75 dBA -- louder than a vacuum cleaner.
  • VRAM on both cards hits 95-110C, triggering thermal throttling on at least one card.

Water cooling solves this by moving heat out of the case entirely. Two slim waterblocks replace two bulky air coolers, and the heat is carried by liquid to a radiator that can be mounted anywhere -- top, front, or external. The GPU temperatures are determined by radiator size and fan speed, not by inter-card airflow dynamics.

Parts List

GPUs

Two RTX 3090s. Same AIB variant is ideal (simplifies waterblock selection), but mixed variants work as long as both are the standard 3090 (not 3090 Ti) with NVLink connectors.

Waterblocks (2x)

You need two waterblocks matched to your specific 3090 AIB variant. Our recommendations:

Active backplane blocks are strongly recommended for a dual build. The 3090 has GDDR6X on both sides of the PCB, and with two cards running at full power, every degree of cooling margin matters.

Radiator

480mm is the minimum. A Bykski 480mm copper radiator handles 700W with fans at moderate speeds (1,000-1,200 RPM). For quieter operation (fans at 600-800 RPM), add a second 240mm or 360mm radiator. The ideal configuration is 480mm + 360mm = 840mm total radiator area, which keeps both GPUs under 55C with fans barely audible.

Pump

A D5 pump is sufficient for a dual-GPU loop. The Barrow D5 pump + reservoir combo provides adequate flow for two waterblocks in series or parallel, with PWM control for noise management.

Distro Plate (Optional but Recommended)

A distro plate simplifies plumbing for a dual-GPU loop by providing a central distribution point for coolant. It eliminates the need for complex tubing runs between the pump, two GPU blocks, and the radiator. The Bykski Core P3 distro plate works in open-frame cases popular with multi-GPU builds.

Supporting Hardware

  • Motherboard: Must have 2x PCIe x16 slots with at least 4 slots of physical spacing between them (for the NVLink bridge). ASUS WS boards, ASRock Rack, and Gigabyte Aorus Creator boards are popular choices.
  • CPU: AMD Threadripper or Ryzen 9 for PCIe lane count. Intel Core i9 works if the motherboard supports bifurcated lanes.
  • NVLink bridge: NVIDIA 3-slot or 4-slot NVLink bridge for RTX 3090 (not A-series or Quadro bridges). Available used for $40-80.
  • PSU: 1,200W minimum. Two 3090s plus system = 800-900W total draw. A single 1,200W PSU handles this. For extra safety margin, some builders use two 850W PSUs with a dual-PSU adapter.
  • Case: Full tower with space for 480mm radiator mounting. The Thermaltake Core P3/P5 open-frame cases are popular for multi-GPU water-cooled builds because they provide easy access and unlimited radiator mounting options.

Building the Loop Step by Step

Loop Order

For a dual-GPU loop, the recommended flow order is: Pump → GPU 1 → GPU 2 → Radiator → Reservoir → Pump. Both GPUs in series keeps the plumbing simple and ensures equal flow through both blocks.

An alternative is parallel flow: pump splits into both GPUs simultaneously, then both outputs merge before the radiator. This reduces flow restriction but requires a Y-splitter or distro plate. Parallel flow provides slightly more even temperatures between the two GPUs (1-2C difference) but adds complexity.

Installation Sequence

  1. Install both 3090 waterblocks on the GPUs (fresh thermal pads and paste).
  2. Install the NVLink bridge on the GPU connectors (before installing in the case -- it is easier to align).
  3. Mount both GPUs in the PCIe slots. Verify the NVLink bridge is properly seated.
  4. Mount the radiator(s) and pump in the case.
  5. Connect tubing: pump out → GPU 1 in → GPU 1 out → GPU 2 in → GPU 2 out → radiator in → radiator out → reservoir → pump in.
  6. Fill and leak test for 24 hours with the GPUs unpowered.
  7. First boot: verify both GPUs appear in nvidia-smi and that NVLink is detected (nvidia-smi topo -m).

Expected Performance: Ollama 70B on Dual 3090 NVLink

Metric Single 3090 (24GB, Q2 Quant) Dual 3090 NVLink (48GB, Q4_K_M)
Llama 3.3 70B Quantization Q2_K (poor quality) Q4_K_M (good quality)
VRAM Used ~22GB ~42GB (split across both GPUs)
Prompt Processing (tok/sec) ~80 ~120
Generation (tok/sec) ~8 ~18
Max Context Length ~4K tokens ~16K tokens
Quality vs Full Precision ~85% ~96%
GPU Temp (water cooled) 52C 55C (top) / 52C (bottom)
VRAM Junction Temp 65C 68C (top) / 64C (bottom)
Noise at 1m 28 dBA 30 dBA
24h Stability Stable (but poor quality) Stable (good quality)

Based on community testing and manufacturer specifications -- actual results vary by loop configuration.

The dual 3090 build at 18 tok/sec with Q4_K_M quantization delivers a qualitatively different experience from a single 3090 forced to use Q2 quantization. The Q4 model produces more coherent reasoning, better code generation, and more accurate document analysis. At 18 tok/sec, the response speed is comfortable for interactive use -- not fast, but fast enough that you are reading the output as it generates rather than waiting for it.

Thermal and Noise Under 24-Hour Load

With a 480mm + 240mm radiator setup and a D5 pump at 50% PWM, the dual 3090 build runs at approximately 30 dBA measured at 1 meter. The top GPU runs 2-3C warmer than the bottom GPU in a series loop configuration. Both GPUs stay well below throttle thresholds.

Over 24 hours of continuous Ollama serving, coolant temperature stabilizes at approximately 35-38C with room temperature at 22-24C. Token generation speed remains consistent from hour 1 to hour 24. There is no thermal drift, no throttling, and no instability -- exactly what you need from a build that serves as your personal AI assistant or team-shared inference server.

Common Pitfalls and How to Avoid Them

Buying 3090 Ti Instead of 3090

The RTX 3090 Ti does NOT have an NVLink connector. It looks similar, has better single-card performance, and costs about the same on the used market. But without NVLink, two 3090 Ti cards cannot pool their VRAM. They can still be used together via PCIe tensor parallelism (as Ollama and llama.cpp support), but the performance is significantly worse than NVLink: PCIe 4.0 provides 31.5 GB/s vs NVLink's 112.5 GB/s. Always verify NVLink connector presence before buying a used 3090 for a dual build.

Mismatched NVLink Bridge Slot Spacing

NVLink bridges come in 3-slot and 4-slot variants, referring to the physical distance between the two GPU cards. Your motherboard's PCIe slot spacing determines which bridge you need. Most ATX motherboards with two x16 slots space them 4 slots apart. Micro-ATX and some workstation boards may use 3-slot spacing. Measure or count slots before ordering the bridge.

Insufficient PSU Wattage

Two 3090s under full AI load draw 600-700W from the GPU power connectors alone. Add CPU, RAM, fans, pump, and motherboard power, and total system draw reaches 800-900W. A 1,000W PSU technically handles this, but with no headroom for transient power spikes during model loading. A 1,200W PSU is the practical minimum. Some builders run dual 850W PSUs with a sync adapter for full redundancy.

Coolant Flow Issues in Long Loop Runs

A dual-GPU loop with 480mm + 240mm radiators and a distro plate has more restriction than a simple single-GPU loop. A DDC pump may struggle -- use a D5, which provides higher head pressure and handles the additional restriction without issue. If you notice temperature differences greater than 5C between the two GPUs in a series loop, increase pump speed or switch to parallel flow.

Ready to Build?

Find waterblocks with active backplane cooling for every major 3090 variant in our Used 3090 AI Revival collection. For the complete multi-GPU cooling solution -- 480mm radiator, D5 pump, distro plate, and all fittings -- explore the Sovereign AI Rig collection. Two used 3090s with NVLink and water cooling remain the most cost-effective path to running 70B models locally in 2026.

ブログに戻る

コメントを残す