What GPU waterblocks does FormulaMod carry?

FormulaMod is an authorized dealer for Bykski, Barrow, and Granzon waterblocks covering RTX 5090, 4090, 3090, RX 9070 XT, and H100/H200 GPUs. We stock over 1,600 water cooling products with same-week worldwide shipping from Shenzhen.

How do I choose a waterblock for my GPU?

Match the waterblock to your exact GPU model and PCB variant. Reference-design cards (Founders Edition, reference Sapphire/PowerColor) use universal reference blocks. Non-reference cards from ASUS ROG Strix, MSI Gaming X Trio, Gigabyte AORUS, and EVGA FTW3 require model-specific full-cover blocks that match their unique PCB layouts. Always verify your GPU's exact model number before ordering.

Will a GPU waterblock reduce noise during AI workloads like Ollama or Stable Diffusion?

Yes. Stock GPU air coolers run at 45-65 dBA under sustained AI inference loads. A full-cover waterblock mounted to a 360mm radiator with a quiet fan curve typically reduces noise to under 30 dBA. VRAM temperatures also drop from 90°C+ to 60-70°C, which prevents throttling during long generation runs.

What components do I need for a complete custom water cooling loop?

A complete custom loop requires: (1) a full-cover GPU waterblock, (2) a radiator — 360mm recommended for single-GPU systems, 480mm for dual GPU or CPU+GPU loops, (3) a D5 or DDC pump with reservoir combo, (4) G1/4 threaded fittings — compression fittings for soft tubing or hard fittings for PETG/acrylic hard tubing, (5) tubing, and (6) premixed coolant. FormulaMod sells individual components and complete kits starting at $249 USD.

Does FormulaMod ship internationally?

Yes. FormulaMod ships worldwide from Shenzhen, China. Standard shipping to the US, EU, UK, Canada, and Australia typically takes 7-14 business days. Express DHL/FedEx options are available at checkout for 3-5 business day delivery. All orders include a tracking number.

Is FormulaMod an authorized Bykski and Barrow dealer?

Yes. FormulaMod has been an authorized Bykski and Barrow distributor since 2013. Products ship directly from the manufacturer's warehouse in Shenzhen with full manufacturer warranty. Granzon products are also stocked as an authorized reseller.

Which NVIDIA GPU for Local AI in 2026? RTX 3090 vs 4060 Ti vs 4070 Ti Super vs 4090 vs 5090

By HuangLiang · Published Apr 10, 2026 · Updated Apr 17, 2026

April 10, 2026

Which NVIDIA GPU for Local AI in 2026? RTX 3090 vs 4060 Ti vs 4070 Ti Super vs 4090 vs 5090

The RTX 4090 is the best all-around GPU for running AI locally in 2026. It has 24GB of VRAM — enough to run 32-billion-parameter models at Q4 quantization — and pushes 104 tokens per second on an 8B model (Hardware Corner, 2025). At $1,599 new, it costs less than the 5090 while handling every mainstream local AI task from LLM chat to Stable Diffusion batch rendering without breaking a sweat.

But $1,599 isn't everyone's budget. A used RTX 3090 at $750 gets you the same 24GB VRAM with lower throughput. An RTX 4060 Ti 16GB at $399 lets you experiment with smaller models. And the new RTX 5090 with 32GB GDDR7 opens doors to model sizes no other consumer card can touch. This guide covers each GPU with real benchmark numbers, thermal data, and build recommendations — no guesswork, no filler.

Key Takeaways

RTX 4090 delivers 104 tok/s on 8B models and fits 32B Q4 in 24GB VRAM — best value for serious local AI

RTX 3090 at $750 used matches the 4090's VRAM capacity but runs 15-20% slower and needs a waterblock for VRAM thermals

16GB cards (4060 Ti, 4070 Ti Super) top out at ~14B parameters comfortably — fine for 7B-8B chat models and SDXL

RTX 5090's 32GB GDDR7 is the only consumer GPU that runs 32B models at Q4 with room for long context windows

Water cooling drops GPU temps 20-30°C under sustained AI load and eliminates fan noise for 24/7 home office use

GPU Specs at a Glance

VRAM and memory bandwidth are the two specs that matter most for local AI. VRAM determines the largest model you can load. Bandwidth determines how fast tokens come out. Everything else — CUDA cores, clock speeds — is secondary for inference workloads.

GPU	VRAM	Bandwidth	TDP	CUDA Cores	Price (USD)
RTX 3090	24 GB GDDR6X	936 GB/s	350W	10,496	~$750 used
RTX 4060 Ti 16GB	16 GB GDDR6	288 GB/s	165W	4,352	~$399 new
RTX 4070 Ti Super	16 GB GDDR6X	672 GB/s	285W	8,448	~$799 MSRP
RTX 4090	24 GB GDDR6X	1,008 GB/s	450W	16,384	~$1,599 new
RTX 5090	32 GB GDDR7	1,792 GB/s	575W	21,760	~$1,999 new

Notice the bandwidth gap. The 4060 Ti sits at 288 GB/s on a 128-bit bus — less than a third of the 4090's 1,008 GB/s. Same 16GB VRAM as the 4070 Ti Super, but it reads that memory 2.3x slower. For AI inference, where the GPU spends most of its time streaming model weights from VRAM, bandwidth is the bottleneck that decides your tokens-per-second.

The RTX 5090 stands apart with 32GB of GDDR7 at 1,792 GB/s. That's 1.78x the bandwidth of the 4090 and 6.2x the 4060 Ti. It's also 575 watts — the highest TDP of any consumer GPU — which has direct implications for cooling.

How Fast Can Each GPU Run a Local LLM?

The RTX 4090 generates 104 tokens per second on a Qwen3 8B model at Q4 quantization, and 69 tok/s on a 14B model (Hardware Corner, tested with llama.cpp on Ubuntu 24.04, CUDA 12.8). That's fast enough for real-time chat with a local model that rivals GPT-3.5-class quality.

GPU	Qwen3 8B Q4 (tok/s)	Qwen3 14B Q4 (tok/s)	Prompt Processing 8B (tok/s)
RTX 4060 Ti 16GB	34	22	1,481
RTX 4070 Ti Super	72	47	3,051
RTX 3090	87	52	2,572
RTX 4090	104	69	6,721
RTX 5090	145	103	6,956

Source: Hardware Corner GPU Ranking for LLMs, tested with llama.cpp llama-bench, Q4_K_XL quantization, 16K context.

Look at the 4060 Ti column: 34 tok/s on 8B. That's usable for a casual chat — you'll see words appear at a readable pace — but not fast enough for batch processing or tool-use workflows where the model needs to think quickly. The 4070 Ti Super doubles that to 72 tok/s despite having the same 16GB VRAM. The difference? Memory bandwidth. The 4060 Ti's 128-bit bus is the choke point.

The RTX 3090, a card from 2020, still posts 87 tok/s on 8B — faster than the 4070 Ti Super. It has older CUDA cores but wider memory bandwidth (936 GB/s vs 672 GB/s). For pure LLM inference, the 3090 punches above its generation.

What's the Biggest Model Each Card Can Run?

VRAM is the hard wall. A model that doesn't fit won't run — or it spills to system RAM and drops to single-digit tok/s.

VRAM	Max Model at Q4	Max Model at FP16	Cards
16 GB	~14B comfortably	~7-8B	4060 Ti 16GB, 4070 Ti Super
24 GB	~32B (tight)	~12B	RTX 3090, RTX 4090
32 GB	~32B with room	~16B	RTX 5090

Qwen 2.5 32B at Q4 quantization takes about 20GB. It fits on a 24GB card with room for a moderate context window. On the 5090's 32GB, you get 12GB of headroom — enough for 32K+ context or running the model alongside other tools. Llama 3.1 70B at Q4 needs ~40GB. No single consumer GPU can run it. You'd need two GPUs or heavy CPU offloading, which kills throughput.

Which GPU for Stable Diffusion and Flux?

The RTX 5090 generates an SDXL image (1024x1024, 20 steps) in 2.2 seconds — roughly 3x faster than the 3090's 5.6 seconds and 5.5x faster than the 4060 Ti's 12 seconds (ComfyUI community benchmarks).

GPU	SDXL 1024x1024 (sec)	Flux DEV FP16 (sec)
RTX 4060 Ti 16GB	~12	~92
RTX 3090	~5.6	~40
RTX 4090	~3.2	~18
RTX 5090	~2.2	~9

Sources: SDXL numbers from ComfyUI Discussion #2970. Flux numbers from Hardware Corner Flux GPU Guide and Furkan Gozukara's RTX 5090 wiki.

Flux is the model that separates the 16GB cards from the 24GB cards. Running Flux DEV at full FP16 precision needs 12GB+ VRAM. The 4060 Ti can technically do it, but at 92 seconds per image — that's a minute and a half for one picture. Drop to FP8 quantization and it speeds up to ~49 seconds, but quality takes a hit. The 3090 at 40 seconds FP16 and the 4090 at 18 seconds are in a different league entirely.

For SDXL, which is still the workhorse for most Stable Diffusion workflows, even the 4060 Ti is workable at 12 seconds per image. If you're generating one image at a time for a project, that's fine. Batch rendering 50 images for a product catalog? You want a 4090 or 5090.

Why Water Cooling Matters for AI Workloads

Running AI inference isn't like gaming. A game varies GPU load frame to frame — 60-90% utilization with dips during menus and cutscenes. AI inference pins your GPU at sustained high load for minutes or hours straight. That sustained heat is where stock coolers start to struggle, and where water cooling earns its cost.

RTX 3090: Water Cooling Is a Requirement, Not a Luxury

The RTX 3090 has a well-documented VRAM thermal problem. Under sustained compute workloads, the GDDR6X memory chips on the back of the PCB hit 110°C — Micron's thermal throttle limit (Tom's Hardware). The stock cooler was designed for gaming bursts, not 24/7 AI inference. A full-cover waterblock with an active backplate drops VRAM temps from 110°C to 62-70°C (Overclock.net). Without it, your 3090 will thermal throttle during any extended AI workload.

RTX 5090: 575 Watts Need Real Cooling

The RTX 5090 pulls 575W — the highest TDP of any consumer GPU ever made. Stock cooler temps sit at 72-75°C for the GPU core and 88-90°C for the GDDR7 memory (GamersNexus). A custom waterblock drops the GPU to 48°C and the memory to 58°C — a 25°C and 30°C reduction respectively (TweakTown, from der8auer/Thermal Grizzly testing). Lower temps also mean slightly higher automatic boost clocks — 2,680 MHz vs 2,655 MHz in one test.

RTX 4090: Already Decent on Air, Better on Water

The 4090 runs at a manageable 67-72°C on the stock Founders Edition cooler (Tom's Hardware). It doesn't need water cooling to survive. But a waterblock drops it to the high 30s with near-silent operation — 31-32 dBA on a liquid-cooled MSI Suprim versus 45 dBA on the stock FE fan (Tom's Hardware). If you're running inference overnight in a bedroom or home office, that noise difference matters.

RTX 4060 Ti and 4070 Ti Super: Cool on Air

Both 16GB cards stay under 67°C on stock coolers even under sustained load (Tom's Hardware). The 4060 Ti only draws 165W — most aftermarket models barely spin their fans. Water cooling these cards is about noise elimination and aesthetics, not thermal necessity.

The Noise Factor

Stock air coolers run at 32-49 dBA under AI workloads depending on the card. A custom water loop drops that to roughly 25 dBA — pump hum only. The difference between hearing a persistent fan whine and near-silence. For a machine running 24/7 in your workspace, that's not a trivial detail.

Recommended Builds by Budget

Budget Build — $1,200 Total

Component	Pick	Price
GPU	RTX 3090 (used)	$750
CPU	AMD Ryzen 5 7600	$180
RAM	32GB DDR5-5600	$80
PSU	750W 80+ Gold	$90
Storage	1TB NVMe SSD	$70

Runs: 8B-14B models at 50-87 tok/s, SDXL in 5.6 seconds. The CPU doesn't matter much for inference — the GPU does all the work. Spend the saved CPU money on a waterblock. A used 3090 without water cooling will thermal throttle during extended AI sessions. Budget at least $100-150 for a Bykski or Barrow full-cover block with an active backplate.

Mid-Range Build — $2,500 Total

Component	Pick	Price
GPU	RTX 4090	$1,600
CPU	AMD Ryzen 7 7700X	$250
RAM	64GB DDR5-5600	$150
PSU	850W 80+ Gold	$110
Storage	2TB NVMe SSD	$120

Runs: 32B Q4 models at 37 tok/s, Flux in 18 seconds, SDXL batch rendering. 64GB system RAM is useful when loading large models — the CPU can assist with layers that don't fit in VRAM. This is the sweet spot for someone who uses local AI as a daily tool, not just a weekend experiment. Water cooling is optional here but recommended for noise.

High-End Build — $4,000+

Component	Pick	Price
GPU	RTX 5090	$2,000
CPU	AMD Ryzen 9 9900X	$400
RAM	64GB DDR5-6000	$180
PSU	1000W 80+ Platinum	$170
Storage	2TB NVMe SSD	$120

Runs: 32B models at Q4 with full context windows, Flux in 9 seconds, the biggest models that fit in 32GB. The 1000W PSU isn't optional — the 5090 alone can draw 575W. Water cooling is strongly recommended at this TDP level, both for thermals and noise. Running a 575W GPU on air in a home office means constant fan noise.

Entry-Level — $400 GPU Only

The RTX 4060 Ti 16GB at $399 is the cheapest way to get 16GB of VRAM for local AI. It runs 7B-8B models at a usable 34 tok/s and handles SDXL in 12 seconds. Don't expect fast Flux rendering (92 seconds per image at FP16) or smooth performance with 14B+ models. The 128-bit memory bus is the fundamental limit — no amount of software optimization fixes a narrow pipe. Good for learning and experimenting. Not for production use.

Which GPU Should You Buy?

Here's the short version:

Just experimenting with local AI? RTX 4060 Ti 16GB ($399). Runs 8B models, SDXL, basic Flux. You'll know within a month whether you want more.
Serious about running AI daily? RTX 4090 ($1,599). Best all-around. 24GB handles 32B models at Q4. Fast enough for everything except the very largest models.
Need the maximum VRAM a consumer card offers? RTX 5090 ($1,999). 32GB lets you run 32B models at Q4 with headroom for long conversations and large context windows. Nothing else in this price range offers that.
On a tight budget? Used RTX 3090 ($750). Same 24GB VRAM as the 4090 at half the price. 15-20% slower on inference. Factor in $100-150 for a waterblock — the 3090's VRAM thermals are a known problem under sustained load.
Already have a 4070 Ti Super? Keep it. 16GB and 672 GB/s bandwidth is solid for 8B-14B models. Only upgrade if you need 24GB+ VRAM for larger models.

FormulaMod Waterblock Compatibility

If you decide to water cool your GPU, FormulaMod stocks full-cover waterblocks for all five GPUs covered in this guide. Bykski and Barrow blocks for 500+ card models — reference designs and AIB variants. The RTX 3090 active backplate, specifically, is worth looking at if you're buying a used 3090 for AI work.

Search by your exact card model in the GPU Waterblocks collection. For RTX 50 series blocks specifically, see the 50 Series collection. Volume pricing and B2B inquiries are handled through the Enterprise page.

Frequently Asked Questions

How much VRAM do I need for local AI?

16GB runs 7B-8B models at full speed and handles SDXL image generation. 24GB lets you run 32B models at Q4 quantization — a meaningful quality jump for reasoning and coding tasks. 32GB (RTX 5090 only) gives you headroom for 32B models with long context windows. For most users running Llama, Qwen, or DeepSeek distilled models, 24GB is the sweet spot.

Is the RTX 3090 still worth buying for AI in 2026?

Yes, if you buy used at $700-800 and budget for a waterblock. It has 24GB VRAM — same as the $1,599 RTX 4090 — and generates 87 tok/s on 8B models (Hardware Corner). The catch: its GDDR6X memory hits 110°C under sustained AI load with stock cooling (Tom's Hardware). A full-cover waterblock with active backplate is effectively mandatory.

Does water cooling improve AI inference speed?

Not directly — inference speed is determined by memory bandwidth, not temperature. But it prevents thermal throttling, which can degrade performance. The RTX 3090 throttles at 110°C VRAM junction temp, losing throughput. The RTX 5090 sees a small automatic boost clock increase (2,655 MHz to 2,680 MHz) from lower temps. The bigger win is noise reduction: stock fans run 32-49 dBA under AI load, while a water loop drops to ~25 dBA.

Can I run Llama 70B on a single GPU?

Not at standard quantization. Llama 3.1 70B at Q4 needs ~40GB of VRAM. The largest consumer GPU — RTX 5090 — has 32GB. At extreme quantization (Q2/Q3) it might squeeze into 32GB, but quality degrades noticeably. Practical options: run it across two GPUs, offload layers to system RAM (slow — expect single-digit tok/s), or use a smaller model. DeepSeek-R1-Distill-Qwen-32B at Q4 (~20GB) offers strong reasoning at a size that fits on any 24GB card.

RTX 4060 Ti 8GB vs 16GB — which for AI?

The 16GB version. The 8GB 4060 Ti can barely load a 7B model at Q4 (~5GB model + overhead). The 16GB version gives you room for 14B models and comfortable SDXL rendering. Both share the same compute hardware and 128-bit bus, so inference speed is identical on models that fit in 8GB. But most useful local AI models need 10-16GB, making the 8GB version a dead end. Pay the $50-100 premium for 16GB.

Last updated: April 2026. Benchmark data from Hardware Corner, ComfyUI Community Benchmarks, GamersNexus, and Tom's Hardware. Prices reflect approximate US market rates at time of writing.

Back to blog

Item added to your cart

Which NVIDIA GPU for Local AI in 2026? RTX 3090 vs 4060 Ti vs 4070 Ti Super vs 4090 vs 5090

Which NVIDIA GPU for Local AI in 2026? RTX 3090 vs 4060 Ti vs 4070 Ti Super vs 4090 vs 5090

GPU Specs at a Glance

How Fast Can Each GPU Run a Local LLM?

What's the Biggest Model Each Card Can Run?

Which GPU for Stable Diffusion and Flux?

Why Water Cooling Matters for AI Workloads

RTX 3090: Water Cooling Is a Requirement, Not a Luxury

RTX 5090: 575 Watts Need Real Cooling

RTX 4090: Already Decent on Air, Better on Water

RTX 4060 Ti and 4070 Ti Super: Cool on Air

The Noise Factor

Recommended Builds by Budget

Budget Build — $1,200 Total

Mid-Range Build — $2,500 Total

High-End Build — $4,000+

Entry-Level — $400 GPU Only

Which GPU Should You Buy?

FormulaMod Waterblock Compatibility

Frequently Asked Questions

How much VRAM do I need for local AI?

Is the RTX 3090 still worth buying for AI in 2026?

Does water cooling improve AI inference speed?

Can I run Llama 70B on a single GPU?

RTX 4060 Ti 8GB vs 16GB — which for AI?

Related Articles

Leave a comment

Country/region

Language

Which NVIDIA GPU for Local AI in 2026? RTX 3090 vs 4060 Ti vs 4070 Ti Super vs 4090 vs 5090

GPU Specs at a Glance

How Fast Can Each GPU Run a Local LLM?

What's the Biggest Model Each Card Can Run?

Which GPU for Stable Diffusion and Flux?

Why Water Cooling Matters for AI Workloads

RTX 3090: Water Cooling Is a Requirement, Not a Luxury

RTX 5090: 575 Watts Need Real Cooling

RTX 4090: Already Decent on Air, Better on Water

RTX 4060 Ti and 4070 Ti Super: Cool on Air

The Noise Factor

Recommended Builds by Budget

Budget Build — $1,200 Total

Mid-Range Build — $2,500 Total

High-End Build — $4,000+

Entry-Level — $400 GPU Only

Which GPU Should You Buy?

FormulaMod Waterblock Compatibility

Frequently Asked Questions

How much VRAM do I need for local AI?

Is the RTX 3090 still worth buying for AI in 2026?

Does water cooling improve AI inference speed?

Can I run Llama 70B on a single GPU?

RTX 4060 Ti 8GB vs 16GB — which for AI?

Related Articles

Related Articles

Leave a comment

SUBSCRIBE TO OUR EMAILS