What GPU waterblocks does FormulaMod carry?

FormulaMod is an authorized dealer for Bykski, Barrow, and Granzon waterblocks covering RTX 5090, 4090, 3090, RX 9070 XT, and H100/H200 GPUs. We stock over 1,600 water cooling products with same-week worldwide shipping from Guangzhou.

How do I choose a waterblock for my GPU?

Match the waterblock to your exact GPU model and PCB variant. Reference-design cards (Founders Edition, reference Sapphire/PowerColor) use universal reference blocks. Non-reference cards from ASUS ROG Strix, MSI Gaming X Trio, Gigabyte AORUS, and EVGA FTW3 require model-specific full-cover blocks that match their unique PCB layouts. Always verify your GPU's exact model number before ordering.

Will a GPU waterblock reduce noise during AI workloads like Ollama or Stable Diffusion?

Yes. Stock GPU air coolers run at 45-65 dBA under sustained AI inference loads. A full-cover waterblock mounted to a 360mm radiator with a quiet fan curve typically reduces noise to under 30 dBA. VRAM temperatures also drop from 90°C+ to 60-70°C, which prevents throttling during long generation runs.

What components do I need for a complete custom water cooling loop?

A complete custom loop requires: (1) a full-cover GPU waterblock, (2) a radiator — 360mm recommended for single-GPU systems, 480mm for dual GPU or CPU+GPU loops, (3) a D5 or DDC pump with reservoir combo, (4) G1/4 threaded fittings — compression fittings for soft tubing or hard fittings for PETG/acrylic hard tubing, (5) tubing, and (6) premixed coolant. FormulaMod sells individual components and complete kits starting at $249 USD.

Does FormulaMod ship internationally?

Yes. FormulaMod ships worldwide from Guangzhou, China. Standard shipping to the US, EU, UK, Canada, and Australia typically takes 7-14 business days. Express DHL/FedEx options are available at checkout for 3-5 business day delivery. All orders include a tracking number.

Is FormulaMod an authorized Bykski and Barrow dealer?

Yes. FormulaMod has been an authorized Bykski and Barrow distributor since 2013. Products ship directly from the manufacturer's warehouse in Guangzhou with full manufacturer warranty. Granzon products are also stocked as an authorized reseller.

Best GPU for Stable Diffusion and Flux in 2026: RTX 4060 Ti vs 3090 vs 4090 vs 5090 Benchmarks

By HuangLiang · Published Apr 10, 2026 · Updated Apr 17, 2026

April 10, 2026

Best GPU for Stable Diffusion and Flux in 2026: RTX 4060 Ti vs 3090 vs 4090 vs 5090 Benchmarks

April 9, 2026 · FormulaMod · 12 min read

The RTX 4090 is the best GPU for Stable Diffusion in 2026 if you have $1,600 to spend. It handles every model at native precision without compromise. If your budget is tighter, a used RTX 3090 at $700–$800 is the single best value card for AI image generation today.

Key Takeaways

VRAM is the bottleneck, not compute. Flux at FP16 needs 12 GB minimum. Training LoRAs on SDXL needs 16 GB+. Cards with 8 GB are a dead end.
The RTX 4090 generates SDXL images in 3.2 seconds and Flux images in 18 seconds at native FP16. Nothing else under $2,000 comes close to this ratio of speed and VRAM headroom.
A used RTX 3090 at ~$750 delivers 80% of the capability at half the price. Its 24 GB of VRAM handles every current model, including Flux at full precision.
The RTX 5090 is raw speed king at 2.2s per SDXL image and 9s per Flux image, but at $2,000+ street price, the per-dollar improvement over the 4090 is marginal.
The RTX 4060 Ti 16 GB is a trap. It has enough VRAM to load big models but a 128-bit memory bus that makes everything 2–3x slower than it should be.

SDXL Benchmark Comparison

SDXL (Stable Diffusion XL) is still the workhorse model for most image generation workflows. It runs well on modest hardware, but GPU speed differences are stark when you look at the numbers. These benchmarks use SDXL 1.0 at 1024×1024 resolution, 20 sampling steps, measured in ComfyUI with community-standard settings.

GPU	VRAM	SDXL 1024×1024 (20 steps)	Relative Speed	Street Price (Apr 2026)
RTX 4060 Ti 16GB	16 GB	~12.0s	1.0x (baseline)	$380–$420
RTX 3090	24 GB	~5.6s	2.1x	$700–$800 (used)
RTX 4090	24 GB	~3.2s	3.75x	$1,500–$1,700
RTX 5090	32 GB	~2.2s	5.45x	$2,000–$2,200

Sources: ComfyUI community benchmarks, Prompting Pixels (avg it/s: 4060 Ti = 9.6, 3090 = 15.9, 5090 = 41.5)

The 4090 is 3.75x faster than the 4060 Ti and only 31% slower than the 5090, despite costing $500 less. The 3090, a two-generation-old card, still delivers 2.1x the speed of the 4060 Ti. For SDXL specifically, both the 3090 and 4090 are in a comfortable zone where generation feels near-instant during interactive use.

The 5090 pulls ahead in absolute terms, but the jump from 3.2s to 2.2s is less noticeable in practice during single-image workflows. Where the 5090 matters is batch rendering, which we will cover below.

Flux Generation Speed by GPU

Flux changed the game when it shipped in mid-2024. It produces noticeably better images than SDXL—sharper text rendering, better anatomy, more coherent compositions—but it demands significantly more from your hardware. The model is larger, the architecture is heavier, and the VRAM requirements are brutal at native precision.

GPU	Flux DEV FP16 (20 steps, 1MP)	Flux DEV FP8	Flux DEV NF4
RTX 4060 Ti 16GB	~92s	~49s	~47s
RTX 3090	~40s	~30s	—
RTX 4090	~18s	~14s	—
RTX 5090	~9s	~6s	—

Sources: Hardware Corner, Furkan Gozukara benchmarks. 1MP = 1 megapixel output.

Flux at FP16 is where the 4060 Ti falls apart. At 92 seconds per image, it is borderline unusable for iterative creative work. Even at NF4 quantization (which degrades output quality), it still takes 47 seconds—longer than the 3090 at full FP16.

The 5090 is 3x faster than the 3090 on Flux and generates a full-quality FP16 image in 9 seconds. At FP8 quantization, it drops to 6 seconds—fast enough for real-time experimentation with prompt tweaking.

If Flux is your primary model, the 4090 is the minimum card where the workflow feels responsive. Anything below it introduces enough latency to break creative flow.

VRAM Requirements: When 16 GB Is Not Enough

Every discussion about Stable Diffusion GPUs eventually comes down to VRAM. Here is what each model actually needs:

Workflow	Minimum VRAM	Comfortable VRAM	Notes
SDXL inference (1024×1024)	8 GB	12–16 GB	Works on 8 GB with aggressive offloading; 16 GB avoids swapping
Flux DEV FP16	12 GB	24 GB	Will OOM on 8 GB cards. 12 GB is tight—no room for ControlNet
Flux DEV FP8 / NF4	8 GB	16 GB	Quantization reduces quality but makes it possible on smaller cards
SD 3.5 Medium FP8	8 GB	16 GB	Lighter than Flux but still benefits from VRAM headroom
SDXL LoRA training	16 GB	24 GB	Training eats VRAM for optimizer states and gradients
Flux LoRA training	24 GB	32 GB+	Practically requires a 3090/4090 minimum

The pattern is clear: 8 GB cards are a dead end for anything beyond basic SDXL. The 4060 Ti's 16 GB lets you load Flux, but as the benchmarks show, loading and running are different things. The 3090 and 4090 with 24 GB each handle every current model at native precision and have headroom for ControlNet, IP-Adapter, and other pipeline extensions that pile on VRAM usage.

If you plan to train LoRAs—and most serious users eventually do—24 GB is the floor. The 5090's 32 GB gives genuine breathing room for Flux fine-tuning, but at a steep price premium.

Batch Rendering: How Many Images Per Hour?

Single-image speed is what reviewers benchmark, but real production work is batch rendering. You queue 50–200 images with different prompts, seeds, or ControlNet inputs, walk away, and come back. Here the speed differences compound.

GPU	SDXL per image	50 images	Images/hour	Flux FP16 per image	50 Flux images
RTX 4060 Ti	12.0s	10 min 0s	300	92s	76 min 40s
RTX 3090	5.6s	4 min 40s	643	40s	33 min 20s
RTX 4090	3.2s	2 min 40s	1,125	18s	15 min 0s
RTX 5090	2.2s	1 min 50s	1,636	9s	7 min 30s

At 50 Flux images, the 4060 Ti takes over an hour and fifteen minutes. The 4090 finishes in 15 minutes. That is the difference between a quick lunch break and losing your entire afternoon.

For studios running batch jobs overnight—character sheet variations, texture generation for game assets, product mockups—the 4090 and 5090 produce 4–5x the output of a 4060 Ti in the same time window. Over weeks, that throughput difference translates directly into productivity and, for commercial users, revenue.

The 128-Bit Bus Problem: Why the 4060 Ti Is Slow Despite 16 GB

On paper, the RTX 4060 Ti 16 GB looks ideal for AI work. Sixteen gigabytes of VRAM at $400—what is not to like? The problem is memory bandwidth.

The 4060 Ti uses a 128-bit memory bus with 288 GB/s of bandwidth. The 4070 Ti Super, which also has 16 GB, uses a 256-bit bus with 672 GB/s. That is 2.3x more bandwidth for the same VRAM capacity. The 4090 pushes 1,008 GB/s on a 384-bit bus. The 5090 hits 1,792 GB/s on a 512-bit bus.

Stable Diffusion and Flux are bandwidth-hungry workloads. The denoising loop reads and writes large tensors every step. When the bus cannot feed data to the compute units fast enough, the GPU sits idle waiting for memory. This is why the 4060 Ti benchmarks 2–3x slower than cards with similar or even lower raw compute—it is perpetually starved for data.

The 4060 Ti's VRAM capacity lets it load Flux at FP16, but its bandwidth means it runs Flux at a crawl. If you are considering a 4060 Ti for AI image generation, save up for a used 3090 instead. You get 24 GB of VRAM, 936 GB/s of bandwidth, and roughly 2x the actual generation speed—for about twice the price, which makes the cost per image nearly identical.

Cooling for Long Rendering Sessions

Generating a single image takes seconds. Rendering a batch of 200 takes minutes to hours. During sustained loads, GPU temperatures climb and stay elevated. The thermal behavior of these cards matters more for AI workloads than for gaming, because rendering sessions can run continuously for hours.

RTX 3090: The Known Problem

The 3090 is notorious for its VRAM thermals. Tom's Hardware and multiple reviewers have documented GDDR6X temperatures hitting 110°C under sustained load with the stock cooler—right at the thermal limit. At that temperature, the card throttles to protect itself, and long rendering sessions become inconsistent.

With a full-coverage GPU waterblock, VRAM temperatures drop to the 62–70°C range. That is a 40–48°C reduction. The card stops throttling, maintains consistent clock speeds, and the fan noise disappears entirely. For a card that many AI users run 8–12 hours a day, water cooling is not a luxury—it is the difference between reliable operation and thermal shutdowns.

RTX 5090: Better Stock, Even Better Cooled

The 5090 runs cooler at stock than the 3090 did—72–75°C GPU and 88–90°C VRAM under sustained AI workloads. That is livable, but VRAM is still running warm. Der8auer's testing with a custom waterblock showed 48°C GPU and 58°C VRAM—a 24–32°C drop that gives significant thermal headroom for overclocking or extended sessions in warm environments.

RTX 4090: The Easy One

The 4090 is the least problematic thermally. Stock temps sit at 67–72°C, and with a waterblock, it drops to the high 30s. If you are running a 4090 for AI image generation, water cooling is a nice-to-have rather than a necessity, but it does eliminate fan noise during those long overnight batch runs.

GPU	Stock GPU Temp	Stock VRAM Temp	Waterblock GPU Temp	Waterblock VRAM Temp
RTX 3090	~80°C	~110°C	~62–70°C	~62–70°C
RTX 4090	67–72°C	~72°C	~38°C	~45°C
RTX 5090	72–75°C	88–90°C	~48°C	~58°C

Sources: Tom's Hardware (3090), der8auer (5090), various reviews (4090).

If you are running a 3090 for Stable Diffusion—and a lot of people are, given the price-to-performance ratio—a full-coverage GPU waterblock should be considered essential, not optional. The card was designed in 2020 for gaming bursts, not for the sustained compute loads that AI image generation demands.

FormulaMod carries full-coverage GPU waterblocks for RTX 3090, 4090, and 5090 reference and AIB designs. Copper base, full VRAM contact, compatible with standard G1/4 fittings.

Browse GPU Waterblocks

Frequently Asked Questions

Is 8 GB of VRAM enough for Stable Diffusion in 2026?

For basic SDXL inference at 1024×1024, yes—barely. You can generate images but you will be constantly managing VRAM with model offloading. For Flux at FP16, no. For LoRA training, no. If you are buying a GPU specifically for AI image generation in 2026, 16 GB is the minimum and 24 GB is the realistic target.

Should I buy a used RTX 3090 or a new RTX 4060 Ti 16 GB?

The used 3090, without question. It has 24 GB of VRAM (vs 16 GB), 936 GB/s of bandwidth (vs 288 GB/s), and benchmarks 2x faster on SDXL and 2.3x faster on Flux. The 3090 costs roughly twice as much as a 4060 Ti, but it generates images at half the time, so the cost per image is comparable—and you get a card that can handle every model at full precision.

Is the RTX 5090 worth it over the 4090 for Stable Diffusion?

For most home users, no. The 5090 is about 45% faster than the 4090 on SDXL and 50% faster on Flux, but it costs 25–40% more. The 4090's 24 GB of VRAM handles every current model at full precision. The 5090's 32 GB is genuinely useful for Flux LoRA training, but for inference-only workflows, the 4090 remains the better value.

Do I need water cooling for my GPU if I run Stable Diffusion?

It depends on the card and your workload. For the RTX 3090, water cooling is strongly recommended—VRAM hits 110°C at stock during sustained loads, which causes throttling and reduces card lifespan. For the 4090, it is optional but nice for noise reduction. For the 5090, it helps keep VRAM temps in check during long batch sessions. If you regularly run renders for more than 30 minutes at a time, a GPU waterblock pays for itself in consistent performance and quieter operation.

What about AMD GPUs for Stable Diffusion?

AMD GPUs have made progress with ROCm support, and cards like the RX 7900 XTX offer 24 GB of VRAM at competitive prices. However, the Stable Diffusion and ComfyUI ecosystems are still heavily optimized for CUDA. Many extensions, custom nodes, and optimization techniques (like TensorRT acceleration) are NVIDIA-only. If AI image generation is your primary use case, NVIDIA remains the safer choice in 2026 due to broader software compatibility.

About FormulaMod — We manufacture and sell custom PC water cooling components including GPU waterblocks, CPU blocks, radiators, fittings, and tubing. Based in Shenzhen, shipping worldwide. Visit us at formulamod.net.

Benchmark data in this article is sourced from ComfyUI community benchmarks, Prompting Pixels, Hardware Corner, Furkan Gozukara, Tom's Hardware, and der8auer. All prices are approximate street prices as of April 2026.

Shop Water Cooling Components

Back to blog

Item added to your cart

Best GPU for Stable Diffusion and Flux in 2026: RTX 4060 Ti vs 3090 vs 4090 vs 5090 Benchmarks

Best GPU for Stable Diffusion and Flux in 2026: RTX 4060 Ti vs 3090 vs 4090 vs 5090 Benchmarks

Key Takeaways

SDXL Benchmark Comparison

Flux Generation Speed by GPU

VRAM Requirements: When 16 GB Is Not Enough

Batch Rendering: How Many Images Per Hour?

The 128-Bit Bus Problem: Why the 4060 Ti Is Slow Despite 16 GB

Cooling for Long Rendering Sessions

RTX 3090: The Known Problem

RTX 5090: Better Stock, Even Better Cooled

RTX 4090: The Easy One

Frequently Asked Questions

Is 8 GB of VRAM enough for Stable Diffusion in 2026?

Should I buy a used RTX 3090 or a new RTX 4060 Ti 16 GB?

Is the RTX 5090 worth it over the 4090 for Stable Diffusion?

Do I need water cooling for my GPU if I run Stable Diffusion?

What about AMD GPUs for Stable Diffusion?

Shop Water Cooling Components

Related Articles

Leave a comment

Country/region

Best GPU for Stable Diffusion and Flux in 2026: RTX 4060 Ti vs 3090 vs 4090 vs 5090 Benchmarks

Key Takeaways

SDXL Benchmark Comparison

Flux Generation Speed by GPU

VRAM Requirements: When 16 GB Is Not Enough

Batch Rendering: How Many Images Per Hour?

The 128-Bit Bus Problem: Why the 4060 Ti Is Slow Despite 16 GB

Cooling for Long Rendering Sessions

RTX 3090: The Known Problem

RTX 5090: Better Stock, Even Better Cooled

RTX 4090: The Easy One

Frequently Asked Questions

Is 8 GB of VRAM enough for Stable Diffusion in 2026?

Should I buy a used RTX 3090 or a new RTX 4060 Ti 16 GB?

Is the RTX 5090 worth it over the 4090 for Stable Diffusion?

Do I need water cooling for my GPU if I run Stable Diffusion?

What about AMD GPUs for Stable Diffusion?

Shop Water Cooling Components

Related Articles

Related Articles

Leave a comment

SUBSCRIBE TO OUR EMAILS