Models grow 3x every year. Your GPU stays the same.
AI picks the optimal recipe. Pruning, quantization, and MoE surgery in one pipeline.
Profiling model... MoE detected (16 experts, top-2)AI Route: expert_prune(8/16) + Q4_K_MPruning experts: 16 → 8 (removing lowest-impact)Quantizing: FP16 → Q4_K_MConverting to GGUF...Done! 218GB → 15.8GB (13.8x compression)Quality retained: 82.1% (MMLU) | 28.3 tok/s on RTX 4090→ ollama run smelt/llama4-scout-109b-q4kmParameter frontier
Consumer GPU VRAM
Fragmented ecosystems
Running uncompressed
The Problem
Frontier models scale from 70B to 671B parameters. Consumer GPUs are stuck at 24GB VRAM. The tools between them are fragmented across 6 incompatible ecosystems.
70B → 405B → 671B. Each generation demands more RAM than any consumer card provides.
RTX 4090 shipped at 24GB in 2022. RTX 5090 ships at 32GB. Models grew 9x in the same window.
GGUF, GPTQ, AWQ, EXL2, ONNX, TensorRT. Each has its own tools, formats, and quality tradeoffs.
How It Works
Tell Smelt your model and target hardware. It analyzes architecture, parameter count, and MoE topology.
AI selects the optimal compression pipeline — quantization level, pruning targets, expert cuts — calibrated to your VRAM.
One command. Output: a GGUF file that runs on Ollama, llama.cpp, or LM Studio. Quality report included.
Try it now
Select your hardware and use case. We'll show you the best open-source models that fit — ranked by quality.
No models fit 16 GB for this task. Try more RAM.
Features
Six compression techniques in one pipeline. From basic quantization to advanced MoE surgery.
Proof
Every compression job produces a quality report with before/after benchmarks. No guessing — you see exactly what you get.
8 model families supported, covering 95% of local LLM usage
| Input | Llama 4 Scout 109B |
| Architecture | MoE (16 experts, top-2) |
| Original Size | 218 GB |
| Pipeline | expert_prune(8/16) + Q4_K_M |
| Output Size | 15.8 GB |
| Format | GGUF (Q4_K_M) |
| Compression | 13.8x |
| Quality (MMLU) | 82.1% retained |
| Perplexity | 5.42 → 6.18 (+14%) |
| Throughput | 28.3 tok/s on RTX 4090 |
Pricing
Why Smelt
6 incompatible compression ecosystems with no standard recipe
50%+ of production deployments run uncompressed because tooling is too complex
No single tool combines pruning, quantization, and MoE surgery
AI picks the optimal recipe. Pruning, quantization, and MoE surgery in one pipeline.