Llama 4 Scout just dropped. Run it locally today.
The first MoE expert surgery tool. Compress any LLM to fit your hardware.
Profiling model... MoE detected (16 experts, top-2)AI Route: expert_prune(8/16) + Q4_K_MPruning experts: 16 → 8 (removing lowest-impact)Quantizing: FP16 → Q4_K_MConverting to GGUF...Done! 218GB → 15.8GB (13.8x compression)Quality retained: 82.1% (MMLU) | 28.3 tok/s on RTX 4090→ ollama run smelt/llama4-scout-109b-q4kmParameter frontier
Consumer GPU VRAM
Fragmented ecosystems
Running uncompressed
The Problem
Frontier models scale from 70B to 671B parameters. Consumer GPUs are stuck at 24GB VRAM. The tools between them are fragmented across 6 incompatible ecosystems.
70B → 405B → 671B. Each generation demands more RAM than any consumer card provides.
RTX 4090 shipped at 24GB in 2022. RTX 5090 ships at 32GB. Models grew 9x in the same window.
GGUF, GPTQ, AWQ, EXL2, ONNX, TensorRT. Each has its own tools, formats, and quality tradeoffs.
How It Works
Tell Smelt your model and target hardware. It analyzes architecture, parameter count, and MoE topology.
AI selects the optimal compression pipeline — quantization level, pruning targets, expert cuts — calibrated to your VRAM.
One command. Output: a GGUF file that runs on Ollama, llama.cpp, or LM Studio. Quality report included.
Try it now
Select your hardware and use case. We'll show you the best open-source models that fit — ranked by quality.
No models fit 16 GB for this task. Try more RAM.
Features
Six compression techniques in one pipeline. From basic quantization to advanced MoE surgery.
Proof
Every compression job produces a quality report with before/after benchmarks. No guessing — you see exactly what you get.
Llama 4 Scout 109B compressed to 16GB GGUF with 82% quality retained
| Input | Llama 4 Scout 109B |
| Architecture | MoE (16 experts, top-2) |
| Original Size | 218 GB |
| Pipeline | expert_prune(8/16) + Q4_K_M |
| Output Size | 15.8 GB |
| Format | GGUF (Q4_K_M) |
| Compression | 13.8x |
| Quality (MMLU) | 82.1% retained |
| Perplexity | 5.42 → 6.18 (+14%) |
| Throughput | 28.3 tok/s on RTX 4090 |
Pricing
Why Smelt
109B models need 218GB RAM — you have 16GB
Zero production tools exist for MoE expert pruning
Trial-and-error quantization wastes hours for every new model
The first MoE expert surgery tool. Compress any LLM to fit your hardware.