Models grow 3x every year. Your GPU stays the same.

Compress Any LLM to Fit Your Hardware

AI picks the optimal recipe. Pruning, quantization, and MoE surgery in one pipeline.

Join the waitlist — be first to compress when we launch.

See what fits your GPU right now
Profiling model... MoE detected (16 experts, top-2)
AI Route: expert_prune(8/16) + Q4_K_M
Pruning experts: 16 → 8 (removing lowest-impact)
Quantizing: FP16 → Q4_K_M
Converting to GGUF...
Done! 218GB → 15.8GB (13.8x compression)
Quality retained: 82.1% (MMLU) | 28.3 tok/s on RTX 4090
→ ollama run smelt/llama4-scout-109b-q4km
0B

Parameter frontier

0GB

Consumer GPU VRAM

0

Fragmented ecosystems

0%

Running uncompressed

The Problem

The Gap Is Growing

Frontier models scale from 70B to 671B parameters. Consumer GPUs are stuck at 24GB VRAM. The tools between them are fragmented across 6 incompatible ecosystems.

Model Size

3x/year growth

70B → 405B → 671B. Each generation demands more RAM than any consumer card provides.

GPU VRAM

24GBceiling

RTX 4090 shipped at 24GB in 2022. RTX 5090 ships at 32GB. Models grew 9x in the same window.

Tooling

6ecosystems

GGUF, GPTQ, AWQ, EXL2, ONNX, TensorRT. Each has its own tools, formats, and quality tradeoffs.

How It Works

Three Steps to Compression

Profile
01

Profile

Tell Smelt your model and target hardware. It analyzes architecture, parameter count, and MoE topology.

Route
02

Route

AI selects the optimal compression pipeline — quantization level, pruning targets, expert cuts — calibrated to your VRAM.

Compress
03

Compress

One command. Output: a GGUF file that runs on Ollama, llama.cpp, or LM Studio. Quality report included.

Try it now

What can you run on your hardware?

Select your hardware and use case. We'll show you the best open-source models that fit — ranked by quality.

RAM
Task
0 models fit

No models fit 16 GB for this task. Try more RAM.

Features

Everything You Need to Compress

Six compression techniques in one pipeline. From basic quantization to advanced MoE surgery.

Quantization
GGUF-native quantization from Q2_K to Q8_0. One command, any model, ready for Ollama.
Included in Free tier
MoE Expert Surgery
The first production tool for pruning MoE experts. Cut 109B models down to fit 16GB.
Depth Pruning
Remove redundant transformer layers. Reduce model depth while preserving output quality.
Width Pruning
Slim attention heads and FFN dimensions. Targeted compression for dense architectures.
AI Route Selector
AI analyzes your model and hardware, then picks the optimal compression recipe. Free.
Included in Free tier
Cloud Compression
Upload once, download a GGUF. No local GPU required. Works with any model on HuggingFace.

Proof

Compression Quality Report

Every compression job produces a quality report with before/after benchmarks. No guessing — you see exactly what you get.

8 model families supported, covering 95% of local LLM usage

quality-report.json
InputLlama 4 Scout 109B
ArchitectureMoE (16 experts, top-2)
Original Size218 GB
Pipelineexpert_prune(8/16) + Q4_K_M
Output Size15.8 GB
FormatGGUF (Q4_K_M)
Compression13.8x
Quality (MMLU)82.1% retained
Perplexity5.42 → 6.18 (+14%)
Throughput28.3 tok/s on RTX 4090

Pricing

Start Free, Scale When Ready

Free

$0forever
  • Quantization (all GGUF formats)
  • AI Route Selector
  • 1 cloud job / month
  • Community support
Most Popular

Pro

$19/mo
  • Everything in Free
  • MoE Expert Surgery
  • Depth + Width Pruning
  • 5 cloud jobs / month
  • Priority queue
  • Quality reports with benchmarks

Team

$49/mo
  • Everything in Pro
  • 20 cloud jobs / month
  • Team workspace
  • Custom compression recipes
  • Dedicated support
  • API access

Why Smelt

The Pain Is Real

6 incompatible compression ecosystems with no standard recipe

50%+ of production deployments run uncompressed because tooling is too complex

No single tool combines pruning, quantization, and MoE surgery

Stop guessing quant levels — AI decides the best compression path

AI picks the optimal recipe. Pruning, quantization, and MoE surgery in one pipeline.

Join the waitlist — be first to compress when we launch.

Or try the free Model Recommender