EU AI Act penalties start August 2, 2026 — up to 35M EUR or 7% of global revenue

Your Models. Your Hardware. Your Data.

EU AI Act hits August 2026. Compress LLMs for on-premise deployment — nothing leaves your network.

Join the waitlist — be first to compress when we launch.

See what fits your GPU right now
Profiling model... MoE detected (16 experts, top-2)
AI Route: expert_prune(8/16) + Q4_K_M
Pruning experts: 16 → 8 (removing lowest-impact)
Quantizing: FP16 → Q4_K_M
Converting to GGUF...
Done! 218GB → 15.8GB (13.8x compression)
Quality retained: 82.1% (MMLU) | 28.3 tok/s on RTX 4090
→ ollama run smelt/llama4-scout-109b-q4km
0B

Parameter frontier

0GB

Consumer GPU VRAM

0

Fragmented ecosystems

0%

Running uncompressed

The Problem

The Gap Is Growing

Frontier models scale from 70B to 671B parameters. Consumer GPUs are stuck at 24GB VRAM. The tools between them are fragmented across 6 incompatible ecosystems.

Model Size

3x/year growth

70B → 405B → 671B. Each generation demands more RAM than any consumer card provides.

GPU VRAM

24GBceiling

RTX 4090 shipped at 24GB in 2022. RTX 5090 ships at 32GB. Models grew 9x in the same window.

Tooling

6ecosystems

GGUF, GPTQ, AWQ, EXL2, ONNX, TensorRT. Each has its own tools, formats, and quality tradeoffs.

How It Works

Three Steps to Compression

Profile
01

Profile

Tell Smelt your model and target hardware. It analyzes architecture, parameter count, and MoE topology.

Route
02

Route

AI selects the optimal compression pipeline — quantization level, pruning targets, expert cuts — calibrated to your VRAM.

Compress
03

Compress

One command. Output: a GGUF file that runs on Ollama, llama.cpp, or LM Studio. Quality report included.

Try it now

What can you run on your hardware?

Select your hardware and use case. We'll show you the best open-source models that fit — ranked by quality.

RAM
Task
0 models fit

No models fit 16 GB for this task. Try more RAM.

Features

Everything You Need to Compress

Six compression techniques in one pipeline. From basic quantization to advanced MoE surgery.

Quantization
GGUF-native quantization from Q2_K to Q8_0. One command, any model, ready for Ollama.
Included in Free tier
MoE Expert Surgery
The first production tool for pruning MoE experts. Cut 109B models down to fit 16GB.
Depth Pruning
Remove redundant transformer layers. Reduce model depth while preserving output quality.
Width Pruning
Slim attention heads and FFN dimensions. Targeted compression for dense architectures.
AI Route Selector
AI analyzes your model and hardware, then picks the optimal compression recipe. Free.
Included in Free tier
Cloud Compression
Upload once, download a GGUF. No local GPU required. Works with any model on HuggingFace.

Proof

Compression Quality Report

Every compression job produces a quality report with before/after benchmarks. No guessing — you see exactly what you get.

GGUF output runs on Ollama, llama.cpp, and LM Studio — fully air-gapped capable

quality-report.json
InputLlama 4 Scout 109B
ArchitectureMoE (16 experts, top-2)
Original Size218 GB
Pipelineexpert_prune(8/16) + Q4_K_M
Output Size15.8 GB
FormatGGUF (Q4_K_M)
Compression13.8x
Quality (MMLU)82.1% retained
Perplexity5.42 → 6.18 (+14%)
Throughput28.3 tok/s on RTX 4090

Pricing

Start Free, Scale When Ready

Free

$0forever
  • Quantization (all GGUF formats)
  • AI Route Selector
  • 1 cloud job / month
  • Community support
Most Popular

Pro

$19/mo
  • Everything in Free
  • MoE Expert Surgery
  • Depth + Width Pruning
  • 5 cloud jobs / month
  • Priority queue
  • Quality reports with benchmarks

Team

$49/mo
  • Everything in Pro
  • 20 cloud jobs / month
  • Team workspace
  • Custom compression recipes
  • Dedicated support
  • API access

Why Smelt

The Pain Is Real

67% of HIPAA AI breaches in 2025 came from data sent to cloud services

Attorney-client privilege does not cover cloud AI conversations (SDNY 2026 ruling)

Air-gapped classified networks need compressed models that fit limited hardware

Run AI on-premise without cloud risk or compliance headaches

EU AI Act hits August 2026. Compress LLMs for on-premise deployment — nothing leaves your network.

Join the waitlist — be first to compress when we launch.

Or try the free Model Recommender