Any open-source LLM on Hugging Face — dense and Mixture-of-Experts architectures are both supported.
MoE models contain hundreds of specialist sub-networks, but only a fraction activate per token. Smelt removes the least-used experts to dramatically reduce size while preserving quality.
No. Quantization and pruning run on CPU. Cloud jobs run on Smelt infrastructure — no GPU required on your end.
Typical results retain 85-95% of original benchmark scores. Smelt generates a quality report after every compression.
Smelt outputs standard GGUF files — compatible with Ollama, llama.cpp, LM Studio, vLLM, and any GGUF runtime.
Cloud jobs process and deliver. We do not retain, train on, or redistribute your models. Local CLI usage never leaves your machine.
llama.cpp handles quantization only. Smelt combines quantization with MoE expert surgery, pruning, and an AI agent that picks the optimal compression recipe.