Shimmy: A 5MB Rust Binary That Makes Ollama Look Bloated
Iâve been running Ollama for local inference for over a year. It works. Itâs fine. But âfineâ is a low bar when youâre running models on hardware you own, and every megabyte of overhead is a megabyte you could give to your modelâs context window.
Enter Shimmy â a local LLM inference server written in pure Rust that ships as a single 5MB binary. No Python runtime. No Docker container. No configuration files. Just download and run.
The Numbers That Matter
Let me put this in perspective:
| Shimmy | Ollama | |
|---|---|---|
| Binary size | ~5MB | ~200MB+ |
| Startup time | ~100ms | Several seconds |
| Idle RAM | ~50MB | ~300MB+ |
| Dependencies | Zero | Go runtime |
| Config files | None | Optional but common |
These arenât benchmarks from a lab. These are the differences you feel â especially on a Mac Mini or a small VPS where every resource counts.
Why It Works as a Drop-in Replacement
Shimmy implements the OpenAI API spec. Not a subset, not a âcompatible-ishâ variant â the actual /v1/chat/completions endpoint that every tool already speaks. Point your OPENAI_BASE_URL at http://localhost:11435/v1, set the API key to literally anything (Shimmy ignores it), and your existing code just works:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=32,
)
This means Cursor, Continue.dev, any VSCode extension, any OpenAI SDK â they all work without code changes. Thatâs the kind of compatibility that actually matters.
Zero Configuration Isnât Marketing Speak
Shimmy auto-discovers GGUF models from your HuggingFace cache, existing Ollama installations, and local directories. Run shimmy list and it shows you whatâs available. Run shimmy serve and it picks a port, loads what it finds, and starts serving.
No YAML files. No model manifests. No Modelfile syntax to learn.
Hot model swapping works too â request a different model in your API call and Shimmy loads it on the fly. On a machine with limited RAM, this is genuinely useful. Youâre not pre-loading three models you might need; youâre loading what you need when you need it.
The v1.9.0 GPU Story
The latest release bundles all GPU backends into the single binary. CUDA, Vulkan, OpenCL on Linux/Windows; MLX on Apple Silicon. No separate downloads, no feature flag confusion, no âdid I compile with the right backend?â debugging sessions.
Your GPU gets detected at runtime. It just works or it falls back to CPU. This is how it should have always been.
For larger models, thereâs MOE (Mixture of Experts) support that intelligently splits layers between CPU and GPU. Run 70B+ parameter models on consumer hardware by letting Shimmy figure out the optimal placement. The --cpu-moe flag gives you control when you want it.
Whatâs the Catch?
Shimmy is newer and smaller in community than Ollama. The model ecosystem around Ollama â the web UIs, the management tools, the integrations â is more mature. If you need Open WebUI or similar frontends, check compatibility first.
But hereâs what Iâve found: most of my local LLM usage is API-driven. IDE completions, agent toolchains, scripts that call chat endpoints. For that workflow, Shimmy is strictly better. Lighter, faster, simpler.
The project is MIT-licensed (edit: Apache-2.0 per the repo badges) and the maintainer has made an explicit âfree foreverâ commitment â no asterisks, no pivot-to-paid. At 5,000+ stars and growing, itâs past the âweekend projectâ stage.
Try It in 30 Seconds
# macOS Apple Silicon
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &
./shimmy list
If youâre running local models and you havenât tried Shimmy, youâre leaving performance on the table. Itâs one of those tools where the engineering speaks for itself â small binary, fast startup, zero config, full compatibility. Thatâs the Rust promise delivered.