mac-code: Run a 35B AI Coding Agent on a $600 Mac Mini for $0/Month

By Prahlad Menon 8 min read

On NVIDIA hardware, paging a 35B model from NVMe gives you about 1.6 tokens per second. On a $600 Mac Mini M4 with 16GB of RAM: 30 tokens per second. That’s an 18.6x difference — not because Apple Silicon has faster storage, but because of how it accesses it.

mac-code is an open-source AI coding agent built around that insight. It runs a 35B mixture-of-experts model entirely on consumer Mac hardware, for $0/month, with web search, shell commands, file operations, and code generation. No cloud API. No subscription.

The repo is MIT licensed with a one-command setup: bash setup.sh.

It also happens to validate in practice everything we’ve written about TurboQuant and RotorQuant — but we’ll get to that.

The Core Insight: Apple Silicon Flash-Paging

Most discussions of “running large models on small RAM” focus on quantization — shrinking the weights until they fit. mac-code takes a different approach: stream the weights from SSD, keep only what’s needed in RAM at any moment.

This sounds slow. On traditional hardware, it is slow — NVIDIA’s NVMe paging tops out around 1.6 tok/s because PCIe SSD reads go through the system memory controller, compete with CPU bandwidth, and then get transferred again to GPU VRAM.

Apple Silicon changes the equation. The unified memory architecture means the Neural Engine, GPU, and CPU all share the same physical memory bus. When the SSD streams weight data, it goes directly to where the computation happens — no PCIe handoff, no VRAM copy, no memory controller bottleneck. The result: 30 tok/s on a $600 machine.

How Flash Streaming Works

The model is split into two components:

Pinned in RAM (4–6 GB, stays forever):

  • Attention weights
  • Embeddings and layer norms
  • KV cache

Streamed from SSD per token:

  • FFN (feed-forward network) weights — the bulk of the model
  • Loaded layer-by-layer, used for one matrix multiply, then discarded
  • Memory footprint stays flat no matter how long the context gets

For each token, the loop is:

  1. Run attention — from RAM, instant
  2. Load FFN weights from SSD (~165–221 MB per layer)
  3. Run the FFN matmul on the GPU
  4. Discard FFN weights — memory never grows

For MoE (mixture-of-experts) models like Qwen3.5-35B, step 2 loads only the 8 active experts per token — roughly 14 MB — not all 256 experts. That’s why MoE models are 10x faster than equivalent dense models under flash-paging.

What You Can Actually Run

Every number below was measured on a 16 GB Mac Mini M4. Nothing estimated.

SetupRAMModelSpeed
Any Mac8 GBQwen3.5-9B (Q4_K_M, 5.3 GB)16–20 tok/s
Any Mac16 GBQwen3.5-9B, 64K context16–20 tok/s
Mac Mini M416 GBQwen3.5-35B-A3B (IQ2_M)30 tok/s
Mac Mini M416 GBQwen3-30B-A3B Q4 via Expert Sniper4.3 tok/s
Mac Mini M416 GBQwen3.5-35B-A3B Q4 via Flash Streaming1.54 tok/s
Mac Mini M4 Pro48 GB35B at full Q4 in RAM30+ tok/s

The fastest option — IQ2_M at 2.6-bit quantization — fits entirely in 16 GB RAM and runs at 30 tok/s. For users who need full 4-bit quality but only have 16 GB, Expert Sniper streams expert weights from SSD and achieves 5.4 tok/s with full Q4 precision.

Two Backends, Different Trade-offs

mac-code ships with two inference backends:

llama.cpp (Primary)

The default path. Runs Qwen3.5-35B at IQ2_M quantization — 10.6 GB fits entirely in RAM. The agent uses llama.cpp’s built-in tool-calling support for routing between search, shell, and chat actions.

llama-server \
  --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
  --port 8000 --host 127.0.0.1 \
  --flash-attn on --ctx-size 12288 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --n-gpu-layers 99 -t 4

MLX (Research)

Apple’s ML framework — 25% faster generation than llama.cpp when models fit in RAM, plus a feature that matters for agents: persistent KV cache that saves to disk and syncs across Macs via Cloudflare R2.

The context persistence benchmarks are striking:

OperationTime
Reprocessing 141 tokens1.01 seconds
Loading same context from SSD0.0003 seconds

That’s a 6,677x speedup for resuming a session. Analyze a codebase once; come back tomorrow and pick up instantly without re-ingesting everything.

KV Cache Compression: TurboQuant in Production

This is where mac-code directly connects to our earlier coverage.

The --cache-type-k q4_0 --cache-type-v q4_0 flags in the llama.cpp invocation above are KV cache quantization — the same technique Google formalized in TurboQuant (ICLR 2026) and that RotorQuant took further.

The practical result on a 9B model at 16-bit:

ConfigContext WindowKV Cache Memory
Standard (FP16)32K1,024 MB
Q4_0 KV cache64K288 MB

Quantizing the KV cache to 4-bit doubles the context window for free and cuts cache memory by 72%. The mac-code repo also includes turboquant.py — an experimental TurboQuant implementation that compresses saved context files 4x (26.6 MB → 6.7 MB) with 0.993 cosine similarity. Near-zero information loss.

The repo README frames this explicitly as a research finding: quantized KV cache in llama.cpp, combined with Apple Silicon’s unified memory bus, is what makes 64K context on a 9B model practical on an 8 GB machine.

Text-Level Routing: Solving the 2-Bit JSON Problem

The 35B IQ2_M model is fast, but extreme quantization breaks JSON function calling — structured outputs fail because the model can’t reliably format tool invocations at 2.6 bits.

mac-code’s solution: don’t use JSON for routing. The LLM classifies its own intent at the text level — outputting a plain word like search, shell, or chat — and the Python agent parses that to decide which tool to call. No JSON schema involved.

The result: 8/8 routing accuracy in testing, even with a model that fails at JSON structure. The insight generalizes: for routing decisions (which are simple classifications), text output is more robust than structured output under heavy quantization.

The Agent Itself

mac-code is not a chat UI. It’s a Python agent (agent.py) with:

  • Web search via DuckDuckGo (ddgs)
  • Shell command execution — real filesystem access
  • File read/write — actual code editing
  • Code generation and explanation
  • Chain-of-thought reasoning before acting

The interface is a terminal REPL with commands: /agent (default), /raw (no tools), /search <q>, /stats, /clear. A web UI (web/server.py) is also included.

Setup is a single command:

git clone https://github.com/walter-grace/mac-code
bash setup.sh

The setup script handles llama.cpp installation, model download (10.6 GB for the 35B IQ2_M), and configuration.

Why This Matters

Claude Code, Cursor, and GitHub Copilot are excellent tools. They’re also API-dependent — every token costs money, every request leaves your machine, and the cost compounds fast for heavy users.

mac-code demonstrates that the alternative isn’t just possible, it’s approaching practical. A 35B model at 30 tok/s is fast enough for real coding work. The KV cache persistence means session context survives restarts. The text-level routing handles tool use even at extreme quantization.

The remaining limitation is honest: IQ2_M at 2.6 bits is meaningfully below full Q4 quality. You’ll notice it on complex reasoning tasks. The 4.3 tok/s Expert Sniper path gives you full Q4 at a speed penalty that’s tolerable for non-interactive tasks but frustrating for live chat.

The gap is closing. RotorQuant’s 10x KV compression, applied to a future 14B model running fully in RAM on M4 Pro, starts to look genuinely competitive with cloud-hosted inference — at zero marginal cost.


Repo: github.com/walter-grace/mac-code
License: MIT
Related: TurboQuant explained · RotorQuant: beating TurboQuant


Frequently Asked Questions

What is mac-code and what does it do?
mac-code is an open-source AI coding agent that runs a 35B parameter language model entirely on Apple Silicon Macs — including the $600 Mac Mini M4 with 16 GB RAM — at 30 tokens per second, with no cloud API and no monthly cost. It supports web search, shell commands, file operations, and code generation via a local Python agent.

How does mac-code run a 35B model on only 16 GB of RAM?
mac-code uses two techniques: IQ2_M quantization (2.6-bit compression that shrinks the 35B model to 10.6 GB, fitting in 16 GB RAM) for the fast path, and flash-paging (streaming FFN weights from SSD while keeping attention weights in RAM) for full 4-bit quality. Apple Silicon’s unified memory architecture makes the SSD-streaming path 18.6x faster than equivalent NVIDIA hardware.

Why is Apple Silicon 18.6x faster than NVIDIA for flash-paging models?
On NVIDIA, SSD data travels through PCIe lanes to system RAM and then to VRAM — two memory transfers with bandwidth competition between CPU and GPU. Apple Silicon’s unified memory means the SSD stream goes directly to the same physical memory that the GPU and Neural Engine use. No PCIe handoff, no VRAM copy, no bandwidth competition.

What is Expert Sniper in mac-code?
Expert Sniper is mac-code’s SSD-streaming engine for MoE (mixture-of-experts) models. Instead of loading all 256 experts per layer, it streams only the 8 active experts per token from SSD — about 14 MB vs. ~450 MB — enabling full 4-bit quality on a 16 GB Mac at 4–5 tok/s. It’s slower than the IQ2_M RAM-based path but maintains full model quality.

How does KV cache quantization in mac-code work?
mac-code uses llama.cpp’s --cache-type-k q4_0 --cache-type-v q4_0 flags to compress the KV cache from 16-bit to 4-bit precision. On a 9B model, this doubles the context window from 32K to 64K tokens and cuts cache memory from 1,024 MB to 288 MB — a 72% reduction with negligible quality loss. This is the same principle behind Google’s TurboQuant (ICLR 2026).

What is the MLX backend and what does KV cache persistence mean?
The MLX backend uses Apple’s ML framework for 25% faster inference when models fit in RAM. Its key feature is KV cache persistence: the agent saves conversation context to disk and can restore it in 0.0003 seconds instead of reprocessing 141 tokens (which takes 1.01 seconds) — a 6,677x speedup. This means you can analyze a large codebase once and resume instantly the next day.

What hardware do I need to run mac-code?
Any Apple Silicon Mac works. An 8 GB Mac can run Qwen3.5-9B at 16–20 tok/s with 4K context. A 16 GB Mac Mini M4 (~$600) runs the 35B model at 30 tok/s via IQ2_M quantization, or 5.4 tok/s at full Q4 via Expert Sniper. A Mac Mini M4 Pro with 48 GB runs the full 35B at Q4 in RAM at 30+ tok/s.

How does mac-code handle tool routing without JSON function calling?
At extreme quantization (IQ2_M, 2.6-bit), the model can’t reliably produce valid JSON schemas for function calls. mac-code solves this with text-level routing: the model outputs a plain word (search, shell, or chat) and the Python agent parses that text to decide which tool to invoke. This achieved 8/8 routing accuracy in testing, even at quantization levels that break structured output.