Notes on Gemma 4 E4B at 4-bit on Apple Silicon.

Vehla currently exposes three local Gemma 4 configurations: E2B at 4-bit, E4B at 4-bit, and E4B at 8-bit. The default is E2B 4-bit because it is the fastest path to a working local model.

The benchmark suite we actually use

Public benchmarks are useful but not aligned with what Vehla does. Our local-model choice is mostly about the shape of common palette actions: rewrite, summarize, translate, explain, extract, and draft.

Quality scores

Model	Profile	Best fit
Gemma 4 E2B (4-bit)	Fast	Default local model
Gemma 4 E4B (4-bit)	Higher quality	Better for nuanced writing
Gemma 4 E4B (8-bit)	Highest local quality	Best when memory is available

E2B is the best default for getting started. E4B is the upgrade path when output quality matters more than download size and memory use.

Model Size

Stage	E2B	E4B
Estimated download	~2.5 GB	~4.5 GB
Best use	Fast everyday actions	Higher-quality writing
Resource use	Medium	High

Exact speed depends on the Mac, memory pressure, and prompt length.

Where 4-bit hurts

Quantization is mostly free at this scale, except for two patterns: numerical reasoning (don't ask E4B to add three-digit numbers) and code with unusual variable names. We added a quality nudge in the prompt that mentions "be careful with arithmetic" and the worst cases disappear.

Why not bigger?

We tried larger models — Llama 3.1 8B at 4-bit, Qwen 2.5 7B. Both are excellent but neither was clearly better than E4B for our specific actions and both made cold model load too slow to feel snappy. We ended up keeping E4B as the default and exposing a "load any MLX model" option for power users.