Notes on Gemma 4 E4B at 4-bit on Apple Silicon.

How we think about Gemma 4 models, default local inference, and the trade-offs between speed, quality, and download size.

The Vehla team
Ottawa

Vehla currently exposes three local Gemma 4 configurations: E2B at 4-bit, E4B at 4-bit, and E4B at 8-bit. The default is E2B 4-bit because it is the fastest path to a working local model.

The benchmark suite we actually use

Public benchmarks are useful but not aligned with what Vehla does. Our local-model choice is mostly about the shape of common palette actions: rewrite, summarize, translate, explain, extract, and draft.

Quality scores

ModelProfileBest fit
Gemma 4 E2B (4-bit)FastDefault local model
Gemma 4 E4B (4-bit)Higher qualityBetter for nuanced writing
Gemma 4 E4B (8-bit)Highest local qualityBest when memory is available

E2B is the best default for getting started. E4B is the upgrade path when output quality matters more than download size and memory use.

Model Size

StageE2BE4B
Estimated download~2.5 GB~4.5 GB
Best useFast everyday actionsHigher-quality writing
Resource useMediumHigh

Exact speed depends on the Mac, memory pressure, and prompt length.

Where 4-bit hurts

Quantization is mostly free at this scale, except for two patterns: numerical reasoning (don't ask E4B to add three-digit numbers) and code with unusual variable names. We added a quality nudge in the prompt that mentions "be careful with arithmetic" and the worst cases disappear.

Why not bigger?

We tried larger models — Llama 3.1 8B at 4-bit, Qwen 2.5 7B. Both are excellent but neither was clearly better than E4B for our specific actions and both made cold model load too slow to feel snappy. We ended up keeping E4B as the default and exposing a "load any MLX model" option for power users.

← All field notes Download Vehla →