Vehla currently exposes three local Gemma 4 configurations: E2B at 4-bit, E4B at 4-bit, and E4B at 8-bit. The default is E2B 4-bit because it is the fastest path to a working local model.
The benchmark suite we actually use
Public benchmarks are useful but not aligned with what Vehla does. Our local-model choice is mostly about the shape of common palette actions: rewrite, summarize, translate, explain, extract, and draft.
Quality scores
| Model | Profile | Best fit |
|---|---|---|
| Gemma 4 E2B (4-bit) | Fast | Default local model |
| Gemma 4 E4B (4-bit) | Higher quality | Better for nuanced writing |
| Gemma 4 E4B (8-bit) | Highest local quality | Best when memory is available |
E2B is the best default for getting started. E4B is the upgrade path when output quality matters more than download size and memory use.
Model Size
| Stage | E2B | E4B |
|---|---|---|
| Estimated download | ~2.5 GB | ~4.5 GB |
| Best use | Fast everyday actions | Higher-quality writing |
| Resource use | Medium | High |
Exact speed depends on the Mac, memory pressure, and prompt length.
Where 4-bit hurts
Quantization is mostly free at this scale, except for two patterns: numerical reasoning (don't ask E4B to add three-digit numbers) and code with unusual variable names. We added a quality nudge in the prompt that mentions "be careful with arithmetic" and the worst cases disappear.
Why not bigger?
We tried larger models — Llama 3.1 8B at 4-bit, Qwen 2.5 7B. Both are excellent but neither was clearly better than E4B for our specific actions and both made cold model load too slow to feel snappy. We ended up keeping E4B as the default and exposing a "load any MLX model" option for power users.