Chat 32 models
Ternary Bonsai 8B
Ternary Bonsai 8B 1.6 GB
PrismML's 1.58-bit ternary model — top intelligence at 8B in ~1.6 GB
Q2 ~125 tok/s
Tight fit
Qwen 3.5 2B
Qwen 3.5 2B 1.3 GB
Compact multimodal chat — native vision, 262K context
FP16 ~48 tok/s
Runs great
Llama 3.2 1B
Llama 3.2 1B 791 MB
Meta's solid all-rounder
FP16 ~74 tok/s
Runs great
Bonsai 8B (1-bit)
Bonsai 8B (1-bit) 1.16 GB
PrismML's native 1-bit model — 8B params in 1.16 GB, needs forked llama.cpp
Q1 ~172 tok/s
Tight fit
Gemma 4 E2B
Gemma 4 E2B 1.5 GB
Google's 2026 on-device model — 2.3B active / 5.1B total, multimodal vision and audio
FP16 ~44 tok/s
Runs great
Qwen 3.5 0.8B
Qwen 3.5 0.8B 560 MB
Alibaba's tiny multimodal model — vision-capable, 262K context, runs on phones
FP16 ~111 tok/s
Runs great
Gemma 3 12B
Gemma 3 12B 7.3 GB
Google's largest open model — near-frontier quality
Q8 ~15 tok/s
Runs great
Mistral Nemo 12B
Mistral Nemo 12B 7.1 GB
Mistral's 12B — Tekken tokenizer, 128K context
Q8 ~16 tok/s
Runs great
Gemma 3 1B
Gemma 3 1B 786 MB
Google's compact model — great reasoning for its size
FP16 ~91 tok/s
Runs great
LFM2.5 1.2B
LFM2.5 1.2B 731 MB
Liquid AI's hybrid model — blazing fast CPU inference
FP16 ~85 tok/s
Runs great
Qwen 2.5 0.5B
Qwen 2.5 0.5B 477 MB
Lightweight chat — surprisingly capable
FP16 ~167 tok/s
Runs great
Llama 3.2 3B
Llama 3.2 3B 2.0 GB
Meta's 3B — best small model for many tasks
FP16 ~30 tok/s
Runs great
SmolLM2 1.7B
SmolLM2 1.7B 1.0 GB
Largest SmolLM2 — punches above its weight
FP16 ~59 tok/s
Runs great
Gemma 3 4B
Gemma 3 4B 2.8 GB
Google's 4B — strong reasoning and instruction following
FP16 ~22 tok/s
Runs great
Qwen 3.5 4B
Qwen 3.5 4B 2.5 GB
Multimodal sweet spot — vision, strong reasoning, 262K context
FP16 ~24 tok/s
Runs great
Ternary Bonsai 4B
Ternary Bonsai 4B 860 MB
PrismML's 1.58-bit ternary model — 4B-class intelligence in ~0.86 GB, ~9x smaller than fp16
Q2 ~200 tok/s
Tight fit
Gemma 4 E4B
Gemma 4 E4B 3.0 GB
Google's most capable edge model — 4B effective, multimodal, 128K context
FP16 ~22 tok/s
Runs great
SmolLM2 360M
SmolLM2 360M 267 MB
Ultra-small language model
FP16 ~200 tok/s
Runs great
Qwen 2.5 3B
Qwen 2.5 3B 2.0 GB
Sweet-spot model — great quality-to-size ratio
FP16 ~31 tok/s
Runs great
SmolLM2 135M
SmolLM2 135M 99 MB
Ultra-tiny language model — runs anywhere
FP16 ~200 tok/s
Runs great
StableLM 2 1.6B
StableLM 2 1.6B 1.0 GB
Stability AI's efficient small model
FP16 ~61 tok/s
Runs great
Qwen 2.5 7B
Qwen 2.5 7B 4.4 GB
Top-tier 7B chat — rivals much larger models
FP16 ~13 tok/s
Runs well
Llama 3.1 8B
Llama 3.1 8B 4.9 GB
Meta's workhorse 8B — excellent all-around
FP16 ~12 tok/s
Runs well
TinyLlama 1.1B
TinyLlama 1.1B 780 MB
Popular tiny model trained on 3T tokens
FP16 ~91 tok/s
Runs great
Ternary Bonsai 1.7B
Ternary Bonsai 1.7B 380 MB
PrismML's 1.58-bit ternary model — 1.7B params in under 0.4 GB, runs anywhere
Q2 ~200 tok/s
Tight fit
Gemma 3n E4B
Gemma 3n E4B 4.5 GB
Google's most capable on-device model — 4B effective, multimodal
FP16 ~15 tok/s
Runs great
Gemma 3n E2B
Gemma 3n E2B 3.0 GB
Google's on-device multimodal — 2B effective params with vision and audio
FP16 ~22 tok/s
Runs great
Qwen 3.5 9B
Qwen 3.5 9B 5.5 GB
Flagship small Qwen — native multimodal, 262K context, rivals far larger models
FP16 ~11 tok/s
Tight fit
Gemma 4 31B
Gemma 4 31B 18 GB
Google's dense flagship open model — near server-grade quality, 256K context
Q5 ~10 tok/s
Tight fit
LFM2.5 8B A1B
LFM2.5 8B A1B 4.7 GB
Liquid AI's on-device MoE — 8.3B total, 1.5B active, reasoning + tool calling at 3–4B quality
FP16 ~12 tok/s
Runs well
Gemma 4 26B A4B
Gemma 4 26B A4B 16 GB
Google's sparse MoE — 26B total, 3.8B active per token, 256K context
Q6 ~10 tok/s
Tight fit
Qwen 3.5 35B A3B
Qwen 3.5 35B A3B 20 GB
Alibaba's sparse MoE — 35B total, 3B active per token, multimodal, 262K context
Q4 ~10 tok/s
Tight fit
Reasoning 7 models
Code 7 models
Vision 4 models
Embedding 4 models
Image 5 models
Voice 6 models
Transcription 6 models
Too heavy 4 models