Chat 19 models
Llama 3.2 1B
Llama 3.2 1B 791 MB
Meta's solid all-rounder
FP16 ~74 tok/s
Runs great โ†’
Bonsai 8B (1-bit)
Bonsai 8B (1-bit) 1.16 GB
PrismML's native 1-bit model โ€” 8B params in 1.16 GB, needs forked llama.cpp
Q1 ~172 tok/s
Tight fit โ†’
Gemma 3 12B
Gemma 3 12B 7.3 GB
Google's largest open model โ€” near-frontier quality
Q8 ~15 tok/s
Runs great โ†’
Mistral Nemo 12B
Mistral Nemo 12B 7.1 GB
Mistral's 12B โ€” Tekken tokenizer, 128K context
Q8 ~16 tok/s
Runs great โ†’
Gemma 3 1B
Gemma 3 1B 786 MB
Google's compact model โ€” great reasoning for its size
FP16 ~91 tok/s
Runs great โ†’
๐Ÿ’ฌ
LFM2.5 1.2B 731 MB
Liquid AI's hybrid model โ€” blazing fast CPU inference
FP16 ~85 tok/s
Runs great โ†’
Qwen 2.5 0.5B
Qwen 2.5 0.5B 477 MB
Lightweight chat โ€” surprisingly capable
FP16 ~167 tok/s
Runs great โ†’
Llama 3.2 3B
Llama 3.2 3B 2.0 GB
Meta's 3B โ€” best small model for many tasks
FP16 ~30 tok/s
Runs great โ†’
SmolLM2 1.7B
SmolLM2 1.7B 1.0 GB
Largest SmolLM2 โ€” punches above its weight
FP16 ~59 tok/s
Runs great โ†’
Gemma 3 4B
Gemma 3 4B 2.8 GB
Google's 4B โ€” strong reasoning and instruction following
FP16 ~22 tok/s
Runs great โ†’
SmolLM2 360M
SmolLM2 360M 267 MB
Ultra-small language model
FP16 ~200 tok/s
Runs great โ†’
Qwen 2.5 3B
Qwen 2.5 3B 2.0 GB
Sweet-spot model โ€” great quality-to-size ratio
FP16 ~31 tok/s
Runs great โ†’
SmolLM2 135M
SmolLM2 135M 99 MB
Ultra-tiny language model โ€” runs anywhere
FP16 ~200 tok/s
Runs great โ†’
StableLM 2 1.6B
StableLM 2 1.6B 1.0 GB
Stability AI's efficient small model
FP16 ~61 tok/s
Runs great โ†’
Qwen 2.5 7B
Qwen 2.5 7B 4.4 GB
Top-tier 7B chat โ€” rivals much larger models
FP16 ~13 tok/s
Runs well โ†’
Llama 3.1 8B
Llama 3.1 8B 4.9 GB
Meta's workhorse 8B โ€” excellent all-around
FP16 ~12 tok/s
Runs well โ†’
TinyLlama 1.1B
TinyLlama 1.1B 780 MB
Popular tiny model trained on 3T tokens
FP16 ~91 tok/s
Runs great โ†’
Gemma 3n E4B
Gemma 3n E4B 4.5 GB
Google's most capable on-device model โ€” 4B effective, multimodal
FP16 ~15 tok/s
Runs great โ†’
Gemma 3n E2B
Gemma 3n E2B 3.0 GB
Google's on-device multimodal โ€” 2B effective params with vision and audio
FP16 ~22 tok/s
Runs great โ†’
Reasoning 7 models
Code 6 models
Vision 4 models
Embedding 4 models
Image 4 models
Voice 4 models
Transcription 6 models
Too heavy 1 models