Chat 15 models
Bonsai 8B (1-bit)
Bonsai 8B (1-bit) 1.16 GB
PrismML's native 1-bit model โ€” 8B params in 1.16 GB, needs forked llama.cpp
Q1 ~34 tok/s
Tight fit โ†’
SmolLM2 135M
SmolLM2 135M 99 MB
Ultra-tiny language model โ€” runs anywhere
FP16 ~148 tok/s
Runs great โ†’
SmolLM2 360M
SmolLM2 360M 267 MB
Ultra-small language model
FP16 ~43 tok/s
Runs great โ†’
Qwen 2.5 0.5B
Qwen 2.5 0.5B 477 MB
Lightweight chat โ€” surprisingly capable
FP16 ~33 tok/s
Runs great โ†’
Gemma 3 1B
Gemma 3 1B 786 MB
Google's compact model โ€” great reasoning for its size
FP16 ~18 tok/s
Runs well โ†’
SmolLM2 1.7B
SmolLM2 1.7B 1.0 GB
Largest SmolLM2 โ€” punches above its weight
Q8 ~22 tok/s
Runs well โ†’
Qwen 2.5 7B
Qwen 2.5 7B 4.4 GB
Top-tier 7B chat โ€” rivals much larger models
Q2 ~16 tok/s
Tight fit โ†’
Llama 3.1 8B
Llama 3.1 8B 4.9 GB
Meta's workhorse 8B โ€” excellent all-around
Q2 ~15 tok/s
Tight fit โ†’
Llama 3.2 3B
Llama 3.2 3B 2.0 GB
Meta's 3B โ€” best small model for many tasks
Q6 ~15 tok/s
Tight fit โ†’
Gemma 3 4B
Gemma 3 4B 2.8 GB
Google's 4B โ€” strong reasoning and instruction following
Q4 ~16 tok/s
Tight fit โ†’
Llama 3.2 1B
Llama 3.2 1B 791 MB
Meta's solid all-rounder
FP16 ~15 tok/s
Tight fit โ†’
StableLM 2 1.6B
StableLM 2 1.6B 1.0 GB
Stability AI's efficient small model
Q8 ~22 tok/s
Runs well โ†’
TinyLlama 1.1B
TinyLlama 1.1B 780 MB
Popular tiny model trained on 3T tokens
FP16 ~18 tok/s
Runs well โ†’
Qwen 2.5 3B
Qwen 2.5 3B 2.0 GB
Sweet-spot model โ€” great quality-to-size ratio
Q6 ~16 tok/s
Tight fit โ†’
๐Ÿ’ฌ
LFM2.5 1.2B 731 MB
Liquid AI's hybrid model โ€” blazing fast CPU inference
FP16 ~17 tok/s
Tight fit โ†’
Reasoning 6 models
Code 6 models
Vision 4 models
Embedding 4 models
Image 3 models
Voice 4 models
Transcription 6 models
Too heavy 7 models