onmydevice
.ai
⌘
⌘
Devices
Models
Benchmarks
Apps
Replace
Blog
📱
iPhone 15
6GB
iOS
$699
Buy on Apple ↗
Buy on Amazon ↗
Chip
Apple A16
RAM
6GB
GPU
5-core GPU
AI accelerator
16-core Neural Engine
2/5
Best for: Transcription and basic TTS on a budget iPhone
▾
📱
The iPhone 16 Pro has 8GB and a much faster Neural Engine — a significant upgrade for AI tasks.
iPhone 16 Pro
· $999
+22 models
→
PICK A TOOL
💬 Chat
🎙️ Transcription
🔊 Voice / TTS
LocallyAI
Beginner
Free
Run AI models privately on your iPhone and iPad
Download ↗
MODELS YOU CAN RUN
All models
Chat
Reasoning
Voice
Transcription
SmolLM2 135M
99 MB
Ultra-tiny language model — runs anywhere
FP16
·
~109 tok/s
·
0.3 GB
LM Studio
Ollama
+2
Runs great
▾
🔢
Nomic Embed Text
137 MB
Open-source embedding with 8K context — long document search
FP16
·
~109 tok/s
·
0.3 GB
Ollama
Runs great
▾
🗣️
Kokoro 82M
183 MB
24 natural voices — instant TTS
FP16
·
~82 tok/s
·
0.4 GB
Piper
Xybrid CLI
+2
Runs great
▾
🗣️
KittenTTS Mini
80 MB
KittenML's expressive 80M TTS — 8 voices, quantization-aware, runs on any CPU
FP16
·
~184 tok/s
·
0.2 GB
Piper
Xybrid CLI
+2
Runs great
▾
Wav2Vec2 Base
231 MB
High-accuracy English ASR
FP16
·
~128 tok/s
·
0.2 GB
MacWhisper
Whisper Transcription
Runs great
▾
Whisper Small
488 MB
Good accuracy-speed balance — 99 languages
FP16
·
~60 tok/s
·
0.5 GB
MacWhisper
Whisper Transcription
Runs great
▾
🔢
all-MiniLM-L6-v2
23 MB
Fast sentence embeddings — ideal for semantic search
FP16
·
~200 tok/s
·
0.0 GB
Ollama
Runs great
▾
🔢
BGE Small
34 MB
Compact BAAI embedding — great for RAG pipelines
FP16
·
~200 tok/s
·
0.1 GB
Ollama
Runs great
▾
🗣️
KittenTTS Nano
19 MB
Tiny, fast voice synthesis
FP16
·
~200 tok/s
·
0.0 GB
Piper
Xybrid CLI
+2
Runs great
▾
Whisper Tiny
89 MB
Real-time speech recognition — runs on anything
FP16
·
~200 tok/s
·
0.1 GB
MacWhisper
Whisper Transcription
Runs great
▾
🔢
GTE Large
335 MB
High-quality embeddings — top of MTEB benchmark at its size
FP16
·
~44 tok/s
·
0.7 GB
Ollama
Runs great
▾
SmolLM2 360M
267 MB
Ultra-small language model
FP16
·
~32 tok/s
·
0.9 GB
LM Studio
Ollama
+2
Runs great
▾
Qwen 2.5 0.5B
477 MB
Lightweight chat — surprisingly capable
FP16
·
~25 tok/s
·
1.2 GB
LM Studio
Ollama
+2
Runs great
▾
Qwen 2.5 Coder 0.5B
477 MB
Tiny code completion model — autocomplete on any device
FP16
·
~25 tok/s
·
1.2 GB
LM Studio
Ollama
+5
Runs great
▾
🗣️
OuteTTS 0.3 500M
500 MB
Voice cloning and natural TTS — zero-shot voice synthesis
FP16
·
~29 tok/s
·
1.0 GB
Xybrid CLI
Supertonic
+1
Runs great
▾
👁️
SmolVLM 500M
490 MB
Tiny vision-language model — describe images on any device
FP16
·
~27 tok/s
·
1.1 GB
LM Studio
Ollama
Runs great
▾
🗣️
NeuTTS Air
450 MB
Neuphonic's on-device TTS — 748M, instant voice cloning, runs on CPU via llama.cpp
FP16
·
~20 tok/s
·
1.5 GB
Piper
Xybrid CLI
+2
Runs great
▾
Whisper Medium
1.5 GB
Strong multilingual transcription
FP16
·
~20 tok/s
·
1.5 GB
MacWhisper
Whisper Transcription
Runs great
▾
Whisper Large V3
3.1 GB
Best open transcription — near-human accuracy
Q8
·
~18 tok/s
·
1.6 GB
MacWhisper
Whisper Transcription
Runs great
▾
🎙️
Distil-Whisper Large V3
1.5 GB
6x faster than Whisper Large — nearly same accuracy
FP16
·
~20 tok/s
·
1.5 GB
MacWhisper
Whisper Transcription
Runs great
▾
🧠
DeepSeek R1 Distill 1.5B
1.1 GB
Distilled reasoning — chain-of-thought in a tiny package
Q8
·
~16 tok/s
·
1.8 GB
LM Studio
Ollama
+1
Runs well
▾
Qwen 3.5 0.8B
560 MB
Alibaba's tiny multimodal model — vision-capable, 262K context, runs on phones
FP16
·
~16 tok/s
·
1.8 GB
LM Studio
Ollama
+2
Runs well
▾
Qwen 2.5 Coder 1.5B
1.1 GB
Best small code model — great for autocomplete
Q8
·
~17 tok/s
·
1.7 GB
LM Studio
Ollama
+5
Runs well
▾
SmolLM2 1.7B
1.0 GB
Largest SmolLM2 — punches above its weight
Q8
·
~16 tok/s
·
1.8 GB
LM Studio
Ollama
+2
Runs well
▾
👁️
Moondream 2B
1.3 GB
Small but capable vision model — image Q&A and captioning
Q8
·
~15 tok/s
·
2.0 GB
LM Studio
Ollama
Runs well
▾
StableLM 2 1.6B
1.0 GB
Stability AI's efficient small model
Q8
·
~16 tok/s
·
1.8 GB
LM Studio
Ollama
+2
Runs well
▾
🗣️
Dia 1.6B
1.6 GB
Nari Labs dialogue TTS — multi-speaker with emotion
Q8
·
~16 tok/s
·
1.8 GB
Xybrid CLI
Supertonic
+1
Runs well
▾
🎨
Stable Diffusion Turbo
2.2 GB
1-step image generation — instant results
FP16
·
~14 tok/s
·
2.1 GB
Runs well
▾
Bonsai 8B (1-bit)
1.16 GB
PrismML's native 1-bit model — 8B params in 1.16 GB, needs forked llama.cpp
Q1
·
~25 tok/s
·
1.2 GB
Tight fit
▾
Ternary Bonsai 1.7B
380 MB
PrismML's 1.58-bit ternary model — 1.7B params in under 0.4 GB, runs anywhere
Q2
·
~77 tok/s
·
0.4 GB
Tight fit
▾
Ternary Bonsai 8B
1.6 GB
PrismML's 1.58-bit ternary model — top intelligence at 8B in ~1.6 GB
Q2
·
~18 tok/s
·
1.6 GB
Tight fit
▾
Ternary Bonsai 4B
860 MB
PrismML's 1.58-bit ternary model — 4B-class intelligence in ~0.86 GB, ~9x smaller than fp16
Q2
·
~34 tok/s
·
0.9 GB
Tight fit
▾
Qwen 2.5 7B
4.4 GB
Top-tier 7B chat — rivals much larger models
Q2
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+1
Tight fit
▾
Llama 3.1 8B
4.9 GB
Meta's workhorse 8B — excellent all-around
Q2
·
~11 tok/s
·
2.7 GB
LM Studio
Ollama
+1
Tight fit
▾
🧠
DeepSeek R1 Distill 7B
4.7 GB
Distilled from DeepSeek R1 — strong step-by-step reasoning
Q2
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+1
Tight fit
▾
🧠
DeepSeek R1 Distill 8B
4.9 GB
Llama-based R1 distill — best open reasoning at 8B
Q2
·
~11 tok/s
·
2.7 GB
LM Studio
Ollama
+1
Tight fit
▾
🧠
Qwen3 8B
5.2 GB
Alibaba's latest — thinking mode with strong reasoning
Q2
·
~11 tok/s
·
2.7 GB
LM Studio
Ollama
+1
Tight fit
▾
Qwen 2.5 Coder 7B
4.4 GB
Best open 7B code model — rivals GPT-4 on coding benchmarks
Q2
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+5
Tight fit
▾
Llama 3.2 3B
2.0 GB
Meta's 3B — best small model for many tasks
Q6
·
~11 tok/s
·
2.6 GB
LM Studio
Ollama
+2
Tight fit
▾
Gemma 3 4B
2.8 GB
Google's 4B — strong reasoning and instruction following
Q4
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+1
Tight fit
▾
Gemma 4 E2B
1.5 GB
Google's 2026 on-device model — 2.3B active / 5.1B total, multimodal vision and audio
Q8
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+2
Tight fit
▾
Qwen 3.5 2B
1.3 GB
Compact multimodal chat — native vision, 262K context
Q8
·
~13 tok/s
·
2.3 GB
LM Studio
Ollama
+2
Tight fit
▾
Phi-4 Mini
8.7 GB
Microsoft's reasoning model — exceptional for its size
Q5
·
~11 tok/s
·
2.7 GB
LM Studio
Ollama
+1
Tight fit
▾
Qwen 3.5 4B
2.5 GB
Multimodal sweet spot — vision, strong reasoning, 262K context
Q4
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+1
Tight fit
▾
Llama 3.2 1B
791 MB
Meta's solid all-rounder
FP16
·
~11 tok/s
·
2.7 GB
LM Studio
Ollama
+2
Tight fit
▾
Mistral 7B
4.3 GB
High-quality reasoning and analysis
Q2
·
~13 tok/s
·
2.3 GB
LM Studio
Ollama
+1
Tight fit
▾
Qwen 2.5 VL 7B
4.6 GB
State-of-the-art vision-language — image, video, document understanding
Q2
·
~11 tok/s
·
2.6 GB
LM Studio
Ollama
Tight fit
▾
Qwen 2.5 3B
2.0 GB
Sweet-spot model — great quality-to-size ratio
Q6
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+2
Tight fit
▾
Qwen 2.5 Coder 3B
2.0 GB
Strong code generation and editing at 3B
Q6
·
~12 tok/s
·
2.5 GB
LM Studio
Ollama
+5
Tight fit
▾
LFM2.5 1.2B
731 MB
Liquid AI's hybrid model — blazing fast CPU inference
FP16
·
~13 tok/s
·
2.3 GB
LM Studio
Ollama
+1
Tight fit
▾
Gemma 3 1B
786 MB
Google's compact model — great reasoning for its size
FP16
·
~13 tok/s
·
2.2 GB
LM Studio
Ollama
+2
Tight fit
▾
Bonsai Image 4B
1.8 GB
PrismML's 1.58-bit ternary diffusion — first 4B image model to run on iPhone, ~3GB in-browser
Q2
·
~16 tok/s
·
1.8 GB
Tight fit
▾
👨💻
DeepSeek Coder 6.7B
3.8 GB
DeepSeek's code model — strong at generation and debugging
Q2
·
~13 tok/s
·
2.2 GB
LM Studio
Ollama
+5
Tight fit
▾
👨💻
StarCoder2 3B
1.9 GB
BigCode's multilingual code model — 600+ languages
Q6
·
~12 tok/s
·
2.4 GB
LM Studio
Ollama
+5
Tight fit
▾
👁️
LLaVA 1.6 7B
4.5 GB
Leading open vision model — image understanding and reasoning
Q2
·
~13 tok/s
·
2.3 GB
LM Studio
Ollama
Tight fit
▾
🎨
Stable Diffusion 3.5 Medium
4.6 GB
Latest SD architecture — excellent image quality
Q8
·
~11 tok/s
·
2.7 GB
Tight fit
▾
TinyLlama 1.1B
780 MB
Popular tiny model trained on 3T tokens
FP16
·
~13 tok/s
·
2.2 GB
LM Studio
Ollama
+2
Tight fit
▾
🎨
SDXL Turbo
6.5 GB
High-res 1-step generation — 1024×1024 in seconds
Q4
·
~13 tok/s
·
2.2 GB
Tight fit
▾
Gemma 3n E2B
3.0 GB
Google's on-device multimodal — 2B effective params with vision and audio
needs 3.0 GB
Too heavy
Gemma 3n E4B
4.5 GB
Google's most capable on-device model — 4B effective, multimodal
needs 4.5 GB
Too heavy
Gemma 3 12B
7.3 GB
Google's largest open model — near-frontier quality
needs 4.1 GB
Too heavy
Mistral Nemo 12B
7.1 GB
Mistral's 12B — Tekken tokenizer, 128K context
needs 4.0 GB
Too heavy
Gemma 4 E4B
3.0 GB
Google's most capable edge model — 4B effective, multimodal, 128K context
needs 3.0 GB
Too heavy
Gemma 4 26B A4B
16 GB
Google's sparse MoE — 26B total, 3.8B active per token, 256K context
needs 10.0 GB
Too heavy
Gemma 4 31B
18 GB
Google's dense flagship open model — near server-grade quality, 256K context
needs 10.5 GB
Too heavy
Qwen 3.5 9B
5.5 GB
Flagship small Qwen — native multimodal, 262K context, rivals far larger models
needs 3.1 GB
Too heavy
Qwen 3.5 35B A3B
20 GB
Alibaba's sparse MoE — 35B total, 3B active per token, multimodal, 262K context
needs 12.0 GB
Too heavy
LFM2.5 8B A1B
4.7 GB
Liquid AI's on-device MoE — 8.3B total, 1.5B active, reasoning + tool calling at 3–4B quality
needs 2.8 GB
Too heavy
Phi-4 Medium
8.0 GB
Microsoft's 14B reasoning model — frontier-class performance
needs 4.7 GB
Too heavy
Llama 4 Scout
63 GB
Meta's MoE model — 109B total params, 17B active per token
needs 42.0 GB
Too heavy
DeepSeek V4 Flash
156 GB
DeepSeek's V4 Flash MoE — 284B total, 13B active, 1M context (server-grade hardware)
needs 95.0 GB
Too heavy
Kimi K2.6
594 GB
Moonshot's 1T MoE — 32B active, frontier agentic coding; needs a multi-GPU workstation
needs 300.0 GB
Too heavy
👨💻
Laguna XS.2
19 GB
Poolside's open agentic coding MoE — 33B total, 3B active, runs locally on 36GB Macs
needs 11.0 GB
Too heavy
Qwen3 Coder Next
46 GB
Alibaba's local coding agent MoE — 80B total, 3B active, agentic long-horizon coding, 256K context
needs 28.0 GB
Too heavy
🎨
FLUX.1 Schnell
12.1 GB
Black Forest Labs' fast model — stunning quality in 4 steps
needs 4.0 GB
Too heavy
BENCHMARKS
View 58 real-world benchmarks
Measured tok/s, RAM usage, and more from community tests
→
Your
iPhone 15
can run
22 AI models
locally.
Share this page ↗