Best AI Models for 4GB RAM — onmydevice.ai

BROWSE BY TIER

Low tier

4 GB

FILTER BY TASK

Chat 23 models

Bonsai 8B (1-bit) 1.16 GB

PrismML's native 1-bit model — 8B params in 1.16 GB, needs forked llama.cpp

Ternary Bonsai 4B 860 MB

PrismML's 1.58-bit ternary model — 4B-class intelligence in ~0.86 GB, ~9x smaller than fp16

SmolLM2 135M 99 MB

Ultra-tiny language model — runs anywhere

FP16 ~148 tok/s

Ternary Bonsai 8B 1.6 GB

PrismML's 1.58-bit ternary model — top intelligence at 8B in ~1.6 GB

SmolLM2 360M 267 MB

Ultra-small language model

Ternary Bonsai 1.7B 380 MB

PrismML's 1.58-bit ternary model — 1.7B params in under 0.4 GB, runs anywhere

Qwen 2.5 0.5B 477 MB

Lightweight chat — surprisingly capable

Qwen 3.5 0.8B 560 MB

Alibaba's tiny multimodal model — vision-capable, 262K context, runs on phones

Qwen 3.5 2B 1.3 GB

Compact multimodal chat — native vision, 262K context

Gemma 3 1B 786 MB

Google's compact model — great reasoning for its size

SmolLM2 1.7B 1.0 GB

Largest SmolLM2 — punches above its weight

Qwen 2.5 7B 4.4 GB

Top-tier 7B chat — rivals much larger models

Llama 3.1 8B 4.9 GB

Meta's workhorse 8B — excellent all-around

Llama 3.2 3B 2.0 GB

Meta's 3B — best small model for many tasks

Gemma 3 4B 2.8 GB

Google's 4B — strong reasoning and instruction following

Gemma 4 E2B 1.5 GB

Google's 2026 on-device model — 2.3B active / 5.1B total, multimodal vision and audio

Qwen 3.5 4B 2.5 GB

Multimodal sweet spot — vision, strong reasoning, 262K context

Llama 3.2 1B 791 MB

Meta's solid all-rounder

StableLM 2 1.6B 1.0 GB

Stability AI's efficient small model

TinyLlama 1.1B 780 MB

Popular tiny model trained on 3T tokens

Qwen 2.5 3B 2.0 GB

Sweet-spot model — great quality-to-size ratio

LFM2.5 1.2B 731 MB

Liquid AI's hybrid model — blazing fast CPU inference

LFM2.5 8B A1B 4.7 GB

Liquid AI's on-device MoE — 8.3B total, 1.5B active, reasoning + tool calling at 3–4B quality

Reasoning 6 models

DeepSeek R1 Distill 1.5B 1.1 GB

Distilled reasoning — chain-of-thought in a tiny package

Mistral 7B 4.3 GB

High-quality reasoning and analysis

Tight fit → 🧠

DeepSeek R1 Distill 7B 4.7 GB

Distilled from DeepSeek R1 — strong step-by-step reasoning

Tight fit → 🧠

DeepSeek R1 Distill 8B 4.9 GB

Llama-based R1 distill — best open reasoning at 8B

Tight fit → 🧠

Qwen3 8B 5.2 GB

Alibaba's latest — thinking mode with strong reasoning

Phi-4 Mini 8.7 GB

Microsoft's reasoning model — exceptional for its size

Code 6 models

Qwen 2.5 Coder 0.5B 477 MB

Tiny code completion model — autocomplete on any device

Qwen 2.5 Coder 1.5B 1.1 GB

Best small code model — great for autocomplete

Runs great → 👨‍💻

DeepSeek Coder 6.7B 3.8 GB

DeepSeek's code model — strong at generation and debugging

Qwen 2.5 Coder 7B 4.4 GB

Best open 7B code model — rivals GPT-4 on coding benchmarks

Qwen 2.5 Coder 3B 2.0 GB

Strong code generation and editing at 3B

Tight fit → 👨‍💻

StarCoder2 3B 1.9 GB

BigCode's multilingual code model — 600+ languages

Vision 4 models

SmolVLM 500M 490 MB

Tiny vision-language model — describe images on any device

Runs great → 👁️

LLaVA 1.6 7B 4.5 GB

Leading open vision model — image understanding and reasoning

Tight fit → 👁️

Moondream 2B 1.3 GB

Small but capable vision model — image Q&A and captioning

Qwen 2.5 VL 7B 4.6 GB

State-of-the-art vision-language — image, video, document understanding

Embedding 4 models

Nomic Embed Text 137 MB

Open-source embedding with 8K context — long document search

FP16 ~148 tok/s

Runs great → 🔢

GTE Large 335 MB

High-quality embeddings — top of MTEB benchmark at its size

Runs great → 🔢

all-MiniLM-L6-v2 23 MB

Fast sentence embeddings — ideal for semantic search

FP16 ~200 tok/s

Runs great → 🔢

BGE Small 34 MB

Compact BAAI embedding — great for RAG pipelines

FP16 ~200 tok/s

Image 4 models

Stable Diffusion Turbo 2.2 GB

1-step image generation — instant results

Bonsai Image 4B 1.8 GB

PrismML's 1.58-bit ternary diffusion — first 4B image model to run on iPhone, ~3GB in-browser

Tight fit → 🎨

SDXL Turbo 6.5 GB

High-res 1-step generation — 1024×1024 in seconds

Runs well → 🎨

Stable Diffusion 3.5 Medium 4.6 GB

Latest SD architecture — excellent image quality

Voice 6 models

Kokoro 82M 183 MB

24 natural voices — instant TTS

FP16 ~111 tok/s

Runs great → 🗣️

KittenTTS Mini 80 MB

KittenML's expressive 80M TTS — 8 voices, quantization-aware, runs on any CPU

FP16 ~200 tok/s

Runs great → 🗣️

KittenTTS Nano 19 MB

Tiny, fast voice synthesis

FP16 ~200 tok/s

Runs great → 🗣️

OuteTTS 0.3 500M 500 MB

Voice cloning and natural TTS — zero-shot voice synthesis

Runs great → 🗣️

NeuTTS Air 450 MB

Neuphonic's on-device TTS — 748M, instant voice cloning, runs on CPU via llama.cpp

Runs great → 🗣️

Dia 1.6B 1.6 GB

Nari Labs dialogue TTS — multi-speaker with emotion

Transcription 6 models

Wav2Vec2 Base 231 MB

High-accuracy English ASR

FP16 ~174 tok/s

Whisper Small 488 MB

Good accuracy-speed balance — 99 languages

Whisper Tiny 89 MB

Real-time speech recognition — runs on anything

FP16 ~200 tok/s

Whisper Medium 1.5 GB

Strong multilingual transcription

Runs great → 🎙️

Distil-Whisper Large V3 1.5 GB

6x faster than Whisper Large — nearly same accuracy

Whisper Large V3 3.1 GB

Best open transcription — near-human accuracy

Too heavy 16 models

DeepSeek V4 Flash 156 GB

DeepSeek's V4 Flash MoE — 284B total, 13B active, 1M context (server-grade hardware)

Too heavy → 🎨

FLUX.1 Schnell 12.1 GB

Black Forest Labs' fast model — stunning quality in 4 steps

Gemma 3 12B 7.3 GB

Google's largest open model — near-frontier quality

Gemma 3n E2B 3.0 GB

Google's on-device multimodal — 2B effective params with vision and audio

Gemma 3n E4B 4.5 GB

Google's most capable on-device model — 4B effective, multimodal

Gemma 4 26B A4B 16 GB

Google's sparse MoE — 26B total, 3.8B active per token, 256K context

Gemma 4 31B 18 GB

Google's dense flagship open model — near server-grade quality, 256K context

Gemma 4 E4B 3.0 GB

Google's most capable edge model — 4B effective, multimodal, 128K context

Kimi K2.6 594 GB

Moonshot's 1T MoE — 32B active, frontier agentic coding; needs a multi-GPU workstation

Too heavy → 👨‍💻

Laguna XS.2 19 GB

Poolside's open agentic coding MoE — 33B total, 3B active, runs locally on 36GB Macs

Llama 4 Scout 63 GB

Meta's MoE model — 109B total params, 17B active per token

Mistral Nemo 12B 7.1 GB

Mistral's 12B — Tekken tokenizer, 128K context

Phi-4 Medium 8.0 GB

Microsoft's 14B reasoning model — frontier-class performance

Qwen 3.5 35B A3B 20 GB

Alibaba's sparse MoE — 35B total, 3B active per token, multimodal, 262K context

Qwen 3.5 9B 5.5 GB

Flagship small Qwen — native multimodal, 262K context, rivals far larger models

Qwen3 Coder Next 46 GB

Alibaba's local coding agent MoE — 80B total, 3B active, agentic long-horizon coding, 256K context