Local AI vs Cloud AI: What Actually Runs on Your Device
Every AI product launched in the past year claims to be “local,” “private,” or “on-device.” Apple Intelligence. Google Gemini Nano. A dozen startups with sleek landing pages promising your data never leaves your machine. The term has become so diluted that it risks meaning nothing at all.
But the engineering reality behind local AI is both more nuanced and more interesting than the marketing suggests. Running a language model on your laptop is genuinely possible now in ways it was not two years ago. The question is not whether it can be done, but what actually runs where, what the real trade-offs are, and how to tell the difference between products that deliver on the promise and those that are simply “local-washing.”
This is a technical breakdown of the current state of on-device AI, written for people who want specifics rather than slogans.
The Three Tiers of “Local AI”
Not all claims about local AI mean the same thing. There are three distinct tiers, and the differences between them are significant.
Tier 1: True Local — Everything On-Device
In the strictest definition, local AI means the model weights, the inference computation, and all data storage exist entirely on your hardware. Zero network requests are made during operation. You could pull the ethernet cable, disable Wi-Fi, and the system would function identically.
This is the hardest tier to achieve for a general-purpose AI assistant, but it is entirely possible for specific capabilities: embeddings, speech-to-text, small language models, and vector search all run comfortably on modern consumer hardware.
Tier 2: Hybrid — Local for Private, Cloud for Complex
The pragmatic middle ground. Simple, frequent, or privacy-sensitive tasks run locally. Complex reasoning or tasks requiring frontier-model capability escalate to cloud APIs. The critical architectural question is not whether hybrid exists, but how the boundary is drawn and who controls it.
A well-designed hybrid system makes the escalation decision transparent and consent-gated. The user sees what will be sent to the cloud before it happens. A poorly designed one makes the decision silently, often defaulting to cloud processing and only falling back to local when the network is unavailable.
Tier 3: “Local-Washing” — Marketing Without Substance
This is the most common category. Products that advertise “on-device AI” but actually send your data to cloud APIs, cache the responses locally, and call it local processing. Or products that run a trivial classifier on-device (to decide which cloud model to call) and market the classifier as “local AI.”
How to identify local-washing: monitor your network traffic while using the product. Tools like Little Snitch (macOS), Wireshark, or even the Network tab in developer tools will tell you whether your prompts are leaving your machine. If a product claims local AI but makes HTTPS requests to api.openai.com or api.anthropic.com every time you ask a question, you have your answer.
What Can Actually Run Locally in 2025-2026
The landscape of on-device AI has shifted dramatically. Here is a realistic, numbers-driven assessment of what consumer hardware can handle today.
Large Language Models
The open-weight model ecosystem has matured rapidly. The models worth knowing about for local inference:
Llama 3 (Meta, 2024-2025): The 8B parameter version is the workhorse of local LLM inference. It fits comfortably in 16GB of RAM when quantized to 4-bit precision and delivers genuinely useful output for summarization, rewriting, question answering, and code assistance. The 70B variant requires 32-48GB of RAM at 4-bit quantization, putting it within reach of high-end MacBooks and workstations but outside typical consumer hardware.
Qwen 2.5 (Alibaba, 2025): Particularly strong at mathematics and structured reasoning. The 7B variant punches well above its weight class on benchmarks like GSM8K and MATH, making it a compelling choice for applications that need analytical capability. We have benchmarked it extensively in our own math competition work, and its reasoning chains on algebra and number theory problems are remarkably coherent for a model of its size.
Phi-3 and Phi-3.5 (Microsoft, 2024-2025): Microsoft’s small language models — the 3.8B mini variant is notable because it runs on devices with as little as 8GB of RAM while still producing surprisingly coherent output. It demonstrates that careful training data curation can partially compensate for parameter count.
Mistral 7B and Mixtral (Mistral AI, 2024): Mistral 7B set the benchmark for what a 7-billion parameter model could achieve. Its sliding window attention mechanism makes it particularly efficient for long-context local inference. Mixtral, the mixture-of-experts variant, activates only a subset of its parameters per token, achieving near-13B quality at 7B inference cost.
Gemma 2 (Google, 2024-2025): Available at 2B, 9B, and 27B parameter counts. The 9B variant is well-suited for local use, and Google’s instruction-tuned versions handle conversational tasks cleanly. The 2B variant is small enough to run on mobile devices.
Hardware Requirements
The minimum hardware for useful local LLM inference:
- Apple Silicon M1-M4: The unified memory architecture is the single biggest advantage for local AI on consumer hardware. Because the CPU and GPU share the same memory pool, a MacBook with 16GB of unified memory can load a 7B model without the memory copy overhead that plagues discrete GPU setups. An M2 Pro with 16GB handles 7B models at 30-50 tokens per second. An M3 Max with 36GB or 48GB can run 70B models at usable speeds (8-15 tokens/sec).
- NVIDIA GPUs (RTX 4060 and above): Discrete GPU inference requires the model to fit in VRAM. An RTX 4060 with 8GB VRAM can run quantized 7B models. An RTX 4090 with 24GB VRAM handles quantized 30B+ models. The bottleneck is VRAM, not compute.
- RAM minimums: 16GB is the practical floor for running a 7B model alongside an operating system and application. 32GB is recommended for comfortable operation with larger models or multiple concurrent tasks.
The GGUF Revolution
The technical enabler for consumer-grade local LLM inference is llama.cpp and the GGUF model format. Originally written by Georgi Gerganov in C/C++, llama.cpp performs quantized inference without requiring a full ML framework like PyTorch or TensorFlow.
Quantization compresses model weights from their original 16-bit or 32-bit floating point representation into lower bit widths. The key quantization levels in GGUF:
- Q4_K_M (4-bit, medium): The sweet spot for most users. Roughly 4.5 bits per weight on average. A 7B model compresses to approximately 4.5GB. Quality degradation is minimal for conversational and instructional tasks — typically within 1-3% of the full-precision model on standard benchmarks.
- Q5_K_M (5-bit, medium): Slightly higher quality at a modest size increase. A 7B model is approximately 5.5GB. The quality difference between Q4_K_M and Q5_K_M is measurable on benchmarks but rarely noticeable in practice for most use cases.
- Q8_0 (8-bit): Near-lossless quantization. A 7B model is approximately 7.5GB. Useful when you have the RAM headroom and want to minimize any quality loss, but the diminishing returns compared to Q5 are significant.
The practical implication: a high-quality 7B language model fits in a 4.5GB file and runs at conversational speed on a three-year-old laptop. That was not possible in 2023.
Embeddings: The Strongest Case for Local AI
If there is one AI capability where local processing is an unambiguous win, it is text embeddings. Embedding models convert text into numerical vectors for semantic search, clustering, and retrieval-augmented generation (RAG). The models are small, fast, and produce output that is indistinguishable from cloud alternatives.
all-MiniLM-L6-v2: 384-dimensional output, approximately 80MB model file. Generates embeddings for a typical paragraph in under 10 milliseconds on any modern CPU. This model powers semantic search in thousands of applications and there is zero quality penalty for running it locally versus calling the OpenAI embeddings API.
BGE-small-en-v1.5: Also 384 dimensions, similar size, competitive with models several times larger on retrieval benchmarks. Developed by the Beijing Academy of Artificial Intelligence and widely used in open-source RAG pipelines.
nomic-embed-text: A newer entrant that achieves strong performance across both short queries and long documents. Its 768-dimensional output captures more semantic nuance at the cost of a slightly larger vector index.
These models run via ONNX Runtime, a cross-platform inference engine that achieves 2-5x speedups over native PyTorch for many model architectures. ONNX is particularly effective for embedding models because their computation graphs are relatively simple and benefit enormously from operator fusion and graph optimization.
The point cannot be overstated: there is no technical reason to send your text to a cloud API for embedding generation. The local models are just as good, orders of magnitude cheaper (free), and eliminate the privacy risk entirely.
Speech-to-Text: Whisper Changes Everything
OpenAI’s Whisper models, released as open weights, made high-quality speech recognition a local-first capability.
whisper-base (74M parameters): Runs at real-time speed on an Apple M1 chip, meaning it transcribes one second of audio in roughly one second of compute. Accuracy is sufficient for dictation, voice notes, and conversational transcription in quiet environments. The model file is approximately 140MB.
whisper-small (244M parameters): A significant accuracy improvement over base, particularly for accented speech and noisy environments. Runs at roughly 2x real-time on an M1 (one second of audio takes about 0.5 seconds to process). Around 460MB.
whisper-large-v3 (1.5B parameters): Near-professional transcription quality across 99 languages. On an M2 Pro, it processes audio at roughly 3x real-time speed. The model is approximately 3GB. This is the quality ceiling for local speech-to-text, and it is remarkably high.
All Whisper variants can run through ONNX Runtime or via dedicated inference libraries like whisper.cpp (the same developer behind llama.cpp). On Apple Silicon, CoreML-optimized versions of Whisper leverage the Neural Engine — a dedicated 16-core ML accelerator that handles matrix operations while leaving the CPU and GPU free for other work.
The Apple Silicon Advantage
Apple’s Neural Engine deserves specific attention. Present in every Apple Silicon chip from M1 onward, it provides up to 15.8 TOPS (trillion operations per second) on M1 and up to 38 TOPS on M4. CoreML, Apple’s ML framework, automatically dispatches compatible operations to the Neural Engine, CPU, or GPU based on a cost model.
For embedding models and Whisper, CoreML-optimized versions can be 2-3x faster than generic ONNX inference because they exploit the Neural Engine’s fixed-function hardware for attention computations and matrix multiplications. This is dedicated silicon that draws minimal power compared to running the same operations on the GPU.
This hardware advantage is one reason local AI on Apple devices feels qualitatively different from local AI on many Windows machines. It is not just software optimization — there is purpose-built hardware accelerating the workload.
The Honest Comparison
Here is a side-by-side comparison across the dimensions that actually matter, with no marketing spin.
| Dimension | True Local | Cloud API | Hybrid (Done Right) |
|---|---|---|---|
| Privacy | Complete — data never leaves device | Provider sees all inputs/outputs | Private data stays local; only general queries reach cloud |
| Latency (first token) | 50-200ms (model load if cold), <50ms if warm | 200-800ms (network + queue + inference) | Varies by routing decision |
| Throughput (tokens/sec) | 30-50 tok/s (7B on M2 Pro) | 50-100+ tok/s (frontier models) | Best of both depending on task |
| Output quality (7B local) | Good for focused tasks, weaker on complex reasoning | Frontier models (GPT-4o, Claude Opus) are significantly stronger | High-quality cloud for hard tasks, fast local for simple ones |
| Cost | Hardware only (one-time) | $0.002-$0.06 per 1K tokens, ongoing | Reduced cloud spend, hardware investment |
| Offline capability | Full functionality | None | Graceful degradation to local |
| Battery impact | Moderate to high during inference | Minimal (network request only) | Depends on local/cloud ratio |
| Setup complexity | Model downloads, hardware requirements | API key and HTTP client | More complex architecture |
The Quality Gap: Honest Assessment
Frontier cloud models — GPT-4o, Claude Opus, Gemini Ultra — are genuinely better than local 7B models at complex reasoning, nuanced writing, multi-step analysis, and tasks requiring broad world knowledge. This is not a controversial claim. A 7-billion parameter model running in 4-bit quantization on your laptop cannot match a model with hundreds of billions of parameters (or a mixture of experts with trillions of effective parameters) running on a data center full of H100 GPUs.
But the quality gap matters less than you might expect for most daily tasks. Consider what people actually use AI assistants for:
- Summarizing an email or document: A 7B model handles this well. The input constrains the output, so hallucination risk is low.
- Drafting a reply: Local models produce coherent, appropriate text for standard communication.
- Searching personal notes semantically: Embedding models are identical quality locally. This is purely a retrieval task.
- Transcribing a voice note: Whisper-large-v3 locally matches or exceeds most cloud transcription services.
- Answering factual questions about your own data: With RAG over local documents, the model’s job is primarily synthesis, not knowledge recall. A 7B model with good retrieval is often sufficient.
- Complex multi-step reasoning, creative writing, coding large systems: This is where frontier models still have a clear edge.
The practical conclusion: approximately 70-80% of typical AI assistant interactions can be handled by local models at acceptable quality. The remaining 20-30% genuinely benefit from frontier cloud models. A well-designed system routes accordingly.
The Trend Line
The gap is also closing. The trajectory of open-weight models over the past 18 months is striking:
- Llama 2 7B (July 2023) was noticeably worse than GPT-3.5 at most tasks.
- Llama 3 8B (April 2024) matched or exceeded GPT-3.5 on many benchmarks.
- Qwen 2.5 7B and Llama 3.1 8B (late 2024) approached early GPT-4 performance on specific task categories.
This improvement comes from three converging forces: better training data curation (quality over quantity), architectural innovations (grouped-query attention, sliding window attention, mixture of experts), and distillation techniques where smaller models learn from larger ones. Each generation of 7B models absorbs techniques that were exclusive to 100B+ models a year prior.
Quantization research is advancing in parallel. GPTQ, AWQ, and GGML/GGUF quantization methods have become sophisticated enough that 4-bit models retain 95-97% of their full-precision benchmark scores. Two years ago, 4-bit quantization caused significant quality degradation. Today, it is nearly transparent for most tasks.
How Morphee Approaches the Local-Cloud Boundary
Rather than picking a side in the local-versus-cloud debate, we built Morphee around a principle: local by default, cloud by explicit consent. The app is fully functional without any network connection. Cloud capabilities exist as opt-in enhancements, gated behind granular consent controls that explain exactly what data will be sent and to whom.
Local-First Processing
All core AI capabilities in Morphee run entirely on your device. Text embeddings, language model inference, speech-to-text, and semantic search all happen locally using the kinds of open models and inference techniques described earlier in this article. There is no phone-home behavior, no silent cloud calls, and no telemetry on your conversations.
When you save a note, have a conversation, or add any content to Morphee, it is immediately processed and indexed on your device. Semantic search — finding relevant context not by keyword matching but by meaning — runs against a local vector index with sub-millisecond latency.
On first launch, Morphee profiles the available hardware — CPU cores, available RAM, GPU capability — and recommends an appropriate model size and quantization level for your machine. A device with plenty of memory gets a higher-quality model; a more constrained machine gets a smaller, faster one. Either way, everything runs locally.
Memory You Can Inspect
Morphee’s memory system is designed around a philosophical position: your AI’s memory should be transparent, inspectable, and portable. You should be able to see exactly what the AI has learned, when it learned it, and edit or delete anything you choose. Your data is stored locally in human-readable formats, not locked inside opaque cloud databases.
Consent-Gated Cloud Fallback
When a task exceeds what the local model can handle well — complex multi-step reasoning, tasks requiring very recent world knowledge, or generating long-form creative content — Morphee can escalate to cloud APIs (Claude, GPT-4o, and others). But this escalation is never silent.
Before any data leaves the device, Morphee presents a consent dialog that shows exactly what will be sent, which provider will receive it, and what that provider’s data retention policy is. The user can approve, deny, or modify the request. This consent is granular: you can allow cloud processing for general knowledge queries while blocking it for anything involving personal or family data.
This is not a toggle buried in settings. It is an active, per-request decision point. The architecture enforces this — cloud providers cannot be called without explicit user consent scoped to the specific task at hand. Read more about our approach to privacy and family data and how we maintain GDPR compliance.
Why This Architecture Matters for Families
Morphee is designed for groups — families, classrooms, small teams. In these contexts, the local-versus-cloud question takes on additional weight.
A family’s AI assistant processes conversations about children’s homework, health questions, financial discussions, scheduling, and personal matters. This is exactly the category of data that should not transit through third-party servers by default. A child asking their AI assistant for help with math homework should not generate a training data point on a cloud provider’s servers.
Local-first architecture makes this the default behavior rather than a configuration option. The assistant works. The data stays home. If the family decides that cloud models would be helpful for certain tasks, they opt in with full visibility into what that means.
This is also why Morphee’s feature set focuses on the capabilities that work best locally: semantic search over personal knowledge, voice transcription, summarization, and conversational assistance grounded in your own data. These are the tasks where local models already match cloud quality, and where privacy matters most.
The Road Ahead
The trajectory of local AI capabilities points in one direction: more capability in less hardware. Quantization will continue improving. Architectures will become more efficient. Apple, Qualcomm, and Intel are all shipping dedicated ML accelerators in consumer silicon, increasing the TOPS available for on-device inference with each generation.
Within the next two to three years, a 7B-class model running on consumer hardware will likely match the quality of today’s frontier cloud models for most tasks. The question of local versus cloud will shift from “can local models handle this?” to “is there any reason to send this to the cloud at all?”
We are building Morphee for that future while making it useful today. The architecture is designed so that better local models can be adopted as they arrive — no infrastructure migration, no data exposure, just improved capability running on your own hardware.
Morphee is currently in early access. If you want an AI assistant that runs on your device, respects your family’s privacy, and gets better as local models improve, join the waitlist.
Morphee Team
Morphee Team
Related articles
Why Your AI Assistant Should Never Train on Your Family's Data
Your family conversations, photos, and routines are deeply personal. Here's why most AI assistants handle this data irresponsibly — and what to demand instead.
Building a GDPR-Compliant AI Assistant From Day One
GDPR compliance isn't a checkbox you tick before launch. Here's how we built privacy into every layer of Morphee's architecture from the very first line of code.
How a Family of Four Uses AI to Replace Five Apps
Sophie was drowning in apps — calendar, tasks, shopping list, meal planner, family chat. Here's how one AI assistant replaced the chaos.