The AI hype cycle is obsessed with one number: parameter count. GPT-4? 1 trillion. Llama 3? 405 billion. Phi-4? 14 billion. Meanwhile, Microsoft slipped out Phi-4-mini: 3.8 billion parameters, trained with synthetic data and RLHF, competitive with models 10x its size.
And it fits in your laptop.
I started the AEON project—an on-device AI companion for my phone—with the assumption I'd ship GPT-4 API calls. Too expensive. Too slow. Too dependent on connectivity. Then Phi-4-mini appeared, and I realized local inference is actually viable in 2026.
Here's what I learned running a 3.8B model on 8GB RAM with Apple Silicon.
Why Local LLMs Matter
Three problems with API-based AI:
- Cost: $0.03 per 1M tokens (GPT-4). AEON runs constantly in the background. That's $50+/month per user at scale.
- Latency: Network roundtrip + inference = 500ms–2s. Unacceptable for a real-time conversational companion.
- Privacy: Your thoughts go to OpenAI's servers. Some users won't accept that.
Local inference solves all three: free (amortized compute cost), instant (no network), and private (stays on-device).
The tradeoff: the model has to be small enough to fit and fast enough to feel responsive. That's where Phi-4-mini wins.
The Phi-4-mini Spec
Released by Microsoft in late 2025:
- Parameters: 3.8 billion (vs Llama 3.1 405B, Mistral 7B, GPT-4 ~1T)
- Training: Synthetic data (MIT-licensed, no copyright concerns) + RLHF
- Performance: Beats Llama 3.1 8B on most benchmarks
- Context: 4,096 tokens (expandable to 131K with ALiBi)
- License: MIT, commercially usable
The trick: Microsoft trained it on high-quality synthetic data instead of internet scrapes. Less jailbreak vectors, better reasoning, smaller footprint. It's what Llama should have done.
Size matters. On M1:
- Phi-4-mini FP16: ~7.6 GB RAM (fits in 8GB with OS overhead)
- Phi-4-mini INT4 quantized: ~2.5 GB RAM (headroom for context)
- Llama 3.1 8B FP16: ~16 GB RAM (doesn't fit)
MLX: The Framework That Makes It Work
Apple Silicon's GPU (unified memory, instant CPU↔GPU transfers) is insanely good for inference. But you need a framework that understands this architecture.
MLX is Apple's answer: a lightweight ML framework designed for Apple Silicon. No CUDA, no complicated setup.
import mlx.core as mx
import mlx.nn as nn
from mlx_lm.models import load_model
model, tokenizer = load_model(
"mlx-community/Phi-4-mini-4bit"
)
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
logits = model(tokens)
response = tokenizer.decode(logits)
print(response)Install MLX:
pip install mlx mlx-lmThat's it. No GPU drivers, no CUDA setup, no PyTorch complexity.
4-Bit Quantization: The Quality Trade-Off
Full-precision Phi-4-mini (FP32) is ~15 GB. Unacceptable. Solutions:
- FP16 (half precision): ~7.6 GB, imperceptible quality loss, 2x faster
- INT8 (8-bit integer): ~3.8 GB, minor quality degradation, 4x faster
- INT4 (4-bit integer): ~1.9 GB, noticeable quality drop, 8x faster
I tested all three on a coding question ("Write a Python function to find primes"):
| Quantization | RAM | Tokens/sec | Quality |
|---|---|---|---|
| FP16 | 7.6 GB | 45 tok/s | Perfect |
| INT8 | 3.8 GB | 42 tok/s | Nearly identical |
| INT4 | 1.9 GB | 38 tok/s | Acceptable, occasional garble |
I'm shipping INT8 in AEON: good quality, good speed, and 3.8 GB leaves room for context window + system overhead.
How quantization works (simplified):
- Original weights: 32-bit floats (e.g., 0.472618)
- INT8: Map to 0–255, store metadata for rescaling
- On inference: decompress to FP16, compute, recompress
- Result: 4x smaller, minimal accuracy loss (Phi's training helps here)
Why Phi-4-mini tolerates quantization well: synthetic training data is cleaner, weights are more uniform, less catastrophic forgetting.
The 7-Weapon Concept
Raw inference is just the baseline. To make a convincing AI companion, you need:
- 1. Base Model (Phi-4-mini): Core reasoning
- 2. LoRA Adapters: Fine-tune for personality without retraining (merge in seconds)
- 3. Retrieval-Augmented Generation (RAG): Ground responses in user context (phone files, notes)
- 4. Monte Carlo Tree Search (MCTS): 50 sampled reasoning paths, pick best (slow but accurate)
- 5. Self-Play Refinement: Companion debates itself, picks winner response
- 6. Speculative Decoding: Draft fast, verify with main model (2-3x speedup)
- 7. KV Cache Management: Sliding window to keep context while staying under RAM
Only weapon #1 is implemented yet. Weapons 2-7 are the roadmap.
Local vs API: When to Use What
| Metric | Local (Phi-4-mini) | API (GPT-4) |
|---|---|---|
| Latency | ~100ms first token | ~500ms–2s |
| Cost per request | ~$0.00001 | ~$0.03 |
| Context limit | 4K–131K (tunable) | 128K (but costs scale) |
| Privacy | 100% on-device | Server-side (assumed deleted) |
| Reasoning quality | Good, single-pass | Excellent, multi-sample |
| Customization | Full control | Prompt engineering only |
Use local for: Real-time conversational UX, privacy-sensitive data, always-on companions, custom personalities, offline-first apps.
Use API for: Complex reasoning, user trust (GPT-4 is better known), one-off queries, when latency is acceptable.
The Gotchas
Memory leaks on older MLX versions: KV cache wasn't freed after inference. Upgrade to mlx>=0.16.0.
Model download size: ~8 GB download (INT4), takes 10 minutes on WiFi. Cache it locally.
No batching on M1: MLX doesn't efficiently batch multiple requests on unified memory. One conversation at a time.
Context window exhaustion: 4K context fills fast. Implement aggressive summarization or sliding-window attention (not yet in MLX, coming soon).
Is It Worth It?
For AEON, yes. I control the UX, the personality, the privacy guarantees. No API dependency. Users get instant, offline-capable AI.
For a startup MVP? Probably not yet. GPT-4 API is faster to ship, easier to debug, and the quality difference matters when your metric is user retention.
But in 6 months? Every mobile app will have an on-device LLM layer. The cost of API calls will become unacceptable. Phi-4-mini (or its successors) will be the default.
Start experimenting now. The era of shipping LLMs to the edge is here.