Running 3.8B Models on 8GB RAM

The AI hype cycle is obsessed with one number: parameter count. GPT-4? 1 trillion. Llama 3? 405 billion. Phi-4? 14 billion. Meanwhile, Microsoft slipped out Phi-4-mini: 3.8 billion parameters, trained with synthetic data and RLHF, competitive with models 10x its size.

And it fits in your laptop.

I started the AEON project—an on-device AI companion for my phone—with the assumption I'd ship GPT-4 API calls. Too expensive. Too slow. Too dependent on connectivity. Then Phi-4-mini appeared, and I realized local inference is actually viable in 2026.

Here's what I learned running a 3.8B model on 8GB RAM with Apple Silicon.

Why Local LLMs Matter

Three problems with API-based AI:

Cost: $0.03 per 1M tokens (GPT-4). AEON runs constantly in the background. That's $50+/month per user at scale.
Latency: Network roundtrip + inference = 500ms–2s. Unacceptable for a real-time conversational companion.
Privacy: Your thoughts go to OpenAI's servers. Some users won't accept that.

Local inference solves all three: free (amortized compute cost), instant (no network), and private (stays on-device).

The tradeoff: the model has to be small enough to fit and fast enough to feel responsive. That's where Phi-4-mini wins.

The Phi-4-mini Spec

Released by Microsoft in late 2025:

Parameters: 3.8 billion (vs Llama 3.1 405B, Mistral 7B, GPT-4 ~1T)
Training: Synthetic data (MIT-licensed, no copyright concerns) + RLHF
Performance: Beats Llama 3.1 8B on most benchmarks
Context: 4,096 tokens (expandable to 131K with ALiBi)
License: MIT, commercially usable

The trick: Microsoft trained it on high-quality synthetic data instead of internet scrapes. Less jailbreak vectors, better reasoning, smaller footprint. It's what Llama should have done.

Size matters. On M1:

Phi-4-mini FP16: ~7.6 GB RAM (fits in 8GB with OS overhead)
Phi-4-mini INT4 quantized: ~2.5 GB RAM (headroom for context)
Llama 3.1 8B FP16: ~16 GB RAM (doesn't fit)

MLX: The Framework That Makes It Work

Apple Silicon's GPU (unified memory, instant CPU↔GPU transfers) is insanely good for inference. But you need a framework that understands this architecture.

MLX is Apple's answer: a lightweight ML framework designed for Apple Silicon. No CUDA, no complicated setup.

import mlx.core as mx
import mlx.nn as nn
from mlx_lm.models import load_model

model, tokenizer = load_model(
    "mlx-community/Phi-4-mini-4bit"
)

prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
logits = model(tokens)
response = tokenizer.decode(logits)
print(response)

Install MLX:

pip install mlx mlx-lm

That's it. No GPU drivers, no CUDA setup, no PyTorch complexity.

Why MLX wins: It's a ~10K-line library specifically tuned for Apple Silicon. PyTorch on Mac is a nightmare (CPU fallback, horrible performance). MLX maps to GPU primitives natively.

4-Bit Quantization: The Quality Trade-Off

Full-precision Phi-4-mini (FP32) is ~15 GB. Unacceptable. Solutions:

FP16 (half precision): ~7.6 GB, imperceptible quality loss, 2x faster
INT8 (8-bit integer): ~3.8 GB, minor quality degradation, 4x faster
INT4 (4-bit integer): ~1.9 GB, noticeable quality drop, 8x faster

I tested all three on a coding question ("Write a Python function to find primes"):

Quantization	RAM	Tokens/sec	Quality
FP16	7.6 GB	45 tok/s	Perfect
INT8	3.8 GB	42 tok/s	Nearly identical
INT4	1.9 GB	38 tok/s	Acceptable, occasional garble

I'm shipping INT8 in AEON: good quality, good speed, and 3.8 GB leaves room for context window + system overhead.

How quantization works (simplified):

Original weights: 32-bit floats (e.g., 0.472618)
INT8: Map to 0–255, store metadata for rescaling
On inference: decompress to FP16, compute, recompress
Result: 4x smaller, minimal accuracy loss (Phi's training helps here)

Why Phi-4-mini tolerates quantization well: synthetic training data is cleaner, weights are more uniform, less catastrophic forgetting.

The 7-Weapon Concept

Raw inference is just the baseline. To make a convincing AI companion, you need:

1. Base Model (Phi-4-mini): Core reasoning
2. LoRA Adapters: Fine-tune for personality without retraining (merge in seconds)
3. Retrieval-Augmented Generation (RAG): Ground responses in user context (phone files, notes)
4. Monte Carlo Tree Search (MCTS): 50 sampled reasoning paths, pick best (slow but accurate)
5. Self-Play Refinement: Companion debates itself, picks winner response
6. Speculative Decoding: Draft fast, verify with main model (2-3x speedup)
7. KV Cache Management: Sliding window to keep context while staying under RAM

Only weapon #1 is implemented yet. Weapons 2-7 are the roadmap.

The real advantage of local inference: You control the whole stack. Add weapons one by one. No API limits, no prompt injections, no vendor lock-in.

Local vs API: When to Use What

Metric	Local (Phi-4-mini)	API (GPT-4)
Latency	~100ms first token	~500ms–2s
Cost per request	~$0.00001	~$0.03
Context limit	4K–131K (tunable)	128K (but costs scale)
Privacy	100% on-device	Server-side (assumed deleted)
Reasoning quality	Good, single-pass	Excellent, multi-sample
Customization	Full control	Prompt engineering only

Use local for: Real-time conversational UX, privacy-sensitive data, always-on companions, custom personalities, offline-first apps.

Use API for: Complex reasoning, user trust (GPT-4 is better known), one-off queries, when latency is acceptable.

The Gotchas

Memory leaks on older MLX versions: KV cache wasn't freed after inference. Upgrade to mlx>=0.16.0.

Model download size: ~8 GB download (INT4), takes 10 minutes on WiFi. Cache it locally.

No batching on M1: MLX doesn't efficiently batch multiple requests on unified memory. One conversation at a time.

Context window exhaustion: 4K context fills fast. Implement aggressive summarization or sliding-window attention (not yet in MLX, coming soon).

The toughest problem: Hallucinations. Even with quantization, Phi will confidently generate false information. Add retrieval (RAG) and self-verification (MCTS) to make it trustworthy.

Is It Worth It?

For AEON, yes. I control the UX, the personality, the privacy guarantees. No API dependency. Users get instant, offline-capable AI.

For a startup MVP? Probably not yet. GPT-4 API is faster to ship, easier to debug, and the quality difference matters when your metric is user retention.

But in 6 months? Every mobile app will have an on-device LLM layer. The cost of API calls will become unacceptable. Phi-4-mini (or its successors) will be the default.

Start experimenting now. The era of shipping LLMs to the edge is here.