Redis creator antirez (Salvatore Sanfilippo) shipped ds4 (DwarfStar 4) in a week. Roughly a thousand lines of plain C. It is the first time DeepSeek V4 Flash truly runs on a Mac. In under three weeks the repo has 11,500+ stars and 30 contributors. The hardware bar is just as real: 96GB of unified memory is the floor; 128GB is the comfortable point. That maps to a Mac Studio starting around $3,000 and topping out beyond $10,000. This piece does not repeat the README. It answers three questions: why ds4 is not just another llama.cpp wrapper; why Apple Silicon UMA forces Metal to be the primary backend; and how to be running ds4 today without buying a top-spec Mac, by renting a high-memory Mac node.
ds4 went public on 2026-05-06. In under three weeks it accumulated 11,500+ stars and 30 contributors, all in pure C under MIT. Almost no one tracking local inference missed the news. Far fewer people actually compiled it, pulled the GGUF, and ran the server. The reason is plain: ds4's hardware bar rules out most everyday Macs. The six symptoms below are what every aspiring user runs into.
Stock MacBook Pro RAM is too small. 14"/16" MacBooks ship with 16/24/36GB. Even the 81GB q2 weights will not fit, let alone activations and KV.
Upgrading RAM is not cheap. Going from 64GB to 96GB or 128GB requires a top-spec M3/M4/M5 Max. The delta runs into thousands of dollars.
Mac Studio is not a casual purchase. A 128GB Mac Studio starts around $3,000. Pushing toward 512GB for a V4 Pro experiment costs well over $10,000. That is hard to justify for one experiment.
Windows or Linux workstations take a detour. Consumer GPUs with 24/32GB VRAM cannot hold ds4's working set. DGX Spark-class boxes carry their own price and operational cost.
Sharing one high-spec Mac across a team is painful. A single ds4-server eats most of the memory. Multiple users mean queueing and cross-session contamination.
You may swap models in six months. ds4 is self-described as alpha. DeepSeek V4 Flash is a preview. Buying a $10,000 Mac for one preview model is a real depreciation risk.
The conclusion writes itself: the software is ready; the hardware has not caught up. ds4 moved "DeepSeek V4 Flash on a Mac" from impossible to possible. The distance between possible and accessible is exactly one Mac Studio invoice away.
Knowing what ds4 is not matters as much as knowing what it is. The README is blunt: not a generic GGUF runner, not a wrapper, not a framework. It does exactly one thing — run DeepSeek V4 Flash on Metal and CUDA — and pushes that one thing to the limit. The table below places it next to the local inference tools you already know.
| Tool | Model coverage | Best fit | Key limit |
|---|---|---|---|
| ds4 (DwarfStar 4) | DeepSeek V4 Flash only | Maximum speed on a Mac for V4 Flash, paired with a coding agent for the long run | Single-model, alpha quality, 96–128GB RAM |
| llama.cpp | Most GGUF families | Trying a new model every week; broad portability | No V4 Flash-specific path or persistent KV like ds4 |
| Ollama | Mainstream GGUFs, one-line pull | Team-shared local models behind a clean API | Middle ground on speed and control; long context is uneven |
| vLLM / SGLang | Most HuggingFace weights | Cloud multi-GPU serving, shared team endpoint | Not built for a single Mac |
| Cloud API (e.g. DeepSeek) | Full-precision V4 Flash / Pro | Forget about hardware; aim for top quality | Data leaves your box; per-token billing on long sessions |
ds4 has three real engineering choices. First, a model-specific graph executor built around V4 Flash's tensor layout, tokenizer, and MoE routing — faster than a generic runner. Second, asymmetric 2-bit quantization: aggressive low precision on layers that tolerate it (the routed MoE experts use IQ2_XXS for gate and Q2_K for down), with higher precision kept on the critical path. The result is an 81GB q2 that fits 128GB UMA and still calls tools reliably. Third, an on-disk KV cache keyed by the SHA1 of token IDs. It survives session switches and server restarts, so the expensive 25k-token first prefill is paid once.
ds4 turns "V4 Flash on a Mac" from a slogan into an engineering path: do nothing else, and push this one path to what Apple Silicon and CUDA can deliver.
The ds4 backend order is deliberate. Metal first. CUDA next, with special attention to DGX Spark and GB10. ROCm only on a separate branch. CPU is correctness-only. That order is directly tied to Apple Silicon's Unified Memory Architecture (UMA).
On a Mac, CPU and GPU share the same physical memory. Loading an 81GB q2 GGUF does not require a host-to-device copy. Tensors are read directly by the GPU. Activations, KV state, and tokenizer buffers all sit in the same address space. Metal kernels touch them in place. For ds4 — a MoE engine that hits a large sparse pool of expert weights on every token — eliminating that copy directly lowers the inference latency floor.
Discrete GPU paths cannot match this. A 32GB consumer card will not hold the working set at all. An 80GB H100 means a data-center chassis with the cooling to match — not something you put on a desk. That is why antirez places Metal first and concentrates CUDA work on DGX Spark and GB10, the NVIDIA platforms that also expose unified memory. The goal is not yet another inference framework. It is to squeeze the only consumer hardware form factor where a GPU can directly touch huge memory.
# On an Apple Silicon Mac (96/128GB UMA): build and start ds4 from scratch
git clone https://github.com/antirez/ds4.git
cd ds4
make # Metal backend by default
# Download the V4 Flash q2-imatrix GGUF (~81GB into ./gguf/)
./download_model.sh q2-imatrix
# Start the server: 100k context + 8GB on-disk KV cache
./ds4-server --ctx 100000 \
--kv-disk-dir /tmp/ds4-kv \
--kv-disk-space-mb 8192
# Listens on http://127.0.0.1:8000/v1/chat/completions (OpenAI-compatible)
Once it is up, point Claude Code, Cursor, or opencode at http://127.0.0.1:8000/v1 and you have a fully offline V4 Flash endpoint. The permission boundary stays on the machine.
Restate the memory math before you buy or rent. The q2 GGUF lands on disk at roughly 81GB. Loaded weights plus activations, tokenizer state, and Metal buffers leave 96GB UMA as the reported floor — some users push context to 250k there. 128GB is the level antirez actually recommends. Pushing context toward 1M tokens (the V4 series ceiling) costs about 22GB for the indexer alone, roughly 26GB total. That squeezes 128GB hard. A 100–300k context window is the realistic sweet spot on 128GB.
| Form factor | Unified memory | Can it run ds4 (V4 Flash q2)? | Practical context size |
|---|---|---|---|
| MacBook Pro stock (16–36GB) | 16 / 24 / 36GB | No. Weights do not fit. | — |
| MacBook Pro mid (48–64GB) | 48 / 64GB | No. Weights consume everything. | — |
| MacBook Pro M3/M4/M5 Max 96GB | 96GB | Just barely. Close other heavy apps. | Community reports up to ~250k |
| Mac Studio / MacBook Pro 128GB | 128GB | Comfortable. Room for editor and agent. | 100–300k is stable |
| Mac Studio M3 Ultra 256GB+ | 256GB+ | Plenty. Long sessions, persistent KV. | Can approach 1M tokens |
| Mac Studio M3 Ultra 512GB (V4 Pro try) | 512GB | Not yet — ds4 only targets Flash. | — |
Tip: the on-disk KV cache earns its keep here. Point --kv-disk-dir at the Mac's internal SSD. Session switches, server restarts, and even next-day reuse skip thousands of tokens of prefill. This is the deepest user-facing difference versus a generic inference server.
Caution: the README states clearly that current macOS versions crash the kernel on the CPU path. Use Metal, do not build with make cpu on macOS. That is also why ds4's roadmap has no CPU fallback on Apple Silicon.
The five numbers below come from the ds4 README, the DeepSeek-V4-Flash model card on Hugging Face, and community reports. Together they answer one question: how short is my Mac, exactly?
Translate the numbers into a decision. Buying a top-spec Mac Studio is workable but expensive: it locks $6,000–$10,000 onto an alpha engine and a preview model. Cloud API delivers full precision, but data leaves your machine and long sessions get charged per token on every prefill, with the agent and permission boundary out of your hands. For developers who want the real local-inference feel of ds4 plus V4 Flash, yet do not want to bet their budget on a Mac whose resale value may erode, NodeMini's Mac Mini cloud rental is usually the better answer: SSH-in, ready to run, stop when done, data stays in your dedicated instance. Specs and pricing on the rental rates page; billing detail on SLA and commitment.
The order below is the minimal path from "no top-spec Mac" to "OpenAI-compatible V4 Flash endpoint on my desk". Each step maps to a constraint discussed above. End to end, under two hours.
Pick the spec from 128GB up. 2-bit weights with ~100k context need 128GB to be comfortable. For close to 1M context, jump to 256GB+. Do not save money on 96GB and then watch IDE, agent and browser fight for RAM.
Provision a NodeMini high-memory Mac node. Pick memory, region, and term on the order page. Seconds-level provisioning. An SSH key pair lands in your inbox. Connect with ssh user@host.
Clone, install deps, build. git clone https://github.com/antirez/ds4.git && cd ds4 && make. Metal by default on Apple Silicon. Do not try make cpu on macOS — the README is explicit about the kernel crash risk.
Pull the q2-imatrix GGUF and wire up the disk KV cache. Use the bundled download_model.sh for q2 / q2-imatrix / q4. Point --kv-disk-dir at a fixed local SSD path. Set --kv-disk-space-mb to 8–32GB so the disk KV really does the work.
Wire ds4-server into your coding agent. Start ./ds4-server --ctx 200000 --kv-disk-dir ... --kv-disk-space-mb 16384. Point Claude Code, Cursor, or opencode at http://127.0.0.1:8000/v1 via SSH port forwarding. Never expose the port publicly. OpenAI and Anthropic tool protocols are natively supported.
Lock in the access topology. SSH keys plus a private tunnel such as Tailscale turn the node into a zero-trust private endpoint. Stop the node when idle to stop billing. For always-on use, ship a launchd unit that starts ds4-server at boot, paired with the persistent KV cache.
Compare this to buying a Mac Studio. The local-purchase path has three real downsides: depreciation is glued to an alpha engine and a preview model; a long-running ds4 process competes with daily work for RAM; sharing one high-spec Mac across a team turns into a queue. For developers who want ds4 plus V4 Flash as part of daily work, while keeping depreciation risk on demand, NodeMini's Mac Mini cloud rental is usually the better answer. The fit aligns with the three-year TCO comparison and 24/7 cloud Mac automation. Access details on the help center.
Not today. ds4 is a DeepSeek V4 Flash-specific engine. Flash has 284B total and 13B activated parameters. Pro is 1.6T total and 49B activated. Quantized Pro still exceeds current Mac unified memory. For Pro, cloud vLLM or SGLang remains the realistic path.
96GB is the documented floor. Community reports show 2-bit quants running on 96GB Macs, sometimes at 250k context. For comfortable daily use with an editor and agent alongside, 128GB is what antirez actually recommends. Pushing context toward 1M tokens adds about 26GB more. The safe pick is a 256GB+ node — see the rental rates page.
Rent a high-memory Mac node from NodeMini. SSH in, git clone, make, download the GGUF, and start ./ds4-server — under two hours end to end. Access details are in the help center; for keeping an always-on agent paired with the node, see 24/7 cloud Mac automation.