Can ds4 run DeepSeek V4 Pro?

Not today. ds4 is a DeepSeek V4 Flash-specific engine. Flash has 284B total parameters and 13B activated; Pro has 1.6T total and 49B activated. Quantized Pro still exceeds what current Mac unified memory can hold. For Pro, cloud vLLM or SGLang remains the realistic path.

Does 96GB really work, or do I need 128GB?

96GB is the documented floor. Community reports show 2-bit quants running on 96GB Macs, sometimes at 250k context. For comfortable everyday use with an editor and an agent on the side, 128GB is the level antirez actually recommends. Pushing context closer to 1M tokens adds about 26GB more for the indexer alone.

I do not own a top-spec Mac. What is the fastest way to try ds4?

Renting a high-memory Mac node from NodeMini is the most direct route. SSH in, clone the repo, run make, download the GGUF, and start ds4-server. The full path takes under two hours. Specs and pricing are on the NodeMini rental rates page; access details are in the help center.

antirez ds4 brings DeepSeek V4 Flash to local Mac
the 96GB UMA wall, on-disk KV cache, and a remote Mac shortcut

Redis creator antirez (Salvatore Sanfilippo) shipped ds4 (DwarfStar 4) in a week. Roughly a thousand lines of plain C. It is the first time DeepSeek V4 Flash truly runs on a Mac. In under three weeks the repo has 11,500+ stars and 30 contributors. The hardware bar is just as real: 96GB of unified memory is the floor; 128GB is the comfortable point. That maps to a Mac Studio starting around $3,000 and topping out beyond $10,000. This piece does not repeat the README. It answers three questions: why ds4 is not just another llama.cpp wrapper; why Apple Silicon UMA forces Metal to be the primary backend; and how to be running ds4 today without buying a top-spec Mac, by renting a high-memory Mac node.

11.5k stars in three weeks, and a hardware wall worth tens of thousands of dollars

ds4 went public on 2026-05-06. In under three weeks it accumulated 11,500+ stars and 30 contributors, all in pure C under MIT. Almost no one tracking local inference missed the news. Far fewer people actually compiled it, pulled the GGUF, and ran the server. The reason is plain: ds4's hardware bar rules out most everyday Macs. The six symptoms below are what every aspiring user runs into.

01
Stock MacBook Pro RAM is too small. 14"/16" MacBooks ship with 16/24/36GB. Even the 81GB q2 weights will not fit, let alone activations and KV.
02
Upgrading RAM is not cheap. Going from 64GB to 96GB or 128GB requires a top-spec M3/M4/M5 Max. The delta runs into thousands of dollars.
03
Mac Studio is not a casual purchase. A 128GB Mac Studio starts around $3,000. Pushing toward 512GB for a V4 Pro experiment costs well over $10,000. That is hard to justify for one experiment.
04
Windows or Linux workstations take a detour. Consumer GPUs with 24/32GB VRAM cannot hold ds4's working set. DGX Spark-class boxes carry their own price and operational cost.
05
Sharing one high-spec Mac across a team is painful. A single ds4-server eats most of the memory. Multiple users mean queueing and cross-session contamination.
06
You may swap models in six months. ds4 is self-described as alpha. DeepSeek V4 Flash is a preview. Buying a $10,000 Mac for one preview model is a real depreciation risk.

The conclusion writes itself: the software is ready; the hardware has not caught up. ds4 moved "DeepSeek V4 Flash on a Mac" from impossible to possible. The distance between possible and accessible is exactly one Mac Studio invoice away.

ds4 is not another llama.cpp: model-specific design, asymmetric 2-bit quant, on-disk KV cache

Knowing what ds4 is not matters as much as knowing what it is. The README is blunt: not a generic GGUF runner, not a wrapper, not a framework. It does exactly one thing — run DeepSeek V4 Flash on Metal and CUDA — and pushes that one thing to the limit. The table below places it next to the local inference tools you already know.

Tool	Model coverage	Best fit	Key limit
ds4 (DwarfStar 4)	DeepSeek V4 Flash only	Maximum speed on a Mac for V4 Flash, paired with a coding agent for the long run	Single-model, alpha quality, 96–128GB RAM
llama.cpp	Most GGUF families	Trying a new model every week; broad portability	No V4 Flash-specific path or persistent KV like ds4
Ollama	Mainstream GGUFs, one-line pull	Team-shared local models behind a clean API	Middle ground on speed and control; long context is uneven
vLLM / SGLang	Most HuggingFace weights	Cloud multi-GPU serving, shared team endpoint	Not built for a single Mac
Cloud API (e.g. DeepSeek)	Full-precision V4 Flash / Pro	Forget about hardware; aim for top quality	Data leaves your box; per-token billing on long sessions

ds4 has three real engineering choices. First, a model-specific graph executor built around V4 Flash's tensor layout, tokenizer, and MoE routing — faster than a generic runner. Second, asymmetric 2-bit quantization: aggressive low precision on layers that tolerate it (the routed MoE experts use IQ2_XXS for gate and Q2_K for down), with higher precision kept on the critical path. The result is an 81GB q2 that fits 128GB UMA and still calls tools reliably. Third, an on-disk KV cache keyed by the SHA1 of token IDs. It survives session switches and server restarts, so the expensive 25k-token first prefill is paid once.

ds4 turns "V4 Flash on a Mac" from a slogan into an engineering path: do nothing else, and push this one path to what Apple Silicon and CUDA can deliver.

Why Metal is the primary backend: Apple Silicon UMA is the unfair advantage

The ds4 backend order is deliberate. Metal first. CUDA next, with special attention to DGX Spark and GB10. ROCm only on a separate branch. CPU is correctness-only. That order is directly tied to Apple Silicon's Unified Memory Architecture (UMA).

On a Mac, CPU and GPU share the same physical memory. Loading an 81GB q2 GGUF does not require a host-to-device copy. Tensors are read directly by the GPU. Activations, KV state, and tokenizer buffers all sit in the same address space. Metal kernels touch them in place. For ds4 — a MoE engine that hits a large sparse pool of expert weights on every token — eliminating that copy directly lowers the inference latency floor.

Discrete GPU paths cannot match this. A 32GB consumer card will not hold the working set at all. An 80GB H100 means a data-center chassis with the cooling to match — not something you put on a desk. That is why antirez places Metal first and concentrates CUDA work on DGX Spark and GB10, the NVIDIA platforms that also expose unified memory. The goal is not yet another inference framework. It is to squeeze the only consumer hardware form factor where a GPU can directly touch huge memory.

bash

# On an Apple Silicon Mac (96/128GB UMA): build and start ds4 from scratch
git clone https://github.com/antirez/ds4.git
cd ds4
make                        # Metal backend by default

# Download the V4 Flash q2-imatrix GGUF (~81GB into ./gguf/)
./download_model.sh q2-imatrix

# Start the server: 100k context + 8GB on-disk KV cache
./ds4-server --ctx 100000 \
             --kv-disk-dir /tmp/ds4-kv \
             --kv-disk-space-mb 8192
# Listens on http://127.0.0.1:8000/v1/chat/completions (OpenAI-compatible)

Once it is up, point Claude Code, Cursor, or opencode at http://127.0.0.1:8000/v1 and you have a fully offline V4 Flash endpoint. The permission boundary stays on the machine.

The memory bill: 96GB is the floor, 128GB is comfortable, 1M context costs 26GB more

Restate the memory math before you buy or rent. The q2 GGUF lands on disk at roughly 81GB. Loaded weights plus activations, tokenizer state, and Metal buffers leave 96GB UMA as the reported floor — some users push context to 250k there. 128GB is the level antirez actually recommends. Pushing context toward 1M tokens (the V4 series ceiling) costs about 22GB for the indexer alone, roughly 26GB total. That squeezes 128GB hard. A 100–300k context window is the realistic sweet spot on 128GB.

Form factor	Unified memory	Can it run ds4 (V4 Flash q2)?	Practical context size
MacBook Pro stock (16–36GB)	16 / 24 / 36GB	No. Weights do not fit.	—
MacBook Pro mid (48–64GB)	48 / 64GB	No. Weights consume everything.	—
MacBook Pro M3/M4/M5 Max 96GB	96GB	Just barely. Close other heavy apps.	Community reports up to ~250k
Mac Studio / MacBook Pro 128GB	128GB	Comfortable. Room for editor and agent.	100–300k is stable
Mac Studio M3 Ultra 256GB+	256GB+	Plenty. Long sessions, persistent KV.	Can approach 1M tokens
Mac Studio M3 Ultra 512GB (V4 Pro try)	512GB	Not yet — ds4 only targets Flash.	—

info

Tip: the on-disk KV cache earns its keep here. Point --kv-disk-dir at the Mac's internal SSD. Session switches, server restarts, and even next-day reuse skip thousands of tokens of prefill. This is the deepest user-facing difference versus a generic inference server.

warning

Caution: the README states clearly that current macOS versions crash the kernel on the CPU path. Use Metal, do not build with make cpu on macOS. That is also why ds4's roadmap has no CPU fallback on Apple Silicon.

Hard numbers: parameters, quant size, and the hardware wall

The five numbers below come from the ds4 README, the DeepSeek-V4-Flash model card on Hugging Face, and community reports. Together they answer one question: how short is my Mac, exactly?

Datum 1 · Model size: DeepSeek-V4-Flash has 284B total parameters and 13B activated, with native 1M-token context. V4-Pro is 1.6T total and 49B activated. ds4 only targets Flash today. Pro stays on cloud vLLM / SGLang for now.
Datum 2 · Quant footprint: the recommended q2-imatrix GGUF is around 81GB on disk. The trick is asymmetric distribution: routed MoE experts use IQ2_XXS for gate and Q2_K for down, while critical layers stay higher precision. The net effect: 96–128GB UMA can hold it and still call tools reliably.
Datum 3 · Memory budget: 1M-token context needs roughly 26GB extra (about 22GB for the indexer alone). Inside 128GB you also feed weights, KV, OS and apps. 100–300k tokens is the practical comfort zone on 128GB.
Datum 4 · Hardware cost: a Mac that can run ds4 well: 96GB MacBook Pro M3/M4/M5 Max from ~$3,500; 128GB Mac Studio from ~$3,000; 256GB Mac Studio Ultra from ~$6,000; 512GB Mac Studio M3 Ultra top spec ~$10,000+. That is the cost of "I want to try a new model".
Datum 5 · Project state: created 2026-05-06, last push 2026-05-24. 11,593 stars, 30 contributors, pure C, MIT. The author labels the code alpha. Interfaces and weight formats may still move in the coming months, so resale value of a $10,000 Mac bought for this stack is far from guaranteed.

Translate the numbers into a decision. Buying a top-spec Mac Studio is workable but expensive: it locks $6,000–$10,000 onto an alpha engine and a preview model. Cloud API delivers full precision, but data leaves your machine and long sessions get charged per token on every prefill, with the agent and permission boundary out of your hands. For developers who want the real local-inference feel of ds4 plus V4 Flash, yet do not want to bet their budget on a Mac whose resale value may erode, NodeMini's Mac Mini cloud rental is usually the better answer: SSH-in, ready to run, stop when done, data stays in your dedicated instance. Specs and pricing on the rental rates page; billing detail on SLA and commitment.

Six steps to run ds4 on a remote high-memory Mac, today

The order below is the minimal path from "no top-spec Mac" to "OpenAI-compatible V4 Flash endpoint on my desk". Each step maps to a constraint discussed above. End to end, under two hours.

01
Pick the spec from 128GB up. 2-bit weights with ~100k context need 128GB to be comfortable. For close to 1M context, jump to 256GB+. Do not save money on 96GB and then watch IDE, agent and browser fight for RAM.
02
Provision a NodeMini high-memory Mac node. Pick memory, region, and term on the order page. Seconds-level provisioning. An SSH key pair lands in your inbox. Connect with ssh user@host.
03
Clone, install deps, build. git clone https://github.com/antirez/ds4.git && cd ds4 && make. Metal by default on Apple Silicon. Do not try make cpu on macOS — the README is explicit about the kernel crash risk.
04
Pull the q2-imatrix GGUF and wire up the disk KV cache. Use the bundled download_model.sh for q2 / q2-imatrix / q4. Point --kv-disk-dir at a fixed local SSD path. Set --kv-disk-space-mb to 8–32GB so the disk KV really does the work.
05
Wire ds4-server into your coding agent. Start ./ds4-server --ctx 200000 --kv-disk-dir ... --kv-disk-space-mb 16384. Point Claude Code, Cursor, or opencode at http://127.0.0.1:8000/v1 via SSH port forwarding. Never expose the port publicly. OpenAI and Anthropic tool protocols are natively supported.
06
Lock in the access topology. SSH keys plus a private tunnel such as Tailscale turn the node into a zero-trust private endpoint. Stop the node when idle to stop billing. For always-on use, ship a launchd unit that starts ds4-server at boot, paired with the persistent KV cache.

Compare this to buying a Mac Studio. The local-purchase path has three real downsides: depreciation is glued to an alpha engine and a preview model; a long-running ds4 process competes with daily work for RAM; sharing one high-spec Mac across a team turns into a queue. For developers who want ds4 plus V4 Flash as part of daily work, while keeping depreciation risk on demand, NodeMini's Mac Mini cloud rental is usually the better answer. The fit aligns with the three-year TCO comparison and 24/7 cloud Mac automation. Access details on the help center.

FAQ

Frequently asked questions

Not today. ds4 is a DeepSeek V4 Flash-specific engine. Flash has 284B total and 13B activated parameters. Pro is 1.6T total and 49B activated. Quantized Pro still exceeds current Mac unified memory. For Pro, cloud vLLM or SGLang remains the realistic path.

96GB is the documented floor. Community reports show 2-bit quants running on 96GB Macs, sometimes at 250k context. For comfortable daily use with an editor and agent alongside, 128GB is what antirez actually recommends. Pushing context toward 1M tokens adds about 26GB more. The safe pick is a 256GB+ node — see the rental rates page.

Rent a high-memory Mac node from NodeMini. SSH in, git clone, make, download the GGUF, and start ./ds4-server — under two hours end to end. Access details are in the help center; for keeping an always-on agent paired with the node, see 24/7 cloud Mac automation.