On June 30, 2026, Huawei delivered on the promise made at HDC 2026: openPangu-2.0-Flash weights, inference code, and training operators went live on GitCode. This is the first frontier-scale open-source LLM trained entirely on Huawei Ascend 910B NPUs — no A100, no H100, no NVIDIA silicon anywhere in the pipeline. Two MoE variants share a 512K context window: Pro at 505B total / 18B active and Flash at 92B total / 6B active. Huawei plans to open-source seven components including pre-training and post-training code — a level of transparency almost unheard of at this scale. This guide covers the release timeline, architecture innovations (mHC, Muon, ModAttn, DSA+SWA), Ascend training metrics, competitor comparison tables, ModelArts API and GitCode deployment steps, hardware requirements, sovereign AI implications, and the openPangu License.
Richard Yu unveiled openPangu 2.0 at HDC 2026 in Dongguan on June 12, 2026. Flash weights shipped eighteen days later. Pro follows in July. The rest of the training stack rolls out through H2 2026.
| Date | Milestone |
|---|---|
| 2026-06-12 | HDC 2026 keynote: openPangu 2.0 officially announced |
| 2026-06-30 | Flash weights, base inference code, and training operators live on GitCode |
| 2026-07 (planned) | Pro model weights and inference code |
| H2 2026 (planned) | Pre-training code, post-training code (SFT/RLHF), additional training operators |
| Spec | openPangu 2.0 Pro | openPangu 2.0 Flash |
|---|---|---|
| Total parameters | 505B | 92B |
| Active parameters | 18B | 6B |
| Sparsity ratio | ~28:1 | ~15:1 |
| Context window | 512K tokens | 512K tokens |
| Status | July 2026 (planned) | Live since June 30 |
| License | openPangu License (permissive commercial use, royalty-free) | |
512K context is roughly eight full-length novels in a single prompt — entire codebases, contract bundles with appendices, or hours of transcript data without chunking.
Most labs release weights and inference code. Huawei is shipping the full stack in seven parts:
| # | Component | Status |
|---|---|---|
| 1 | Model architecture | Released June 30 |
| 2 | Model weights (Flash) | Released June 30 |
| 3 | Technical report | Released June 30 |
| 4 | Inference code + training operators | Released June 30 |
| 5 | Model weights (Pro) | Planned July 2026 |
| 6 | Pre-training code | Planned H2 2026 |
| 7 | Post-training code (SFT/RLHF) | Planned H2 2026 |
Items 6 and 7 are the differentiator. When they ship, researchers can reproduce the full training pipeline on Ascend hardware and enterprises can run domain-specific pre-training from scratch — not just fine-tune released weights.
openPangu 2.0 is a Mixture-of-Experts architecture with four headline innovations:
If you are evaluating openPangu 2.0 against NVIDIA-trained alternatives, these six misconceptions will skew your decision:
"Open weights" equals "open training": DeepSeek, Qwen, and Kimi ship weights and inference. openPangu 2.0 will also ship pre-training and post-training code — a qualitatively different transparency level.
Context window is a checkbox: 512K is not marginally better than 128K. It is 4x DeepSeek and Qwen, 2x Kimi — and changes what you can fit in a single Agent context without retrieval hacks.
Active parameters predict everything: Flash activates 6B per token. Pro activates 18B. DeepSeek V4 Pro activates ~200B. Raw reasoning depth still favors higher activation counts for the hardest tasks.
NVIDIA-free means CUDA-free inference only: The training pipeline also ran on Ascend 910B. Sovereignty covers the full stack — not just deployment on alternative hardware.
MoE train/inference mismatch is inevitable: openPangu reports >99% train/inference consistency — a metric that matters when routing and expert activation patterns diverge in production.
Benchmark leaderboards are current: Independent third-party benchmarks were not available as of July 1, 2026. Architecture-based expectations place Pro in the dense-70B class for general tasks, with clear long-context advantages. Treat leaderboard claims as provisional until Hugging Face Open LLM Leaderboard and LiveBench publish results.
Every other frontier open-source model — DeepSeek V4 Pro, Qwen 3.7 Max, Kimi K2.7, Llama 4 — was trained on NVIDIA hardware. openPangu 2.0 is the outlier: the entire 505B MoE pre-training run completed on Huawei Ascend 910B NPUs.
Huawei developed the Ascend 910B under US export controls that restricted China's access to A100 and H100 clusters. openPangu 2.0 is evidence that a frontier-quality training pipeline is now viable without CUDA.
| Metric | Result | Why it matters |
|---|---|---|
| Single-card throughput | 2x vs mainstream open-source models on Ascend | Native Ascend affinity — models not designed for NPU run slower |
| Hypernode training efficiency | +30% | Multi-node scaling on Ascend clusters |
| 512K long-sequence training | +50% throughput | Long-context training is usually the bottleneck; this is the headline feature |
| Train/inference consistency | >99% | MoE models often degrade when expert routing diverges at inference time |
| Inference latency | 1.2x better than comparable models | Lower time-to-first-token on Ascend-optimized paths |
| Flash-Int8 quantization | 40% less memory, <10% quality loss | W4A8 quantization for constrained deployments |
Ascend runs on CANN (Huawei's CUDA-equivalent stack, open-sourced in late 2025) and torch_npu — a PyTorch adapter that lets standard PyTorch code target Ascend NPUs with a single import. Edge deployment adds a 30B embedded variant with 50% faster inference and 20% lower memory, running offline on Kirin mobile chips.
"In my dictionary, there is no second place — only first. We will go from number one in China to number one in the world."
Richard Yu's HDC 2026 framing positions openPangu 2.0 as a sovereignty statement as much as a product launch — proof that export controls did not halt frontier-scale AI development.
Citable hard numbers: 505B total / 18B active (Pro), 92B total / 6B active (Flash), 512K context on both variants, 2x Ascend throughput, >99% train/inference consistency, seven planned open-source components.
English-language coverage of Chinese AI models often stops at DeepSeek. openPangu 2.0 occupies a different niche: sovereign hardware, extreme context, and full-stack openness. Here is how it compares on paper.
| Model | Total params | Active params | Context | License | Training HW | Open depth |
|---|---|---|---|---|---|---|
| openPangu 2.0 Pro | 505B | 18B | 512K | openPangu | Ascend NPU | Full stack (7 components) |
| openPangu 2.0 Flash | 92B | 6B | 512K | openPangu | Ascend NPU | Full stack (7 components) |
| DeepSeek V4 Pro | 1.6T | ~200B | 128K | MIT | NVIDIA | Weights + inference |
| Qwen 3.7 Max | ~400B+ | varies | 128K | Apache 2.0 | NVIDIA | Weights + inference + partial training |
| Kimi K2.7 | 1T | 32B | 256K | Modified MIT | NVIDIA | Weights + inference |
| Llama 4 405B | 405B | — | 128K | Llama License | NVIDIA | Weights + inference |
| Dimension | openPangu 2.0 Pro | DeepSeek V4 Pro | Qwen 3.7 Max | Kimi K2.7 |
|---|---|---|---|---|
| Code generation | Strong | Best in class | Very strong | Very strong |
| Complex reasoning | Good | Best in class | Best in class | Very strong |
| Tool use / Agents | Very strong | Very strong | Very strong | Best in class (MCP ecosystem) |
| Ultra-long context | Leader (512K) | Moderate (128K) | Moderate (128K) | Strong (256K) |
| Inference efficiency | Leader on Ascend | Moderate | Moderate | Strong |
| Sovereign / domestic HW | Only option | Low | Low | Low |
| Full training pipeline open | Leader (planned) | Partial | Partial | Partial |
| Your primary need | Recommended model | Reason |
|---|---|---|
| Code generation / complex reasoning | DeepSeek V4 Pro | ~200B active parameters, current open-source leader |
| Agent / multi-tool orchestration | Kimi K2.7 | MCP ecosystem maturity |
| Documents >256K tokens | openPangu 2.0 Pro | 512K context — no chunking required |
| Sovereign AI / no NVIDIA dependency | openPangu 2.0 | Only frontier model trained entirely off NVIDIA |
| Ascend / Huawei Cloud deployment | openPangu 2.0 | Native optimization, 2x throughput on Ascend |
| Edge / mobile (HarmonyOS) | openPangu Embedded (30B) | Kirin chip offline inference |
| Low-cost local inference | openPangu 2.0 Flash | 6B active, ~96GB unified memory |
Benchmark disclaimer: Independent third-party benchmarks were not available as of July 1, 2026. Capability ratings above are architecture-based inferences. This article will be updated when Hugging Face Open LLM Leaderboard, LiveBench, or peer-reviewed evaluations publish verified scores.
Three paths: cloud API for zero hardware, GitCode self-deploy on Ascend, or PyTorch + torch_npu for custom integration. Follow these steps in order.
Choose your tier: Flash is live now for low-latency, high-concurrency workloads. Pro ships in July 2026 for maximum long-document quality. Flash-Int8 cuts memory by 40% with under 10% quality loss.
Cloud API (fastest): Register at huaweicloud.com, open ModelArts, navigate to AI Gallery, search "openPangu 2.0", subscribe to Flash or Pro, and copy your API endpoint and auth token.
Call the ModelArts Chat Completions API:
curl -X POST "https://modelarts.${REGION}.myhuaweicloud.com/v1/infers/openpangu-2-flash/chat/completions" \
-H "Content-Type: application/json" \
-H "X-Auth-Token: ${TOKEN}" \
-d '{
"model": "openpangu-2.0-flash",
"messages": [
{"role": "user", "content": "Explain MoE architecture in simple terms"}
],
"max_tokens": 1024,
"temperature": 0.7
}'
Self-deploy from GitCode: Clone from Ascend Tribe. Key repos: openPangu-2.0-Flash, openPangu-2.0-Flash-Int8, openPangu-2.0-Infer, openPangu-2.0-Op (Ascend custom operators).
Run Flash inference on a single Ascend 910B:
python inference.py \
--model_path ./openPangu-Flash \
--device npu:0 \
--context_length 512000 \
--precision bf16
For the Int8 quantized build:
python inference.py \
--model_path ./openPangu-Flash-Int8 \
--device npu:0 \
--quantization int8
Run Pro distributed inference (when weights ship):
python distributed_inference.py \
--model_path ./openPangu-Pro \
--num_devices 8 \
--context_length 512000
Domain fine-tuning with LoRA:
python finetune.py \
--model_path ./openPangu-Pro \
--data_path ./domain_data \
--output_dir ./fine_tuned_model \
--method lora \
--lora_rank 16
PyTorch + torch_npu integration:
import torch
import torch_npu # Switch PyTorch backend to Ascend
model = load_openpangu("./openPangu-Flash")
model = model.to("npu:0")
output = model.generate(
input_ids.to("npu:0"),
max_new_tokens=512,
temperature=0.7
)
| Version | Recommended hardware | Minimum config | Notes |
|---|---|---|---|
| Flash (6B active) | Single Ascend 910B | ~96GB unified memory | Community tests on large-memory systems |
| Flash-Int8 | Single Ascend Atlas A2 | ~48GB VRAM | W4A8 quantization, <10% quality loss |
| Pro (18B active) | 4+ Ascend 910B cards | Multi-card cluster | Weights available July 2026 |
| Embedded (edge) | Kirin mobile SoC | 30B on-device model | 50% faster inference, 20% less memory vs prior gen |
Pair this deployment guide with the OpenRouter June 2026 rankings if you are building a multi-model routing layer alongside Ascend-hosted inference.
US export controls have restricted China's access to NVIDIA's most advanced GPUs since 2022. Huawei responded by building the Ascend stack — CANN, torch_npu, and now a 505B MoE model trained entirely on domestic silicon. openPangu 2.0 is not an incremental release; it is the first proof that frontier-scale open-source AI can exist outside the CUDA ecosystem.
For enterprises under data-sovereignty, domestic procurement, or "信创" (information technology innovation) compliance requirements, openPangu 2.0 is currently the only frontier open-weight model with zero NVIDIA dependency across training and deployment.
Releasing pre-training and post-training code — not just weights — serves three goals:
openPangu 2.0 is the AI backbone for Huawei's broader platform strategy, not a standalone model release:
Released under the openPangu License on GitCode:
Pro weights in July. Pre-training and post-training code in H2. Each release is an opportunity to update routing policies, benchmark scores, and deployment playbooks. Watch GitCode Ascend Tribe and the HDC developer portal for component drops.
Disclaimer: Benchmark and capability assessments in this article are architecture-based inferences as of July 1, 2026. Independent third-party test results will replace provisional ratings when published. Last updated: July 1, 2026.
openPangu 2.0 solves the model and inference layer — especially for 512K document Agents and Ascend-native deployments. It does not solve the execution environment for developers who also run iOS CI, CLI coding Agents, and multi-hour SSH sessions alongside model API calls.
A personal MacBook thermal-throttles during 12-hour Agent loops. A cheap Linux VPS cannot run xcodebuild, notarytool, or Keychain-isolated signing. Chasing sovereign inference on Ascend while running Agents on unstable local hardware optimizes the wrong layer.
Teams building long-context document Agents on openPangu — or routing between openPangu, DeepSeek, and Claude via OpenRouter — need a fixed SSH execution node for the workloads models cannot run: iOS builds, notarization, and persistent CLI Agent sessions.
NodeMini Mac Mini cloud rental is the execution-layer complement: swap ModelArts endpoints or GitCode deployments in your gateway while SSH nodes and CI labels stay fixed. When Pro weights ship in July or pre-training code drops in H2, you update inference config — not rebuild a laptop environment.
For stable iOS CI/CD and AI Agent automation alongside openPangu or multi-model stacks, NodeMini's Mac Mini cloud rental is usually the better fit than personal hardware or API-only setups. See the help center for access setup and rental rates for M4 Pro and M4 Max tiers.
Yes. The entire training pipeline ran on Ascend 910B NPUs with no A100 or H100 involvement. Inference uses CANN and torch_npu for native Ascend deployment. Flash weights are live on GitCode for self-hosted Ascend inference today.
DeepSeek V4 Pro leads on coding and complex reasoning with ~200B active parameters versus openPangu Pro's 18B. openPangu 2.0 wins on 512K context (4x most rivals), 2x Ascend throughput, planned full training-pipeline open source, and sovereign AI compliance with zero NVIDIA dependency. Route by task: DeepSeek for hardest reasoning, openPangu for ultra-long documents and Ascend environments.
Flash (6B active): single Ascend 910B or roughly 96GB unified memory. Flash-Int8: about 48GB VRAM on Ascend Atlas A2. Pro (18B active): multi-card Ascend 910B cluster when weights ship in July 2026. Compare infrastructure costs against NodeMini rental rates if you are splitting inference (Ascend) and Agent execution (cloud Mac).
Use Huawei Cloud ModelArts for API access without hardware, or self-deploy from GitCode on Ascend NPUs. Pair 512K document Agents with a dedicated cloud Mac for iOS CI and long-session CLI Agents. Start with the help center for provisioning and the six-step deployment guide in Section 04 above.