RUST_MODULE_PATH: /home/pigeon/dev/duck-llm/target/release/libduck_llm.so DPDK initialized successfully. Found 4 ports. Port 0 driver: net_ice Port 0 device: 0000:01:00.0 Port 0 bus info: vendor_id=8086, device_id=1593 Port 1 driver: net_ice Port 1 device: 0000:01:00.1 Port 1 bus info: vendor_id=8086, device_id=1593 Port 2 driver: net_ice Port 2 device: 0000:01:00.2 Port 2 bus info: vendor_id=8086, device_id=1593 Port 3 driver: net_ice Port 3 device: 0000:01:00.3 Port 3 bus info: vendor_id=8086, device_id=1593 Thread bound to lcore 2 Thread bound to lcore 8 Thread bound to lcore 4 Thread bound to lcore 6 Port 0 [host_ip 0a150101, bcast_ip 0a1501ff] initialized with MAC: 9C:69:B4:68:64:40 Port 1 [host_ip 0a150201, bcast_ip 0a1502ff] initialized with MAC: 9C:69:B4:68:64:41 Port 2 [host_ip 0a150301, bcast_ip 0a1503ff] initialized with MAC: 9C:69:B4:68:64:42 Port 3 [host_ip 0a150401, bcast_ip 0a1504ff] initialized with MAC: 9C:69:B4:68:64:43 [Worker port 1] [init] link: up=1 speed=25000Mbps duplex=full autoneg=1 [Worker port 1] [link-wait] link became up after 0ms [Worker port 1] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1 [Worker port 0] [init] link: up=0 speed=0Mbps duplex=half autoneg=0 [Worker port 2] [init] link: up=0 speed=0Mbps duplex=half autoneg=0 [Worker port 3] [init] link: up=0 speed=0Mbps duplex=half autoneg=0 [Worker port 0] [link-wait] link became up after 158ms [Worker port 0] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1 [Worker port 2] [link-wait] link became up after 153ms [Worker port 3] [link-wait] link became up after 153ms [Worker port 2] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1 [Worker port 3] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1 Initializing DeepseekV32OfficialFp8EagerForCausalLM with config: DeepseekV32Config( attention_bias: False attention_dropout: 0.0 bos_token_id: 0 eos_token_id: 1 ep_size: 1 first_k_dense_replace: 3 hidden_act: silu hidden_size: 7168 index_head_dim: 128 index_n_heads: 64 index_topk: 2048 initializer_range: 0.02 intermediate_size: 18432 kv_lora_rank: 512 max_position_embeddings: 163840 model_path: /home/pigeon/models/DeepSeek-V3.2 model_type: deepseek_v32 moe_intermediate_size: 2048 n_group: 8 n_routed_experts: 256 n_shared_experts: 1 norm_topk_prob: True num_attention_heads: 128 num_experts_per_tok: 8 num_hidden_layers: 61 num_key_value_heads: 128 num_nextn_predict_layers: 1 pad_token_id: None q_lora_rank: 1536 qk_nope_head_dim: 128 qk_rope_head_dim: 64 quantization_config: {'activation_scheme': 'dynamic', 'fmt': 'e4m3', 'quant_method': 'fp8', 'scale_fmt': 'ue8m0', 'weight_block_size': [128, 128]} rms_norm_eps: 1e-06 rope_scaling: {'beta_fast': 32, 'beta_slow': 1, 'factor': 40, 'mscale': 1.0, 'mscale_all_dim': 1.0, 'original_max_position_embeddings': 4096, 'type': 'yarn'} rope_theta: 10000 routed_scaling_factor: 2.5 scoring_func: sigmoid tie_word_embeddings: False topk_group: 4 torch_dtype: torch.bfloat16 use_cache: True v_head_dim: 128 vocab_size: 129280 dequantize_fp8: False use_duck: True mock_duck: False async_duck_init: True duck_tp_size: 32 duck_weight_cache: True duck_weight_cache_force_invalidate: False fp9_gemv_multiple_of: 64 flattened_decode: True v32_attention_impl: official_fp8 v32_full_attn_max_seq_len: 2048 v32_static_graph_max_model_len: 2048 v32_fp8_kv_cache_block_size: 128 v32_fp8_kv_cache_round_scale_pow2: True v32_triton_fp8_kv_cache_append: False v32_triton_fp8_decode_context: False v32_mtp1_companion: False v32_mtp1_online: True v32_mtp1_track_accuracy: True v32_mtp1_print_each_step: True v32_first_k_dense_cuda: True device: cuda qk_head_dim: 192 ) duck-llm global options: DUCK_TRITON_FP8_LINEAR=1 (default=0): Enable Triton-backed CUDA quantized Linear for eligible bf16@fp8 paths. DUCK_CUDA_MEMORY_TRACE=0 (default=0): Enable CUDA memory trace JSONL logging. DUCK_CUDA_MEMORY_SNAPSHOT_ON_OOM=0 (default=0): Dump CUDA allocator snapshot on OOM. cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.before device=cuda allocated_mib=329.0 reserved_mib=382.0 active_mib=329.0 inactive_split_mib=13.0 free_mib=29385.6 total_mib=32111.5 max_allocated_mib=366.2 max_reserved_mib=382.0 rust_mlp.init_cached layer=0 cache_slot=0: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.048ms, total=0.052ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576 Qwen3MLPDuck load_weights layer=0: weights_load=0.000ms, rust_mlp_init=0.065ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.065ms rust_mlp.init_cached layer=1 cache_slot=1: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.042ms, total=0.046ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576 Qwen3MLPDuck load_weights layer=1: weights_load=0.000ms, rust_mlp_init=0.062ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.062ms rust_mlp.init_cached layer=2 cache_slot=2: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.052ms, total=0.055ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576 Qwen3MLPDuck load_weights layer=2: weights_load=0.000ms, rust_mlp_init=0.068ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.068ms rust_moe.moe_init_cached layer=3 cache_slot=3: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=3: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms rust_moe.moe_init_cached layer=4 cache_slot=4: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=4: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=5 cache_slot=5: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=5: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=6 cache_slot=6: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=6: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms rust_moe.moe_init_cached layer=7 cache_slot=7: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=7: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=8 cache_slot=8: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=8: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=9 cache_slot=9: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=9: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=10 cache_slot=10: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=10: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=11 cache_slot=11: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=11: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=12 cache_slot=12: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=12: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.040ms rust_moe.moe_init_cached layer=13 cache_slot=13: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=13: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=14 cache_slot=14: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=14: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=15 cache_slot=15: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=15: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=16 cache_slot=16: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=16: weights_load=0.000ms, rust_moe_init=0.048ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.048ms rust_moe.moe_init_cached layer=17 cache_slot=17: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=17: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms rust_moe.moe_init_cached layer=18 cache_slot=18: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=18: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=19 cache_slot=19: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=19: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=20 cache_slot=20: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=20: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=21 cache_slot=21: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=21: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.040ms rust_moe.moe_init_cached layer=22 cache_slot=22: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=22: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=23 cache_slot=23: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=23: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms rust_moe.moe_init_cached layer=24 cache_slot=24: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=24: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=25 cache_slot=25: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.035ms, total=0.038ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=25: weights_load=0.000ms, rust_moe_init=0.054ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.054ms rust_moe.moe_init_cached layer=26 cache_slot=26: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=26: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=27 cache_slot=27: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=27: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=28 cache_slot=28: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=28: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=29 cache_slot=29: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=29: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=30 cache_slot=30: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=30: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=31 cache_slot=31: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=31: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=32 cache_slot=32: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=32: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=33 cache_slot=33: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=33: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=34 cache_slot=34: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=34: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=35 cache_slot=35: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=35: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=36 cache_slot=36: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=36: weights_load=0.000ms, rust_moe_init=0.049ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.049ms rust_moe.moe_init_cached layer=37 cache_slot=37: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=37: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=38 cache_slot=38: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=38: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=39 cache_slot=39: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=39: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=40 cache_slot=40: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=40: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=41 cache_slot=41: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=41: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=42 cache_slot=42: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=42: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=43 cache_slot=43: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=43: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=44 cache_slot=44: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=44: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms rust_moe.moe_init_cached layer=45 cache_slot=45: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=45: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=46 cache_slot=46: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=46: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=47 cache_slot=47: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=47: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=48 cache_slot=48: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=48: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms rust_moe.moe_init_cached layer=49 cache_slot=49: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=49: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=50 cache_slot=50: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=50: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=51 cache_slot=51: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=51: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms rust_moe.moe_init_cached layer=52 cache_slot=52: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=52: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms rust_moe.moe_init_cached layer=53 cache_slot=53: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.036ms, total=0.039ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=53: weights_load=0.000ms, rust_moe_init=0.050ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.050ms rust_moe.moe_init_cached layer=54 cache_slot=54: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=54: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=55 cache_slot=55: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=55: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms rust_moe.moe_init_cached layer=56 cache_slot=56: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=56: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms rust_moe.moe_init_cached layer=57 cache_slot=57: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=57: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms rust_moe.moe_init_cached layer=58 cache_slot=58: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=58: weights_load=0.000ms, rust_moe_init=0.050ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.050ms rust_moe.moe_init_cached layer=59 cache_slot=59: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=59: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms rust_moe.moe_init_cached layer=60 cache_slot=60: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=60: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms rust_moe.moe_init_cached layer=61 cache_slot=61: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256 DeepseekV32ExpertsDuck load_weights layer=61: weights_load=0.000ms, rust_moe_init=0.050ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.050ms === Linear Load Stats: DeepseekV32OfficialFp8EagerForCausalLM === totals: linear_count=498 source_total_mib=16642.2 (weight=16638.6, scale=3.6) resident_total_mib=17634.6 (weight=17630.6, scale=3.3, aux=0.6) device count src_total res_total res_w res_scale res_aux --------------------------------------------------------------------------------- cuda:0 498 16642.2 17634.6 17630.6 3.3 0.6 by device + category: device category count src_total res_total res_w res_scale res_aux ------------------------------------------------------------------------------------------------------------------------- cuda:0 fp8_scaled_kept_quantized_triton 434 13686.5 13687.1 13683.1 3.3 0.6 cuda:0 fp8_scaled_dequantized_no_triton_to_bf16 62 992.2 1984.0 1984.0 0.0 0.0 cuda:0 source_bf16_loaded_as_bf16 2 1963.5 1963.5 1963.5 0.0 0.0 samples: cuda:0 / fp8_scaled_kept_quantized_triton: - model.layers.0.mlp.down_proj shape=7168x18432 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0 - model.layers.0.mlp.gate_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0 - model.layers.0.mlp.up_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0 cuda:0 / fp8_scaled_dequantized_no_triton_to_bf16: - model.layers.0.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0 - model.layers.1.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0 - model.layers.2.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0 cuda:0 / source_bf16_loaded_as_bf16: - lm_head shape=129280x7168 src=bf16 loaded=bf16 reason=official_lm_head_bf16_storage_fp32_out device=cuda:0 resident_mib=1767.5 - model.layers.61.eh_proj shape=7168x14336 src=bf16 loaded=bf16 reason=official_dense_bf16 device=cuda:0 resident_mib=196.0 cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.after device=cuda allocated_mib=20159.0 reserved_mib=20504.0 active_mib=20159.0 inactive_split_mib=345.0 free_mib=9263.6 total_mib=32111.5 max_allocated_mib=20173.0 max_reserved_mib=20504.0 DeepseekV32OfficialFp8EagerForCausalLM static graph prebuild: max_model_len=2048 DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2 cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20237.2 reserved_mib=20582.0 active_mib=20237.2 inactive_split_mib=344.8 free_mib=9185.6 total_mib=32111.5 max_allocated_mib=20237.2 max_reserved_mib=20582.0 === Duck Load Stats: DeepseekV32OfficialFp8EagerForCausalLM === totals: module_count=62 source_total_mib=635657.2 theory_per_rank_mib=19864.3 rust_buf_per_rank_mib=22351.6 rust_buf_all_ranks_mib=715250.6 weights_buffer_used_mib=22351.6 group count src_total theory/rk rust/rk rust/all ----------------------------------------------------------------------------------- duck_moe_cache_hit 59 634522.9 19828.8 22311.7 713973.8 duck_mlp_cache_hit 3 1134.3 35.4 39.9 1276.9 samples: duck_moe_cache_hit: - model.layers.3.mlp.experts layer=3 cache_slot=3 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2 - model.layers.4.mlp.experts layer=4 cache_slot=4 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2 - model.layers.5.mlp.experts layer=5 cache_slot=5 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2 duck_mlp_cache_hit: - model.layers.0.mlp layer=0 cache_slot=0 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3 - model.layers.1.mlp layer=1 cache_slot=1 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3 - model.layers.2.mlp layer=2 cache_slot=2 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3 DeepseekV32OfficialFp8EagerForCausalLM wait_pending_init=0.000ms DeepseekV32OfficialFp8EagerForCausalLM initialized 请输入您的问题(输入exit退出):你觉得这个项目真的能做到吗?为什么? duck-llm:一种基于 1×RTX 5090 + 32×N100 的异构 LLM 推理系统 “个人娱乐项目”。 没有 NVLink,没有 InfiniBand,只有一台工作机、32 台 Intel N100 小机器——低功耗、4 个 E-core(Alder Lake-N,Gracemont 系)、6W TDP、单通道内存、只有 AVX2,本来更像评测节点或 mini PC,而不是推理集群——和以太网。 最终结果:DeepSeek-V3.2(685B 参数,含 MTP head)decode 吞吐达到 15.301 tok/s,MTP1 接受率 87.1%。 这是什么 这套系统的出发点不是“我要做一个 LLM 推理系统”,而是一批已经存在的机器——32 台 Intel N100 小盒子,原先用来做在线评测(OJ)节点——能不能硬趟出一条跑大模型的路。 到 2026 年 3 月底,这条路的终点数字是:15.301 tok/s,运行 DeepSeek-V3.2(685B 参数,含 MTP head),在一台 i9-13900K + RTX 5090 的主机和 32 台 N100 之间,通过以太网协作完成。 一句话介绍该系统 把注意力计算放在单张 RTX 5090 上,把 MoE/FFN 专家层分散发到 32 台 N100 上跑 fp9 GEMV,中间通过 DPDK 用户态网络通信,JudgeDuck-OS 作为远端 kernel launch runtime。这套分工被称为 AFD(Attention-FeedForward Disaggregation)。 阅读导航 报告分为 Part I(系统搭建) 和 Part II(优化与结果) 两部分,共 12 章。每章可以独立阅读,但建议按顺序读完 Part I 再进入 Part II。 Part I:系统搭建 章 标题 一句话摘要 1 设计目标与硬件约束 硬件从哪来、为什么凑在一起、约束如何推导出分离架构 2 系统总览 四层架构:Python 模型主线 → Rust 编排 → C++ DPDK → JudgeDuck-OS 远端执行 3 单核 MLP fp9 起点 fp9 格式的动机、位布局设计和第一次进入真实路径 4 JudgeDuck-OS 作为 runtime 从“一次评测一个 ELF”到“低延迟远端 kernel launcher”的五处改造 5 DPDK 与“一鸭三吃” 通信栈被数据逼出来的演化路线:socket → async → sendmmsg → DPDK Part II:优化与结果 章 标题 一句话摘要 6 模型接入路径 Qwen3 → Qwen3.5 → DeepSeek-V3.2,三条路线的结构差异和共性 7 decode 固定成本压缩 static KV cache → flash attention 2 → flattened decode → CUDA Graph → Triton fp8 → V3.2 decode context 8 fp9 扩展到完整远端路径 MoE 接入、sub-128 解锁、same-ELF SMP、prefill、weight cache、persistent judge 9 MTP1 dual-lane overlap 不加第二组鸭子,把 GPU 和 duck 两条流水线叠起来跑 10 方法学 roofline 测量、breakdown 分析、forced-cold 验证——不靠感觉靠测量 11 最终配置与最终结果 1024 token 长跑:15.301 tok/s,accept=1 共 891/1023,各项贡献分解 12 负结果、反直觉与未完成项 不成立的优化路径、与直觉相反的发现,以及还没做完的事 结果口径说明 首页所列结果来自一次 1024 token 单 prompt 长跑(prompt:104857601是质数吗),max_model_len=2048。 这是当前最佳工程路径的终点数字,不是标准 benchmark 汇总。更详细的口径和限制条件见第 11 章。 速查 硬件规模 项目 数值 在役 duck 数 32 × Intel N100(4 核,32 GB DDR4-3200) host GPU 1 × NVIDIA RTX 5090(32 GB GDDR7) host NIC 4 × Intel E810-C 25GbE 网络拓扑 4 个隔离广播域,每组 8 duck 最终结果(DeepSeek-V3.2,真实长跑) 同一轮 1024 token 长跑证据包(prompt:104857601是质数吗,max_model_len=2048)。 指标 数值 decode 吞吐 15.301 tok/s TTFT 0.857 s TPOT 65 ms MTP1 accept=1 891 / 1023 ≈ 87.1% 关键机制指标(V3.2 真实系统内部) 这些是解释“为什么最终结果成立”的局部指标,不是端到端吞吐。 指标 数值 来源 persistent judge_gap 418µs → 78µs 第 8 章 dual-lane overlap ratio ~0.434(549 step 均值) 第 9 章 duck 侧参数吞吐 15.0~15.7 Gparam/s 第 11 章 跨模型里程碑 证明同一套系统架构能在不同模型上跑通。 模型 改动 指标 来源 Qwen3-30B-A3B tp24 flattened runtime + CUDA Graph 18 → 24 tok/s 第 7 章 Qwen3.5-35B-A3B tp16 flattened decode 24.2 → 37.9 tok/s 第 7 章 专项 bench / 方法学 以下不是最终系统吞吐,而是内核级或方法学专项测量。 指标 数值 来源 fp9 single-core kernel ~5.4 Gparam/s 第 8 章 MoE fp9 hot (4-core methodology breakdown, wall) ~20.30 GB/s 第 10 章 关键 commit commit 说明 ab33e9e fp9: prune dead common paths and land final 64/128 kernels d32c6fe deepseek_v32: land MTP1 two-lane online path and trace-ready runtime 279d749 Enable persistent fp9 judge by default for decode forward e553d30 Fix DeepSeek V3.2 flattened active_len and add critical tests 项目代码 duck-llm:模型加载、host runtime、GPU/duck 协同、FP8/FP9 路径、DPDK 集成 JudgeDuck-OS:远端 kernel launch runtime、same-ELF SMP、persistent judge 会话 <|begin▁of▁sentence|><|User|>你觉得这个项目真的能做到吗?为什么? duck-llm:一种基于 1×RTX 5090 + 32×N100 的异构 LLM 推理系统 “个人娱乐项目”。 没有 NVLink,没有 InfiniBand,只有一台工作机、32 台 Intel N100 小机器——低功耗、4 个 E-core(Alder Lake-N,Gracemont 系)、6W TDP、单通道内存、只有 AVX2,本来更像评测节点或 mini PC,而不是推理集群——和以太网。 最终结果:DeepSeek-V3.2(685B 参数,含 MTP head)decode 吞吐达到 15.301 tok/s,MTP1 接受率 87.1%。 这是什么 这套系统的出发点不是“我要做一个 LLM 推理系统”,而是一批已经存在的机器——32 台 Intel N100 小盒子,原先用来做在线评测(OJ)节点——能不能硬趟出一条跑大模型的路。 到 2026 年 3 月底,这条路的终点数字是:15.301 tok/s,运行 DeepSeek-V3.2(685B 参数,含 MTP head),在一台 i9-13900K + RTX 5090 的主机和 32 台 N100 之间,通过以太网协作完成。 一句话介绍该系统 把注意力计算放在单张 RTX 5090 上,把 MoE/FFN 专家层分散发到 32 台 N100 上跑 fp9 GEMV,中间通过 DPDK 用户态网络通信,JudgeDuck-OS 作为远端 kernel launch runtime。这套分工被称为 AFD(Attention-FeedForward Disaggregation)。 阅读导航 报告分为 Part I(系统搭建) 和 Part II(优化与结果) 两部分,共 12 章。每章可以独立阅读,但建议按顺序读完 Part I 再进入 Part II。 Part I:系统搭建 章 标题 一句话摘要 1 设计目标与硬件约束 硬件从哪来、为什么凑在一起、约束如何推导出分离架构 2 系统总览 四层架构:Python 模型主线 → Rust 编排 → C++ DPDK → JudgeDuck-OS 远端执行 3 单核 MLP fp9 起点 fp9 格式的动机、位布局设计和第一次进入真实路径 4 JudgeDuck-OS 作为 runtime 从“一次评测一个 ELF”到“低延迟远端 kernel launcher”的五处改造 5 DPDK 与“一鸭三吃” 通信栈被数据逼出来的演化路线:socket → async → sendmmsg → DPDK Part II:优化与结果 章 标题 一句话摘要 6 模型接入路径 Qwen3 → Qwen3.5 → DeepSeek-V3.2,三条路线的结构差异和共性 7 decode 固定成本压缩 static KV cache → flash attention 2 → flattened decode → CUDA Graph → Triton fp8 → V3.2 decode context 8 fp9 扩展到完整远端路径 MoE 接入、sub-128 解锁、same-ELF SMP、prefill、weight cache、persistent judge 9 MTP1 dual-lane overlap 不加第二组鸭子,把 GPU 和 duck 两条流水线叠起来跑 10 方法学 roofline 测量、breakdown 分析、forced-cold 验证——不靠感觉靠测量 11 最终配置与最终结果 1024 token 长跑:15.301 tok/s,accept=1 共 891/1023,各项贡献分解 12 负结果、反直觉与未完成项 不成立的优化路径、与直觉相反的发现,以及还没做完的事 结果口径说明 首页所列结果来自一次 1024 token 单 prompt 长跑(prompt:104857601是质数吗),max_model_len=2048。 这是当前最佳工程路径的终点数字,不是标准 benchmark 汇总。更详细的口径和限制条件见第 11 章。 速查 硬件规模 项目 数值 在役 duck 数 32 × Intel N100(4 核,32 GB DDR4-3200) host GPU 1 × NVIDIA RTX 5090(32 GB GDDR7) host NIC 4 × Intel E810-C 25GbE 网络拓扑 4 个隔离广播域,每组 8 duck 最终结果(DeepSeek-V3.2,真实长跑) 同一轮 1024 token 长跑证据包(prompt:104857601是质数吗,max_model_len=2048)。 指标 数值 decode 吞吐 15.301 tok/s TTFT 0.857 s TPOT 65 ms MTP1 accept=1 891 / 1023 ≈ 87.1% 关键机制指标(V3.2 真实系统内部) 这些是解释“为什么最终结果成立”的局部指标,不是端到端吞吐。 指标 数值 来源 persistent judge_gap 418µs → 78µs 第 8 章 dual-lane overlap ratio ~0.434(549 step 均值) 第 9 章 duck 侧参数吞吐 15.0~15.7 Gparam/s 第 11 章 跨模型里程碑 证明同一套系统架构能在不同模型上跑通。 模型 改动 指标 来源 Qwen3-30B-A3B tp24 flattened runtime + CUDA Graph 18 → 24 tok/s 第 7 章 Qwen3.5-35B-A3B tp16 flattened decode 24.2 → 37.9 tok/s 第 7 章 专项 bench / 方法学 以下不是最终系统吞吐,而是内核级或方法学专项测量。 指标 数值 来源 fp9 single-core kernel ~5.4 Gparam/s 第 8 章 MoE fp9 hot (4-core methodology breakdown, wall) ~20.30 GB/s 第 10 章 关键 commit commit 说明 ab33e9e fp9: prune dead common paths and land final 64/128 kernels d32c6fe deepseek_v32: land MTP1 two-lane online path and trace-ready runtime 279d749 Enable persistent fp9 judge by default for decode forward e553d30 Fix DeepSeek V3.2 flattened active_len and add critical tests 项目代码 duck-llm:模型加载、host runtime、GPU/duck 协同、FP8/FP9 路径、DPDK 集成 JudgeDuck-OS:远端 kernel launch runtime、same-ELF SMP、persistent judge 会话 <|Assistant|> {'input_ids': tensor([[ 0, 128803, 50492, ..., 201, 128804, 128798]]), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1]])} tensor([[ 0, 128803, 50492, ..., 201, 128804, 128798]]) DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2 cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20501.5 reserved_mib=21354.0 active_mib=20501.5 inactive_split_mib=850.5 free_mib=8303.6 total_mib=32111.5 max_allocated_mib=20501.7 max_reserved_mib=21354.0 我们被问到:“你觉得这个项目真的能做到吗?为什么?” 关于一个名为“duck-llm”的项目,它描述了一种基于1×RTX 5090 + 32×N100的异构LLM推理系统。这是一个个人娱乐项目。他们声称在没有NVLink和InfiniBand的情况下,只有一台工作机和32台Intel N100小机器,通过以太网,最终实现了DeepSeek-V3.2(685B参数)的解码吞吐达到15.301 tok/s,MTP1接受率87.1%。 问题:我们是否相信这个项目真的能做到?为什么? 我们需要分析这个项目的可行性。首先,项目描述了一个非常规的硬件设置:一个强大的GPU(RTX 5090,假设是未来的GPU,但可能是模拟或假设)和许多低功耗的小型PC(N100)。它们通过以太网连接,没有高速互连。他们声称通过将注意力计算放在GPU上,将MoE/FFN专家层分散到N100上运行fp9 GEMV,实现了不错的性能。 关键点: - 模型是DeepSeek-V3.2,有685B参数,是一个混合专家模型(MoE)。MoE模型只有一部分专家在每层被激活,因此可能适合将专家分散到多个设备上。 - 他们使用了32台N100,每台有4个E-core(Gracemont),6W TDP,单通道内存,只有AVX2。这些是低功耗、低性能的CPU。 - 他们使用了DPDK进行用户态网络通信,以及一个自定义的远端内核启动运行时(JudgeDuck-OS)。 - 报告了15.301 tok/s的解码吞吐,TTFT为0.857秒,TPOT为65毫秒。 我们需要考虑这个性能数字是否合理。15.3 tok/s对于685B模型来说,如果是在单个GPU上(即使是RTX 5090),也是不可能的,因为内存带宽和计算限制。但是,这里他们使用了32个额外的低功耗CPU节点来帮助计算MoE层。 在MoE模型中,前馈层(FFN)是专家层,通常只有少数专家被激活(例如,DeepSeek-V3.2可能使用MoE,每层有多个专家,但每个token只路由到少数专家)。因此,对于每个token,只需要计算一部分参数。如果他们将专家分布到不同的N100节点上,那么每个节点只需要计算一部分。但是,他们需要为每个token传输激活值到相应的专家节点,然后返回结果。这引入了网络开销。 他们声称使用了fp9格式(自定义的9位浮点)来减少数据传输和计算。N100只支持AVX2,因此 Generation stats: time 114.340s, input_len 1481, output_len 567, ttft 74.499s, tpot 0.070s, prefill 19.880 tok/s, decode 14.206 tok/s 请输入您的问题(输入exit退出):