RUST_MODULE_PATH: /home/pigeon/dev/duck-llm/target/release/libduck_llm.so
DPDK initialized successfully. Found 4 ports.
Port 0 driver: net_ice
Port 0 device: 0000:01:00.0
Port 0 bus info: vendor_id=8086, device_id=1593
Port 1 driver: net_ice
Port 1 device: 0000:01:00.1
Port 1 bus info: vendor_id=8086, device_id=1593
Port 2 driver: net_ice
Port 2 device: 0000:01:00.2
Port 2 bus info: vendor_id=8086, device_id=1593
Port 3 driver: net_ice
Port 3 device: 0000:01:00.3
Port 3 bus info: vendor_id=8086, device_id=1593
Thread bound to lcore 2
Thread bound to lcore 4
Thread bound to lcore 8
Thread bound to lcore 6
Port 0 [host_ip 0a150101, bcast_ip 0a1501ff] initialized with MAC: 9C:69:B4:68:64:40
Port 1 [host_ip 0a150201, bcast_ip 0a1502ff] initialized with MAC: 9C:69:B4:68:64:41
Port 2 [host_ip 0a150301, bcast_ip 0a1503ff] initialized with MAC: 9C:69:B4:68:64:42
Port 3 [host_ip 0a150401, bcast_ip 0a1504ff] initialized with MAC: 9C:69:B4:68:64:43
[Worker port 0] [init] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 0] [link-wait] link became up after 2ms
[Worker port 0] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 1] [init] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 2] [init] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 1] [link-wait] link became up after 1ms
[Worker port 2] [link-wait] link became up after 1ms
[Worker port 1] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 2] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 3] [init] link: up=0 speed=0Mbps duplex=half autoneg=0
[Worker port 3] [link-wait] link became up after 668ms
[Worker port 3] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
Initializing DeepseekV32OfficialFp8EagerForCausalLM with config: DeepseekV32Config(
  attention_bias: False
  attention_dropout: 0.0
  bos_token_id: 0
  eos_token_id: 1
  ep_size: 1
  first_k_dense_replace: 3
  hidden_act: silu
  hidden_size: 7168
  index_head_dim: 128
  index_n_heads: 64
  index_topk: 2048
  initializer_range: 0.02
  intermediate_size: 18432
  kv_lora_rank: 512
  max_position_embeddings: 163840
  model_path: /home/pigeon/models/DeepSeek-V3.2
  model_type: deepseek_v32
  moe_intermediate_size: 2048
  n_group: 8
  n_routed_experts: 256
  n_shared_experts: 1
  norm_topk_prob: True
  num_attention_heads: 128
  num_experts_per_tok: 8
  num_hidden_layers: 61
  num_key_value_heads: 128
  num_nextn_predict_layers: 1
  pad_token_id: None
  q_lora_rank: 1536
  qk_nope_head_dim: 128
  qk_rope_head_dim: 64
  quantization_config: {'activation_scheme': 'dynamic', 'fmt': 'e4m3', 'quant_method': 'fp8', 'scale_fmt': 'ue8m0', 'weight_block_size': [128, 128]}
  rms_norm_eps: 1e-06
  rope_scaling: {'beta_fast': 32, 'beta_slow': 1, 'factor': 40, 'mscale': 1.0, 'mscale_all_dim': 1.0, 'original_max_position_embeddings': 4096, 'type': 'yarn'}
  rope_theta: 10000
  routed_scaling_factor: 2.5
  scoring_func: sigmoid
  tie_word_embeddings: False
  topk_group: 4
  torch_dtype: torch.bfloat16
  use_cache: True
  v_head_dim: 128
  vocab_size: 129280
  dequantize_fp8: False
  use_duck: True
  mock_duck: False
  async_duck_init: True
  duck_tp_size: 32
  duck_weight_cache: True
  duck_weight_cache_force_invalidate: False
  fp9_gemv_multiple_of: 64
  flattened_decode: True
  v32_attention_impl: official_fp8
  v32_full_attn_max_seq_len: 2048
  v32_static_graph_max_model_len: 2048
  v32_fp8_kv_cache_block_size: 128
  v32_fp8_kv_cache_round_scale_pow2: True
  v32_triton_fp8_kv_cache_append: False
  v32_triton_fp8_decode_context: False
  v32_mtp1_companion: False
  v32_mtp1_online: True
  v32_mtp1_track_accuracy: True
  v32_mtp1_print_each_step: True
  v32_first_k_dense_cuda: True
  device: cuda
  qk_head_dim: 192
)
duck-llm global options:
  DUCK_TRITON_FP8_LINEAR=1 (default=0): Enable Triton-backed CUDA quantized Linear for eligible bf16@fp8 paths.
  DUCK_CUDA_MEMORY_TRACE=0 (default=0): Enable CUDA memory trace JSONL logging.
  DUCK_CUDA_MEMORY_SNAPSHOT_ON_OOM=0 (default=0): Dump CUDA allocator snapshot on OOM.
cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.before device=cuda allocated_mib=329.0 reserved_mib=382.0 active_mib=329.0 inactive_split_mib=13.0 free_mib=29385.6 total_mib=32111.5 max_allocated_mib=366.2 max_reserved_mib=382.0
rust_mlp.init_cached layer=0 cache_slot=0: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.038ms, total=0.043ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=0: weights_load=0.000ms, rust_mlp_init=0.056ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.056ms
rust_mlp.init_cached layer=1 cache_slot=1: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.046ms, total=0.051ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=1: weights_load=0.000ms, rust_mlp_init=0.066ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.066ms
rust_mlp.init_cached layer=2 cache_slot=2: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.042ms, total=0.045ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=2: weights_load=0.000ms, rust_mlp_init=0.058ms, cache_hit=True, prefill_max_batch=3288, prefill_act_quant=False, total=0.058ms
rust_moe.moe_init_cached layer=3 cache_slot=3: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=3: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=4 cache_slot=4: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=4: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=5 cache_slot=5: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=5: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=6 cache_slot=6: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=6: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=7 cache_slot=7: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=7: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.048ms
rust_moe.moe_init_cached layer=8 cache_slot=8: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=8: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=9 cache_slot=9: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=9: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=10 cache_slot=10: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=10: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=11 cache_slot=11: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=11: weights_load=0.000ms, rust_moe_init=0.049ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.049ms
rust_moe.moe_init_cached layer=12 cache_slot=12: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=12: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=13 cache_slot=13: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=13: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=14 cache_slot=14: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=14: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=15 cache_slot=15: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=15: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=16 cache_slot=16: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=16: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=17 cache_slot=17: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=17: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=18 cache_slot=18: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=18: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=19 cache_slot=19: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=19: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=20 cache_slot=20: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=20: weights_load=0.000ms, rust_moe_init=0.050ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.050ms
rust_moe.moe_init_cached layer=21 cache_slot=21: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=21: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=22 cache_slot=22: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=22: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=23 cache_slot=23: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=23: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=24 cache_slot=24: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=24: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=25 cache_slot=25: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=25: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=26 cache_slot=26: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=26: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=27 cache_slot=27: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=27: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=28 cache_slot=28: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=28: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=29 cache_slot=29: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=29: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.047ms
rust_moe.moe_init_cached layer=30 cache_slot=30: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=30: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms
rust_moe.moe_init_cached layer=31 cache_slot=31: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=31: weights_load=0.001ms, rust_moe_init=0.050ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.050ms
rust_moe.moe_init_cached layer=32 cache_slot=32: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=32: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=33 cache_slot=33: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=33: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=34 cache_slot=34: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=34: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=35 cache_slot=35: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=35: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=36 cache_slot=36: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=36: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=37 cache_slot=37: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=37: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms
rust_moe.moe_init_cached layer=38 cache_slot=38: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=38: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=39 cache_slot=39: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=39: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=40 cache_slot=40: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=40: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=41 cache_slot=41: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=41: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=42 cache_slot=42: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.036ms, total=0.039ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=42: weights_load=0.000ms, rust_moe_init=0.052ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.052ms
rust_moe.moe_init_cached layer=43 cache_slot=43: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.036ms, total=0.038ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=43: weights_load=0.000ms, rust_moe_init=0.049ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.049ms
rust_moe.moe_init_cached layer=44 cache_slot=44: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=44: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.045ms
rust_moe.moe_init_cached layer=45 cache_slot=45: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=45: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=46 cache_slot=46: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=46: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=47 cache_slot=47: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=47: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=48 cache_slot=48: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=48: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.043ms
rust_moe.moe_init_cached layer=49 cache_slot=49: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=49: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.040ms
rust_moe.moe_init_cached layer=50 cache_slot=50: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=50: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.040ms
rust_moe.moe_init_cached layer=51 cache_slot=51: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=51: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=52 cache_slot=52: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=52: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms
rust_moe.moe_init_cached layer=53 cache_slot=53: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=53: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=54 cache_slot=54: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=54: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms
rust_moe.moe_init_cached layer=55 cache_slot=55: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=55: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.042ms
rust_moe.moe_init_cached layer=56 cache_slot=56: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=56: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.044ms
rust_moe.moe_init_cached layer=57 cache_slot=57: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=57: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.041ms
rust_moe.moe_init_cached layer=58 cache_slot=58: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.028ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=58: weights_load=0.000ms, rust_moe_init=0.038ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.038ms
rust_moe.moe_init_cached layer=59 cache_slot=59: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=59: weights_load=0.000ms, rust_moe_init=0.048ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.048ms
rust_moe.moe_init_cached layer=60 cache_slot=60: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=60: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.046ms
rust_moe.moe_init_cached layer=61 cache_slot=61: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=61: weights_load=0.000ms, rust_moe_init=0.051ms, cache_hit=True, prefill_max_batch=328, prefill_act_quant=False, total=0.052ms
=== Linear Load Stats: DeepseekV32OfficialFp8EagerForCausalLM ===
totals: linear_count=498 source_total_mib=16642.2 (weight=16638.6, scale=3.6) resident_total_mib=17634.6 (weight=17630.6, scale=3.3, aux=0.6)
device         count  src_total  res_total      res_w  res_scale    res_aux
---------------------------------------------------------------------------------
cuda:0           498   16642.2   17634.6   17630.6       3.3       0.6
by device + category:
device       category                                       count  src_total  res_total      res_w  res_scale    res_aux
-------------------------------------------------------------------------------------------------------------------------
cuda:0       fp8_scaled_kept_quantized_triton                 434   13686.5   13687.1   13683.1       3.3       0.6
cuda:0       fp8_scaled_dequantized_no_triton_to_bf16          62     992.2    1984.0    1984.0       0.0       0.0
cuda:0       source_bf16_loaded_as_bf16                         2    1963.5    1963.5    1963.5       0.0       0.0
samples:
  cuda:0 / fp8_scaled_kept_quantized_triton:
    - model.layers.0.mlp.down_proj shape=7168x18432 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
    - model.layers.0.mlp.gate_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
    - model.layers.0.mlp.up_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
  cuda:0 / fp8_scaled_dequantized_no_triton_to_bf16:
    - model.layers.0.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
    - model.layers.1.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
    - model.layers.2.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
  cuda:0 / source_bf16_loaded_as_bf16:
    - lm_head shape=129280x7168 src=bf16 loaded=bf16 reason=official_lm_head_bf16_storage_fp32_out device=cuda:0 resident_mib=1767.5
    - model.layers.61.eh_proj shape=7168x14336 src=bf16 loaded=bf16 reason=official_dense_bf16 device=cuda:0 resident_mib=196.0
cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.after device=cuda allocated_mib=20159.0 reserved_mib=20504.0 active_mib=20159.0 inactive_split_mib=345.0 free_mib=9263.6 total_mib=32111.5 max_allocated_mib=20173.0 max_reserved_mib=20504.0
DeepseekV32OfficialFp8EagerForCausalLM static graph prebuild: max_model_len=2048
DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2
cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20237.2 reserved_mib=20582.0 active_mib=20237.2 inactive_split_mib=344.8 free_mib=9185.6 total_mib=32111.5 max_allocated_mib=20237.2 max_reserved_mib=20582.0
=== Duck Load Stats: DeepseekV32OfficialFp8EagerForCausalLM ===
totals: module_count=62 source_total_mib=635657.2 theory_per_rank_mib=19864.3 rust_buf_per_rank_mib=22351.6 rust_buf_all_ranks_mib=715250.6 weights_buffer_used_mib=22351.6
group                          count  src_total  theory/rk    rust/rk   rust/all
-----------------------------------------------------------------------------------
duck_moe_cache_hit                59  634522.9   19828.8   22311.7  713973.8
duck_mlp_cache_hit                 3    1134.3      35.4      39.9    1276.9
samples:
  duck_moe_cache_hit:
    - model.layers.3.mlp.experts layer=3 cache_slot=3 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
    - model.layers.4.mlp.experts layer=4 cache_slot=4 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
    - model.layers.5.mlp.experts layer=5 cache_slot=5 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
  duck_mlp_cache_hit:
    - model.layers.0.mlp layer=0 cache_slot=0 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
    - model.layers.1.mlp layer=1 cache_slot=1 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
    - model.layers.2.mlp layer=2 cache_slot=2 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
DeepseekV32OfficialFp8EagerForCausalLM wait_pending_init=0.000ms
DeepseekV32OfficialFp8EagerForCausalLM initialized

请输入您的问题（输入exit退出）：帮我把这个文章重新写一遍，用更地道的中文表达，但是不要丢掉任何技术细节，这是个技术回忆录。

一鸭三吃：我把评测鸭变成了满血 DeepSeek-V3.2 推理系统
这不是为了融资或产品化而造的系统，而是一个从评测鸭、远端执行器和个人好奇心一路长出来的异构推理项目，最后真的把满血 DeepSeek-V3.2 跑到了可用速度。

先把最终结果摆出来
1 × RTX 5090 + 32 × Intel N100，运行 DeepSeek-V3.2（685B 参数，含 MTP head），在一次 1024 token 的真实长跑中（prompt：104857601是质数吗，max_model_len=2048）：

指标	数值
decode 吞吐	15.301 tok/s
TTFT	0.857 s
TPOT	65 ms
MTP1 accept=1	891 / 1023 ≈ 87.1%
最终结果终端截图
图：2026-03-30，1024 token 长跑结果。decode 15.301 tok/s，accept=1 出现 891/1023 次。

这不是理论峰值，不是 mock 路径，也没有跳过任何一层。stdout 和 stderr 完整证据保存在报告附录里，549 条 overlap 样本的统计均值也在那份日志中可查。

但这件事其实不是从大模型开始的，而是从评测鸭和远端执行器开始的。

这件事不是从大模型开始的，而是从评测鸭开始的
2024 年 12 月，我采购了一批 Intel N100 准系统——下单规模是 100 台准系统 + 100 条 32 GB DDR4-3200 内存。当时的出发点是给评测鸭 / JudgeDuck（在线评测网站，GitHub）扩容评测机。评测机运行自制的操作系统 JudgeDuck-OS——跟大模型没有任何关系。

N100 这颗芯片在公共舆论里没什么好名声。它是一颗 6W TDP、4 个 E-core（Alder Lake-N，Gracemont 系）、单通道内存的低功耗小核 x86，Intel ARK 上的规格看起来就像是“什么正经事也干不了”。2025 年甚至有人在公共社区质疑 N100 不适合作为评测节点，焦点集中在散热、降频、缓存和 32 GB 配置的合理性上——公共直觉常把它看成“配置极低、不可靠”。

我的观察是，经过 BIOS 调教（调整 Turbo Ratio Limit、开放 PL1/PL2）之后，N100 单核跑到官方标称的 3.4 GHz 没有问题，全核持续负载下也能稳在 2.9 GHz 左右。而当时给每台鸭配 32 GB DDR4-3200，纯粹是因为“反正内存不贵，多给点没坏处”，并不是为了任何 LLM 路线。

这批鸭子从采购到装机，是手工活：

鸭子装机现场
图：2025 年 1 月初，鸭子装机现场。

它们被部署成 JudgeDuck-OS 的远端执行节点——通过网络 PXE 启动，载入用户 ELF，计时，返回结果。每台鸭子不跑通用 Linux，没有 glibc，没有常规进程管理，就是一个最小化 x86-64 内核加上沙箱 ELF 执行。这个流程对评测来说够用。

到这时候，host 机器和鸭子群也完全没有关联。host 是 2023 年自费购置的主力工作机，i9-13900K + 128 GB DDR5，用于日常科研和开发。买它的时候，鸭群还不存在。

memtest86 截图
图：host 机器装机阶段的 memtest86 跑分，内存 4×32GB DDR5 稳定通过。

两批硬件之间没有预设的联系。直到后来，有人把它们放在一起想了一遍。

真正埋下伏笔的是内存；R1 之后，问题第一次被改写
DeepSeek-R1 公开之后，我第一次认真把问题写成了“能不能本地跑满血 685B”。

这个想法的出发点不是框架选型，不是论文阅读，而是容量、带宽和 roofline 的粗算。

先算容量。DeepSeek-V3 系列的官方发布格式本身就是 fp8，完整权重在磁盘上约 688 GB。32 只鸭子合计 32 × 32 GB ≈ 1 TB 总内存，看起来似乎装得下——但要留内存给 OS、ELF、input/output buffer，实际可用并没有那么宽裕。

再算带宽。每台 N100 的 DDR4-3200 单通道理论带宽约 25.6 GB/s，32 台合计约 819 GB/s。而一张 RTX 5090 的显存带宽约 1792 GB/s，高于鸭子侧的总量。

这两组数字放在一起，一个自然的分工就浮现出来了：

Attention 相关参数（KV cache、MLA 低秩矩阵、attention projection）可以被 5090 的 32 GB 显存装下——集中在 GPU 上处理。
FFN/MoE 专家层的权重远超单机显存容量——分散到 32 台鸭子的总内存里。
这个分工被称为 AFD（Attention-FeedForward Disaggregation）。我是按自己的容量和带宽约束独立推到了这套切分，后来补查时发现外部已经有人讨论过类似思路，并被概括为 AFD 这个名字——知乎专栏、Emergent Mind、甚至有 TensorRT-LLM 的 AFD 特性提案。清华大学趋境团队的 kTransformers（SOSP 2025）也是 AFD + CPU/GPU 混合计算——把 Attention/Shared Expert 留在 GPU，把 Routed Expert 卸载到同一台服务器的 CPU 上，利用服务器 CPU 的大内存带宽来做 MoE 计算。而我的路线是把 FFN/MoE 拆到 32 台独立的小机器上，用以太网做通信。多了一层网络，通信压力完全不一样。

选 5090 也不是靠榜单选出来的。核心判断只有两条：32 GB 显存能装下 Attention 侧工作集，且 1792 GB/s 显存带宽明显高于鸭子侧的理论总量。

<｜begin▁of▁sentence｜><｜User｜>帮我把这个文章重新写一遍，用更地道的中文表达，但是不要丢掉任何技术细节，这是个技术回忆录。

一鸭三吃：我把评测鸭变成了满血 DeepSeek-V3.2 推理系统
这不是为了融资或产品化而造的系统，而是一个从评测鸭、远端执行器和个人好奇心一路长出来的异构推理项目，最后真的把满血 DeepSeek-V3.2 跑到了可用速度。

先把最终结果摆出来
1 × RTX 5090 + 32 × Intel N100，运行 DeepSeek-V3.2（685B 参数，含 MTP head），在一次 1024 token 的真实长跑中（prompt：104857601是质数吗，max_model_len=2048）：

指标	数值
decode 吞吐	15.301 tok/s
TTFT	0.857 s
TPOT	65 ms
MTP1 accept=1	891 / 1023 ≈ 87.1%
最终结果终端截图
图：2026-03-30，1024 token 长跑结果。decode 15.301 tok/s，accept=1 出现 891/1023 次。

这不是理论峰值，不是 mock 路径，也没有跳过任何一层。stdout 和 stderr 完整证据保存在报告附录里，549 条 overlap 样本的统计均值也在那份日志中可查。

但这件事其实不是从大模型开始的，而是从评测鸭和远端执行器开始的。

这件事不是从大模型开始的，而是从评测鸭开始的
2024 年 12 月，我采购了一批 Intel N100 准系统——下单规模是 100 台准系统 + 100 条 32 GB DDR4-3200 内存。当时的出发点是给评测鸭 / JudgeDuck（在线评测网站，GitHub）扩容评测机。评测机运行自制的操作系统 JudgeDuck-OS——跟大模型没有任何关系。

N100 这颗芯片在公共舆论里没什么好名声。它是一颗 6W TDP、4 个 E-core（Alder Lake-N，Gracemont 系）、单通道内存的低功耗小核 x86，Intel ARK 上的规格看起来就像是“什么正经事也干不了”。2025 年甚至有人在公共社区质疑 N100 不适合作为评测节点，焦点集中在散热、降频、缓存和 32 GB 配置的合理性上——公共直觉常把它看成“配置极低、不可靠”。

我的观察是，经过 BIOS 调教（调整 Turbo Ratio Limit、开放 PL1/PL2）之后，N100 单核跑到官方标称的 3.4 GHz 没有问题，全核持续负载下也能稳在 2.9 GHz 左右。而当时给每台鸭配 32 GB DDR4-3200，纯粹是因为“反正内存不贵，多给点没坏处”，并不是为了任何 LLM 路线。

这批鸭子从采购到装机，是手工活：

鸭子装机现场
图：2025 年 1 月初，鸭子装机现场。

它们被部署成 JudgeDuck-OS 的远端执行节点——通过网络 PXE 启动，载入用户 ELF，计时，返回结果。每台鸭子不跑通用 Linux，没有 glibc，没有常规进程管理，就是一个最小化 x86-64 内核加上沙箱 ELF 执行。这个流程对评测来说够用。

到这时候，host 机器和鸭子群也完全没有关联。host 是 2023 年自费购置的主力工作机，i9-13900K + 128 GB DDR5，用于日常科研和开发。买它的时候，鸭群还不存在。

memtest86 截图
图：host 机器装机阶段的 memtest86 跑分，内存 4×32GB DDR5 稳定通过。

两批硬件之间没有预设的联系。直到后来，有人把它们放在一起想了一遍。

真正埋下伏笔的是内存；R1 之后，问题第一次被改写
DeepSeek-R1 公开之后，我第一次认真把问题写成了“能不能本地跑满血 685B”。

这个想法的出发点不是框架选型，不是论文阅读，而是容量、带宽和 roofline 的粗算。

先算容量。DeepSeek-V3 系列的官方发布格式本身就是 fp8，完整权重在磁盘上约 688 GB。32 只鸭子合计 32 × 32 GB ≈ 1 TB 总内存，看起来似乎装得下——但要留内存给 OS、ELF、input/output buffer，实际可用并没有那么宽裕。

再算带宽。每台 N100 的 DDR4-3200 单通道理论带宽约 25.6 GB/s，32 台合计约 819 GB/s。而一张 RTX 5090 的显存带宽约 1792 GB/s，高于鸭子侧的总量。

这两组数字放在一起，一个自然的分工就浮现出来了：

Attention 相关参数（KV cache、MLA 低秩矩阵、attention projection）可以被 5090 的 32 GB 显存装下——集中在 GPU 上处理。
FFN/MoE 专家层的权重远超单机显存容量——分散到 32 台鸭子的总内存里。
这个分工被称为 AFD（Attention-FeedForward Disaggregation）。我是按自己的容量和带宽约束独立推到了这套切分，后来补查时发现外部已经有人讨论过类似思路，并被概括为 AFD 这个名字——知乎专栏、Emergent Mind、甚至有 TensorRT-LLM 的 AFD 特性提案。清华大学趋境团队的 kTransformers（SOSP 2025）也是 AFD + CPU/GPU 混合计算——把 Attention/Shared Expert 留在 GPU，把 Routed Expert 卸载到同一台服务器的 CPU 上，利用服务器 CPU 的大内存带宽来做 MoE 计算。而我的路线是把 FFN/MoE 拆到 32 台独立的小机器上，用以太网做通信。多了一层网络，通信压力完全不一样。

选 5090 也不是靠榜单选出来的。核心判断只有两条：32 GB 显存能装下 Attention 侧工作集，且 1792 GB/s 显存带宽明显高于鸭子侧的理论总量。
<｜Assistant｜><think>
{'input_ids': tensor([[     0, 128803,   3950,  ...,    876, 128804, 128798]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}
tensor([[     0, 128803,   3950,  ...,    876, 128804, 128798]])
DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2
cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20501.5 reserved_mib=21354.0 active_mib=20501.5 inactive_split_mib=850.5 free_mib=8303.6 total_mib=32111.5 max_allocated_mib=20501.7 max_reserved_mib=21354.0
嗯，用户给了我一篇技术回忆录，想让我用更地道的中文重写，但不能丢掉任何技术细节。让我看看这是什么内容……

哦，这是一篇关于如何把评测鸭系统变成DeepSeek-V3.2推理系统的技术文章。作者用了一台RTX 5090和32台Intel N100小机器，成功运行了685B参数的大模型。挺有意思的。

用户想要更地道的中文表达，那我得注意不要用太生硬的翻译腔，要更符合中文技术文章的表达习惯。但技术细节一点都不能少，那些参数、配置、性能指标都得保留。

我看看文章结构……先有最终性能结果，然后从头讲起是怎么从评测鸭系统开始的，再讲到技术方案的选择和实现。这个叙事顺序挺好的，我可以保持这个结构。

有些表达可以调整得更自然些。比如“这不是为了融资或产品化而造的系统”这种表述可以更流畅。“先把最终结果摆出来”这种小标题也可以更有文采一点。

技术细节部分要特别小心，那些数字、术语、性能指标都不能错。decode吞吐15.301 tok/s、TTFT 0.857秒、TPOT 65毫秒……这些都得原样保留。

作者从评测鸭开始讲起这个角度很好，让文章有故事性。那些装机照片的描述也可以更生动些。

AFD架构的解释部分很重要，需要清晰说明为什么这样分工，GPU和CPU各自负责什么。那些带宽计算、容量规划的逻辑都要讲清楚。

我想让整篇文章读起来更像一篇流畅的技术叙事，而不是冷冰冰的技术报告。可以加一些连接词，让逻辑更连贯，但不要改变原意。

好，我大概知道怎么改了。要保持所有技术细节不变，但让语言更地道、更流畅，更有中文技术文章的味道。我会注意不要过度发挥，忠实于原文的技术内容。</think>**一鸭三用：我将“评测鸭”改造成了全功能 DeepSeek-V3.2 推理系统**

这个系统并非为了融资或产品化而构建，它源于“评测鸭”、远程执行器以及纯粹的个人技术好奇心，最终成长为一个异构推理项目，并成功以可用的速度运行了完整的 DeepSeek-V3.2 模型。

**先看最终成果**
使用 **1 张 RTX 5090 + 32 台 Intel N100 小主机**，运行 **DeepSeek-V3.2（685B 参数，包含 MTP 预测头）**。在一次真实的 1024 token 生成任务中（提示词：“104857601是质数吗”，`max_model_len=2048`），性能如下：

| 指标 | 数值 |
| :--- | :--- |
| **解码吞吐量** | 15.301 token/秒 |
| **首 Token 延迟** | 0.857 秒 |
| **单 Token 处理时间** | 65 毫秒 |
| **MTP accept=1 比例** | 891 / 1023 ≈ 87.1% |

*（终端运行结果截图，显示时间为2026-03-30，1024 token长任务，解码速度15.301 tok/s，accept=1出现891次。）*

这不是理论峰值，没有使用任何模拟路径，也没有跳过任何计算层。完整的 stdout 和 stderr 日志证据保存在项目附录中，其中包含 549
Generation stats: time 118.325s, input_len 1336, output_len 712, ttft 67.184s, tpot 0.072s, prefill 19.886 tok/s, decode 13.903 tok/s


请输入您的问题（输入exit退出）：