RUST_MODULE_PATH: /home/pigeon/dev/duck-llm/target/release/libduck_llm.so
DPDK initialized successfully. Found 4 ports.
Port 0 driver: net_ice
Port 0 device: 0000:01:00.0
Port 0 bus info: vendor_id=8086, device_id=1593
Port 1 driver: net_ice
Port 1 device: 0000:01:00.1
Port 1 bus info: vendor_id=8086, device_id=1593
Port 2 driver: net_ice
Port 2 device: 0000:01:00.2
Port 2 bus info: vendor_id=8086, device_id=1593
Port 3 driver: net_ice
Port 3 device: 0000:01:00.3
Port 3 bus info: vendor_id=8086, device_id=1593
Thread bound to lcore 2
Thread bound to lcore 4
Thread bound to lcore 8
Thread bound to lcore 6
Port 0 [host_ip 0a150101, bcast_ip 0a1501ff] initialized with MAC: 9C:69:B4:68:64:40
Port 1 [host_ip 0a150201, bcast_ip 0a1502ff] initialized with MAC: 9C:69:B4:68:64:41
Port 2 [host_ip 0a150301, bcast_ip 0a1503ff] initialized with MAC: 9C:69:B4:68:64:42
Port 3 [host_ip 0a150401, bcast_ip 0a1504ff] initialized with MAC: 9C:69:B4:68:64:43
[Worker port 2] [init] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 2] [link-wait] link became up after 1ms
[Worker port 2] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 0] [init] link: up=0 speed=0Mbps duplex=half autoneg=0
[Worker port 1] [init] link: up=0 speed=0Mbps duplex=half autoneg=0
[Worker port 3] [init] link: up=0 speed=0Mbps duplex=half autoneg=0
[Worker port 3] [link-wait] link became up after 259ms
[Worker port 3] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 0] [link-wait] link became up after 310ms
[Worker port 1] [link-wait] link became up after 310ms
[Worker port 0] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
[Worker port 1] [post-wait] link: up=1 speed=25000Mbps duplex=full autoneg=1
Initializing DeepseekV32OfficialFp8EagerForCausalLM with config: DeepseekV32Config(
  attention_bias: False
  attention_dropout: 0.0
  bos_token_id: 0
  eos_token_id: 1
  ep_size: 1
  first_k_dense_replace: 3
  hidden_act: silu
  hidden_size: 7168
  index_head_dim: 128
  index_n_heads: 64
  index_topk: 2048
  initializer_range: 0.02
  intermediate_size: 18432
  kv_lora_rank: 512
  max_position_embeddings: 163840
  model_path: /home/pigeon/models/DeepSeek-V3.2
  model_type: deepseek_v32
  moe_intermediate_size: 2048
  n_group: 8
  n_routed_experts: 256
  n_shared_experts: 1
  norm_topk_prob: True
  num_attention_heads: 128
  num_experts_per_tok: 8
  num_hidden_layers: 61
  num_key_value_heads: 128
  num_nextn_predict_layers: 1
  pad_token_id: None
  q_lora_rank: 1536
  qk_nope_head_dim: 128
  qk_rope_head_dim: 64
  quantization_config: {'activation_scheme': 'dynamic', 'fmt': 'e4m3', 'quant_method': 'fp8', 'scale_fmt': 'ue8m0', 'weight_block_size': [128, 128]}
  rms_norm_eps: 1e-06
  rope_scaling: {'beta_fast': 32, 'beta_slow': 1, 'factor': 40, 'mscale': 1.0, 'mscale_all_dim': 1.0, 'original_max_position_embeddings': 4096, 'type': 'yarn'}
  rope_theta: 10000
  routed_scaling_factor: 2.5
  scoring_func: sigmoid
  tie_word_embeddings: False
  topk_group: 4
  torch_dtype: torch.bfloat16
  use_cache: True
  v_head_dim: 128
  vocab_size: 129280
  dequantize_fp8: False
  use_duck: True
  mock_duck: False
  async_duck_init: True
  duck_tp_size: 32
  duck_weight_cache: True
  duck_weight_cache_force_invalidate: False
  fp9_gemv_multiple_of: 64
  flattened_decode: True
  v32_attention_impl: official_fp8
  v32_full_attn_max_seq_len: 2048
  v32_static_graph_max_model_len: 2048
  v32_fp8_kv_cache_block_size: 128
  v32_fp8_kv_cache_round_scale_pow2: True
  v32_triton_fp8_kv_cache_append: False
  v32_triton_fp8_decode_context: False
  v32_mtp1_companion: False
  v32_mtp1_online: True
  v32_mtp1_track_accuracy: True
  v32_mtp1_print_each_step: True
  v32_first_k_dense_cuda: True
  device: cuda
  qk_head_dim: 192
)
duck-llm global options:
  DUCK_TRITON_FP8_LINEAR=1 (default=0): Enable Triton-backed CUDA quantized Linear for eligible bf16@fp8 paths.
  DUCK_CUDA_MEMORY_TRACE=0 (default=0): Enable CUDA memory trace JSONL logging.
  DUCK_CUDA_MEMORY_SNAPSHOT_ON_OOM=0 (default=0): Dump CUDA allocator snapshot on OOM.
cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.before device=cuda allocated_mib=329.0 reserved_mib=382.0 active_mib=329.0 inactive_split_mib=13.0 free_mib=29385.6 total_mib=32111.5 max_allocated_mib=366.2 max_reserved_mib=382.0
rust_mlp.init_cached layer=0 cache_slot=0: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.043ms, total=0.049ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=0: weights_load=0.000ms, rust_mlp_init=0.063ms, cache_hit=True, total=0.063ms
rust_mlp.init_cached layer=1 cache_slot=1: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.039ms, total=0.043ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=1: weights_load=0.001ms, rust_mlp_init=0.059ms, cache_hit=True, total=0.059ms
rust_mlp.init_cached layer=2 cache_slot=2: py_prepack=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.045ms, total=0.048ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=576
Qwen3MLPDuck load_weights layer=2: weights_load=0.000ms, rust_mlp_init=0.062ms, cache_hit=True, total=0.062ms
rust_moe.moe_init_cached layer=3 cache_slot=3: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=3: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, total=0.047ms
rust_moe.moe_init_cached layer=4 cache_slot=4: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=4: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, total=0.045ms
rust_moe.moe_init_cached layer=5 cache_slot=5: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=5: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=6 cache_slot=6: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=6: weights_load=0.000ms, rust_moe_init=0.047ms, cache_hit=True, total=0.047ms
rust_moe.moe_init_cached layer=7 cache_slot=7: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=7: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=8 cache_slot=8: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=8: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=9 cache_slot=9: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=9: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=10 cache_slot=10: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=10: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=11 cache_slot=11: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.037ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=11: weights_load=0.000ms, rust_moe_init=0.048ms, cache_hit=True, total=0.048ms
rust_moe.moe_init_cached layer=12 cache_slot=12: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=12: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=13 cache_slot=13: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=13: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=14 cache_slot=14: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=14: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=15 cache_slot=15: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.037ms, total=0.040ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=15: weights_load=0.000ms, rust_moe_init=0.051ms, cache_hit=True, total=0.051ms
rust_moe.moe_init_cached layer=16 cache_slot=16: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=16: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, total=0.046ms
rust_moe.moe_init_cached layer=17 cache_slot=17: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=17: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, total=0.046ms
rust_moe.moe_init_cached layer=18 cache_slot=18: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=18: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=19 cache_slot=19: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=19: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=20 cache_slot=20: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=20: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, total=0.045ms
rust_moe.moe_init_cached layer=21 cache_slot=21: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=21: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=22 cache_slot=22: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=22: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=23 cache_slot=23: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=23: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=24 cache_slot=24: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=24: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, total=0.045ms
rust_moe.moe_init_cached layer=25 cache_slot=25: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=25: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=26 cache_slot=26: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=26: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=27 cache_slot=27: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=27: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=28 cache_slot=28: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=28: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=29 cache_slot=29: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=29: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, total=0.046ms
rust_moe.moe_init_cached layer=30 cache_slot=30: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=30: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=31 cache_slot=31: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=31: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=32 cache_slot=32: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=32: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=33 cache_slot=33: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=33: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, total=0.044ms
rust_moe.moe_init_cached layer=34 cache_slot=34: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=34: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=35 cache_slot=35: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=35: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=36 cache_slot=36: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=36: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=37 cache_slot=37: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=37: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=38 cache_slot=38: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.032ms, total=0.034ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=38: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, total=0.044ms
rust_moe.moe_init_cached layer=39 cache_slot=39: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=39: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=40 cache_slot=40: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=40: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=41 cache_slot=41: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.029ms, total=0.031ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=41: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=42 cache_slot=42: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.034ms, total=0.036ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=42: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, total=0.046ms
rust_moe.moe_init_cached layer=43 cache_slot=43: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.028ms, total=0.030ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=43: weights_load=0.000ms, rust_moe_init=0.041ms, cache_hit=True, total=0.041ms
rust_moe.moe_init_cached layer=44 cache_slot=44: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.033ms, total=0.035ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=44: weights_load=0.000ms, rust_moe_init=0.045ms, cache_hit=True, total=0.045ms
rust_moe.moe_init_cached layer=45 cache_slot=45: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=45: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=46 cache_slot=46: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.028ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=46: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=47 cache_slot=47: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=47: weights_load=0.000ms, rust_moe_init=0.044ms, cache_hit=True, total=0.044ms
rust_moe.moe_init_cached layer=48 cache_slot=48: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=48: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=49 cache_slot=49: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=49: weights_load=0.000ms, rust_moe_init=0.040ms, cache_hit=True, total=0.040ms
rust_moe.moe_init_cached layer=50 cache_slot=50: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.028ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=50: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=51 cache_slot=51: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=51: weights_load=0.000ms, rust_moe_init=0.042ms, cache_hit=True, total=0.042ms
rust_moe.moe_init_cached layer=52 cache_slot=52: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=52: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=53 cache_slot=53: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=53: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=54 cache_slot=54: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.027ms, total=0.029ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=54: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=55 cache_slot=55: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.028ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=55: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=56 cache_slot=56: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.038ms, total=0.040ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=56: weights_load=0.000ms, rust_moe_init=0.050ms, cache_hit=True, total=0.050ms
rust_moe.moe_init_cached layer=57 cache_slot=57: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=57: weights_load=0.000ms, rust_moe_init=0.046ms, cache_hit=True, total=0.046ms
rust_moe.moe_init_cached layer=58 cache_slot=58: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.026ms, total=0.028ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=58: weights_load=0.000ms, rust_moe_init=0.039ms, cache_hit=True, total=0.039ms
rust_moe.moe_init_cached layer=59 cache_slot=59: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.030ms, total=0.032ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=59: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.043ms
rust_moe.moe_init_cached layer=60 cache_slot=60: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.031ms, total=0.033ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=60: weights_load=0.000ms, rust_moe_init=0.043ms, cache_hit=True, total=0.044ms
rust_moe.moe_init_cached layer=61 cache_slot=61: py_prepack=0.000ms, py_concat=0.000ms, rust_wait_prev=0.000ms, rust_plan=0.038ms, total=0.042ms, fp9_gemv_multiple_of=64, hidden_size=7168, local_intermediate_size_padded=64, num_experts=256
DeepseekV32ExpertsDuck load_weights layer=61: weights_load=0.000ms, rust_moe_init=0.055ms, cache_hit=True, total=0.055ms
=== Linear Load Stats: DeepseekV32OfficialFp8EagerForCausalLM ===
totals: linear_count=498 source_total_mib=16642.2 (weight=16638.6, scale=3.6) resident_total_mib=17634.6 (weight=17630.6, scale=3.3, aux=0.6)
device         count  src_total  res_total      res_w  res_scale    res_aux
---------------------------------------------------------------------------------
cuda:0           498   16642.2   17634.6   17630.6       3.3       0.6
by device + category:
device       category                                       count  src_total  res_total      res_w  res_scale    res_aux
-------------------------------------------------------------------------------------------------------------------------
cuda:0       fp8_scaled_kept_quantized_triton                 434   13686.5   13687.1   13683.1       3.3       0.6
cuda:0       fp8_scaled_dequantized_no_triton_to_bf16          62     992.2    1984.0    1984.0       0.0       0.0
cuda:0       source_bf16_loaded_as_bf16                         2    1963.5    1963.5    1963.5       0.0       0.0
samples:
  cuda:0 / fp8_scaled_kept_quantized_triton:
    - model.layers.0.mlp.down_proj shape=7168x18432 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
    - model.layers.0.mlp.gate_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
    - model.layers.0.mlp.up_proj shape=18432x7168 src=fp8_e4m3fn loaded=fp8_e4m3fn reason=official_fp8_kept_quantized device=cuda:0 resident_mib=126.0
  cuda:0 / fp8_scaled_dequantized_no_triton_to_bf16:
    - model.layers.0.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
    - model.layers.1.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
    - model.layers.2.self_attn.kv_b_proj shape=32768x512 src=fp8_e4m3fn loaded=bf16 reason=official_fp8_decode_predequantized_bf16 device=cuda:0 resident_mib=32.0
  cuda:0 / source_bf16_loaded_as_bf16:
    - lm_head shape=129280x7168 src=bf16 loaded=bf16 reason=official_lm_head_bf16_storage_fp32_out device=cuda:0 resident_mib=1767.5
    - model.layers.61.eh_proj shape=7168x14336 src=bf16 loaded=bf16 reason=official_dense_bf16 device=cuda:0 resident_mib=196.0
cuda_mem_stats label=DeepseekV32OfficialFp8EagerForCausalLM.load_weights.after device=cuda allocated_mib=20159.0 reserved_mib=20504.0 active_mib=20159.0 inactive_split_mib=345.0 free_mib=9263.6 total_mib=32111.5 max_allocated_mib=20173.0 max_reserved_mib=20504.0
DeepseekV32OfficialFp8EagerForCausalLM static graph prebuild: max_model_len=2048
DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2
cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20237.2 reserved_mib=20582.0 active_mib=20237.2 inactive_split_mib=344.8 free_mib=9185.6 total_mib=32111.5 max_allocated_mib=20237.2 max_reserved_mib=20582.0
=== Duck Load Stats: DeepseekV32OfficialFp8EagerForCausalLM ===
totals: module_count=62 source_total_mib=635657.2 theory_per_rank_mib=19864.3 rust_buf_per_rank_mib=22351.6 rust_buf_all_ranks_mib=715250.6 weights_buffer_used_mib=22351.6
group                          count  src_total  theory/rk    rust/rk   rust/all
-----------------------------------------------------------------------------------
duck_moe_cache_hit                59  634522.9   19828.8   22311.7  713973.8
duck_mlp_cache_hit                 3    1134.3      35.4      39.9    1276.9
samples:
  duck_moe_cache_hit:
    - model.layers.3.mlp.experts layer=3 cache_slot=3 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
    - model.layers.4.mlp.experts layer=4 cache_slot=4 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
    - model.layers.5.mlp.experts layer=5 cache_slot=5 tp=32 source_total_mib=10754.6 source_per_rank_mib=336.1 rust_buf_per_rank_mib=378.2
  duck_mlp_cache_hit:
    - model.layers.0.mlp layer=0 cache_slot=0 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
    - model.layers.1.mlp layer=1 cache_slot=1 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
    - model.layers.2.mlp layer=2 cache_slot=2 tp=32 source_total_mib=378.1 source_per_rank_mib=11.8 rust_buf_per_rank_mib=13.3
DeepseekV32OfficialFp8EagerForCausalLM wait_pending_init=0.000ms
DeepseekV32OfficialFp8EagerForCausalLM initialized

请输入您的问题（输入exit退出）：<｜begin▁of▁sentence｜><｜User｜>我要部署一个 bsz=1 的 LLM 推理服务，请详细讲解 dp, tp, pp, ep 的区别，以及是否适用<｜Assistant｜><think>
{'input_ids': tensor([[     0, 128803,  15373,  15133,   1057,    291,  11528,     31,     19,
            223,    301,  33792,     47,    223,  36101,   2833,    303,   2788,
          12201,  21373,  18670,     14, 103982,     14,   5727,     14,   5089,
            223,  54229,    303,   3515,   5402,  18580, 128804, 128798]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([[     0, 128803,  15373,  15133,   1057,    291,  11528,     31,     19,
            223,    301,  33792,     47,    223,  36101,   2833,    303,   2788,
          12201,  21373,  18670,     14, 103982,     14,   5727,     14,   5089,
            223,  54229,    303,   3515,   5402,  18580, 128804, 128798]])
DeepseekV32OfficialFp8Eager prepare_generation_cache: required_total_length=2048 per_layer_cache_mib=1.3 total_cache_mib=78.2
cuda_mem_stats label=DeepseekV32OfficialFp8Eager.prepare_generation_cache device=cuda allocated_mib=20501.5 reserved_mib=21354.0 active_mib=20501.5 inactive_split_mib=850.5 free_mib=8303.6 total_mib=32111.5 max_allocated_mib=20501.7 max_reserved_mib=21354.0
我们讨论的是大规模语言模型（LLM）推理服务部署时的并行策略。在分布式训练和推理中，常见的并行方式有数据并行（Data Parallelism, DP）、张量并行（Tensor Parallelism, TP）、流水线并行（Pipeline Parallelism, PP）和专家并行（Expert Parallelism, EP）。下面我将详细解释这些概念，并讨论在批大小（batch size, bsz）为1的推理场景下它们的适用性。

## 1. 数据并行 (DP)
- **原理**：将训练数据分割成多个子集，每个设备上有一个完整的模型副本，各自处理一个子集的数据，然后同步梯度。
- **推理场景**：在推理时，如果批大小大于1，可以将不同的数据样本分发到不同设备上，每个设备独立进行前向传播。但是，当批大小为1时，数据并行无法提供任何加速，因为只有一个样本需要处理，无法分割到多个设备上。因此，在bsz=1的推理中，数据并行不适用。

## 2. 张量并行 (TP)
- **原理**：将模型的层内的权重矩阵进行分割，分布到多个设备上。每个设备只拥有权重的一部分，通过通信来组合结果。例如，一个线性层可以按列分割，每个设备计算一部分输出，然后通过AllReduce或AllGather等操作合并。
- **推理场景**：在bsz=1的推理中，张量并行是适用的。因为它是将模型本身进行分割，而不是依赖于批量大小。通过将模型参数分布到多个设备上，可以减少每个设备的内存占用，并利用多个设备同时计算一个样本的前向传播。但是，张量并行会引入设备间的通信开销，因此需要权衡模型大小和通信延迟。

## 3. 流水线并行 (PP)
- **原理**：将模型按层分成多个阶段，每个阶段放在不同的设备上。在训练时，通过微批次（micro-batch）来保持流水线的饱和。在推理时，对于单个样本，流水线并行会将这个样本依次经过各个阶段，不同阶段在不同设备上同时处理不同的样本（如果批量大于1）。但是，当bsz=1时，流水线并行的效率会很低，因为只有一个样本在流水线上，无法充分利用多个设备（只有一个设备在计算，其他设备在等待）。
- **推理场景**：在bsz=1的推理中，流水线并行不适用，因为它无法通过多个样本来填充流水线，导致设备利用率低，延迟可能反而增加（由于通信开销）。

## 4. 专家并行 (EP)
- **原理**：主要用于混合专家模型（Mixture of Experts, MoE）。在MoE模型中，每一层包含多个专家（通常是前馈网络），每个样本根据路由权重选择一部分专家进行计算。专家并行将不同的专家分布到不同的设备上，每个设备只负责一部分专家。
- **推理场景**：在bsz=1的推理中，如果模型是MoE架构，专家并行是适用的。因为单个样本可能只需要激活部分专家，可以将这些专家计算分布到多个设备上。但是，如果模型不是MoE，则专家并行不适用。

## 总结与建议
对于bsz=1的LLM推理服务，我们需要考虑如何利用多个设备来加速单个样本的推理，或者减少单个设备的内存压力。

- **数据并行**：不适用，因为无法分割单个样本。
- **张量并行**：适用，可以将模型参数分布到多个设备上，加速单个样本的推理，但通信开销需考虑。
- **流水线并行**：不适用，因为单个样本无法填充流水线，导致设备闲置。
- **专家并行**：仅当模型是MoE架构时适用。

在实际部署中，如果模型太大无法放入单个设备的内存，通常会结合使用张量并行和流水线并行。但在bsz=1的推理场景下，流水线并行的效率问题需要重新考虑。一种常见的做法是使用张量并行来分割模型，而避免使用流水线并行。如果模型是MoE，则可以考虑专家并行。

此外，还可以考虑模型并行（Model Parallelism）的其他形式，例如将不同的层分配到不同的设备上（类似于流水线并行，但针对单个样本，我们可以使用更细粒度的划分来减少闲置时间，但这需要复杂的调度）。实际上，在bsz=1时，更常用的策略是使用张量并行，或者将模型放在单个设备上（如果可能），以避免通信开销。

最后，请注意，在推理时，我们还可以使用诸如量化、内核优化、动态批处理（如果允许累积请求）等技术来提高效率。但是，对于严格的bsz=1（即每次只处理一个请求的一个样本），动态批处理不适用。

根据你的具体模型大小、设备数量、设备间通信带宽以及延迟要求，可以选择合适的并行策略。例如，如果模型较小，可以放在单个GPU上；如果模型较大，可以使用张量并行在多个GPU上分布模型。</think>我将详细讲解在大模型推理中常见的四种并行策略：DP、TP、PP、EP，并分析它们在批大小（bsz=1）的推理场景下的适用性。

## 一、四种并行策略详解

### 1. **数据并行（Data Parallelism, DP）**
- **原理**：将输入数据分割到多个设备，每个设备都有完整的模型副本，独立进行前向计算
- **通信模式**：前向传播无需通信，反向传播需要AllReduce同步梯度
- **推理场景**：
  - 训练中常用，可加速训练过程
  - **推理时**：如果bsz>1，可以将batch分到不同设备
  - **bsz=1时**：无法分割单个样本，DP不适用

### 2. **张量并行（Tensor Parallelism, TP）**
- **原理**：将模型的单个层（如Transformer的MLP或Attention）的权重矩阵水平或垂直切分到多个设备
- **通信模式**：每层都需要AllReduce或AllGather通信
- **典型实现**：Megatron-LM的切分方式
- **适用场景**：
  - 适合单个设备内存不足的大模型
  - 通信开销较大，适合设备间高速互联（如NVLink）

### 3. **流水线并行（Pipeline Parallelism, PP）**
- **原理**：将模型按层切分到多个设备，形成流水线
- **通信模式**：相邻设备间传递激活值（前向）和梯度（反向）
- **推理特点**：
  - 需要多个样本填充流水线才能达到高利用率
  - **bsz=1时**：流水线效率极低，存在大量设备空闲时间（bubble）

### 4. **专家并行（Expert Parallelism, EP）**
- **原理**：专为MoE（Mixture of Experts）模型设计，将不同专家分布到不同设备
- **通信模式**：需要All-to-All通信进行路由
- **适用场景**：
  - 仅适用于MoE架构模型（如Switch Transformer、Mixtral）
  - 每个token只激活部分专家，可节省计算

## 二、bsz=1推理场景下的适用性分析

### **推荐方案**：
```
bsz=1推理 → 优先考虑TP，避免PP，DP不适用
```

### **详细对比**：

| 并行策略 | bsz=1适用性 | 原因 | 建议 |
|---------|------------|------|------|
| **DP** | ❌ 不适用 | 无法分割单个样本 | 不应使用 |
| **TP** | ✅ 适用 | 可切分模型参数，降低单卡内存 | **推荐使用**，但需考虑通信开销 |
| **PP** | ⚠️ 不推荐 | 流水线气泡严重，设备利用率低 | 尽量避免，除非模型极大 |
| **EP** | ✅ 条件适用 | 仅MoE模型可用 | 如果是MoE架构则推荐 |

### **组合策略建议**：

1. **纯TP策略**（最推荐）：
   ```python
   # 例如：将70B模型用TP=8部署在8张A100上
   # 每张卡只需存储约8.75B参数
   ```

2. **TP+EP组合**（MoE模型）：
   ```python
   # 例如：Mixtral 8x7B使用EP=8, TP=2
   # 每个专家分布在不同设备，每设备用TP进一步切分
   ```

3. **TP+PP组合**（极端大模型）：
   ```python
   # 仅在模型太大，TP受限于单层内存时使用
   # 例如：540B模型用PP=4, TP=8
   # 但bsz=1时PP效率问题仍需考虑
   ```

## 三、实际部署考量因素

### **内存分析**：
```
bsz=1推理内存 ≈ 参数内存 + 激活值内存 + KV缓存
1. 参数内存：可通过TP/PP减少
2. KV缓存：bsz=1时较小，但序列长度影响大
3. 激活值：bsz=1时相对较小
```

### **延迟vs吞吐量**：
- **bsz=1关注延迟**：TP增加通信，可能增加延迟
- **需要实测**：在目标硬件上测试TP
Generation stats: time 135.389s, input_len 35, output_len 2013, ttft 2.104s, tpot 0.066s, prefill 16.638 tok/s, decode 15.095 tok/s


请输入您的问题（输入exit退出）：