๐Ÿ“– ์•ฝ 90๋ถ„

Chapter 3: FSDP, ZeRO, NCCL

๋ถ„์‚ฐ ํ•™์Šต์˜ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจํ™” ๊ธฐ์ˆ (ZeRO, FSDP)๊ณผ GPU ๊ฐ„ ํ†ต์‹ ์˜ ํ•ต์‹ฌ์ธ NCCL์„ ์‹ฌ์ธต์ ์œผ๋กœ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ์ˆ ๋“ค์€ Checkpointless Training์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

์ด ์žฅ์—์„œ ๋ฐฐ์šฐ๋Š” ๊ฒƒ ZeRO Stage 1/2/3์˜ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์›๋ฆฌ, PyTorch FSDP์˜ ๋™์ž‘ ๋ฉ”์ปค๋‹ˆ์ฆ˜, NCCL Collective Operations, ๊ทธ๋ฆฌ๊ณ  ๋Œ€๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ์˜ ํ†ต์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

1. ZeRO ๊ฐœ์š”

DDP์˜ ๋ฉ”๋ชจ๋ฆฌ ์ค‘๋ณต ๋ฌธ์ œ

ํ‘œ์ค€ Data Parallelism (DDP)์€ ๊ฐ GPU๊ฐ€ ๋™์ผํ•œ ๋ชจ๋ธ ๋ณต์‚ฌ๋ณธ(Model Replica)์„ ๊ฐ€์ง€๊ณ , ๋ฐ์ดํ„ฐ๋งŒ ๋ถ„์‚ฐํ•˜์—ฌ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ GPU๋Š” Forward/Backward pass๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ ๋’ค ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ All-Reduce๋กœ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

DDP์˜ ํ•ต์‹ฌ ๋ฌธ์ œ: ๋ฉ”๋ชจ๋ฆฌ ์ค‘๋ณต DDP์—์„œ๋Š” ๋ชจ๋“  GPU๊ฐ€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ, ๊ทธ๋ž˜๋””์–ธํŠธ, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ ์ „๋ถ€ ๋ณต์ œํ•ฉ๋‹ˆ๋‹ค. GPU๊ฐ€ 100๊ฐœ๋“  1000๊ฐœ๋“ , ๊ฐ GPU๋Š” ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ๋‹จ์ผ GPU์— ์˜ฌ๋ผ๊ฐ€์ง€ ์•Š๋Š” ๋Œ€ํ˜• ๋ชจ๋ธ์€ DDP๋กœ ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ƒํƒœ(Training State) ๊ตฌ์„ฑ

Mixed Precision Training (BF16/FP16 + FP32 Master Weights)๊ณผ Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, ๋‹จ์ผ ํŒŒ๋ผ๋ฏธํ„ฐ๋‹น ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๊ตฌ์„ฑ ์š”์†Œ๋ฐ์ดํ„ฐ ํƒ€์ž…ํŒŒ๋ผ๋ฏธํ„ฐ๋‹น ํฌ๊ธฐ์„ค๋ช…
Model WeightsBF16/FP162 BytesForward/Backward ์—ฐ์‚ฐ์šฉ ๊ฐ€์ค‘์น˜
GradientsBF16/FP162 BytesBackward pass์—์„œ ๊ณ„์‚ฐ๋œ ๊ทธ๋ž˜๋””์–ธํŠธ
Master WeightsFP324 Bytes์˜ตํ‹ฐ๋งˆ์ด์ € ์—…๋ฐ์ดํŠธ์šฉ ๊ณ ์ •๋ฐ€ ๊ฐ€์ค‘์น˜
Momentum (1st moment)FP324 BytesAdam์˜ 1์ฐจ ๋ชจ๋ฉ˜ํŠธ (ํ‰๊ท )
Variance (2nd moment)FP324 BytesAdam์˜ 2์ฐจ ๋ชจ๋ฉ˜ํŠธ (๋ถ„์‚ฐ)
ํ•ฉ๊ณ„-16 BytesํŒŒ๋ผ๋ฏธํ„ฐ 1๊ฐœ๋‹น ์ด ๋ฉ”๋ชจ๋ฆฌ
์ด ๋ฉ”๋ชจ๋ฆฌ = Parameters x 16 Bytes
์˜ˆ: 70B ๋ชจ๋ธ = 70,000,000,000 x 16 = 1,120 GB (1.12 TB)

์ด 1.12TB๋ฅผ ๋‹จ์ผ GPU(80GB VRAM)์— ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ZeRO๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

2. ZeRO Stage 1: Optimizer State Partitioning

๋™์ž‘ ์›๋ฆฌ

ZeRO Stage 1 ($P_{os}$)์€ ๊ฐ€์žฅ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งŽ์ด ์ฐจ์ง€ํ•˜๋Š” ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ(Optimizer States)๋งŒ GPU๋“ค์— ๋ถ„์‚ฐ(Sharding)ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐ GPU๋Š” ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ ์ž์‹ ์ด ๋งก์€ ๋ถ€๋ถ„์˜ ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋งŒ ์œ ์ง€
  • ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๋ชจ๋“  GPU์— ๋ณต์ œ (DDP์™€ ๋™์ผ)
  • ์˜ตํ‹ฐ๋งˆ์ด์ € ์Šคํ… ํ›„, ์—…๋ฐ์ดํŠธ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ All-Gather๋กœ ๋™๊ธฐํ™”

๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ ๊ณ„์‚ฐ

Optimizer States = 12 Bytes/param (Master Weights 4B + Momentum 4B + Variance 4B)
N๊ฐœ GPU ๋ถ„์‚ฐ ์‹œ: 12/N Bytes/param

8 GPU ์˜ˆ์‹œ: 12/8 = 1.5 Bytes/param (๊ธฐ์กด 12B์—์„œ 8๋ฐฐ ์ ˆ๊ฐ)
DeepSpeed ZeRO Stage 1 Config JSON
{
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  }
}

3. ZeRO Stage 2: + Gradient Partitioning

๋™์ž‘ ์›๋ฆฌ

ZeRO Stage 2 ($P_{os+g}$)๋Š” Stage 1์— ์ถ”๊ฐ€๋กœ ๊ทธ๋ž˜๋””์–ธํŠธ(Gradients)๋„ ๋ถ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

  • Backward pass ํ›„, ์ „์ฒด ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ Reduce-Scatter ์—ฐ์‚ฐ์œผ๋กœ ํ•ฉ์‚ฐ + ๋ถ„๋ฐฐ
  • ๊ฐ GPU๋Š” ์ž์‹ ์ด ๋‹ด๋‹นํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋งŒ ์œ ์ง€
  • ์˜ตํ‹ฐ๋งˆ์ด์ € ์Šคํ…์€ ๊ฐ GPU๊ฐ€ ์ž์‹ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ๊ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ์ˆ˜ํ–‰

Reduce-Scatter ์—ฐ์‚ฐ

Reduce-Scatter๋Š” All-Reduce๋ฅผ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„๋ฆฌํ•œ ๊ฒƒ ์ค‘ ์ฒซ ๋ฒˆ์งธ์ž…๋‹ˆ๋‹ค:

๋ชจ๋“  GPU
Full Gradients
โ†’
Reduce
(ํ•ฉ์‚ฐ)
โ†’
Scatter
(๋ถ„๋ฐฐ)
โ†’
๊ฐ GPU
1/N Gradient Shard

๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ

Stage 1: Optimizer States ๋ถ„์‚ฐ = 12/N Bytes/param
Stage 2: + Gradients ๋ถ„์‚ฐ = 2/N Bytes/param ์ถ”๊ฐ€ ์ ˆ๊ฐ

์ด: (12 + 2)/N = 14/N Bytes/param (ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณต์ œ ์ œ์™ธ)
DeepSpeed ZeRO Stage 2 Config JSON
{
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "gradient_clipping": 1.0,
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 4
}

4. ZeRO Stage 3: + Parameter Partitioning

๋™์ž‘ ์›๋ฆฌ

ZeRO Stage 3 ($P_{os+g+p}$)๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(Parameters)๊นŒ์ง€ ๋ชจ๋‘ ๋ถ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด PyTorch FSDP์™€ ๊ธฐ์ˆ ์ ์œผ๋กœ ๊ฑฐ์˜ ๋™์ผํ•œ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.

  • ๊ฐ GPU๋Š” ๋ชจ๋ธ์˜ 1/N ์กฐ๊ฐ๋งŒ ๋ฉ”๋ชจ๋ฆฌ์— ์ƒ์ฃผ
  • ์—ฐ์‚ฐ์ด ํ•„์š”ํ•  ๋•Œ๋งŒ ๋‹ค๋ฅธ GPU๋กœ๋ถ€ํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ All-Gather๋กœ ๊ฐ€์ ธ์˜ด
  • ์—ฐ์‚ฐ ํ›„ ์ฆ‰์‹œ ํ•ด์ œํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํ™•๋ณด

All-Gather on Demand

Forward pass ์‹œ ํŠน์ • ๋ ˆ์ด์–ด ์—ฐ์‚ฐ ์ง์ „์— ํ•ด๋‹น ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋“  GPU๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค:

GPU 0
Shard 0
GPU 1
Shard 1
GPU 2
Shard 2
GPU 3
Shard 3
โ†’
All-Gather
โ†’
Full Params
(์ž„์‹œ)

๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ (N-fold)

Stage 3: ๋ชจ๋“  ์ƒํƒœ ๋ถ„์‚ฐ
์ด ๋ฉ”๋ชจ๋ฆฌ/GPU = (Parameters 2B + Gradients 2B + Optimizer 12B) / N = 16/N Bytes/param

256 GPU ์˜ˆ์‹œ: 70B ๋ชจ๋ธ
= 70B x 16 / 256 = 4.375 GB/GPU (๊ธฐ์กด 1.12TB์—์„œ 256๋ฐฐ ์ ˆ๊ฐ)
DeepSpeed ZeRO Stage 3 Config JSON
{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "gradient_clipping": 1.0,
  "train_batch_size": 256,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 32
}

5. ZeRO-Infinity: NVMe Offloading

๊ฐœ๋…

ZeRO-Infinity๋Š” ZeRO Stage 3์— NVMe SSD ์˜คํ”„๋กœ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋„˜์–ด ์‹œ์Šคํ…œ ๋ฉ”๋ชจ๋ฆฌ์™€ NVMe ์Šคํ† ๋ฆฌ์ง€๊นŒ์ง€ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

Memory Pool Hierarchy

NVMe SSD
(TB๊ธ‰)
โ†”
CPU Memory
(์ˆ˜๋ฐฑ GB)
โ†”
GPU HBM
(80 GB)
  • Offload Optimizer: ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ์˜คํ”„๋กœ๋“œ
  • Offload Param: ํŒŒ๋ผ๋ฏธํ„ฐ๊นŒ์ง€ CPU/NVMe๋กœ ์˜คํ”„๋กœ๋“œ
  • NVMe Offload: CPU ๋ฉ”๋ชจ๋ฆฌ๋„ ๋ถ€์กฑํ•  ๋•Œ NVMe SSD ํ™œ์šฉ
  • ๋น„๋™๊ธฐ I/O(aio)๋ฅผ ํ†ตํ•œ prefetch๋กœ ์„ฑ๋Šฅ ์ €ํ•˜ ์ตœ์†Œํ™”
DeepSpeed ZeRO-Infinity Config JSON (aio ์„ค์ • ํฌํ•จ)
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "fast_init": false
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8,
      "max_in_cpu": 1e9
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "aio": {
    "block_size": 1048576,
    "queue_depth": 8,
    "thread_count": 1,
    "single_submit": false,
    "overlap_events": true
  },
  "bf16": {
    "enabled": true
  },
  "train_batch_size": 512,
  "gradient_accumulation_steps": 64
}
ZeRO-Infinity ํ™œ์šฉ ์‚ฌ๋ก€ ๋‹จ์ผ DGX A100 ๋…ธ๋“œ(8 GPU x 80GB = 640GB)์—์„œ 1์กฐ(1T) ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. NVMe๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์‹œ์Šคํ…œ ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋„ ๋„˜์–ด์„œ์ง€๋งŒ, ๋‹น์—ฐํžˆ ์†๋„๋Š” ๋А๋ ค์ง‘๋‹ˆ๋‹ค.

6. FSDP ๋™์ž‘ ์›๋ฆฌ

FSDP๋ž€?

FSDP (Fully Sharded Data Parallel)๋Š” PyTorch์˜ ZeRO Stage 3 ๋„ค์ดํ‹ฐ๋ธŒ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ, ๊ทธ๋ž˜๋””์–ธํŠธ, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ ๋ชจ๋“  GPU์— ๋ถ„์‚ฐ(Shard)ํ•ฉ๋‹ˆ๋‹ค.

Forward Pass

# FSDP Forward Pass ๋™์ž‘ ๊ณผ์ •

# 1. At Rest: ๊ฐ GPU๋Š” 1/N ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ๋ณด์œ 
GPU_0: [Shard_0]  GPU_1: [Shard_1]  GPU_2: [Shard_2]  GPU_3: [Shard_3]

# 2. Before Forward: All-Gather๋กœ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ์žฌ๊ตฌ์„ฑ
All-Gather() โ†’ ๋ชจ๋“  GPU: [Full Parameters]

# 3. Forward Compute: ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
output = layer(input)  # with full parameters

# 4. After Forward: ์‚ฌ์šฉํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•ด์ œ (๋ฉ”๋ชจ๋ฆฌ ํ™•๋ณด)
del full_parameters  # keep only local shard

Backward Pass

# FSDP Backward Pass ๋™์ž‘ ๊ณผ์ •

# 1. Before Backward: ๋‹ค์‹œ All-Gather๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์žฌ๊ตฌ์„ฑ
All-Gather() โ†’ ๋ชจ๋“  GPU: [Full Parameters]

# 2. Backward Compute: ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ
gradients = backward(loss)

# 3. After Backward: Reduce-Scatter๋กœ ๊ทธ๋ž˜๋””์–ธํŠธ ํ•ฉ์‚ฐ + ๋ถ„๋ฐฐ
Reduce-Scatter(gradients) โ†’ ๊ฐ GPU: [1/N Gradient Shard]

# 4. Optimizer Step: ๊ฐ GPU๊ฐ€ ์ž์‹ ์˜ ์ƒค๋“œ๋งŒ ์—…๋ฐ์ดํŠธ
optimizer.step(local_shard)  # only 1/N of parameters

์ „์ฒด ํ๋ฆ„ ๋‹ค์ด์–ด๊ทธ๋žจ

Sharded
Params
โ†’
All-Gather
โ†’
Forward
โ†’
Free
โ†’
All-Gather
โ†’
Backward
โ†’
Reduce-Scatter
FSDP๊ฐ€ Checkpointing์„ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ์ด์œ 
  • Sharded State: ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ DTensor ์กฐ๊ฐ์œผ๋กœ ๋ถ„์‚ฐ - ๋‹จ์ผ GPU์— ์ „์ฒด ๋ชจ๋ธ์ด ์—†์Œ
  • Reconstruction Required: "Full" checkpoint ์ €์žฅ ์‹œ All-Gather ํ•„์š”
  • Memory Spike: ์ „์ฒด ์ƒํƒœ ์ˆ˜์ง‘ ์‹œ ์ผ์‹œ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ 2๋ฐฐ
  • Coordination: ๋ชจ๋“  ๋žญํฌ๊ฐ€ ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ์‹œ ๋™๊ธฐํ™” ํ•„์š”

7. FSDP Sharding Strategies

Strategy๋™์ž‘๋ฉ”๋ชจ๋ฆฌ์„ฑ๋Šฅ์‚ฌ์šฉ ์‚ฌ๋ก€
FULL_SHARD Forward ํ›„ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•ด์ œ ์ตœ์†Œ ํ†ต์‹  ๋งŽ์Œ ๋ฉ”๋ชจ๋ฆฌ ๊ทนํ•œ ์ƒํ™ฉ
SHARD_GRAD_OP Forward ์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ์œ ์ง€ ๋†’์Œ ํ†ต์‹  ์ ์Œ ๋ฉ”๋ชจ๋ฆฌ ์—ฌ์œ  ์žˆ์„ ๋•Œ
HYBRID_SHARD ๋…ธ๋“œ ๋‚ด ์ƒค๋”ฉ, ๋…ธ๋“œ ๊ฐ„ ๋ณต์ œ ๊ท ํ˜• ์ตœ์ ํ™”๋จ ๋ฉ€ํ‹ฐ๋…ธ๋“œ ๋Œ€๊ทœ๋ชจ ํ•™์Šต
NO_SHARD ์ƒค๋”ฉ ์—†์Œ (DDP์™€ ๋™์ผ) ์ตœ๋Œ€ ํ†ต์‹  ์ตœ์†Œ ๋””๋ฒ„๊น…, ์ž‘์€ ๋ชจ๋ธ
HYBRID_SHARD ์ƒ์„ธ ์„ค๋ช…

HYBRID_SHARD๋Š” ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค:

  • ๋…ธ๋“œ ๋‚ด๋ถ€ (Intra-node): NVLink๋ฅผ ํ†ตํ•ด ๋น ๋ฅธ All-Gather/Reduce-Scatter
  • ๋…ธ๋“œ ๊ฐ„ (Inter-node): ๋ชจ๋ธ ๋ณต์ œ๋กœ ๋„คํŠธ์›Œํฌ ํ†ต์‹  ์ตœ์†Œํ™”

์˜ˆ: 8 GPU/node x 32 nodes = 256 GPU ํด๋Ÿฌ์Šคํ„ฐ์—์„œ, ๊ฐ ๋…ธ๋“œ ๋‚ด 8 GPU๋Š” FSDP๋กœ ์ƒค๋”ฉํ•˜๊ณ , 32๊ฐœ ๋…ธ๋“œ ๊ฐ„์—๋Š” DDP์ฒ˜๋Ÿผ ๊ทธ๋ž˜๋””์–ธํŠธ๋งŒ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

8. FSDP2 vs FSDP1

์ฃผ์š” ์ฐจ์ด์ 

FeatureFSDP1FSDP2
๊ธฐ๋ฐ˜ ๊ธฐ์ˆ FlatParameterDTensor
APIFSDP(module) wrapperfully_shard(module) ํ•จ์ˆ˜
์œ ์—ฐ์„ฑ๋ชจ๋“ˆ ๋‹จ์œ„ํŒŒ๋ผ๋ฏธํ„ฐ ๋‹จ์œ„
๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์ˆ˜๋™ ์กฐ์ • ํ•„์š”์ž๋™ ์ตœ์ ํ™”
์ถ”์ฒœ ๋ฒ„์ „PyTorch 1.x ~ 2.3PyTorch 2.4+

DTensor๋ž€?

DTensor (Distributed Tensor)๋Š” PyTorch 2.0์—์„œ ๋„์ž…๋œ ๋ถ„์‚ฐ ํ…์„œ ์ถ”์ƒํ™”์ž…๋‹ˆ๋‹ค. ํ…์„œ๊ฐ€ ์—ฌ๋Ÿฌ ๋””๋ฐ”์ด์Šค์— ์–ด๋–ป๊ฒŒ ๋ถ„์‚ฐ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋กœ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

FSDP2 ์ฝ”๋“œ ์˜ˆ์‹œ

from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh

# Device Mesh ์ดˆ๊ธฐํ™” (2D: DP x TP)
mesh = init_device_mesh("cuda", (dp_size, tp_size), mesh_dim_names=("dp", "tp"))

# Mixed Precision ์ •์ฑ…
mp_policy = MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
)

# FSDP2: ๊ฐœ๋ณ„ ๋ชจ๋“ˆ์— fully_shard ์ ์šฉ
for layer in model.transformer.layers:
    fully_shard(layer, mesh=mesh["dp"], mp_policy=mp_policy)

# ์ตœ์ƒ์œ„ ๋ชจ๋“ˆ์—๋„ ์ ์šฉ
fully_shard(model, mesh=mesh["dp"], mp_policy=mp_policy)

# ์ด์ œ model์€ FSDP2๋กœ ์ƒค๋”ฉ๋จ
output = model(input_ids)

9. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์‚ฐ ์˜ˆ์‹œ

70B ๋ชจ๋ธ ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ๋ฉ”๋ชจ๋ฆฌ

์‹œ๋‚˜๋ฆฌ์˜คGPU ์ˆ˜๋ชจ๋ธ ์ƒํƒœ ๋ฉ”๋ชจ๋ฆฌ/GPU๊ฐ€๋Šฅ ์—ฌ๋ถ€
๋‹จ์ผ GPU (DDP) 1 70B x 16B = 1,120 GB ๋ถˆ๊ฐ€๋Šฅ (80GB VRAM ์ดˆ๊ณผ)
8 GPU ZeRO-3 8 1,120 / 8 = 140 GB ๋ถˆ๊ฐ€๋Šฅ
32 GPU ZeRO-3 32 1,120 / 32 = 35 GB ๊ฐ€๋Šฅ (+ Activation ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”)
256 GPU ZeRO-3 256 1,120 / 256 = 4.375 GB ์—ฌ์œ 

Activation Memory ๊ณต์‹

Forward pass ์‹œ Backward๋ฅผ ์œ„ํ•ด ์ค‘๊ฐ„ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ(Activation)๋ฅผ ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

Activation Memory โ‰ˆ batch_size x seq_len x hidden_dim x num_layers x bytes_per_element

์˜ˆ: Llama 70B (hidden=8192, layers=80, BF16)
batch=1, seq=4096: 1 x 4096 x 8192 x 80 x 2 โ‰ˆ 5.4 GB
batch=4, seq=4096: 4 x 4096 x 8192 x 80 x 2 โ‰ˆ 21.5 GB
Activation Checkpointing Activation ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด Gradient Checkpointing ๊ธฐ๋ฒ•์œผ๋กœ Forward ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š๊ณ , Backward ์‹œ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜์ง€๋งŒ ์—ฐ์‚ฐ๋Ÿ‰์ด ~30% ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

10. NCCL Collective Operations

NCCL์ด๋ž€?

NCCL (NVIDIA Collective Communications Library)์€ ๋ถ„์‚ฐ GPU ํ•™์Šต์—์„œ GPU ๊ฐ„ ํ†ต์‹ ์„ ๋‹ด๋‹นํ•˜๋Š” ๊ณ ์„ฑ๋Šฅ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. "๋„คํŠธ์›Œํฌ ์Šคํƒ"์ฒ˜๋Ÿผ GPU๋“ค์ด ์„œ๋กœ ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” Collective Operations

All-Reduce

๋ชจ๋“  GPU๊ฐ€ ๊ธฐ์—ฌํ•˜๊ณ , ๋ชจ๋“  GPU๊ฐ€ ํ•ฉ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค. DDP์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋™๊ธฐํ™”์— ํ•„์ˆ˜.

# All-Reduce: ๋ชจ๋“  GPU์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ํ•ฉ์‚ฐ โ†’ ๋ชจ๋“  GPU์— ๋™์ผํ•œ ๊ฒฐ๊ณผ
GPU 0: [1, 2, 3]     GPU 0: [10, 20, 30]
GPU 1: [2, 4, 6]  โ†’  GPU 1: [10, 20, 30]  (sum)
GPU 2: [3, 6, 9]     GPU 2: [10, 20, 30]
GPU 3: [4, 8, 12]    GPU 3: [10, 20, 30]

All-Gather

๊ฐ GPU์˜ ์กฐ๊ฐ์„ ๋ชจ์•„ ์ „์ฒด ํ…์„œ๋ฅผ ๊ตฌ์„ฑ, ๋ชจ๋“  GPU๊ฐ€ ๋™์ผํ•œ ์ „์ฒด ํ…์„œ๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค. FSDP์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์žฌ๊ตฌ์„ฑ์— ์‚ฌ์šฉ.

# All-Gather: ๊ฐ GPU์˜ ์ƒค๋“œ ์ˆ˜์ง‘ โ†’ ์ „์ฒด ํ…์„œ ์žฌ๊ตฌ์„ฑ
GPU 0: [A]           GPU 0: [A, B, C, D]
GPU 1: [B]        โ†’  GPU 1: [A, B, C, D]  (concatenate)
GPU 2: [C]           GPU 2: [A, B, C, D]
GPU 3: [D]           GPU 3: [A, B, C, D]

Reduce-Scatter

๋ชจ๋“  GPU๊ฐ€ ๊ธฐ์—ฌํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ N๋“ฑ๋ถ„ํ•˜์—ฌ ๊ฐ GPU๊ฐ€ ๋‹ค๋ฅธ ์กฐ๊ฐ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค. FSDP Backward์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ„๋ฐฐ์— ์‚ฌ์šฉ.

# Reduce-Scatter: ํ•ฉ์‚ฐ + ๋ถ„๋ฐฐ
GPU 0: [1,2,3,4]     GPU 0: [10] (chunk 0์˜ ํ•ฉ)
GPU 1: [2,4,6,8]  โ†’  GPU 1: [20] (chunk 1์˜ ํ•ฉ)
GPU 2: [3,6,9,12]    GPU 2: [30] (chunk 2์˜ ํ•ฉ)
GPU 3: [4,8,12,16]   GPU 3: [40] (chunk 3์˜ ํ•ฉ)

Broadcast

ํ•˜๋‚˜์˜ GPU(root)๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‚ด๊ณ , ๋ชจ๋“  GPU๊ฐ€ ๋ฐ›์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ๋ถ„๋ฐฐ์— ์‚ฌ์šฉ.

# Broadcast: Root GPU์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋“  GPU๋กœ ๋ณต์‚ฌ
GPU 0: [W]  (root)   GPU 0: [W]
GPU 1: [?]        โ†’  GPU 1: [W]  (copied from root)
GPU 2: [?]           GPU 2: [W]
GPU 3: [?]           GPU 3: [W]

All-to-All

๊ฐ GPU๊ฐ€ ๋‹ค๋ฅธ ๋ชจ๋“  GPU์—๊ฒŒ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ƒ…๋‹ˆ๋‹ค. MoE์—์„œ Expert Parallelism์— ์‚ฌ์šฉ.

# All-to-All: ๊ฐ GPU๊ฐ€ ๋‹ค๋ฅธ GPU๋“ค์—๊ฒŒ ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์ „์†ก
GPU 0: [A0,A1,A2,A3]     GPU 0: [A0,B0,C0,D0]
GPU 1: [B0,B1,B2,B3]  โ†’  GPU 1: [A1,B1,C1,D1]
GPU 2: [C0,C1,C2,C3]     GPU 2: [A2,B2,C2,D2]
GPU 3: [D0,D1,D2,D3]     GPU 3: [A3,B3,C3,D3]

11. Ring vs Tree Algorithm

Ring Algorithm

GPU๋“ค์„ ๋…ผ๋ฆฌ์ ์ธ ๋ง(Ring) ํ˜•ํƒœ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœํ™˜์‹œํ‚ต๋‹ˆ๋‹ค.

# Ring All-Reduce ๋™์ž‘ (4 GPU ์˜ˆ์‹œ)

   GPU 0 โ†โ†’ GPU 1
     โ†‘         โ†“
   GPU 3 โ†โ†’ GPU 2

# ๋‹จ๊ณ„ 1: ๊ฐ GPU๊ฐ€ ์ด์›ƒ์—๊ฒŒ ์ฒญํฌ ์ „์†ก
# ๋‹จ๊ณ„ 2: ๋ฐ›์€ ๋ฐ์ดํ„ฐ์™€ ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ํ•ฉ์‚ฐ
# ๋‹จ๊ณ„ 3: N-1๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด ๋ชจ๋“  GPU๊ฐ€ ์ „์ฒด ํ•ฉ์‚ฐ ๊ฒฐ๊ณผ ๋ณด์œ 

# ๋ณต์žก๋„
๋Œ€์—ญํญ: O(๋ฐ์ดํ„ฐ ํฌ๊ธฐ)  # ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋น„๋ก€, GPU ์ˆ˜ ๋ฌด๊ด€
์ง€์—ฐ์‹œ๊ฐ„: O(N)           # GPU ์ˆ˜์— ๋น„๋ก€ (๋‹จ์ )

Tree Algorithm (Double Binary Tree)

GPU๋“ค์„ ํŠธ๋ฆฌ(Tree) ๊ตฌ์กฐ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ๊ณ„์ธต์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค.

# Tree All-Reduce ๋™์ž‘

        Root (GPU 0)
       /           \
    GPU 1         GPU 2
   /     \       /     \
GPU 3   GPU 4  GPU 5   GPU 6

# Reduce Phase: ๋ฆฌํ”„ โ†’ ๋ฃจํŠธ (ํ•ฉ์‚ฐ)
# Broadcast Phase: ๋ฃจํŠธ โ†’ ๋ฆฌํ”„ (๋ถ„๋ฐฐ)

# ๋ณต์žก๋„
๋Œ€์—ญํญ: O(๋ฐ์ดํ„ฐ ํฌ๊ธฐ)     # Ring๊ณผ ๋™์ผ
์ง€์—ฐ์‹œ๊ฐ„: O(log N)         # GPU ์ˆ˜์˜ ๋กœ๊ทธ์— ๋น„๋ก€ (์žฅ์ )

๋น„๊ตํ‘œ

ํŠน์„ฑRing AlgorithmTree Algorithm
๋Œ€์—ญํญ ํšจ์œจ์ตœ์ ์ตœ์ 
์ง€์—ฐ์‹œ๊ฐ„O(N) - ๋†’์ŒO(log N) - ๋‚ฎ์Œ
์†Œ๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ์šฐ์ˆ˜๋ณดํ†ต
๋Œ€๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ์ง€์—ฐ ์ฆ๊ฐ€์šฐ์ˆ˜
NCCL ์ž๋™ ์„ ํƒ NCCL์€ ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€์™€ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ Ring, Tree, ๋˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ž๋™์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์ง€์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

12. NCCL ์ดˆ๊ธฐํ™”

TCPStore Rendezvous ๊ณผ์ •

๋ถ„์‚ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋ ค๋ฉด ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค(GPU)๊ฐ€ ์„œ๋กœ์˜ ์กด์žฌ์™€ ์œ„์น˜๋ฅผ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

# ์ „ํ†ต์ ์ธ NCCL ์ดˆ๊ธฐํ™” ๊ณผ์ •

# 1. Master Node (Rank 0)๊ฐ€ TCPStore ์„œ๋ฒ„ ์‹œ์ž‘
Rank 0: TCPStore ์„œ๋ฒ„ ์˜คํ”ˆ (IP:PORT)

# 2. ๋ชจ๋“  Worker๊ฐ€ Master์— ์—ฐ๊ฒฐํ•˜์—ฌ ์ž์‹ ์˜ ์ •๋ณด ๋“ฑ๋ก
Rank 1 โ†’ Rank 0: "๋‚ด ์ฃผ์†Œ๋Š” 192.168.1.2:29501"
Rank 2 โ†’ Rank 0: "๋‚ด ์ฃผ์†Œ๋Š” 192.168.1.3:29501"
...
Rank N โ†’ Rank 0: "๋‚ด ์ฃผ์†Œ๋Š” ..."

# 3. ๋ชจ๋“  Worker๊ฐ€ ๋“ฑ๋ก ์™„๋ฃŒ๋˜๋ฉด NCCL Unique ID ์ƒ์„ฑ
Rank 0: NCCL Unique ID ์ƒ์„ฑ ๋ฐ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

# 4. ๊ฐ Rank๊ฐ€ Communicator ํ˜•์„ฑ
ncclCommInitRank(comm, nranks, uniqueId, rank)

init_process_group ์ฝ”๋“œ

import torch.distributed as dist
import os

# ํ™˜๊ฒฝ๋ณ€์ˆ˜์—์„œ ๋ถ„์‚ฐ ์„ค์ • ์ฝ๊ธฐ
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
master_addr = os.environ['MASTER_ADDR']
master_port = os.environ['MASTER_PORT']

# Process Group ์ดˆ๊ธฐํ™” (TCPStore ๊ธฐ๋ฐ˜ rendezvous)
dist.init_process_group(
    backend='nccl',           # GPU ํ†ต์‹ ์šฉ
    init_method=f'tcp://{master_addr}:{master_port}',
    rank=rank,
    world_size=world_size,
)

# ์ด์ œ collective operations ์‚ฌ์šฉ ๊ฐ€๋Šฅ
tensor = torch.ones(10).cuda()
dist.all_reduce(tensor)  # ๋ชจ๋“  GPU์˜ tensor ํ•ฉ์‚ฐ
TCPStore์˜ ๋ณ‘๋ชฉ ์ˆ˜์ฒœ ๊ฐœ์˜ Rank๊ฐ€ ๋™์‹œ์— Rank 0์— ์—ฐ๊ฒฐํ•˜๋ฉด ๋„คํŠธ์›Œํฌ ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. Checkpointless Training์€ ์ด ๋ฌธ์ œ๋ฅผ Rootless NCCL ์ดˆ๊ธฐํ™”๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค (Chapter 6์—์„œ ์ƒ์„ธ ์„ค๋ช…).

13. Topology Discovery

NCCL์˜ ์ž๋™ ํ† ํด๋กœ์ง€ ํƒ์ƒ‰

NCCL์€ ์‹œ์ž‘ ์‹œ hwloc ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ์„ ์ž๋™์œผ๋กœ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

์—ฐ๊ฒฐ ๊ณ„์ธต (P2P Level)

์—ฐ๊ฒฐ ํƒ€์ž…๋Œ€์—ญํญ์ง€์—ฐ์‹œ๊ฐ„์‚ฌ์šฉ ์œ„์น˜
NVLink600-900 GB/s๋งค์šฐ ๋‚ฎ์Œ๋…ธ๋“œ ๋‚ด GPU ๊ฐ„
NVSwitch7.2 TB/s (์ดํ•ฉ)๋งค์šฐ ๋‚ฎ์Œ๋…ธ๋“œ ๋‚ด All-to-All
PCIe Gen564 GB/s๋‚ฎ์ŒGPU-CPU, NVLink ์—†์„ ๋•Œ
InfiniBand HDR200 Gbps~1 us๋…ธ๋“œ ๊ฐ„
InfiniBand NDR400 Gbps~1 us๋…ธ๋“œ ๊ฐ„ (์ตœ์‹ )
EFA (AWS)3200 Gbps๋‚ฎ์ŒAWS ๋…ธ๋“œ ๊ฐ„

NCCL_TOPO_FILE

๋ณต์žกํ•œ ํ† ํด๋กœ์ง€์—์„œ๋Š” XML ํŒŒ์ผ๋กœ ์ง์ ‘ ํ† ํด๋กœ์ง€๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ํ† ํด๋กœ์ง€ ํŒŒ์ผ ์ง€์ •
export NCCL_TOPO_FILE=/path/to/topology.xml

# ํ† ํด๋กœ์ง€ ํƒ์ƒ‰ ๊ฒฐ๊ณผ ๋คํ”„
export NCCL_TOPO_DUMP_FILE=/tmp/nccl_topo.txt

14. NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜

์ค‘์š” ํ™˜๊ฒฝ๋ณ€์ˆ˜ ํ…Œ์ด๋ธ”

ํ™˜๊ฒฝ๋ณ€์ˆ˜์„ค๋ช…๊ธฐ๋ณธ๊ฐ’์ถ”์ฒœ๊ฐ’
NCCL_DEBUG ๋””๋ฒ„๊ทธ ๋กœ๊ทธ ๋ ˆ๋ฒจ WARN INFO (๋ฌธ์ œ ํ•ด๊ฒฐ ์‹œ TRACE)
NCCL_DEBUG_SUBSYS ํŠน์ • ์„œ๋ธŒ์‹œ์Šคํ…œ๋งŒ ๋กœ๊น… ALL INIT,COLL (์ดˆ๊ธฐํ™”/์ง‘๋‹จํ†ต์‹ )
NCCL_ALGO ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ•์ œ ์ง€์ • ์ž๋™ Ring, Tree, CollnetDirect
NCCL_PROTO ํ”„๋กœํ† ์ฝœ ์ง€์ • ์ž๋™ Simple, LL, LL128
NCCL_BUFFSIZE ํ†ต์‹  ๋ฒ„ํผ ํฌ๊ธฐ 4MB 8388608 (8MB, ๋Œ€๊ทœ๋ชจ ์‹œ)
NCCL_NTHREADS ์ปค๋„ ์Šค๋ ˆ๋“œ ์ˆ˜ ์ž๋™ 512 (๋Œ€๊ทœ๋ชจ ์‹œ)
NCCL_IB_TIMEOUT InfiniBand ํƒ€์ž„์•„์›ƒ 18 22-23 (๋Œ€๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ)
NCCL_IB_RETRY_CNT IB ์žฌ์‹œ๋„ ํšŸ์ˆ˜ 7 13 (์•ˆ์ •์„ฑ ํ–ฅ์ƒ)
NCCL_IB_GID_INDEX IB GID ์ธ๋ฑ์Šค 0 RoCE v2: 3
NCCL_SOCKET_IFNAME ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค ์ž๋™ eth0, ens5 (AWS EFA)
NCCL_P2P_LEVEL P2P ํ†ต์‹  ์ œํ•œ 5 NVL (NVLink๋งŒ ํ—ˆ์šฉ)
NCCL_SHM_DISABLE ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋น„ํ™œ์„ฑํ™” 0 1 (๋””๋ฒ„๊น… ์‹œ)
AWS EFA ์ตœ์ ํ™” ํ™˜๊ฒฝ๋ณ€์ˆ˜
# AWS EFA (Elastic Fabric Adapter) ์ตœ์ ํ™” ์„ค์ •

# ๊ธฐ๋ณธ EFA ์„ค์ •
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export FI_EFA_FORK_SAFE=1

# NCCL EFA ํ”Œ๋Ÿฌ๊ทธ์ธ
export NCCL_NET=aws-ofi-nccl
export NCCL_DEBUG=INFO

# P5 ์ธ์Šคํ„ด์Šค (H100 x8) ์ตœ์ ํ™”
export NCCL_NVLS_ENABLE=1  # NVLink SHARP
export NCCL_IB_TIMEOUT=22
export NCCL_MIN_NCHANNELS=4

# ๋Œ€์—ญํญ ์ตœ์ ํ™”
export NCCL_BUFFSIZE=8388608  # 8MB
export NCCL_P2P_NET_CHUNKSIZE=524288  # 512KB
NCCL ๋””๋ฒ„๊น… ํŒ
# ๋ฌธ์ œ ๋ฐœ์ƒ ์‹œ ์ƒ์„ธ ๋กœ๊น…
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=INIT,COLL,P2P,NET
export NCCL_DEBUG_FILE=/tmp/nccl_debug_%h_%p.log

# ํ† ํด๋กœ์ง€ ํ™•์ธ
export NCCL_TOPO_DUMP_FILE=/tmp/nccl_topo.xml

# Hang ๊ฐ์ง€ (30์ดˆ ํƒ€์ž„์•„์›ƒ)
export NCCL_TIMEOUT=30
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

# ๋””๋ฒ„๊ทธ ์ •๋ณด ์ถœ๋ ฅ ์˜ˆ์‹œ
[Rank 0] NCCL INFO Bootstrap: Using eth0:192.168.1.10<6379>
[Rank 0] NCCL INFO Trees [0] -1/-1/-1->0->1 [1] -1/-1/-1->0->1
[Rank 0] NCCL INFO Channel 00 : 0 1 2 3

์š”์•ฝ

ํ•ต์‹ฌ ํฌ์ธํŠธ
  • ZeRO: Stage 1(Optimizer) โ†’ Stage 2(+Gradients) โ†’ Stage 3(+Parameters) ์ˆœ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ๊ทน๋Œ€ํ™”
  • FSDP: PyTorch์˜ ZeRO-3 ๋„ค์ดํ‹ฐ๋ธŒ ๊ตฌํ˜„, All-Gather/Reduce-Scatter๋กœ ๋™์ž‘
  • FSDP2: DTensor ๊ธฐ๋ฐ˜, fully_shard() API๋กœ ๋” ์œ ์—ฐํ•œ ์ƒค๋”ฉ
  • NCCL: All-Reduce, All-Gather, Reduce-Scatter ๋“ฑ ์ง‘๋‹จ ํ†ต์‹  ๋‹ด๋‹น
  • Ring vs Tree: ์†Œ๊ทœ๋ชจ๋Š” Ring, ๋Œ€๊ทœ๋ชจ๋Š” Tree๊ฐ€ ์œ ๋ฆฌ, NCCL์ด ์ž๋™ ์„ ํƒ
  • TCPStore: ๊ธฐ์กด NCCL ์ดˆ๊ธฐํ™”์˜ ๋ณ‘๋ชฉ์  โ†’ Checkpointless์—์„œ Rootless๋กœ ํ•ด๊ฒฐ