๐Ÿ“– ์•ฝ 45๋ถ„

Chapter 9: Q&A & ์šฉ์–ด์ง‘

๋ฐœํ‘œ ํ›„ ์˜ˆ์ƒ๋˜๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ƒ์„ธ ๋‹ต๋ณ€, ์Šฌ๋ผ์ด๋“œ๋ณ„ ๋ฐœํ‘œ ๊ฐ€์ด๋“œ, ๊ทธ๋ฆฌ๊ณ  ํ•ต์‹ฌ ๊ธฐ์ˆ  ์šฉ์–ด ์ •์˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ด ์žฅ์˜ ํ™œ์šฉ๋ฒ• ๋ฐœํ‘œ ์ง์ „ ๋น ๋ฅด๊ฒŒ Q&A ์„น์…˜์„ ํ›‘์–ด๋ณด๋ฉฐ ์˜ˆ์ƒ ์งˆ๋ฌธ์— ๋Œ€๋น„ํ•˜์„ธ์š”. ์šฉ์–ด์ง‘์€ ๋ฐœํ‘œ ์ค‘ ์ฒญ์ค‘์˜ ์งˆ๋ฌธ์— ์ •ํ™•ํ•œ ์ •์˜๋กœ ๋‹ต๋ณ€ํ•  ๋•Œ ์ฐธ๊ณ ํ•ฉ๋‹ˆ๋‹ค.

1. ์˜ˆ์ƒ ์งˆ๋ฌธ 15๊ฐœ

Q1. Checkpointless Training์ด ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์•„์˜ˆ ์•ˆ ๋งŒ๋“œ๋‚˜์š”?

๋‹ต๋ณ€: ์•„๋‹ˆ์š”, "Checkpointless"๋Š” ์žฅ์•  ๋ณต๊ตฌ ์‹œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

  • ์žฅ์•  ๋ณต๊ตฌ: P2P ๋ฉ”๋ชจ๋ฆฌ ๋ณต์ œ๋ฅผ ํ†ตํ•ด ์ฒดํฌํฌ์ธํŠธ ๋กœ๋“œ ์—†์ด ๋ณต๊ตฌ
  • ์žฅ๊ธฐ ๋ณด๊ด€: ๋ชจ๋ธ ๋ฒ„์ „ ๊ด€๋ฆฌ, ํ•™์Šต ์žฌ๊ฐœ๋ฅผ ์œ„ํ•œ S3 ์ฒดํฌํฌ์ธํŠธ๋Š” ๋ณ„๋„ ์ฃผ๊ธฐ๋กœ ์ €์žฅ ๊ฐ€๋Šฅ
  • Fallback: P2P ๋ณต๊ตฌ ์‹คํŒจ ์‹œ ์ž๋™์œผ๋กœ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ๋ณต๊ตฌ๋กœ ์ „ํ™˜
ํ•ต์‹ฌ ํฌ์ธํŠธ ์ฒดํฌํฌ์ธํŠธ๋ฅผ "์—†์•ค" ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์žฅ์•  ๋ณต๊ตฌ ๊ฒฝ๋กœ์—์„œ ์ฒดํฌํฌ์ธํŠธ I/O ๋ณ‘๋ชฉ์„ ์ œ๊ฑฐํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
Q2. ๊ธฐ์กด PyTorch ์ฝ”๋“œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ˆ˜์ •ํ•ด์•ผ ํ•˜๋‚˜์š”?

๋‹ต๋ณ€: 4 Tier ์ ์ง„์  ์ ์šฉ ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๋ฉฐ, Tier 1์€ ์ฝ”๋“œ ์ˆ˜์ • ์—†์ด ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋งŒ์œผ๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Tier์ฝ”๋“œ ์ˆ˜์ •๊ธฐ๋Šฅ
Tier 1์—†์Œ (ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋งŒ)TCPStore-less NCCL ์ดˆ๊ธฐํ™”
Tier 2์ตœ์†Œ (๋ฐ์ดํ„ฐ ๋กœ๋”)+ MMAP ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
Tier 3์ค‘๊ฐ„+ In-Process Recovery
Tier 4NeMo ๊ธฐ๋ฐ˜+ P2P State Replication (์ „์ฒด ๊ธฐ๋Šฅ)
Q3. NeMo๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ๋„ ์ง€์›ํ•˜๋‚˜์š”?

๋‹ต๋ณ€: ํ˜„์žฌ ์ „์ฒด ๊ธฐ๋Šฅ(Tier 4)์€ NeMo ๊ธฐ๋ฐ˜์ด์ง€๋งŒ, Tier 1-3์€ ์ผ๋ฐ˜ PyTorch์—์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  • Tier 1-3: ์ˆœ์ˆ˜ PyTorch, PyTorch Lightning, Hugging Face Transformers ๋“ฑ๊ณผ ํ˜ธํ™˜
  • Tier 4: P2P State Replication์€ NeMo์˜ ๋ถ„์‚ฐ ํ•™์Šต ๊ตฌ์กฐ์™€ ํ†ตํ•ฉ ํ•„์š”
  • ํ–ฅํ›„ ๊ณ„ํš: AWS๋Š” ๋” ๋งŽ์€ ํ”„๋ ˆ์ž„์›Œํฌ ์ง€์›์„ ํ™•๋Œ€ ์ค‘
Q4. GPU ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ๋Š” ์–ผ๋งˆ๋‚˜ ๋˜๋‚˜์š”?

๋‹ต๋ณ€: P2P ๋ณต์ œ ์„ค์ •์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋ฉฐ, num_distributed_optimizer_instances=2๋ฉด Optimizer State๊ฐ€ 2x ๋ณต์ œ๋ฉ๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ: ๋ณต์ œํ•˜์ง€ ์•Š์Œ (FSDP/ZeRO๋กœ ์ด๋ฏธ ๋ถ„์‚ฐ)
  • Optimizer State: ๋ณต์ œ ์ˆ˜๋งŒํผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ฆ๊ฐ€
  • ์™„ํ™” ๋ฐฉ๋ฒ•: CPU Offload๋ฅผ ํ™œ์šฉํ•˜์—ฌ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด ๊ฐ์†Œ
์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ = Optimizer State Size ร— (๋ณต์ œ ์ˆ˜ - 1)
Q5. Hot Spare ๋…ธ๋“œ๋ฅผ ๋ช‡ ๊ฐœ ์ค€๋น„ํ•ด์•ผ ํ•˜๋‚˜์š”?

๋‹ต๋ณ€: ์ผ๋ฐ˜์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์˜ 1-5%๋ฅผ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๊ถŒ์žฅ Hot Spare๊ทผ๊ฑฐ
100 GPU1-2 ๋…ธ๋“œ์†Œ๊ทœ๋ชจ์—์„œ ์žฅ์•  ๋นˆ๋„ ๋‚ฎ์Œ
1,000 GPU10-50 ๋…ธ๋“œํ†ต๊ณ„์ ์œผ๋กœ ์ฃผ 1-2ํšŒ ์žฅ์• 
10,000+ GPU100-500 ๋…ธ๋“œMeta ํ†ต๊ณ„: 54์ผ๊ฐ„ 466ํšŒ ์žฅ์• 
๋น„์šฉ ๊ณ ๋ ค Hot Spare ๋น„์šฉ vs ์žฅ์•  ๋ณต๊ตฌ ์‹œ๊ฐ„ ์†์‹ค ๋น„์šฉ์„ ๋น„๊ตํ•˜์—ฌ ์ตœ์  ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜์„ธ์š”.
Q6. EKS๋งŒ ์ง€์›ํ•˜๋‚˜์š”? Slurm์€์š”?

๋‹ต๋ณ€: Checkpointless Training์€ EKS ๊ธฐ๋ฐ˜ HyperPod Training Operator ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค. Slurm ํ™˜๊ฒฝ์—์„œ๋Š” ๋ณ„๋„์˜ Auto-Resume ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • EKS (๊ถŒ์žฅ): Training Operator๊ฐ€ Hot Spare ๊ด€๋ฆฌ, ์ž๋™ ๋ณต๊ตฌ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜
  • Slurm: Auto-Resume ๊ธฐ๋Šฅ์œผ๋กœ Job ์žฌ์‹œ์ž‘ ์ง€์›, ํ•˜์ง€๋งŒ In-Process Recovery ๋ฏธ์ง€์›
Q7. ๋ณต๊ตฌ ์ค‘ ํ•™์Šต ์ง„ํ–‰ ์ƒํƒœ(step)๋Š” ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?

๋‹ต๋ณ€: ์žฅ์•  ๋ฐœ์ƒ ์‹œ์ ์˜ step์œผ๋กœ ์ •ํ™•ํžˆ ๋ณต๊ท€ํ•ฉ๋‹ˆ๋‹ค.

  • P2P ๋ณต์ œ๋œ ์ƒํƒœ์—๋Š” ํ˜„์žฌ global step ์ •๋ณด ํฌํ•จ
  • MMAP ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์œผ๋กœ ๋ฐ์ดํ„ฐ ์œ„์น˜๋„ ์ •ํ™•ํžˆ ๋ณต์›
  • RNG State ๋ณต์ œ๋กœ ๋™์ผํ•œ ๋žœ๋ค ์‹œ๋“œ ์œ ์ง€
๊ธฐ์กด ์ฒดํฌํฌ์ธํŠธ ๋ฐฉ์‹๊ณผ์˜ ์ฐจ์ด ๊ธฐ์กด: ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ ์‹œ์ ์œผ๋กœ ๋ณต๊ท€ (์ˆ˜๋ฐฑ~์ˆ˜์ฒœ step ์†์‹ค)
Checkpointless: ์žฅ์•  ์ง์ „ step์œผ๋กœ ๋ณต๊ท€ (์†์‹ค ์ตœ์†Œํ™”)
Q8. ๋‹ค๋ฅธ AWS ๋ฆฌ์ „์—์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ๊ฐ€์š”?

๋‹ต๋ณ€: ํ˜„์žฌ 17๊ฐœ ๋ฆฌ์ „์—์„œ SageMaker HyperPod๋ฅผ ์ง€์›ํ•˜์ง€๋งŒ, ์„œ์šธ(ap-northeast-2)์€ ๋ฏธํฌํ•จ์ž…๋‹ˆ๋‹ค.

  • ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฆฌ์ „: ๋„์ฟ„ (ap-northeast-1)
  • ๋ถ๋ฏธ: us-east-1, us-east-2, us-west-2 ๋“ฑ
  • ์œ ๋Ÿฝ: eu-west-1, eu-central-1 ๋“ฑ
์ฐธ๊ณ  ๋ฆฌ์ „ ๊ฐ€์šฉ์„ฑ์€ ์ˆ˜์‹œ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ์ตœ์‹  ์ •๋ณด๋Š” AWS ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ํ™•์ธํ•˜์„ธ์š”.
Q9. P2P ๋ณต์ œ๊ฐ€ ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ์„ ์–ผ๋งˆ๋‚˜ ์‚ฌ์šฉํ•˜๋‚˜์š”?

๋‹ต๋ณ€: ํ•™์Šต ์ค‘ ๋ฐฑ๊ทธ๋ผ์šด๋“œ๋กœ ๋ณต์ œ๋˜๋ฉฐ, EFA 3,200 Gbps ๋Œ€๋น„ ๋ฏธ๋ฏธํ•œ ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.

  • ๋ณต์ œ ํƒ€์ด๋ฐ: Forward/Backward ์—ฐ์‚ฐ์˜ ์œ ํœด ์‹œ๊ฐ„ ํ™œ์šฉ
  • ๋Œ€์—ญํญ ์‚ฌ์šฉ: AllReduce ๋“ฑ ํ•™์Šต ํ†ต์‹ ๊ณผ ์‹œ๊ฐ„ ๋ถ„๋ฆฌ
  • ์˜ํ–ฅ: ํ•™์Šต ์ฒ˜๋ฆฌ๋Ÿ‰์— ๊ฑฐ์˜ ์˜ํ–ฅ ์—†์Œ (<1%)
Q10. Silent Data Corruption์€ ์–ด๋–ป๊ฒŒ ๊ฐ์ง€ํ•˜๋‚˜์š”?

๋‹ต๋ณ€: CudaHealthCheck์™€ CheckpointManager์˜ global step consistency ๊ฒ€์ฆ์„ ํ†ตํ•ด ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  • CudaHealthCheck: GPU ๋ฉ”๋ชจ๋ฆฌ ๋ฌด๊ฒฐ์„ฑ ์ฃผ๊ธฐ์  ๊ฒ€์ฆ
  • Step Consistency: ๋ชจ๋“  Rank์˜ global step ์ผ์น˜ ์—ฌ๋ถ€ ํ™•์ธ
  • Gradient ๊ฒ€์ฆ: NaN/Inf ๊ฐ์ง€ ๋ฐ ์ž๋™ ๋กค๋ฐฑ
SDC์˜ ์œ„ํ—˜์„ฑ Google Gemini Ultra ํ•™์Šต์—์„œ 1-2์ฃผ๋งˆ๋‹ค SDC ๋ฐœ์ƒ ๋ณด๊ณ . ์กฐ๊ธฐ ๊ฐ์ง€๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
Q11. ๋น„์šฉ์ด ์ถ”๊ฐ€๋กœ ๋“œ๋‚˜์š”?

๋‹ต๋ณ€: Hot Spare ๋…ธ๋“œ ๋น„์šฉ๊ณผ Checkpointless Container Image ์‚ฌ์šฉ๋ฃŒ๊ฐ€ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณต๊ตฌ ์‹œ๊ฐ„ ์ ˆ๊ฐ์œผ๋กœ ์ˆœ ๋น„์šฉ ์ ˆ๊ฐ ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋น„์šฉ ํ•ญ๋ชฉ์ถ”๊ฐ€ ๋น„์šฉ์ ˆ๊ฐ ํšจ๊ณผ
Hot Spare ๋…ธ๋“œํด๋Ÿฌ์Šคํ„ฐ์˜ 1-5%๋ณต๊ตฌ ์‹œ๊ฐ„ 90% ์ด์ƒ ๋‹จ์ถ•
Container Image๋ฏธ๋ฏธ-
๋ณต๊ตฌ ์‹œ๊ฐ„ ์ ˆ๊ฐ-$4,693/์žฅ์•  (1,000 GPU ๊ธฐ์ค€)
์ˆœ ์ ˆ๊ฐ = (์žฅ์•  ํšŸ์ˆ˜ ร— ๋ณต๊ตฌ ์‹œ๊ฐ„ ์ ˆ๊ฐ ร— GPU ์‹œ๊ฐ„๋‹น ๋น„์šฉ) - Hot Spare ๋น„์šฉ
Q12. Amazon Nova๋Š” ์‹ค์ œ๋กœ ์ด ๊ธฐ์ˆ ๋กœ ํ•™์Šต๋๋‚˜์š”?

๋‹ต๋ณ€: ๋„ค, Amazon Nova ๋ชจ๋ธ ๊ฐ€์กฑ์€ Checkpointless Training ๊ธฐ์ˆ ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ: Tens of thousands of accelerators
  • Goodput: 95% ์ด์ƒ ๋‹ฌ์„ฑ
  • ์˜์˜: ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ๋Œ€๊ทœ๋ชจ ๊ฒ€์ฆ ์™„๋ฃŒ
๋ฐœํ‘œ ํฌ์ธํŠธ "์ด ๊ธฐ์ˆ ์€ ์ด๋ก ์ด ์•„๋‹™๋‹ˆ๋‹ค. AWS๊ฐ€ ์ž์ฒด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ณ  ๊ฒ€์ฆํ•œ ํ”„๋กœ๋•์…˜ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค."
Q13. On-premise ํ™˜๊ฒฝ์—์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ๊ฐ€์š”?

๋‹ต๋ณ€: ์•„๋‹ˆ์š”, Checkpointless Training์€ SageMaker HyperPod ์ „์šฉ Managed Service์ž…๋‹ˆ๋‹ค.

  • Training Operator, Hot Spare ๊ด€๋ฆฌ, EFA ํ†ตํ•ฉ ๋“ฑ์ด AWS ์ธํ”„๋ผ์— ์˜์กด
  • On-premise ๋Œ€์•ˆ: NVIDIA NVRx (์˜คํ”ˆ์†Œ์Šค) ๊ณ ๋ ค
Q14. Elastic Training๊ณผ Checkpointless Training์˜ ์ฐจ์ด๋Š”?

๋‹ต๋ณ€: ๋ชฉ์ ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค. Elastic์€ ๋™์  ์Šค์ผ€์ผ๋ง, Checkpointless๋Š” ์žฅ์•  ๋ณต๊ตฌ์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

ํŠน์„ฑElastic TrainingCheckpointless Training
๋ชฉ์ ๋™์  ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ ์กฐ์ ˆ๋น ๋ฅธ ์žฅ์•  ๋ณต๊ตฌ
์‹œ๋‚˜๋ฆฌ์˜คSpot VM ํ™œ์šฉ, ๋ฆฌ์†Œ์Šค ํƒ„๋ ฅ์„ฑ๋Œ€๊ทœ๋ชจ ์žฅ๊ธฐ ํ•™์Šต ์•ˆ์ •์„ฑ
๋ณต๊ตฌ ๋ฐฉ์‹์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฐ˜์ธ๋ฉ”๋ชจ๋ฆฌ ๋ณต์ œ
๋ณด์™„์  ๊ด€๊ณ„ ๋‘ ๊ธฐ์ˆ ์€ ์ƒํ˜ธ ๋ฐฐํƒ€์ ์ด์ง€ ์•Š์œผ๋ฉฐ, ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ํƒ„๋ ฅ์„ฑ๊ณผ ์•ˆ์ •์„ฑ์„ ๋ชจ๋‘ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Q15. ๊ธฐ์กด Checkpoint์™€ ๋ณ‘ํ–‰ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ๊ฐ€์š”?

๋‹ต๋ณ€: ๋„ค, ๋ณ‘ํ–‰ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค.

  • P2P ๋ณต์ œ: ๋น ๋ฅธ ์žฅ์•  ๋ณต๊ตฌ์šฉ (primary)
  • S3 ์ฒดํฌํฌ์ธํŠธ: ์žฅ๊ธฐ ๋ณด๊ด€, ๋ฒ„์ „ ๊ด€๋ฆฌ์šฉ (๋ณ„๋„ ์ฃผ๊ธฐ)
  • Fallback: P2P ๋ณต๊ตฌ ์‹คํŒจ ์‹œ ์ž๋™์œผ๋กœ ์ฒดํฌํฌ์ธํŠธ ๋ณต๊ตฌ๋กœ ์ „ํ™˜
# ๊ถŒ์žฅ ์„ค์ • ์˜ˆ์‹œ
p2p_replication_interval: 100    # 100 step๋งˆ๋‹ค P2P ๋ณต์ œ
s3_checkpoint_interval: 10000    # 10,000 step๋งˆ๋‹ค S3 ์ €์žฅ

2. ์Šฌ๋ผ์ด๋“œ๋ณ„ ๋ฐœํ‘œ ๊ฐ€์ด๋“œ (15๋ถ„ ์„ธ์…˜)

์‹œ๊ฐ„ ๊ด€๋ฆฌ ํŒ 15๋ถ„ ๋ฐœํ‘œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒƒ์€ ์‹œ๊ฐ„ ๊ด€๋ฆฌ์ž…๋‹ˆ๋‹ค. ๊ฐ ์„น์…˜๋ณ„ ๊ถŒ์žฅ ์‹œ๊ฐ„์„ ์ค€์ˆ˜ํ•˜์„ธ์š”.
Slide #TitleKey Talking PointsTime
1 Title Slide ์ž๊ธฐ์†Œ๊ฐœ, ์„ธ์…˜ ๋ชฉํ‘œ ๊ฐ„๋‹จํžˆ 30์ดˆ
2 Agenda 3๊ฐ€์ง€ ํ•ต์‹ฌ: ๋ฌธ์ œ, ์†”๋ฃจ์…˜, ์„ฑ๊ณผ 30์ดˆ
3-4 The Problem - Meta ํ†ต๊ณ„: 54์ผ๊ฐ„ 466ํšŒ ์žฅ์• 
- Llama 3 70B ์ฒดํฌํฌ์ธํŠธ = 521GB
- ๋ณต๊ตฌ ์‹œ๊ฐ„ 15-60๋ถ„
2๋ถ„
5-6 Cost of Failure - $4,693/์žฅ์•  (1,000 GPU)
- Goodput 60-80%๋กœ ํ•˜๋ฝ
- "์‹œ๊ฐ„ = ๋ˆ" ๊ฐ•์กฐ
2๋ถ„
7-8 Checkpointless ์†Œ๊ฐœ - ํ•ต์‹ฌ ์•„์ด๋””์–ด: ๋””์Šคํฌ I/O ์ œ๊ฑฐ
- ์ธ๋ฉ”๋ชจ๋ฆฌ P2P ๋ณต์ œ
- Hot Spare ์ฆ‰์‹œ ๋Œ€์ฒด
2๋ถ„
9-11 5๋Œ€ ์ปดํฌ๋„ŒํŠธ - Rootless NCCL: ๋‹จ์ผ ์žฅ์• ์  ์ œ๊ฑฐ
- MMAP: ๋ฐ์ดํ„ฐ ์œ„์น˜ ์ฆ‰์‹œ ๋ณต์›
- IPR: ๊ฑด๊ฐ•ํ•œ ๋…ธ๋“œ ์œ ์ง€
- P2P: ์ƒํƒœ ๋ณต์ œ ํ•ต์‹ฌ
- Training Operator: ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜
3๋ถ„
12-13 ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ - ๋ณต๊ตฌ ์‹œ๊ฐ„: 15-60๋ถ„ โ†’ <90์ดˆ
- Goodput: 60% โ†’ 95%+
- Amazon Nova ์‚ฌ๋ก€
2๋ถ„
14-15 Getting Started - 4-Tier ์ ์šฉ ๋ชจ๋ธ
- Pre-configured Recipes
- GitHub ๋งํฌ
1๋ถ„ 30์ดˆ
16-17 Wrap-up & Q&A - ํ•ต์‹ฌ 3์ค„ ์š”์•ฝ
- Call to Action
- Q&A ์ „ํ™˜
1๋ถ„ 30์ดˆ
๋ฐœํ‘œ ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€ (3์ค„ ์š”์•ฝ) 1. ๋Œ€๊ทœ๋ชจ ํ•™์Šต์—์„œ ์žฅ์• ๋Š” "์˜ˆ์™ธ"๊ฐ€ ์•„๋‹ˆ๋ผ "์ผ์ƒ"์ž…๋‹ˆ๋‹ค.
2. Checkpointless Training์€ ๋””์Šคํฌ I/O๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ณต๊ตฌ ์‹œ๊ฐ„์„ 90% ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค.
3. Amazon Nova ํ•™์Šต์—์„œ ๊ฒ€์ฆ๋œ ํ”„๋กœ๋•์…˜ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.

3. ์šฉ์–ด์ง‘ (Glossary)

50๊ฐœ+ ํ•ต์‹ฌ ๊ธฐ์ˆ  ์šฉ์–ด ์•ŒํŒŒ๋ฒณ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Ctrl+F๋กœ ๋น ๋ฅด๊ฒŒ ๊ฒ€์ƒ‰ํ•˜์„ธ์š”.
TermDefinition
AllGather ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๊ฐ€ ๊ฐ์ž์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค์—๊ฒŒ ์ „์†กํ•˜์—ฌ, ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ–๊ฒŒ ๋˜๋Š” collective operation
AllReduce ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์‚ฐ(๋˜๋Š” ๋‹ค๋ฅธ ์—ฐ์‚ฐ)ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค์—๊ฒŒ ๋ฐฐํฌํ•˜๋Š” collective operation. ๋ถ„์‚ฐ ํ•™์Šต์—์„œ gradient ๋™๊ธฐํ™”์— ํ•ต์‹ฌ์ ์œผ๋กœ ์‚ฌ์šฉ
Backpropagation ์‹ ๊ฒฝ๋ง ํ•™์Šต์—์„œ ์ถœ๋ ฅ ์˜ค๋ฅ˜๋ฅผ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ์ „ํŒŒํ•˜์—ฌ ๊ฐ ๊ฐ€์ค‘์น˜์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
BF16 (BFloat16) Brain Floating Point 16-bit. Google์ด ๊ฐœ๋ฐœํ•œ 16๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์  ํ˜•์‹์œผ๋กœ, FP32์™€ ๋™์ผํ•œ ์ง€์ˆ˜ ๋ฒ”์œ„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„
Checkpoint ํ•™์Šต ์ค‘ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜, optimizer ์ƒํƒœ, ํ•™์Šต ์ง„ํ–‰ ์ƒํ™ฉ ๋“ฑ์„ ์ €์žฅํ•œ ์Šค๋ƒ…์ƒท. ์žฅ์•  ๋ณต๊ตฌ ๋ฐ ํ•™์Šต ์žฌ๊ฐœ์— ์‚ฌ์šฉ
CUDA NVIDIA์˜ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ํ”Œ๋žซํผ ๋ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ. GPU์—์„œ ๋ฒ”์šฉ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ
DDP (Distributed Data Parallel) PyTorch์˜ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ. ๊ฐ GPU์— ๋ชจ๋ธ ๋ณต์ œ๋ณธ์„ ๋‘๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜์—ฌ ๋ณ‘๋ ฌ ํ•™์Šต
DeepSpeed Microsoft๊ฐ€ ๊ฐœ๋ฐœํ•œ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ํ•™์Šต ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. ZeRO ์˜ตํ‹ฐ๋งˆ์ด์ €๋กœ ์œ ๋ช…
EFA (Elastic Fabric Adapter) AWS์˜ ๊ณ ์„ฑ๋Šฅ ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค. HPC ๋ฐ ML ์›Œํฌ๋กœ๋“œ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ์ตœ๋Œ€ 3,200 Gbps ๋Œ€์—ญํญ ์ œ๊ณต
FSDP (Fully Sharded Data Parallel) PyTorch์˜ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์  ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ๋ฒ•. ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ, gradient, optimizer ์ƒํƒœ๋ฅผ GPU ๊ฐ„์— ์ƒค๋”ฉ
Goodput ์œ ํšจ ์ฒ˜๋ฆฌ๋Ÿ‰. ์ „์ฒด ์ฒ˜๋ฆฌ๋Ÿ‰์—์„œ ์žฅ์• ๋กœ ์ธํ•œ ์†์‹ค ์ž‘์—…์„ ์ œ์™ธํ•œ ์‹ค์ œ ์ƒ์‚ฐ์ ์ธ ์ž‘์—…๋Ÿ‰. Goodput = Throughput ร— (1 - Failure_Rate ร— Recovery_Time)
Gradient ์†์‹ค ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํŽธ๋ฏธ๋ถ„๊ฐ’. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •
HBM3 (High Bandwidth Memory 3) 3์„ธ๋Œ€ ๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ. H100 GPU์—์„œ 3.35 TB/s ๋Œ€์—ญํญ ์ œ๊ณต. H200์€ HBM3e๋กœ 4.8 TB/s
Hot Spare ์žฅ์•  ๋ฐœ์ƒ ์‹œ ์ฆ‰์‹œ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋Œ€๊ธฐ ์ƒํƒœ๋กœ ์œ ์ง€๋˜๋Š” ์˜ˆ๋น„ ๋…ธ๋“œ
HyperPod AWS SageMaker์˜ ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ํ•™์Šต์„ ์œ„ํ•œ ๊ด€๋ฆฌํ˜• ํด๋Ÿฌ์Šคํ„ฐ ์„œ๋น„์Šค. ์ž๋™ ์žฅ์•  ๋ณต๊ตฌ, ํด๋Ÿฌ์Šคํ„ฐ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ ์ œ๊ณต
In-Process Recovery (IPR) ํ”„๋กœ์„ธ์Šค ์žฌ์‹œ์ž‘ ์—†์ด ๋™์ผ ํ”„๋กœ์„ธ์Šค ๋‚ด์—์„œ ์žฅ์• ๋ฅผ ๋ณต๊ตฌํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜. ๊ฑด๊ฐ•ํ•œ ๋…ธ๋“œ์˜ ํ”„๋กœ์„ธ์Šค๋Š” ์œ ์ง€
JLR (Job Level Restart) ์ „์ฒด ํ•™์Šต Job์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์žฌ์‹œ์ž‘ํ•˜๋Š” ๋ณต๊ตฌ ๋ฐฉ์‹. ๊ฐ€์žฅ ๋А๋ฆฌ๊ณ  ๋น„์šฉ์ด ํผ
Loss Function ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ํ•จ์ˆ˜. ํ•™์Šต์˜ ๋ชฉํ‘œ๋Š” ์ด ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ
MMAP (Memory-Mapped I/O) ํŒŒ์ผ์„ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ์— ์ง์ ‘ ๋งคํ•‘ํ•˜์—ฌ ํŒŒ์ผ I/O๋ฅผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•. ๋ฐ์ดํ„ฐ ๋กœ๋” ์ƒํƒœ ๋น ๋ฅธ ๋ณต์›์— ํ™œ์šฉ
Mixed Precision Training FP16/BF16๊ณผ FP32๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ธฐ๋ฒ•. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋ฉด์„œ ์ •ํ™•๋„ ์œ ์ง€
MTBF (Mean Time Between Failures) ํ‰๊ท  ์žฅ์•  ๊ฐ„๊ฒฉ. ์‹œ์Šคํ…œ์ด ์žฅ์•  ์—†์ด ์šด์˜๋˜๋Š” ํ‰๊ท  ์‹œ๊ฐ„
NCCL (NVIDIA Collective Communications Library) NVIDIA GPU ๊ฐ„ ๊ณ ์„ฑ๋Šฅ ํ†ต์‹ ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. AllReduce, AllGather ๋“ฑ collective operations ์ œ๊ณต
NeMo NVIDIA์˜ ๋Œ€ํ™”ํ˜• AI ๋ชจ๋ธ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ. ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต์— ์ตœ์ ํ™”
NVLink NVIDIA์˜ ๊ณ ์† GPU ๊ฐ„ ์ธํ„ฐ์ปค๋„ฅํŠธ. PCIe๋ณด๋‹ค ํ›จ์”ฌ ๋†’์€ ๋Œ€์—ญํญ ์ œ๊ณต (H100: 900 GB/s)
NVSwitch NVLink๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ GPU๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ์Šค์œ„์น˜. ๋‹จ์ผ ๋…ธ๋“œ ๋‚ด ๋ชจ๋“  GPU ๊ฐ„ ์ง์ ‘ ํ†ต์‹  ๊ฐ€๋Šฅ
Optimizer State Optimizer๊ฐ€ ์œ ์ง€ํ•˜๋Š” ์ƒํƒœ ์ •๋ณด. Adam์˜ ๊ฒฝ์šฐ momentum(m)๊ณผ variance(v) ํฌํ•จ. ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ์˜ 2๋ฐฐ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
P2P (Peer-to-Peer) ์ค‘์•™ ์„œ๋ฒ„ ์—†์ด ๋…ธ๋“œ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ ํ•˜๋Š” ๋ฐฉ์‹. Checkpointless Training์—์„œ ์ƒํƒœ ๋ณต์ œ์— ์‚ฌ์šฉ
Pipeline Parallelism (PP) ๋ชจ๋ธ์„ ๋ ˆ์ด์–ด ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์—ฌ๋Ÿฌ GPU์— ๋ฐฐ์น˜ํ•˜๊ณ , ๋งˆ์ดํฌ๋กœ๋ฐฐ์น˜๋ฅผ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•
PLR (Process Level Restart) ์‹คํŒจํ•œ ํ”„๋กœ์„ธ์Šค๋งŒ ์žฌ์‹œ์ž‘ํ•˜๋Š” ๋ณต๊ตฌ ๋ฐฉ์‹. JLR๋ณด๋‹ค ๋น ๋ฅด์ง€๋งŒ ์—ฌ์ „ํžˆ ์ฒดํฌํฌ์ธํŠธ ๋กœ๋“œ ํ•„์š”
RDMA (Remote Direct Memory Access) CPU ๊ฐœ์ž… ์—†์ด ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ์— ์ง์ ‘ ์ ‘๊ทผํ•˜๋Š” ๊ธฐ์ˆ . EFA, InfiniBand ๋“ฑ์—์„œ ์‚ฌ์šฉ
ReduceScatter Reduce ํ›„ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์‚ฐํ•˜๋Š” collective operation. FSDP์—์„œ gradient ๋™๊ธฐํ™” ํ›„ ์ƒค๋”ฉ์— ์‚ฌ์šฉ
Ring Algorithm ๋…ธ๋“œ๋“ค์„ ๋ง ํ˜•ํƒœ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ „๋‹ฌํ•˜๋Š” collective ์•Œ๊ณ ๋ฆฌ์ฆ˜. ๋Œ€์—ญํญ ํšจ์œจ์ 
RNG State Random Number Generator ์ƒํƒœ. ๋™์ผํ•œ ๋žœ๋ค ์‹œํ€€์Šค๋ฅผ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ €์žฅ/๋ณต์› ํ•„์š”
Rootless NCCL ๋‹จ์ผ "root" ๋…ธ๋“œ ์—†์ด ๋ถ„์‚ฐ ๋ฐฉ์‹์œผ๋กœ NCCL์„ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•. TCPStore ์˜์กด์„ฑ ์ œ๊ฑฐ
SageMaker AWS์˜ ์™„์ „ ๊ด€๋ฆฌํ˜• ML ํ”Œ๋žซํผ. ๋ชจ๋ธ ๊ฐœ๋ฐœ, ํ•™์Šต, ๋ฐฐํฌ์˜ ์ „์ฒด ML ๋ผ์ดํ”„์‚ฌ์ดํด ์ง€์›
Sharding ๋ฐ์ดํ„ฐ๋‚˜ ๋ชจ๋ธ ์ƒํƒœ๋ฅผ ์—ฌ๋Ÿฌ ๋…ธ๋“œ์— ๋ถ„ํ• ํ•˜์—ฌ ์ €์žฅํ•˜๋Š” ๊ธฐ๋ฒ•. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ ํ–ฅ์ƒ
SRD (Scalable Reliable Datagram) AWS EFA์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ „์†ก ํ”„๋กœํ† ์ฝœ. UDP ๊ธฐ๋ฐ˜์ด์ง€๋งŒ ์‹ ๋ขฐ์„ฑ ๋ณด์žฅ
Straggler ๋‹ค๋ฅธ ๋…ธ๋“œ๋ณด๋‹ค ๋А๋ฆฌ๊ฒŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ „์ฒด ํ•™์Šต ์†๋„๋ฅผ ์ €ํ•˜์‹œํ‚ค๋Š” ๋…ธ๋“œ
TCPStore PyTorch ๋ถ„์‚ฐ ํ•™์Šต์—์„œ ํ”„๋กœ์„ธ์Šค ๊ฐ„ ์ •๋ณด ๊ณต์œ ๋ฅผ ์œ„ํ•œ ํ‚ค-๊ฐ’ ์ €์žฅ์†Œ. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹จ์ผ ๋งˆ์Šคํ„ฐ ๋…ธ๋“œ์—์„œ ์‹คํ–‰
Tensor Core NVIDIA GPU์˜ ํ–‰๋ ฌ ์—ฐ์‚ฐ ์ „์šฉ ์œ ๋‹›. ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ(ํ–‰๋ ฌ ๊ณฑ์…ˆ)์„ ๊ณ ์†์œผ๋กœ ์ฒ˜๋ฆฌ
Tensor Parallelism (TP) ๋‹จ์ผ ๋ ˆ์ด์–ด์˜ ํ…์„œ๋ฅผ ์—ฌ๋Ÿฌ GPU์— ๋ถ„ํ• ํ•˜์—ฌ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•. ๋งค์šฐ ํฐ ๋ ˆ์ด์–ด์— ํšจ๊ณผ์ 
Training Operator Kubernetes์—์„œ ๋ถ„์‚ฐ ํ•™์Šต Job์„ ๊ด€๋ฆฌํ•˜๋Š” ์˜คํผ๋ ˆ์ดํ„ฐ. HyperPod์—์„œ Hot Spare, ์ž๋™ ๋ณต๊ตฌ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ๋‹ด๋‹น
Tree Algorithm ํŠธ๋ฆฌ ๊ตฌ์กฐ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ง‘๊ณ„/๋ฐฐํฌํ•˜๋Š” collective ์•Œ๊ณ ๋ฆฌ์ฆ˜. ์ง€์—ฐ ์‹œ๊ฐ„ ์ตœ์ ํ™”
UltraCluster AWS์˜ ๋Œ€๊ทœ๋ชจ GPU ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์„ฑ. ์—ฌ๋Ÿฌ UltraServer๋ฅผ ๊ณ ์† ๋„คํŠธ์›Œํฌ๋กœ ์—ฐ๊ฒฐ
UltraServer AWS์˜ ๊ณ ์„ฑ๋Šฅ GPU ์„œ๋ฒ„ ๋…ธ๋“œ. 8x H100/H200 GPU + NVSwitch + EFA๋กœ ๊ตฌ์„ฑ
World Size ๋ถ„์‚ฐ ํ•™์Šต์— ์ฐธ์—ฌํ•˜๋Š” ์ „์ฒด ํ”„๋กœ์„ธ์Šค(GPU) ์ˆ˜
ZeRO (Zero Redundancy Optimizer) DeepSpeed์˜ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ . Optimizer state, gradient, parameter๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ตœ์†Œํ™”
ZeRO-Infinity ZeRO์˜ ํ™•์žฅ์œผ๋กœ, CPU ๋ฐ NVMe ์Šคํ† ๋ฆฌ์ง€๋กœ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ GPU ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณต

4. ์ฐธ๊ณ  ์ž๋ฃŒ

4.1 AWS ๊ณต์‹ ๋ฌธ์„œ

4.2 AWS ๋ธ”๋กœ๊ทธ

4.3 GitHub Repositories

4.4 ํ•™์ˆ  ๋…ผ๋ฌธ

๋…ผ๋ฌธํ•™ํšŒํ•ต์‹ฌ ๊ธฐ์—ฌ
CheckFreq: Frequent, Fine-Grained DNN Checkpointing USENIX FAST '21 Snapshot-and-Persist ๋ถ„๋ฆฌ, I/O Latency Hiding
Bamboo: Making Preemptible Instances Resilient for Affordable Training USENIX NSDI '23 ํŒŒ์ดํ”„๋ผ์ธ ๋ฒ„๋ธ” ํ™œ์šฉ ์ค‘๋ณต ์—ฐ์‚ฐ, Instant Takeover
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models EuroSys '22 (Best Paper) ๋™์  ๋ชจ๋ธ ์žฌ๋ถ„ํ• , Spot VM ํ•™์Šต
Oobleck: Resilient Distributed Training Using Pipeline Templates ACM SOSP '23 Pipeline Template ๊ธฐ๋ฐ˜ ๋™์  ์žฌ๊ตฌ์„ฑ
MegaScale: Scaling LLM Training to More Than 10,000 GPUs arXiv 2024 12,288 GPU ๋Œ€๊ทœ๋ชจ ํ•™์Šต, 55.2% MFU
Pathways: Asynchronous Distributed Dataflow for ML MLSys '22 Google์˜ ๋น„๋™๊ธฐ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ

4.5 NVIDIA ๋ฌธ์„œ

4.6 ๊ธฐํƒ€ ์ฐธ๊ณ  ์ž๋ฃŒ

ํ•™์Šต ์™„๋ฃŒ! ์ด ๋ฌธ์„œ์˜ ๋ชจ๋“  ์ฑ•ํ„ฐ๋ฅผ ์™„๋…ํ•˜์…จ๋‹ค๋ฉด, SageMaker HyperPod Checkpointless Training์— ๋Œ€ํ•ด ๊นŠ์ด ์žˆ๋Š” ์ดํ•ด๋ฅผ ๊ฐ–์ถ”์…จ์Šต๋‹ˆ๋‹ค. ๋ฐœํ‘œ์—์„œ ์ž์‹ ๊ฐ ์žˆ๊ฒŒ ๊ธฐ์ˆ ์„ ์„ค๋ช…ํ•˜๊ณ  Q&A์— ๋Œ€์‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.