๐Ÿ“– ์•ฝ 45๋ถ„

Chapter 4: GPU & EFA ๋„คํŠธ์›Œํ‚น

๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ํ•™์Šต์˜ ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ฐ˜์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค. NVIDIA GPU ์ŠคํŽ™, NVLink/NVSwitch ์ธํ„ฐ์ปค๋„ฅํŠธ, AWS Trainium, EFA ๋„คํŠธ์›Œํ‚น, ๊ทธ๋ฆฌ๊ณ  FSx for Lustre ์Šคํ† ๋ฆฌ์ง€๊นŒ์ง€ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์ด ์žฅ์—์„œ ๋ฐฐ์šฐ๋Š” ๊ฒƒ H100/H200 GPU์˜ ์ƒ์„ธ ์ŠคํŽ™, NVLink ์„ธ๋Œ€๋ณ„ ์ง„ํ™”, AWS์˜ AI ๊ฐ€์†๊ธฐ(Trainium), P5/P5e/P5en ์ธ์Šคํ„ด์Šค ๋น„๊ต, EFA์˜ OS Bypass ์•„ํ‚คํ…์ฒ˜, ๊ทธ๋ฆฌ๊ณ  ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ์„ ์œ„ํ•œ FSx for Lustre ์„ฑ๋Šฅ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

1. NVIDIA H100

H100 ๊ฐœ์š”

NVIDIA H100์€ Hopper ์•„ํ‚คํ…์ฒ˜ ๊ธฐ๋ฐ˜์˜ ๋ฐ์ดํ„ฐ์„ผํ„ฐ GPU๋กœ, ๋Œ€๊ทœ๋ชจ AI ํ•™์Šต๊ณผ ์ถ”๋ก ์— ์ตœ์ ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค. SXM๊ณผ NVL ๋‘ ๊ฐ€์ง€ ํผํŒฉํ„ฐ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

H100 SXM vs NVL ์ŠคํŽ™ ๋น„๊ต

์ŠคํŽ™H100 SXMH100 NVL
HBM3 ๋ฉ”๋ชจ๋ฆฌ80 GB94 GB (2-GPU ํ•ฉ์‚ฐ 188 GB)
๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ3.35 TB/s3.9 TB/s
FP16 Tensor Core1,979 TFLOPS1,979 TFLOPS
FP8 Tensor Core3,958 TFLOPS3,958 TFLOPS
BF16 Tensor Core1,979 TFLOPS1,979 TFLOPS
NVLink ๋Œ€์—ญํญ900 GB/s (์–‘๋ฐฉํ–ฅ)600 GB/s (NVLink Bridge)
TDP700W400W (per GPU)
MIG ์ง€์›์ตœ๋Œ€ 7 ์ธ์Šคํ„ด์Šค์ตœ๋Œ€ 7 ์ธ์Šคํ„ด์Šค
ํผํŒฉํ„ฐSXM5 (์„œ๋ฒ„ ์ „์šฉ)PCIe Gen5 ๋“€์–ผ์Šฌ๋กฏ
์‚ฌ์šฉ ์‚ฌ๋ก€๋Œ€๊ทœ๋ชจ ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐLLM ์ถ”๋ก , RAG
MIG (Multi-Instance GPU) H100์€ ํ•˜๋‚˜์˜ ๋ฌผ๋ฆฌ GPU๋ฅผ ์ตœ๋Œ€ 7๊ฐœ์˜ ๋…๋ฆฝ๋œ ์ธ์Šคํ„ด์Šค๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์ธ์Šคํ„ด์Šค๋Š” ์ž์ฒด ๋ฉ”๋ชจ๋ฆฌ, SM, L2 ์บ์‹œ๋ฅผ ๊ฐ€์ง€๋ฉฐ ์™„์ „ํžˆ ๊ฒฉ๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์›Œํฌ๋กœ๋“œ์—์„œ GPU ํ™œ์šฉ๋ฅ ์„ ๋†’์ด๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. NVIDIA H200

H200 ๊ฐœ์š”

NVIDIA H200์€ H100์˜ ๋ฉ”๋ชจ๋ฆฌ ์—…๊ทธ๋ ˆ์ด๋“œ ๋ฒ„์ „์œผ๋กœ, HBM3e ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํƒ‘์žฌํ•˜์—ฌ ๋Œ€ํ˜• LLM ์ถ”๋ก  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

H200 vs H100 ๋น„๊ต

์ŠคํŽ™H100 SXMH200 SXM๊ฐœ์„ ์œจ
HBM ์šฉ๋Ÿ‰80 GB (HBM3)141 GB (HBM3e)+76%
๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ3.35 TB/s4.8 TB/s+43%
FP8 Tensor Core3,958 TFLOPS3,958 TFLOPS๋™์ผ
NVLink ๋Œ€์—ญํญ900 GB/s900 GB/s๋™์ผ
TDP700W700W๋™์ผ

LLM ์ถ”๋ก  ์„ฑ๋Šฅ ํ–ฅ์ƒ

H200์˜ ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋Œ€์—ญํญ์€ LLM ์ถ”๋ก ์—์„œ ํŠนํžˆ ํฐ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

๋ชจ๋ธH100 ์ฒ˜๋ฆฌ๋Ÿ‰H200 ์ฒ˜๋ฆฌ๋Ÿ‰ํ–ฅ์ƒ
Llama 2 70B๊ธฐ์ค€1.9๋ฐฐ+90%
Llama 3 70B๊ธฐ์ค€1.6๋ฐฐ+60%
GPT-3 175B๊ธฐ์ค€1.8๋ฐฐ+80%
H200์˜ ํ•ต์‹ฌ ๊ฐ€์น˜ H200์€ ์—ฐ์‚ฐ ์„ฑ๋Šฅ(TFLOPS)์€ H100๊ณผ ๋™์ผํ•˜์ง€๋งŒ, ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰/๋Œ€์—ญํญ ์ฆ๊ฐ€๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ(LLM ์ถ”๋ก , ๊ธด ์ปจํ…์ŠคํŠธ)์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 70B ๋ชจ๋ธ์„ ๋‹จ์ผ GPU์— ๋กœ๋“œํ•˜๊ฑฐ๋‚˜, KV ์บ์‹œ๋ฅผ ๋” ๋งŽ์ด ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

NVLink ๊ฐœ์š”

NVLink๋Š” NVIDIA GPU ๊ฐ„ ์ง์ ‘ ์—ฐ๊ฒฐ์„ ์œ„ํ•œ ๊ณ ์† ์ธํ„ฐ์ปค๋„ฅํŠธ์ž…๋‹ˆ๋‹ค. PCIe๋ณด๋‹ค ํ›จ์”ฌ ๋†’์€ ๋Œ€์—ญํญ๊ณผ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

NVLink ์„ธ๋Œ€๋ณ„ ๋Œ€์—ญํญ

์„ธ๋Œ€์•„ํ‚คํ…์ฒ˜GPU๋‹น ๋Œ€์—ญํญ๋งํฌ๋‹น ์†๋„
NVLink 1.0Pascal (P100)160 GB/s20 GB/s x 4 ๋งํฌ
NVLink 2.0Volta (V100)300 GB/s25 GB/s x 6 ๋งํฌ
NVLink 3.0Ampere (A100)600 GB/s50 GB/s x 12 ๋งํฌ
NVLink 4.0Hopper (H100)900 GB/s50 GB/s x 18 ๋งํฌ
NVLink 5.0Blackwell (B100/B200)1,800 GB/s100 GB/s x 18 ๋งํฌ
NVLink 6.0Rubin (์˜ˆ์ •)3,600 GB/s200 GB/s x 18 ๋งํฌ

NVSwitch

NVSwitch๋Š” ๋‹จ์ผ ๋…ธ๋“œ ๋‚ด์—์„œ ๋ชจ๋“  GPU๋ฅผ ์™„์ „ ์—ฐ๊ฒฐ(Full Mesh)ํ•˜๋Š” ์Šค์œ„์น˜ ์นฉ์ž…๋‹ˆ๋‹ค.

์ŠคํŽ™NVSwitch 3.0 (A100)NVSwitch 4.0 (H100)
์ด ๋Œ€์—ญํญ4.8 TB/s7.2 TB/s
ํฌํŠธ ์ˆ˜36 NVLink 364 NVLink 4
GPU ์—ฐ๊ฒฐ8 GPU All-to-All8 GPU All-to-All

NVL72 (GB200 NVL72)

NVIDIA์˜ ์ตœ์‹  GB200 NVL72 ์‹œ์Šคํ…œ์€ 72๊ฐœ์˜ Blackwell GPU๋ฅผ NVLink 5.0์œผ๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

  • ์ด GPU ๋ฉ”๋ชจ๋ฆฌ: 72 x 192GB = 13.8 TB HBM3e
  • ์ด NVLink ๋Œ€์—ญํญ: 130 TB/s
  • 1.4 ExaFLOPS AI ์—ฐ์‚ฐ ์„ฑ๋Šฅ (FP8)
  • ๋‹จ์ผ ์‹œ์Šคํ…œ์—์„œ 27์กฐ(27T) ํŒŒ๋ผ๋ฏธํ„ฐ ์‹ค์‹œ๊ฐ„ ์ถ”๋ก  ๊ฐ€๋Šฅ

4. AWS Trainium

Trainium ๊ฐœ์š”

AWS Trainium์€ AWS๊ฐ€ ์ž์ฒด ๊ฐœ๋ฐœํ•œ ML ํ•™์Šต ์ „์šฉ ๊ฐ€์†๊ธฐ์ž…๋‹ˆ๋‹ค. NVIDIA GPU ๋Œ€๋น„ ์ตœ๋Œ€ 50% ๋น„์šฉ ์ ˆ๊ฐ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Trainium ์„ธ๋Œ€๋ณ„ ๋น„๊ต

์ŠคํŽ™Trn1 (Trainium 1)Trn2 (Trainium 2)Trn3 (Trainium 3, ์˜ˆ์ •)
์ถœ์‹œ2022๋…„ 10์›”2024๋…„ 11์›”2025๋…„ ํ›„๋ฐ˜ ์˜ˆ์ •
์นฉ๋‹น HBM32 GB (HBM2e)96 GB (HBM3)192 GB (HBM3e)
์นฉ๋‹น TFLOPS (BF16)210 TFLOPS750 TFLOPS~1,400 TFLOPS
์ธ์Šคํ„ด์Šคtrn1.32xlargetrn2.48xlargeTBD
์นฉ ์ˆ˜/์ธ์Šคํ„ด์Šค16 ์นฉ16 ์นฉTBD
์ด HBM/์ธ์Šคํ„ด์Šค512 GB1.5 TB3 TB+
NeuronLink1์„ธ๋Œ€2์„ธ๋Œ€ (4๋ฐฐ ๋Œ€์—ญํญ)3์„ธ๋Œ€
EFA ๋Œ€์—ญํญ800 Gbps3,200 GbpsTBD

NeuronLink & NeuronSwitch

  • NeuronLink: Trainium ์นฉ ๊ฐ„ ์ง์ ‘ ์—ฐ๊ฒฐ (NVLink์˜ AWS ๋ฒ„์ „)
  • NeuronSwitch: ๋…ธ๋“œ ๋‚ด ๋ชจ๋“  Trainium ์นฉ์„ ์—ฐ๊ฒฐํ•˜๋Š” ์Šค์œ„์น˜
  • Trn2์—์„œ NeuronLink ๋Œ€์—ญํญ์ด Trn1 ๋Œ€๋น„ 4๋ฐฐ ํ–ฅ์ƒ

๊ฐ€๊ฒฉ ๋Œ€๋น„ ์„ฑ๋Šฅ

Trainium์˜ ๋น„์šฉ ํšจ์œจ์„ฑ
  • Trn1: P4d (A100) ๋Œ€๋น„ ํ•™์Šต ๋น„์šฉ ์ตœ๋Œ€ 50% ์ ˆ๊ฐ
  • Trn2: P5 (H100) ๋Œ€๋น„ ์œ ์‚ฌ ์„ฑ๋Šฅ, ๋” ๋‚ฎ์€ ๊ฐ€๊ฒฉ
  • AWS Neuron SDK๋กœ PyTorch/JAX ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์›
  • ์ œ์•ฝ: ์ผ๋ถ€ ์—ฐ์‚ฐ์ž/๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ ํ™•์ธ ํ•„์š”

5. AWS P5/P5e/P5en ์ธ์Šคํ„ด์Šค

P5 ์‹œ๋ฆฌ์ฆˆ ์ƒ์„ธ ์ŠคํŽ™ ๋น„๊ต

์ŠคํŽ™P5.48xlargeP5e.48xlargeP5en.48xlarge
GPU8x H100 SXM8x H200 SXM8x H200 SXM
GPU ๋ฉ”๋ชจ๋ฆฌ8x 80GB = 640 GB8x 141GB = 1.1 TB8x 141GB = 1.1 TB
vCPU192192192
์‹œ์Šคํ…œ ๋ฉ”๋ชจ๋ฆฌ2 TB2 TB2 TB
NVSwitchNVSwitch 4.0NVSwitch 4.0NVSwitch 4.0
GPU ๊ฐ„ ๋Œ€์—ญํญ900 GB/s/GPU900 GB/s/GPU900 GB/s/GPU
EFA ๋Œ€์—ญํญ3,200 Gbps3,200 Gbps6,400 Gbps
EFA ์–ด๋Œ‘ํ„ฐ32x EFA32x EFA64x EFA
NVMe ์Šคํ† ๋ฆฌ์ง€8x 3.84 TB8x 3.84 TB8x 3.84 TB
์ด NVMe30.7 TB30.7 TB30.7 TB
์‚ฌ์šฉ ์‚ฌ๋ก€๋Œ€๊ทœ๋ชจ ํ•™์ŠตLLM ์ถ”๋ก , ๊ธด ์ปจํ…์ŠคํŠธ์ดˆ๋Œ€๊ทœ๋ชจ ํ•™์Šต
P5en์˜ 6,400 Gbps EFA P5en์€ EFA ๋Œ€์—ญํญ์ด P5์˜ 2๋ฐฐ(6.4 Tbps)์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ˆ˜์ฒœ GPU ํด๋Ÿฌ์Šคํ„ฐ์—์„œ FSDP/ZeRO-3์˜ All-Gather/Reduce-Scatter ํ†ต์‹  ๋ณ‘๋ชฉ์„ ํฌ๊ฒŒ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.

6. EFA (Elastic Fabric Adapter)

EFA ์ •์˜

EFA (Elastic Fabric Adapter)๋Š” AWS๊ฐ€ ๊ฐœ๋ฐœํ•œ ๊ณ ์„ฑ๋Šฅ ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค๋กœ, HPC์™€ ML ์›Œํฌ๋กœ๋“œ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

OS Bypass ์•„ํ‚คํ…์ฒ˜

EFA์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ์€ OS Bypass์ž…๋‹ˆ๋‹ค. ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด OS ์ปค๋„์„ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ ํ•˜๋“œ์›จ์–ด์— ์ง์ ‘ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค:

# ์ „ํ†ต์ ์ธ ๋„คํŠธ์›Œํฌ ์Šคํƒ (๋†’์€ ์ง€์—ฐ์‹œ๊ฐ„)
Application โ†’ System Call โ†’ Kernel TCP/IP โ†’ NIC Driver โ†’ NIC โ†’ Network

# EFA OS Bypass (๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„)
Application โ†’ Libfabric API โ†’ EFA Hardware โ†’ Network
         โ†‘
    ์ปค๋„ ์šฐํšŒ (Direct Memory Access)


# ASCII ๋‹ค์ด์–ด๊ทธ๋žจ: EFA OS Bypass
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Application                         โ”‚
โ”‚                    (NCCL, MPI ๋“ฑ)                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚ Libfabric API (User Space)
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    EFA Device                            โ”‚
โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚              โ”‚   SRD Protocol     โ”‚  โ† ํŒจํ‚ท ์†์‹ค ์‹œ ์žฌ์ „์†กโ”‚
โ”‚              โ”‚   (Scalable        โ”‚                      โ”‚
โ”‚              โ”‚    Reliable        โ”‚                      โ”‚
โ”‚              โ”‚    Datagram)       โ”‚                      โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚ RDMA (Remote Direct Memory Access)
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AWS Network                           โ”‚
โ”‚                 (Petabit Scale)                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SRD Protocol

SRD (Scalable Reliable Datagram)๋Š” AWS๊ฐ€ ๊ฐœ๋ฐœํ•œ ์ „์†ก ํ”„๋กœํ† ์ฝœ์ž…๋‹ˆ๋‹ค:

  • InfiniBand Reliable Connection๋ณด๋‹ค ํ™•์žฅ์„ฑ์ด ๋†’์Œ
  • ํŒจํ‚ท ์†์‹ค ์‹œ ์ž๋™ ์žฌ์ „์†ก (์‹ ๋ขฐ์„ฑ)
  • ์ˆœ์„œ ๋ณด์žฅ ์—†์Œ โ†’ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰
  • AWS ๋„คํŠธ์›Œํฌ ํŠน์„ฑ์— ์ตœ์ ํ™”

Libfabric

Libfabric์€ ๊ณ ์„ฑ๋Šฅ ํŒจ๋ธŒ๋ฆญ ์„œ๋น„์Šค๋ฅผ ์œ„ํ•œ ์‚ฌ์šฉ์ž ๊ณต๊ฐ„ API์ž…๋‹ˆ๋‹ค:

  • OpenFabrics ์žฌ๋‹จ์—์„œ ๊ฐœ๋ฐœ
  • InfiniBand, EFA, TCP ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฑ์—”๋“œ ์ง€์›
  • NCCL, MPI๊ฐ€ Libfabric์„ ํ†ตํ•ด EFA์— ์ ‘๊ทผ

7. EFA ๋Œ€์—ญํญ

์ธ์Šคํ„ด์Šค ํƒ€์ž…๋ณ„ EFA ๋Œ€์—ญํญ

์ธ์Šคํ„ด์ŠคGPU/๊ฐ€์†๊ธฐEFA ๋Œ€์—ญํญEFA ์–ด๋Œ‘ํ„ฐ ์ˆ˜
P6e (B200) 8x B200 28,800 Gbps (28.8 Tbps) TBD
P5en.48xlarge 8x H200 6,400 Gbps 64
P5.48xlarge 8x H100 3,200 Gbps 32
P5e.48xlarge 8x H200 3,200 Gbps 32
Trn2.48xlarge 16x Trainium2 3,200 Gbps 32
Trn1.32xlarge 16x Trainium1 800 Gbps 8
P4d.24xlarge 8x A100 400 Gbps 4
P4de.24xlarge 8x A100 80GB 400 Gbps 4
๋Œ€์—ญํญ์˜ ์ค‘์š”์„ฑ FSDP/ZeRO-3์—์„œ All-Gather/Reduce-Scatter ํ†ต์‹ ๋Ÿ‰์€ ๋ชจ๋ธ ํฌ๊ธฐ์— ๋น„๋ก€ํ•ฉ๋‹ˆ๋‹ค. 70B ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋งค Forward/Backward๋งˆ๋‹ค ์ˆ˜๋ฐฑ GB์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋…ธ๋“œ ๊ฐ„ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. EFA ๋Œ€์—ญํญ์ด ๋ถ€์กฑํ•˜๋ฉด ํ†ต์‹ ์ด ์—ฐ์‚ฐ๋ณด๋‹ค ์˜ค๋ž˜ ๊ฑธ๋ ค GPU๊ฐ€ ์œ ํœด ์ƒํƒœ๋กœ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค.

8. RDMA ์ง€์›

RDMA๋ž€?

RDMA (Remote Direct Memory Access)๋Š” ์›๊ฒฉ ์‹œ์Šคํ…œ์˜ ๋ฉ”๋ชจ๋ฆฌ์— CPU ๊ฐœ์ž… ์—†์ด ์ง์ ‘ ์ฝ๊ธฐ/์“ฐ๊ธฐํ•˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.

RDMA ๋™์ž‘

# RDMA Write (zero-copy)
Node A GPU Memory โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Node B GPU Memory
                    โ”‚
            CPU/OS ๊ฐœ์ž… ์—†์Œ
            ๋„คํŠธ์›Œํฌ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ง์ ‘ ์ „์†ก

# ์ „ํ†ต์ ์ธ ์ „์†ก (multiple copies)
Node A GPU โ†’ Node A CPU โ†’ Kernel โ†’ NIC โ†’ Network โ†’ NIC โ†’ Kernel โ†’ Node B CPU โ†’ Node B GPU
        copy       copy              copy              copy       copy

AWS Nitro ๋ฒ„์ „๋ณ„ RDMA ์ง€์›

Nitro ๋ฒ„์ „์ธ์Šคํ„ด์ŠคEFA RDMA ์ง€์›GPUDirect RDMA
Nitro v4P4d, Trn1๋ถ€๋ถ„ (SRD)๋ฏธ์ง€์›
Nitro v5P5, Trn2์™„์ „ ์ง€์›์ง€์›
Nitro v6P5en, P6e์™„์ „ ์ง€์›์ง€์›
GPUDirect RDMA P5 ์ด์ƒ์—์„œ ์ง€์›๋˜๋Š” GPUDirect RDMA๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ์—์„œ EFA๋ฅผ ํ†ตํ•ด ๋‹ค๋ฅธ ๋…ธ๋“œ์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•ฉ๋‹ˆ๋‹ค. CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ฒฝ์œ ํ•˜์ง€ ์•Š์•„ ์ง€์—ฐ์‹œ๊ฐ„์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•˜๊ณ  CPU ๋ถ€ํ•˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

9. FSx for Lustre

FSx for Lustre ๊ฐœ์š”

Amazon FSx for Lustre๋Š” ๊ณ ์„ฑ๋Šฅ ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ, ML ํ•™์Šต์˜ ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ๊ณผ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์— ์ตœ์ ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ŠคํŽ™

์ŠคํŽ™Scratch (์ž„์‹œ)Persistent (์˜๊ตฌ)
์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰200 MB/s per TiB50-1000 MB/s per TiB
IOPS์ˆ˜๋ฐฑ๋งŒ์ˆ˜๋ฐฑ๋งŒ
์ง€์—ฐ์‹œ๊ฐ„~1ms ๋ฏธ๋งŒ~1ms ๋ฏธ๋งŒ
๋‚ด๊ตฌ์„ฑ์—†์Œ (์ž„์‹œ)๋ณต์ œ/๋ฐฑ์—…
์‚ฌ์šฉ ์‚ฌ๋ก€๋‹จ๊ธฐ ํ•™์Šต, ์บ์‹œ์ฒดํฌํฌ์ธํŠธ, ๋ฐ์ดํ„ฐ์…‹

Storage Classes

์Šคํ† ๋ฆฌ์ง€ ํด๋ž˜์Šค์ฒ˜๋ฆฌ๋Ÿ‰๋น„์šฉ์‚ฌ์šฉ ์‚ฌ๋ก€
SSD (PERSISTENT_1)50-200 MB/s/TiB๋†’์Œ์ง€์—ฐ์‹œ๊ฐ„ ๋ฏผ๊ฐ ์›Œํฌ๋กœ๋“œ
SSD (PERSISTENT_2)125-1000 MB/s/TiB์ค‘๊ฐ„ML ํ•™์Šต ๊ถŒ์žฅ
HDD (PERSISTENT)12-40 MB/s/TiB๋‚ฎ์Œ๋Œ€์šฉ๋Ÿ‰ Cold ๋ฐ์ดํ„ฐ

File Striping

Lustre๋Š” ๋Œ€์šฉ๋Ÿ‰ ํŒŒ์ผ์„ ์—ฌ๋Ÿฌ OST(Object Storage Target)์— ๋ถ„์‚ฐ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ผ ํฌ๊ธฐ > 100GB: ์ž๋™์œผ๋กœ stripe count 32
  • ๋ณ‘๋ ฌ I/O๋กœ ์ฒ˜๋ฆฌ๋Ÿ‰ ์„ ํ˜• ์ฆ๊ฐ€
  • ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ์‹œ ๋ชจ๋“  ๋…ธ๋“œ๊ฐ€ ๋™์‹œ์— ์“ฐ๊ธฐ ๊ฐ€๋Šฅ

EFA Throughput

์—ฐ๊ฒฐ ๋ฐฉ์‹์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ค๋ช…
Standard (ENA)100 Gbps์ผ๋ฐ˜ ๋„คํŠธ์›Œํฌ ์ธํ„ฐํŽ˜์ด์Šค
EFA700 GbpsEFA ์ง์ ‘ ์—ฐ๊ฒฐ
EFA + GPUDirect Storage1,200 GbpsGPU ๋ฉ”๋ชจ๋ฆฌ โ†” FSx ์ง์ ‘ ์ „์†ก
GPUDirect Storage (GDS) GPUDirect Storage๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ์™€ ์Šคํ† ๋ฆฌ์ง€(FSx, NVMe) ๊ฐ„ ์ง์ ‘ ๋ฐ์ดํ„ฐ ์ „์†ก์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ฒดํฌํฌ์ธํŠธ๋ฅผ GPU ๋ฉ”๋ชจ๋ฆฌ์—์„œ FSx๋กœ CPU๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ์ง์ ‘ ์ €์žฅํ•  ์ˆ˜ ์žˆ์–ด, ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

10. EC2 UltraClusters

UltraCluster ๊ฐœ์š”

EC2 UltraClusters๋Š” AWS์˜ ์ดˆ๋Œ€๊ทœ๋ชจ ML ํ•™์Šต์„ ์œ„ํ•œ ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์„ฑ์ž…๋‹ˆ๋‹ค.

์ŠคํŽ™

์ŠคํŽ™์ˆ˜์น˜
์ตœ๋Œ€ GPU ์ˆ˜20,000+ GPU
์ตœ๋Œ€ ์—ฐ์‚ฐ ์„ฑ๋Šฅ20 ExaFLOPS (FP8)
๋„คํŠธ์›ŒํฌPetabit-scale non-blocking fabric
์Šคํ† ๋ฆฌ์ง€FSx for Lustre (TB/s ๊ธ‰ ์ฒ˜๋ฆฌ๋Ÿ‰)
์ง€์› ์ธ์Šคํ„ด์ŠคP5, P5e, P5en, Trn2

ํŠน์ง•

  • Non-blocking Fabric: ๋ชจ๋“  ๋…ธ๋“œ ๊ฐ„ ๋™์ผํ•œ ๋Œ€์—ญํญ ๋ณด์žฅ (Fat-tree ํ† ํด๋กœ์ง€)
  • ๋™์ผ ๊ฐ€์šฉ์˜์—ญ: ๋ชจ๋“  ์ธ์Šคํ„ด์Šค๊ฐ€ ๊ฐ™์€ AZ์— ์œ„์น˜ํ•˜์—ฌ ์ง€์—ฐ์‹œ๊ฐ„ ์ตœ์†Œํ™”
  • Placement Group: Cluster placement group์œผ๋กœ ๋„คํŠธ์›Œํฌ ์ตœ์ ํ™”
  • EFA ํ’€ ํ™œ์šฉ: ๋ชจ๋“  EFA ์–ด๋Œ‘ํ„ฐ๊ฐ€ ์ตœ๋Œ€ ๋Œ€์—ญํญ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
UltraCluster ํ™œ์šฉ ์‚ฌ๋ก€
  • Amazon Nova: ์ˆ˜๋งŒ ๊ฐœ์˜ ๊ฐ€์†๊ธฐ์—์„œ Checkpointless Training์œผ๋กœ ํ•™์Šต
  • Anthropic Claude: AWS UltraCluster์—์„œ Claude ๋ชจ๋ธ ํ•™์Šต
  • Stability AI: Stable Diffusion ๋ชจ๋ธ ํ•™์Šต

์š”์•ฝ

ํ•ต์‹ฌ ํฌ์ธํŠธ
  • H100 SXM: 80GB HBM3, 900 GB/s NVLink, ๋Œ€๊ทœ๋ชจ ํ•™์Šต์˜ ํ‘œ์ค€
  • H200: 141GB HBM3e, 4.8 TB/s ๋Œ€์—ญํญ, LLM ์ถ”๋ก  ์„ฑ๋Šฅ 1.6-1.9๋ฐฐ ํ–ฅ์ƒ
  • NVLink 4.0: GPU๋‹น 900 GB/s, NVSwitch๋กœ 8 GPU All-to-All 7.2 TB/s
  • Trainium 2: H100 ๋Œ€๋น„ ์œ ์‚ฌ ์„ฑ๋Šฅ, ์ตœ๋Œ€ 50% ๋น„์šฉ ์ ˆ๊ฐ
  • P5en: 6,400 Gbps EFA, ์ดˆ๋Œ€๊ทœ๋ชจ ํ•™์Šต์— ์ตœ์ 
  • EFA: OS Bypass + SRD + Libfabric์œผ๋กœ ์ €์ง€์—ฐ ๊ณ ๋Œ€์—ญํญ ํ†ต์‹ 
  • FSx for Lustre: EFA 700 Gbps, GDS 1,200 Gbps๋กœ ์ฒดํฌํฌ์ธํŠธ ๊ณ ์† ์ €์žฅ
  • UltraClusters: 20,000 GPU, 20 ExaFLOPS, Petabit ๋„คํŠธ์›Œํฌ