๋ถ„์‚ฐํ˜• ํ•™์Šต

์ด ํŽ˜์ด์ง€์—์„œ๋Š” Vertex AI์—์„œ ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์š”๊ตฌ์‚ฌํ•ญ

๋ถ„์‚ฐ ํ•™์Šต์„ ์ง€์›ํ•˜๋Š” ML ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”๋“œ์—์„œ CLUSTER_SPEC ๋˜๋Š” TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐ์˜ ํŠน์ • ๋ถ€๋ถ„์„ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์กฐ

Vertex AI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐํ˜• ํ•™์Šต ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ, ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐ์— ์—ฌ๋Ÿฌ ๋จธ์‹ (๋…ธ๋“œ)์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์„œ๋น„์Šค๋Š” ๊ฐœ๋ฐœ์ž๊ฐ€ ์ง€์ •ํ•œ ๋จธ์‹  ์œ ํ˜•์— ๋ฆฌ์†Œ์Šค๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ œ๊ณต๋œ ๋…ธ๋“œ์—์„œ ์‹คํ–‰ ์ค‘์ธ ์ž‘์—…์„ ๋ณต์ œ๋ณธ์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์ด ๋™์ผํ•œ ๋ณต์ œ๋ณธ ๊ทธ๋ฃน์„ ์ž‘์—…์ž ํ’€์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๊ฐ ๋ณต์ œ๋ณธ์—๋Š” ๋ถ„์‚ฐ ํ•™์Šต ์‹œ ๋‹จ์ผ ์—ญํ•  ๋˜๋Š” ํƒœ์Šคํฌ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ: ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ๋ณต์ œ๋ณธ์ด ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ์œผ๋กœ ์ง€์ •๋ฉ๋‹ˆ๋‹ค. ์ด ํƒœ์Šคํฌ๋Š” ๋‹ค๋ฅธ ์ž‘์—…์„ ๊ด€๋ฆฌํ•˜๊ณ  ์ž‘์—… ์ƒํƒœ๋ฅผ ์ „์ฒด์ ์œผ๋กœ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ž‘์—…์ž: ๋ณต์ œ๋ณธ ํ•œ ๊ฐœ ์ด์ƒ์„ ์ž‘์—…์ž๋กœ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์ œ๋ณธ์€ ์ž‘์—… ๊ตฌ์„ฑ์—์„œ ์ง€์ •ํ•œ ๋Œ€๋กœ ์ž‘์—…์˜ ์ผ๋ถ€๋ถ„์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„: ML ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์ง€์›๋˜๋Š” ๊ฒฝ์šฐ ํ•˜๋‚˜ ์ด์ƒ์˜ ๋ณต์ œ๋ณธ์ด ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„๋กœ ์ง€์ •๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ณต์ œ๋ณธ์€ ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ €์žฅํ•˜๊ณ  ๊ฐ ์ž‘์—…์ž ๊ฐ„์— ๊ณต์œ  ๋ชจ๋ธ ์ƒํƒœ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

  • ํ‰๊ฐ€์ž: ML ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์ง€์›๋˜๋Š” ๊ฒฝ์šฐ ํ•˜๋‚˜ ์ด์ƒ์˜ ๋ณต์ œ๋ณธ์ด ํ‰๊ฐ€์ž๋กœ ์ง€์ •๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์ œ๋ณธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. TensorFlow๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ผ๋ฐ˜์ ์œผ๋กœ TensorFlow๋Š” ํ‰๊ฐ€์ž๋ฅผ ๋‘ ๊ฐœ ์ด์ƒ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—… ๊ตฌ์„ฑ

์—ฌ๋Ÿฌ ์ž‘์—…์ž ํ’€์„ ์ •์˜ํ•˜์—ฌ ์ปค์Šคํ…€ ํ•™์Šต ์ž‘์—…์„ ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์œผ๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๋˜๋Š” ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—… ๋‚ด์—์„œ ๋ถ„์‚ฐ ํ•™์Šต์„ ์‹คํ–‰ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์„ ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ๊ฐ ํƒœ์Šคํฌ ์œ ํ˜•์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ WorkerPoolSpec์„ ์ง€์ •ํ•˜์—ฌ ์ž‘์—…์ž ํ’€ ๋ชฉ๋ก(workerPoolSpecs[])์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

workerPoolSpecs[]์˜ ์œ„์น˜ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ํƒœ์Šคํฌ
์ฒซ ๋ฒˆ์งธ(workerPoolSpecs[0]) ๊ธฐ๋ณธ, ์ฃผ, ์Šค์ผ€์ค„๋Ÿฌ, '๋งˆ์Šคํ„ฐ'
๋‘ ๋ฒˆ์งธ(workerPoolSpecs[1]) ๋ณด์กฐ, ๋ณต์ œ๋ณธ, ์ž‘์—…์ž
์„ธ ๋ฒˆ์งธ(workerPoolSpecs[2]) ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„, Reduction Server
๋„ค ๋ฒˆ์งธ(workerPoolSpecs[3]) ํ‰๊ฐ€์ž

๋‹ค๋ฅธ ๋ชจ๋“  ๋ณต์ œ๋ณธ์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ์ž‘์—…์„ ์กฐ์ •ํ•˜๋Š” ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ์„ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ์— ๋Œ€ํ•ด์„œ๋งŒ ์ฒซ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€ ์‚ฌ์–‘์„ ์‚ฌ์šฉํ•˜๊ณ  ํ•ด๋‹น replicaCount๋ฅผ 1๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

{
  "workerPoolSpecs": [
     // `WorkerPoolSpec` for worker pool 0, primary replica, required
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 1, optional
     {},
     // `WorkerPoolSpec` for worker pool 2, optional
     {},
     // `WorkerPoolSpec` for worker pool 3, optional
     {}
   ]
   ...
}

์ถ”๊ฐ€ ์ž‘์—…์ž ํ’€ ์ง€์ •

ML ํ”„๋ ˆ์ž„์›Œํฌ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ชฉ์ ์˜ ์ถ”๊ฐ€ ์ž‘์—…์ž ํ’€์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด TensorFlow๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ž‘์—…์ž ํ’€์„ ์ง€์ •ํ•˜์—ฌ ์ž‘์—…์ž ๋ณต์ œ๋ณธ, ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„ ๋ณต์ œ๋ณธ, ํ‰๊ฐ€์ž ๋ณต์ œ๋ณธ์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

workerPoolSpecs[] ๋ชฉ๋ก์— ์ง€์ •ํ•˜๋Š” ์ž‘์—…์ž ํ’€์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ์ž‘์—…์ž ํ’€ ์œ ํ˜•์ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ์ž‘์—…์ž ํ’€์˜ ๊ฐ’์€ ๋น„์›Œ๋‘ก๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด workerPoolSpecs[] ๋ชฉ๋ก์—์„œ ์ด๋ฅผ ๊ฑด๋„ˆ๋›ฐ๊ณ  ์‚ฌ์šฉํ•˜๋ ค๋Š” ์ž‘์—…์ž ํ’€์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ๋ณต์ œ๋ณธ ๋ฐ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„ ์ž‘์—…์ž ํ’€๋งŒ ํฌํ•จ๋œ ์ž‘์—…์„ ์ง€์ •ํ•˜๋ ค๋ฉด ์ž‘์—…์ž ํ’€์˜ ๋นˆ ๊ฐ’์„ 1๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

{
  "workerPoolSpecs": [
     // `WorkerPoolSpec` for worker pool 0, required
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 1, optional
     {},
     // `WorkerPoolSpec` for worker pool 2, optional
     {
       "machineSpec": {...},
       "replicaCount": 1,
       "diskSpec": {...},
       ...
     },
     // `WorkerPoolSpec` for worker pool 3, optional
     {}
   ]
   ...
}

Reduction Server๋กœ ํ•™์Šต ์‹œ๊ฐ„ ๋‹จ์ถ•

์—ฌ๋Ÿฌ ๋…ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ML ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ๋…ธ๋“œ ๊ฐ„ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํ†ต์‹ ํ•˜๋ฉด ์ƒ๋‹นํ•œ ์ง€์—ฐ ์‹œ๊ฐ„์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Reduction Server๋Š” ๋ถ„์‚ฐ ํ•™์Šต์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋Š˜๋ฆฌ๊ณ  ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” all-reduce ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. Vertex AI๋Š” ๋ถ„์‚ฐ ํ•™์Šต ์ค‘์— ์ž‘์—…์ž ํ’€ ์ค‘ ํ•˜๋‚˜์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Docker ์ปจํ…Œ์ด๋„ˆ ์ด๋ฏธ์ง€์—์„œ Reduction Server๋ฅผ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

Reduction Server์˜ ์ž‘๋™ ๋ฐฉ์‹์„ ์•Œ์•„๋ณด๋ ค๋ฉด Vertex AI์˜ Reduction Server๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ณด๋‹ค ๋น ๋ฅธ ๋ถ„์‚ฐ GPU ํ•™์Šต์„ ์ฐธ์กฐํ•˜์„ธ์š”.

๊ธฐ๋ณธ ์š”๊ฑด

๋‹ค์Œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•˜๋Š” ๊ฒฝ์šฐ Reduction Server๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • GPU ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต ์ฝ”๋“œ์— TensorFlow ๋˜๋Š” PyTorch๊ฐ€ ์‚ฌ์šฉ๋˜๋ฉฐ, NCCL all-reduce๋ฅผ ์‚ฌ์šฉํ•ด์„œ GPU๋กœ ๋ฉ€ํ‹ฐ ํ˜ธ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๊ตฌ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (NCCL์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ML ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.)

  • ๊ธฐ๋ณธ ๋…ธ๋“œ(workerPoolSpecs[0]) ๋ฐ ์ž‘์—…์ž(workerPoolSpecs[1])์—์„œ ์‹คํ–‰ ์ค‘์ธ ์ปจํ…Œ์ด๋„ˆ์— Reduction Server๊ฐ€ ์ง€์›๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๊ฐ ์ปจํ…Œ์ด๋„ˆ๊ฐ€ ๋‹ค์Œ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

Reduction Server๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต

Reduction Server๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์ปค์Šคํ…€ ํ•™์Šต ๋ฆฌ์†Œ์Šค๋ฅผ ๋งŒ๋“ค ๋•Œ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  1. ์„ธ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€(workerPoolSpecs[2])์˜ containerSpec.imageUri ํ•„๋“œ์— ๋‹ค์Œ URI ์ค‘ ํ•˜๋‚˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

    • us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest
    • europe-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest
    • asia-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest

    ์ปค์Šคํ…€ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์œ„์น˜์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฉ€ํ‹ฐ ๋ฆฌ์ „์„ ์„ ํƒํ•˜๋ฉด ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  2. ์„ธ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์˜ ๋จธ์‹  ์œ ํ˜• ๋ฐ ๋…ธ๋“œ ์ˆ˜๋ฅผ ์„ ํƒํ•  ๋•Œ ์„ธ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์˜ ์ด ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ์ด ์ฒซ ๋ฒˆ์งธ ๋ฐ ๋‘ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์˜ ์ด ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ๊ณผ ์ผ์น˜ํ•˜๊ฑฐ๋‚˜ ์ดˆ๊ณผํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    ๋‘ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์—์„œ ๊ฐ ๋…ธ๋“œ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€ ๋Œ€์—ญํญ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ ๋ฐ GPU๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

    Reduction Server ๋…ธ๋“œ์—๋Š” GPU๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์—์„œ ๊ฐ ๋…ธ๋“œ์˜ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€ ๋Œ€์—ญํญ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ฒ”์šฉ ๋จธ์‹  ๊ณ„์—ด์˜ '์ตœ๋Œ€ ์ด๊ทธ๋ ˆ์Šค ๋Œ€์—ญํญ(Gbps)' ์—ด์„ ์ฐธ์กฐํ•˜์„ธ์š”.

    ์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ๋ฒˆ์งธ ๋ฐ ๋‘ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์—์„œ ๊ฐ๊ฐ 8๊ฐœ์˜ NVIDIA_TESLA_V100 GPU๊ฐ€ ์žˆ๋Š” 5๊ฐœ์˜ n1-highmem-96 ๋…ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ๊ตฌ์„ฑํ•  ๊ฒฝ์šฐ ๊ฐ ๋…ธ๋“œ์˜ ์ตœ๋Œ€ ๋Œ€์—ญํญ์€ 100Gbps์ด๊ณ  ์ด ๋Œ€์—ญํญ์€ 500Gbps์ž…๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์—์„œ ์ด ๋Œ€์—ญํญ์„ ์ผ์น˜์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ ์ตœ๋Œ€ ๋Œ€์—ญํญ์ด 32Gbps์ด๊ณ  ์ด ๋Œ€์—ญํญ์€ 512Gbps์ธ n1-highcpu-16 ๋…ธ๋“œ 16๊ฐœ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    Reduction Server ๋…ธ๋“œ์—๋Š” n1-highcpu-16 ๋จธ์‹  ์œ ํ˜•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด ๋จธ์‹  ์œ ํ˜•์€ ๋ฆฌ์†Œ์Šค์— ๋น„๊ต์  ๋†’์€ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋ช…๋ น์–ด๋Š” Reduction Server๋ฅผ ์‚ฌ์šฉํ•˜๋Š” CustomJob ๋ฆฌ์†Œ์Šค๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=n1-highmem-96,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,accelerator-count=8,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI \
  --worker-pool-spec=machine-type=n1-highmem-96,replica-count=4,accelerator-type=NVIDIA_TESLA_V100,accelerator-count=8,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI \
  --worker-pool-spec=machine-type=n1-highcpu-16,replica-count=16,container-image-uri=us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest

์ž์„ธํ•œ ๋‚ด์šฉ์€ CustomJob ๋งŒ๋“ค๊ธฐ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Reduction Server๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต ๊ถŒ์žฅ์‚ฌํ•ญ

๋จธ์‹  ์œ ํ˜• ๋ฐ ๊ฐœ์ˆ˜

Reduction Server ํ•™์Šต์—์„œ ๊ฐ ์ž‘์—…์ž๋Š” ๋ชจ๋“  ๊ฐ์†Œ๊ธฐ ํ˜ธ์ŠคํŠธ์— ์—ฐ๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—…์ž ํ˜ธ์ŠคํŠธ์—์„œ ์—ฐ๊ฒฐ ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ ค๋ฉด ๊ฐ์†Œ๊ธฐ ํ˜ธ์ŠคํŠธ์— ๋Œ€ํ•ด ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ์ด ๊ฐ€์žฅ ๋†’์€ ๋จธ์‹  ์œ ํ˜•์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

๊ฐ์†Œ๊ธฐ ํ˜ธ์ŠคํŠธ์—๋Š” 32Gbps ์ด๊ทธ๋ ˆ์Šค ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๋Š” vCPU๊ฐ€ 16๊ฐœ ์ด์ƒ์ธ ๋ฒ”์šฉ N1/N2 VM์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค(์˜ˆ: n1-highcpu-16 ๋ฐ n2-highcpu-16). N1/N2 VM์˜ ๋“ฑ๊ธ‰ 1 VM ๋Œ€์—ญํญ์€ ์ตœ๋Œ€ ์ด๊ทธ๋ ˆ์Šค ๋Œ€์—ญํญ์„ 50Gbps~100Gbps ๋ฒ”์œ„๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋ฏ€๋กœ ๊ฐ์†Œ๊ธฐ VM ๋…ธ๋“œ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ž‘์—…์ž์™€ ๊ฐ์†Œ๊ธฐ์˜ ์ด ์ด๊ทธ๋ ˆ์Šค ๋Œ€์—ญํญ์€ ๋™์ผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 8๊ฐœ์˜ a2-megagpu-16g VM์„ ์ž‘์—…์ž๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ 25๊ฐœ ์ด์ƒ์˜ n1-highcpu-16 VM์„ ๊ฐ์†Œ๊ธฐ๋กœ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

`(8 worker VMs * 100 Gbps) / 32 Gbps egress = 25 reducer VMs`.

์ž‘์€ ๋ฉ”์‹œ์ง€ ์ผ๊ด„ ์ฒ˜๋ฆฌ

Reduction Server๋Š” ์ง‘๊ณ„ํ•  ๋ฉ”์‹œ์ง€๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฐ ๊ฒฝ์šฐ ๊ฐ€์žฅ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ML ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ „์ฒด ์ถ•์†Œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ์ž‘์€ ๊ฒฝ์‚ฌ ํ…์„œ๋ฅผ ์ผ๊ด„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹ค๋ฅธ ์šฉ์–ด์˜ ๊ธฐ์ˆ ์„ ์ด๋ฏธ ์ œ๊ณตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Horovod

Horovod๋Š” Tensor Fusion์„ ์ง€์›ํ•˜์—ฌ ์ „์ฒด ์ถ•์†Œ๋ฅผ ์œ„ํ•ด ์†Œ๊ทœ๋ชจ ํ…์„œ๋ฅผ ์ผ๊ด„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ํ…์„œ๋Š” ๋ฒ„ํผ๊ฐ€ ์™„์ „ํžˆ ์ฑ„์›Œ์ง€๊ณ  ๋ฒ„ํผ์— ๋Œ€ํ•œ ์ „์ฒด ์ถ•์†Œ ์ž‘์—…์ด ์‹คํ–‰๋  ๋•Œ๊นŒ์ง€ ํ“จ์ „ ๋ฒ„ํผ์— ์ฑ„์›Œ์ง‘๋‹ˆ๋‹ค. HOROVOD_FUSION_THRESHOLD ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•˜์—ฌ ์œตํ•ฉ ๋ฒ„ํผ์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

HOROVOD_FUSION_THRESHOLD ํ™˜๊ฒฝ ๋ณ€์ˆ˜์˜ ๊ถŒ์žฅ ๊ฐ’์€ ์ตœ์†Œ 128MB์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ HOROVOD_FUSION_THRESHOLD ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ 134217728(128 * 1024 * 1024)๋กœ ์„ค์ •ํ•˜์„ธ์š”.

PyTorch

PyTorch DistributedDataParallel์€ ์ผ๊ด„ ๋ฉ”์‹œ์ง€๋ฅผ '๊ฒฝ์‚ฌ ๋ฒ„์ผ€ํŒ…'์œผ๋กœ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. DistributedDataParallel ์ƒ์„ฑ์ž์—์„œ bucket_cap_mb ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ผ๊ด„ ๋ฒ„ํ‚ท์˜ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ํฌ๊ธฐ๋Š” 25MB์ž…๋‹ˆ๋‹ค.

๊ถŒ์žฅ์‚ฌํ•ญ: bucket_cap_mb์˜ ๊ถŒ์žฅ ๊ฐ’์€ 64(64MB)์ž…๋‹ˆ๋‹ค.

ํด๋Ÿฌ์Šคํ„ฐ์˜ ํ™˜๊ฒฝ ๋ณ€์ˆ˜

Vertex AI๋Š” ๋ชจ๋“  ๋ณต์ œ๋ณธ์— ํ™˜๊ฒฝ ๋ณ€์ˆ˜, CLUSTER_SPEC์„ ์ฑ„์›Œ ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์„ค์ •๋œ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. TensorFlow์˜ TF_CONFIG์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ CLUSTER_SPEC์€ ์ƒ‰์ธ๊ณผ ์—ญํ• (๊ธฐ๋ณธ ๋ณต์ œ๋ณธ, ์ž‘์—…์ž, ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„, ํ‰๊ฐ€์ž)์„ ํฌํ•จํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๋ชจ๋“  ๋ณต์ œ๋ณธ์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

TensorFlow๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐํ˜• ํ•™์Šต์„ ์‹คํ–‰ํ•˜๋ฉด TF_CONFIG๊ฐ€ ํŒŒ์‹ฑ๋˜์–ด tf.train.ClusterSpec์ด ๋นŒ๋“œ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋‹ค๋ฅธ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๋ถ„์‚ฐํ˜• ํ•™์Šต์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ CLUSTER_SPEC์„ ํŒŒ์‹ฑํ•˜์—ฌ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์š”๊ตฌํ•˜๋Š” ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ๋˜๋Š” ์„ค์ •์„ ์ฑ„์›Œ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

CLUSTER_SPEC ํ˜•์‹

CLUSTER_SPEC ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•์‹์˜ JSON ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค.

ํ‚ค ์„ค๋ช…
"cluster"

์ปค์Šคํ…€ ์ปจํ…Œ์ด๋„ˆ์— ๋Œ€ํ•œ ํด๋Ÿฌ์Šคํ„ฐ ์„ค๋ช…์ž…๋‹ˆ๋‹ค. TF_CONFIG์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด ๊ฐ์ฒด์˜ ํ˜•์‹์€ TensorFlow ํด๋Ÿฌ์Šคํ„ฐ ์‚ฌ์–‘์œผ๋กœ ์ง€์ •๋˜๋ฉฐ tf.train.ClusterSpec์˜ ์ƒ์„ฑ์ž์— ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํด๋Ÿฌ์Šคํ„ฐ ์„ค๋ช…์—๋Š” ์ง€์ •๋œ ๊ฐ ์ž‘์—…์ž ํ’€์˜ ๋ณต์ œ๋ณธ ์ด๋ฆ„ ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

"workerpool0" ๋ชจ๋“  ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์—๋Š” ์ฒซ ๋ฒˆ์งธ ์ž‘์—…์ž ํ’€์— ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ ํ•˜๋‚˜๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
"workerpool1" ์ด ์ž‘์—…์ž ํ’€์—๋Š” ์ž‘์—…์„ ๋งŒ๋“ค ๋•Œ ์ง€์ •ํ•œ ์ž‘์—…์ž ๋ณต์ œ๋ณธ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
"workerpool2" ์ด ์ž‘์—…์ž ํ’€์—๋Š” ์ž‘์—…์„ ๋งŒ๋“ค ๋•Œ ์ง€์ •ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
"workerpool3" ์ด ์ž‘์—…์ž ํ’€์—๋Š” ์ž‘์—…์„ ๋งŒ๋“ค ๋•Œ ์ง€์ •ํ•œ ํ‰๊ฐ€์ž๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
"environment" cloud ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค.
"task" ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ ์ค‘์ธ ํŠน์ • ๋…ธ๋“œ์˜ ํƒœ์Šคํฌ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ์ž‘์—…์˜ ํŠน์ • ์ž‘์—…์ž์— ๋Œ€ํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ•ญ๋ชฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ‚ค๊ฐ€ ํฌํ•จ๋œ ์‚ฌ์ „์ž…๋‹ˆ๋‹ค.
"type" ์ด ํƒœ์Šคํฌ๊ฐ€ ์‹คํ–‰๋˜๋Š” ์ž‘์—…์ž ํ’€์˜ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "workerpool0"๋Š” ๊ธฐ๋ณธ ๋ณต์ œ๋ณธ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
"index"

0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ํƒœ์Šคํฌ ์ƒ‰์ธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•™์Šต ์ž‘์—…์— ๋‘ ์ž‘์—…์ž๊ฐ€ ํฌํ•จ๋œ ๊ฒฝ์šฐ ์ด ๊ฐ’์€ ํ•œ ์ž‘์—…์ž์—์„œ๋Š” 0, ๋‹ค๋ฅธ ์ž‘์—…์ž์—์„œ๋Š” 1๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

"trial" ํ˜„์žฌ ์‹คํ–‰ ์ค‘์ธ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์‹œ๋„์˜ ์‹๋ณ„์ž์ž…๋‹ˆ๋‹ค. ์ž‘์—…์— ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ •์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ ํ•™์Šต ์‹œ๋„ ํšŸ์ˆ˜๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ํ†ตํ•ด ์‹คํ–‰ ์ค‘์ธ ์‹œ๋„ ๊ฐ„์— ์ฝ”๋“œ๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹๋ณ„์ž๋Š” ์‹œ๋„ ํšŸ์ˆ˜๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฌธ์ž์—ด ๊ฐ’์ด๋ฉฐ 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
job

์‚ฌ์ „์œผ๋กœ ํ‘œ์‹œ๋œ ํ˜„์žฌ ํ•™์Šต ์ž‘์—…์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ œ๊ณตํ–ˆ๋˜ CustomJobSpec์ž…๋‹ˆ๋‹ค.

CLUSTER_SPEC ์˜ˆ์‹œ

๋‹ค์Œ์€ ๊ฐ’ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

{
   "cluster":{
      "workerpool0":[
         "cmle-training-workerpool0-ab-0:2222"
      ],
      "workerpool1":[
         "cmle-training-workerpool1-ab-0:2222",
         "cmle-training-workerpool1-ab-1:2222"
      ],
      "workerpool2":[
         "cmle-training-workerpool2-ab-0:2222",
         "cmle-training-workerpool2-ab-1:2222"
      ],
      "workerpool3":[
         "cmle-training-workerpool3-ab-0:2222",
         "cmle-training-workerpool3-ab-1:2222",
         "cmle-training-workerpool3-ab-2:2222"
      ]
   },
   "environment":"cloud",
   "task":{
      "type":"workerpool0",
      "index":0,
      "trial":"TRIAL_ID"
   },
   "job": {
      ...
   }
}

TF_CONFIG ํ˜•์‹

CLUSTER_SPEC ์™ธ์—๋„ Vertex AI๋Š” ๋ชจ๋“  ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์˜ ๊ฐ ๋ณต์ œ๋ณธ์— ํ™˜๊ฒฝ ๋ณ€์ˆ˜TF_CONFIG๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. Vertex AI๋Š” ๋‹จ์ผ ๋ณต์ œ๋ณธ ํ•™์Šต ์ž‘์—…์— TF_CONFIG๋ฅผ ์„ค์ •ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

CLUSTER_SPEC ๋ฐ TF_CONFIG๋Š” ์ผ๋ถ€ ๊ฐ’์„ ๊ณต์œ ํ•˜์ง€๋งŒ ํ˜•์‹์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋‘ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ๋ชจ๋‘ TensorFlow์— ํ•„์š”ํ•œ ๋ฒ”์œ„๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ์ถ”๊ฐ€ ํ•„๋“œ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

TensorFlow๋ฅผ ์‚ฌ์šฉํ•œ ๋ถ„์‚ฐ ํ•™์Šต์€ ์‚ฌ์ „ ๋นŒ๋“œ๋œ ์ปจํ…Œ์ด๋„ˆ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์™€ ๊ฐ™์ด ์ปค์Šคํ…€ ์ปจํ…Œ์ด๋„ˆ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์™€ ๋™์ผํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•์‹์˜ JSON ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค.

TF_CONFIG ํ•„๋“œ
cluster

TensorFlow ํด๋Ÿฌ์Šคํ„ฐ ์„ค๋ช…์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜ ์ด์ƒ์˜ ํƒœ์Šคํฌ ์ด๋ฆ„(chief , worker , ps ๋˜๋Š” master)์„ ์ด ํƒœ์Šคํฌ๊ฐ€ ์‹คํ–‰๋˜๋Š” ๋„คํŠธ์›Œํฌ ์ฃผ์†Œ ๋ชฉ๋ก์— ๋งคํ•‘ํ•˜๋Š” ์‚ฌ์ „์ž…๋‹ˆ๋‹ค. ์ด ์‚ฌ์ „์€ ํ•œ ํŠน์ • ํ•™์Šต ์ž‘์—…์˜ ๋ชจ๋“  VM์—์„œ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

tf.train.ClusterSpec ์ƒ์„ฑ์ž์˜ ์œ ํšจํ•œ ์ฒซ ๋ฒˆ์งธ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ํ‰๊ฐ€์ž๋ฅผ ์ž‘์—…์— ์ด์šฉํ•˜๋”๋ผ๋„ ํ‰๊ฐ€์ž๋Š” ํ•™์Šต ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ผ๋ถ€๋กœ ๊ฐ„์ฃผ๋˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด ์‚ฌ์ „์€ evaluator๋ฅผ ํ‚ค๋กœ ํฌํ•จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

task

์ด ํ™˜๊ฒฝ ๋ณ€์ˆ˜๊ฐ€ ์„ค์ •๋œ VM์˜ ํƒœ์Šคํฌ ์„ค๋ช…์ž…๋‹ˆ๋‹ค. ์ด ์‚ฌ์ „์€ ๊ฐ™์€ ํ•™์Šต ์ž‘์—…์—์„œ๋„ VM๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ด ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์˜ ๊ฐ VM์—์„œ ์‹คํ–‰๋˜๋Š” ์ฝ”๋“œ๋ฅผ ๋งž์ถค์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์˜ ์‹œ๋„๋งˆ๋‹ค ํ•™์Šต ์ฝ”๋“œ์˜ ๋™์ž‘์„ ๋ณ€๊ฒฝํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์‚ฌ์ „์—๋Š” ๋‹ค์Œ ํ‚ค-๊ฐ’ ์Œ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

task ํ•„๋“œ
type

์ด VM์—์„œ ์ˆ˜ํ–‰ ์ค‘์ธ ํƒœ์Šคํฌ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ์ž‘์—…์ž์˜ ๊ฒฝ์šฐ worker, ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„์˜ ๊ฒฝ์šฐ ps, ํ‰๊ฐ€์ž์˜ ๊ฒฝ์šฐ evaluator๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ž‘์—…์˜ ๋งˆ์Šคํ„ฐ ์ž‘์—…์ž์—์„œ ๊ฐ’์€ chief ๋˜๋Š” master๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

index

0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ํƒœ์Šคํฌ ์ƒ‰์ธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•™์Šต ์ž‘์—…์— ๋‘ ์ž‘์—…์ž๊ฐ€ ํฌํ•จ๋œ ๊ฒฝ์šฐ ์ด ๊ฐ’์€ ํ•œ ์ž‘์—…์ž์—์„œ๋Š” 0, ๋‹ค๋ฅธ ์ž‘์—…์ž์—์„œ๋Š” 1๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

trial

ํ˜„์žฌ ์ด VM์—์„œ ์‹คํ–‰ ์ค‘์ธ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์‹œ๋„์˜ ID์ž…๋‹ˆ๋‹ค. ์ด ํ•„๋“œ๋Š” ํ˜„์žฌ ํ•™์Šต ์ž‘์—…์ด ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์ธ ๊ฒฝ์šฐ์—๋งŒ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์˜ ๊ฒฝ์šฐ Vertex AI๋Š” ๋งค๋ฒˆ ๋‹ค๋ฅธ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์—ฌ๋Ÿฌ ์‹œ๋„์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•„๋“œ๋Š” ํ˜„์žฌ ์‹œ๋„ ํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ๋ฒˆํ˜ธ์ด๋ฉฐ ์ฒซ ์‹œ๋„๋Š” 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

cloud

Vertex AI์—์„œ ๋‚ด๋ถ€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ID์ž…๋‹ˆ๋‹ค. ์ด ํ•„๋“œ๋Š” ๋ฌด์‹œํ•ด๋„ ๋ฉ๋‹ˆ๋‹ค.

job

์‚ฌ์ „์œผ๋กœ ํ‘œ์‹œ๋œ ํ˜„์žฌ ํ•™์Šต ์ž‘์—…์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ œ๊ณตํ–ˆ๋˜ CustomJobSpec์ž…๋‹ˆ๋‹ค.

environment

cloud ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค.

TF_CONFIG ์˜ˆ์‹œ

๋‹ค์Œ ์ฝ”๋“œ ์˜ˆ์‹œ๋Š” ํ•™์Šต ๋กœ๊ทธ์— TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

import json
import os

tf_config_str = os.environ.get('TF_CONFIG')
tf_config_dict  = json.loads(tf_config_str)

# Convert back to string just for pretty printing
print(json.dumps(tf_config_dict, indent=2))

๋Ÿฐํƒ€์ž„ ๋ฒ„์ „ 2.1 ์ด์ƒ์—์„œ ์‹คํ–‰๋˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„, ๋งˆ์Šคํ„ฐ ์ž‘์—…์ž, ์ž‘์—…์ž ๋‘˜์„ ์‚ฌ์šฉํ•˜๋Š” ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์—์„œ ์ด ์ฝ”๋“œ๋Š” ์ฒซ ๋ฒˆ์งธ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์‹œ๋„ ์ค‘์— ์ž‘์—…์ž ์ค‘ ํ•œ ๊ฐœ์— ๋Œ€ํ•ด ๋‹ค์Œ ๋กœ๊ทธ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถœ๋ ฅ ์˜ˆ์‹œ์—์„œ๋Š” ๊ฐ„๊ฒฐ์„ฑ์„ ์œ„ํ•ด job ํ•„๋“œ๋ฅผ ์ˆจ๊ธฐ๊ณ  ์ผ๋ถ€ ID๋ฅผ ์ผ๋ฐ˜์ ์ธ ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

{
  "cluster": {
    "chief": [
      "training-workerpool0-[ID_STRING_1]-0:2222"
    ],
    "ps": [
      "training-workerpool2-[ID_STRING_1]-0:2222"
    ],
    "worker": [
      "training-workerpool1-[ID_STRING_1]-0:2222",
      "training-workerpool1-[ID_STRING_1]-1:2222"
    ]
  },
  "environment": "cloud",
  "job": {
    ...
  },
  "task": {
    "cloud": "[ID_STRING_2]",
    "index": 0,
    "trial": "1",
    "type": "worker"
  }
}

TF_CONFIG ์‚ฌ์šฉ ์‹œ์ 

TF_CONFIG๋Š” ๋ถ„์‚ฐ ํ•™์Šต ์ž‘์—…์—๋งŒ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฝ”๋“œ์—์„œ TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์— ์„ค๋ช…๋œ TensorFlow์˜ ๋ถ„์‚ฐ ์ „๋žต ๋ฐ Vertex AI์˜ ํ‘œ์ค€ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์›Œํฌํ”Œ๋กœ๊ฐ€ ํ•ด๋‹น ์ž‘์—…์— ์ ํ•ฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋งŒ TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”.

๋ถ„์‚ฐํ˜• ํ•™์Šต

Vertex AI์—์„œ๋Š” TensorFlow์˜ ๋ถ„์‚ฐ ํ•™์Šต์— ํ•„์š”ํ•œ ์‚ฌ์–‘์„ ํ™•์žฅํ•˜๋„๋ก TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

TensorFlow๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด tf.distribute.Strategy API๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”. ํŠนํžˆ Keras API๋ฅผ MultiWorkerMirroredStrategy ๋˜๋Š” ParameterServerStrategy(์ž‘์—…์— ๋งค๊ฐœ๋ณ€์ˆ˜ ์„œ๋ฒ„๋ฅผ ์ง€์ •ํ•˜๋Š” ๊ฒฝ์šฐ)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ TensorFlow๋Š” ์ด ๊ฐ™์€ ์ „๋žต์— ๋Œ€ํ•ด ์‹คํ—˜์  ์ง€์›๋งŒ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฐํฌ ์ „๋žต์—์„œ๋Š” TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์ž‘์—…์˜ ๊ฐ VM์— ์—ญํ• ์„ ํ• ๋‹นํ•˜๊ณ  VM ๊ฐ„์˜ ํ†ต์‹ ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. TensorFlow๊ฐ€ ์ด ๊ณผ์ •์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ํ•™์Šต ์ฝ”๋“œ์—์„œ TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜์— ์ง์ ‘ ์•ก์„ธ์Šคํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ์—ฌ๋Ÿฌ VM์˜ ๋™์ž‘์„ ๋งž์ถค์„ค์ •ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ์—๋งŒ TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ํŒŒ์‹ฑํ•˜์„ธ์š”.

์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ •

์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋ฉด Vertex AI์—์„œ ๊ฐ ์‹œ๋„์˜ ํ•™์Šต ์ฝ”๋“œ์— ์„œ๋กœ ๋‹ค๋ฅธ ์ธ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”๋“œ๊ฐ€ ํ˜„์žฌ ์‹คํ–‰ ์ค‘์ธ ์‹œ๋„์— ๋Œ€ํ•ด ์ธ์ง€ํ•ด์•ผ ํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Google Cloud ์ฝ˜์†”์—์„œ ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ์ • ์ž‘์—…์˜ ์ง„ํ–‰ ์ƒํƒœ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์ฝ”๋“œ๊ฐ€ TF_CONFIG ํ™˜๊ฒฝ ๋ณ€์ˆ˜์˜ task ํ•„๋“œ ๋‚ด trial ํ•„๋“œ์—์„œ ํ˜„์žฌ ์‹œ๋„ ํšŸ์ˆ˜์˜ ๋ฒˆํ˜ธ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„