GKE์—์„œ ์—ฌ๋Ÿฌ GPU๋กœ LLM ์„œ๋น™


์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” ํšจ์œจ์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ถ”๋ก ์„ ์œ„ํ•ด GKE์—์„œ GPU ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ๋ฐฐํฌํ•˜๊ณ  ์„œ๋น™ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. L4 GPU ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” GKE ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ๋‹ค์Œ ๋ชจ๋ธ์„ ์„œ๋น™ํ•˜๋„๋ก ์ธํ”„๋ผ๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ ๋ฐ์ดํ„ฐ ํ˜•์‹์— ๋”ฐ๋ผ ํ•„์š”ํ•œ GPU ์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๊ฐ ๋ชจ๋ธ์€ L4 GPU 2๊ฐœ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ GPU ์–‘ ๊ณ„์‚ฐ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

์ด ํŠœํ† ๋ฆฌ์–ผ์€ LLM์„ ์„œ๋น™ํ•˜๋Š” ๋ฐ Kubernetes ์ปจํ…Œ์ด๋„ˆ ์กฐ์ • ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹(ML) ์—”์ง€๋‹ˆ์–ด, ํ”Œ๋žซํผ ๊ด€๋ฆฌ์ž ๋ฐ ์šด์˜์ž, ๋ฐ์ดํ„ฐ ๋ฐ AI ์ „๋ฌธ๊ฐ€๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. Google Cloud์ฝ˜ํ…์ธ ์—์„œ ์ฐธ์กฐํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ ์—ญํ• ๊ณผ ์˜ˆ์‹œ ํƒœ์Šคํฌ๋ฅผ ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด ์ผ๋ฐ˜ GKE ์‚ฌ์šฉ์ž ์—ญํ•  ๋ฐ ํƒœ์Šคํฌ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์ด ํŽ˜์ด์ง€๋ฅผ ์ฝ๊ธฐ ์ „ ๋‹ค์Œ ๋‚ด์šฉ์„ ์ˆ™์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ชฉํ‘œ

์ด ๊ฐ€์ด๋“œ์˜ ๋ชฉํ‘œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํด๋Ÿฌ์Šคํ„ฐ ๋ฐ ๋…ธ๋“œ ํ’€ ๋งŒ๋“ค๊ธฐ
  2. ์›Œํฌ๋กœ๋“œ ์ค€๋น„
  3. ์›Œํฌ๋กœ๋“œ ๋ฐฐํฌ
  4. LLM ์ธํ„ฐํŽ˜์ด์Šค์™€ ์ƒํ˜ธ์ž‘์šฉ

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

  • Google Kubernetes Engine API๋ฅผ ์‚ฌ์šฉ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • Google Kubernetes Engine API ์‚ฌ์šฉ ์„ค์ •
  • ์ด ํƒœ์Šคํฌ์— Google Cloud CLI๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด gcloud CLI๋ฅผ ์„ค์น˜ํ•œ ํ›„ ์ดˆ๊ธฐํ™”ํ•˜์„ธ์š”. ์ด์ „์— gcloud CLI๋ฅผ ์„ค์น˜ํ•œ ๊ฒฝ์šฐ gcloud components update๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ตœ์‹  ๋ฒ„์ „์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
  • ์ผ๋ถ€ ๋ชจ๋ธ์—๋Š” ์ถ”๊ฐ€ ์š”๊ตฌ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • Hugging Face์˜ ๋ชจ๋ธ์— ์•ก์„ธ์Šคํ•˜๋ ค๋ฉด HuggingFace ํ† ํฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • Mixtral 8x7b ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ Mistral Mixtral ๋ชจ๋ธ ์กฐ๊ฑด์„ ์ˆ˜๋ฝํ•ฉ๋‹ˆ๋‹ค.
    • Llama 3 70b ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ Meta Llama ๋ชจ๋ธ์˜ ํ™œ์„ฑ ๋ผ์ด์„ ์Šค๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์ค€๋น„

  1. Google Cloud ์ฝ˜์†”์—์„œ Cloud Shell ์ธ์Šคํ„ด์Šค๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
    Cloud Shell ์—ด๊ธฐ

  2. ๊ธฐ๋ณธ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

    gcloud config set project PROJECT_ID
    gcloud config set billing/quota_project PROJECT_ID
    export PROJECT_ID=$(gcloud config get project)
    export CONTROL_PLANE_LOCATION=us-central1
    

    PROJECT_ID๋ฅผ Google Cloudํ”„๋กœ์ ํŠธ ID๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

GKE ํด๋Ÿฌ์Šคํ„ฐ ๋ฐ ๋…ธ๋“œ ํ’€ ๋งŒ๋“ค๊ธฐ

GPU๋ฅผ ํ™œ์šฉํ•˜์—ฌ GKE Autopilot ๋˜๋Š” Standard ํด๋Ÿฌ์Šคํ„ฐ์—์„œ LLM์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์™„์ „ ๊ด€๋ฆฌํ˜• Kubernetes ํ™˜๊ฒฝ์„ ์œ„ํ•ด์„œ๋Š” Autopilot์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์›Œํฌ๋กœ๋“œ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ GKE ์ž‘์—… ๋ชจ๋“œ๋ฅผ ์„ ํƒํ•˜๋ ค๋ฉด GKE ์ž‘์—… ๋ชจ๋“œ ์„ ํƒ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

Autopilot

  1. Cloud Shell์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    gcloud container clusters create-auto l4-demo \
      --project=${PROJECT_ID} \
      --location=${CONTROL_PLANE_LOCATION} \
      --release-channel=rapid
    

    GKE๋Š” ๋ฐฐํฌ๋œ ์›Œํฌ๋กœ๋“œ์˜ ์š”์ฒญ์— ๋”ฐ๋ผ CPU ๋ฐ GPU ๋…ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Autopilot ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  2. ํด๋Ÿฌ์Šคํ„ฐ์™€ ํ†ต์‹ ํ•˜๋„๋ก kubectl์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

    gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}
    

Standard

  1. Cloud Shell์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ GKE์— ๋Œ€ํ•ด ์›Œํฌ๋กœ๋“œ ์•„์ด๋ดํ‹ฐํ‹ฐ ์ œํœด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Standard ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    gcloud container clusters create l4-demo \
      --location ${CONTROL_PLANE_LOCATION} \
      --workload-pool ${PROJECT_ID}.svc.id.goog \
      --enable-image-streaming \
      --node-locations=${CONTROL_PLANE_LOCATION}-a \
      --workload-pool=${PROJECT_ID}.svc.id.goog \
      --machine-type n2d-standard-4 \
      --num-nodes 1 --min-nodes 1 --max-nodes 5 \
      --release-channel=rapid
    

    ํด๋Ÿฌ์Šคํ„ฐ ๋งŒ๋“ค๊ธฐ๋Š” ๋ช‡ ๋ถ„ ์ •๋„ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  2. ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ์— ๋Œ€ํ•ด ๋…ธ๋“œ ํ’€์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    gcloud container node-pools create g2-standard-24 --cluster l4-demo \
      --location ${CONTROL_PLANE_LOCATION} \
      --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
      --machine-type g2-standard-24 \
      --enable-autoscaling --enable-image-streaming \
      --num-nodes=0 --min-nodes=0 --max-nodes=3 \
      --node-locations ${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c \
      --spot
    

    GKE๋Š” LLM์— ๋‹ค์Œ ๋ฆฌ์†Œ์Šค๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    • ๊ณต๊ฐœ Google Kubernetes Engine(GKE) Standard ๋ฒ„์ „ ํด๋Ÿฌ์Šคํ„ฐ
    • g2-standard-24 ๋จธ์‹  ์œ ํ˜•์ด 0๊ฐœ ๋…ธ๋“œ๋กœ ์ถ•์†Œ๋œ ๋…ธ๋“œ ํ’€. ํฌ๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ GPU๋ฅผ ์š”์ฒญํ•˜๊ธฐ ์ „๊นŒ์ง€๋Š” GPU ๋น„์šฉ์ด ์ฒญ๊ตฌ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ๋…ธ๋“œ ํ’€์€ ๊ธฐ๋ณธ ํ‘œ์ค€ Compute Engine VM๋ณด๋‹ค ์ €๋ ดํ•œ Spot VM์„ ํ”„๋กœ๋น„์ €๋‹ํ•˜๊ณ  ๊ฐ€์šฉ์„ฑ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฃผ๋ฌธํ˜• VM์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์ด ๋ช…๋ น์–ด์™€ text-generation-inference.yaml ๊ตฌ์„ฑ์˜ cloud.google.com/gke-spot ๋…ธ๋“œ ์„ ํƒ๊ธฐ์—์„œ --spot ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ํด๋Ÿฌ์Šคํ„ฐ์™€ ํ†ต์‹ ํ•˜๋„๋ก kubectl์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

    gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}
    

์›Œํฌ๋กœ๋“œ ์ค€๋น„

์ด ์„น์…˜์—์„œ๋Š” ์‚ฌ์šฉํ•  ๋ชจ๋ธ์— ๋”ฐ๋ผ ์›Œํฌ๋กœ๋“œ๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” Kubernetes ๋ฐฐํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค. ๋ฐฐํฌ๋Š” ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋…ธ๋“œ ๊ฐ„์— ๋ฐฐํฌ๋˜๋Š” ์—ฌ๋Ÿฌ ํฌ๋“œ ๋ณต์ œ๋ณธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Kubernetes API ๊ฐ์ฒด์ž…๋‹ˆ๋‹ค.

Llama 3 70b

  1. ๊ธฐ๋ณธ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

    export HF_TOKEN=HUGGING_FACE_TOKEN
    

    HUGGING_FACE_TOKEN์„ HuggingFace ํ† ํฐ์œผ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  2. HuggingFace ํ† ํฐ์— Kubernetes ๋ณด์•ˆ ๋น„๋ฐ€์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    kubectl create secret generic l4-demo \
        --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    
  3. ๋‹ค์Œ text-generation-inference.yaml ๋ฐฐํฌ ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
            resources:
              requests:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: meta-llama/Meta-Llama-3-70B-Instruct
            - name: NUM_SHARD
              value: "2"
            - name: MAX_INPUT_TOKENS
              value: "2048"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: l4-demo
                  key: HUGGING_FACE_TOKEN
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              # mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment
              # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
              # i.e. where the downloaded model from the Hub will be stored
              - mountPath: /tmp
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 150Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    ์ด ๋งค๋‹ˆํŽ˜์ŠคํŠธ์—์„œ ๊ฐ ํ•ญ๋ชฉ์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    • ๋ชจ๋ธ์— 2๊ฐœ์˜ NVIDIA L4 GPU๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ NUM_SHARD๋Š” 2์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • QUANTIZE๋Š” bitsandbytes-nf4๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด 32๋น„ํŠธ ๋Œ€์‹  4๋น„ํŠธ์— ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด GKE๊ฐ€ ํ•„์š”ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ ์–‘์„ ์ค„์ด๊ณ  ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ ์ •ํ™•์„ฑ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์š”์ฒญํ•  GPU๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ GPU ์–‘ ๊ณ„์‚ฐ์„ ์ฐธ์กฐํ•˜์„ธ์š”.
  4. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f text-generation-inference.yaml
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    deployment.apps/llm created
    
  5. ๋ชจ๋ธ์˜ ์ƒํƒœ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    kubectl get deploy
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           20m
    
  6. ์‹คํ–‰ ์ค‘์ธ ๋ฐฐํฌ์˜ ๋กœ๊ทธ๋ฅผ ๋ด…๋‹ˆ๋‹ค.

    kubectl logs -l app=llm
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

Mixtral 8x7b

  1. ๊ธฐ๋ณธ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

    export HF_TOKEN=HUGGING_FACE_TOKEN
    

    HUGGING_FACE_TOKEN์„ HuggingFace ํ† ํฐ์œผ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  2. HuggingFace ํ† ํฐ์— Kubernetes ๋ณด์•ˆ ๋น„๋ฐ€์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    kubectl create secret generic l4-demo \
        --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    
  3. ๋‹ค์Œ text-generation-inference.yaml ๋ฐฐํฌ ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
            resources:
              requests:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: mistralai/Mixtral-8x7B-Instruct-v0.1
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: l4-demo
                  key: HUGGING_FACE_TOKEN          
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              # mountPath is set to /tmp as it's the path where the HF_HOME environment
              # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
              # i.e. where the downloaded model from the Hub will be stored
              - mountPath: /tmp
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 100Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    ์ด ๋งค๋‹ˆํŽ˜์ŠคํŠธ์—์„œ ๊ฐ ํ•ญ๋ชฉ์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    • ๋ชจ๋ธ์— 2๊ฐœ์˜ NVIDIA L4 GPU๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ NUM_SHARD๋Š” 2์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • QUANTIZE๋Š” bitsandbytes-nf4๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด 32๋น„ํŠธ ๋Œ€์‹  4๋น„ํŠธ์— ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด GKE๊ฐ€ ํ•„์š”ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ ์–‘์„ ์ค„์ด๊ณ  ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ ์ •ํ™•๋„๊ฐ€ ์ค„์–ด๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์š”์ฒญํ•  GPU๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ GPU ์–‘ ๊ณ„์‚ฐ์„ ์ฐธ์กฐํ•˜์„ธ์š”.
  4. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f text-generation-inference.yaml
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    deployment.apps/llm created
    
  5. ๋ชจ๋ธ์˜ ์ƒํƒœ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    watch kubectl get deploy
    

    ๋ฐฐํฌ๊ฐ€ ์ค€๋น„๋˜๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           10m
    

    ํ™•์ธ์„ ์ข…๋ฃŒํ•˜๋ ค๋ฉด CTRL + C๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  6. ์‹คํ–‰ ์ค‘์ธ ๋ฐฐํฌ์˜ ๋กœ๊ทธ๋ฅผ ๋ด…๋‹ˆ๋‹ค.

    kubectl logs -l app=llm
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

Falcon 40b

  1. ๋‹ค์Œ text-generation-inference.yaml ๋ฐฐํฌ ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310
            resources:
              requests:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: tiiuae/falcon-40b-instruct
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment
              # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be
              # stored
              - mountPath: /data
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 175Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    ์ด ๋งค๋‹ˆํŽ˜์ŠคํŠธ์—์„œ ๊ฐ ํ•ญ๋ชฉ์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    • ๋ชจ๋ธ์— 2๊ฐœ์˜ NVIDIA L4 GPU๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ NUM_SHARD๋Š” 2์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • QUANTIZE๋Š” bitsandbytes-nf4๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด 32๋น„ํŠธ ๋Œ€์‹  4๋น„ํŠธ์— ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด GKE๊ฐ€ ํ•„์š”ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ ์–‘์„ ์ค„์ด๊ณ  ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ ์ •ํ™•์„ฑ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์š”์ฒญํ•  GPU๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ GPU ์–‘ ๊ณ„์‚ฐ์„ ์ฐธ์กฐํ•˜์„ธ์š”.
  2. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f text-generation-inference.yaml
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    deployment.apps/llm created
    
  3. ๋ชจ๋ธ์˜ ์ƒํƒœ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    watch kubectl get deploy
    

    ๋ฐฐํฌ๊ฐ€ ์ค€๋น„๋˜๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           10m
    

    ํ™•์ธ์„ ์ข…๋ฃŒํ•˜๋ ค๋ฉด CTRL + C๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  4. ์‹คํ–‰ ์ค‘์ธ ๋ฐฐํฌ์˜ ๋กœ๊ทธ๋ฅผ ๋ด…๋‹ˆ๋‹ค.

    kubectl logs -l app=llm
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

ClusterIP ์œ ํ˜•์˜ ์„œ๋น„์Šค ๋งŒ๋“ค๊ธฐ

๋‹ค๋ฅธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ํฌ๋“œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ๋„๋ก ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด์—์„œ ํฌ๋“œ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๋…ธ์ถœํ•ฉ๋‹ˆ๋‹ค.

  1. ๋‹ค์Œ llm-service.yaml ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: v1
    kind: Service
    metadata:
      name: llm-service
    spec:
      selector:
        app: llm
      type: ClusterIP
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8080
    
  2. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f llm-service.yaml
    

์ฑ„ํŒ… ์ธํ„ฐํŽ˜์ด์Šค ๋ฐฐํฌ

Gradio๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์›น ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค. Gradio๋Š” ์ฑ—๋ด‡์šฉ ์‚ฌ์šฉ์ž ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๋งŒ๋“œ๋Š” ChatInterface ๋ž˜ํผ๊ฐ€ ์žˆ๋Š” Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค.

Llama 3 70b

  1. gradio.yaml์ด๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy:
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "meta-llama/Meta-Llama-3-70B-Instruct"
            - name: USER_PROMPT
              value: "<|begin_of_text|><|start_header_id|>user<|end_header_id|> prompt <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
            - name: SYSTEM_PROMPT
              value: "prompt <|eot_id|>"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
    
  2. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f gradio.yaml
    
  3. ์„œ๋น„์Šค์˜ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

    kubectl get svc
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. EXTERNAL-IP ์—ด์—์„œ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  5. ์™ธ๋ถ€ IP ์ฃผ์†Œ์™€ ๋…ธ์ถœ๋œ ํฌํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›น๋ธŒ๋ผ์šฐ์ €์—์„œ ๋ชจ๋ธ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    http://EXTERNAL_IP
    

Mixtral 8x7b

  1. gradio.yaml์ด๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy:
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "mixtral-8x7b"
            - name: USER_PROMPT
              value: "[INST] prompt [/INST]"
            - name: SYSTEM_PROMPT
              value: "prompt"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
    
  2. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f gradio.yaml
    
  3. ์„œ๋น„์Šค์˜ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

    kubectl get svc
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. EXTERNAL-IP ์—ด์—์„œ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  5. ์™ธ๋ถ€ IP ์ฃผ์†Œ์™€ ๋…ธ์ถœ๋œ ํฌํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›น๋ธŒ๋ผ์šฐ์ €์—์„œ ๋ชจ๋ธ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    http://EXTERNAL_IP
    

Falcon 40b

  1. gradio.yaml์ด๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy:
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "falcon-40b-instruct"
            - name: USER_PROMPT
              value: "User: prompt"
            - name: SYSTEM_PROMPT
              value: "Assistant: prompt"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
    
  2. ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

    kubectl apply -f gradio.yaml
    
  3. ์„œ๋น„์Šค์˜ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

    kubectl get svc
    

    ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. EXTERNAL-IP ์—ด์—์„œ ์™ธ๋ถ€ IP ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  5. ์™ธ๋ถ€ IP ์ฃผ์†Œ์™€ ๋…ธ์ถœ๋œ ํฌํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›น๋ธŒ๋ผ์šฐ์ €์—์„œ ๋ชจ๋ธ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

    http://EXTERNAL_IP
    

GPU ์–‘ ๊ณ„์‚ฐ

GPU ์–‘์€ QUANTIZE ํ”Œ๋ž˜๊ทธ์˜ ๊ฐ’์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” QUANTIZE๊ฐ€ bitsandbytes-nf4๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด 4๋น„ํŠธ๋กœ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.

700์–ต ๊ฐœ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ชจ๋ธ์—๋Š” 700์–ต x 4๋น„ํŠธ(700์–ต x 4๋น„ํŠธ = 35GB)์— ํ•ด๋‹นํ•˜๋Š” ์ตœ์†Œ 40GB์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ 5GB์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๋‹จ์ผ L4 GPU์— ์ถฉ๋ถ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ํŠœํ† ๋ฆฌ์–ผ์˜ ์˜ˆ์‹œ์—์„œ๋Š” 2๊ฐœ์˜ L4 GPU ๋ฉ”๋ชจ๋ฆฌ(2 x 24 = 48GB)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ตฌ์„ฑ์€ L4 GPU์—์„œ Falcon 40b ๋˜๋Š” Llama 3 70b๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.

์‚ญ์ œ

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์‚ฌ์šฉ๋œ ๋ฆฌ์†Œ์Šค ๋น„์šฉ์ด Google Cloud ๊ณ„์ •์— ์ฒญ๊ตฌ๋˜์ง€ ์•Š๋„๋ก ํ•˜๋ ค๋ฉด ๋ฆฌ์†Œ์Šค๊ฐ€ ํฌํ•จ๋œ ํ”„๋กœ์ ํŠธ๋ฅผ ์‚ญ์ œํ•˜๊ฑฐ๋‚˜ ํ”„๋กœ์ ํŠธ๋ฅผ ์œ ์ง€ํ•˜๊ณ  ๊ฐœ๋ณ„ ๋ฆฌ์†Œ์Šค๋ฅผ ์‚ญ์ œํ•˜์„ธ์š”.

ํด๋Ÿฌ์Šคํ„ฐ ์‚ญ์ œ

์ด ๊ฐ€์ด๋“œ์—์„œ ๋งŒ๋“  ๋ฆฌ์†Œ์Šค์— ๋Œ€ํ•œ ๋น„์šฉ์ด Google Cloud ๊ณ„์ •์— ์ฒญ๊ตฌ๋˜์ง€ ์•Š๊ฒŒ ํ•˜๋ ค๋ฉด GKE ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.

gcloud container clusters delete l4-demo --location ${CONTROL_PLANE_LOCATION}

๋‹ค์Œ ๋‹จ๊ณ„