GKE에서 여러 GPU로 LLM 서빙

Autopilot Standard

이 튜토리얼에서는 효율적이고 확장 가능한 추론을 위해 GKE에서 GPU 여러 개를 사용하여 대규모 언어 모델(LLM)을 배포하고 서빙하는 방법을 보여줍니다. L4 GPU 여러 개를 사용하는 GKE 클러스터를 만들고 다음 모델을 서빙하도록 인프라를 준비합니다.

모델의 데이터 형식에 따라 필요한 GPU 수가 달라집니다. 이 튜토리얼에서 각 모델은 L4 GPU 2개를 사용합니다. 자세한 내용은 GPU 양 계산을 참조하세요.

이 튜토리얼은 LLM을 서빙하는 데 Kubernetes 컨테이너 조정 기능을 사용하려고 하는 머신러닝(ML) 엔지니어, 플랫폼 관리자 및 운영자, 데이터 및 AI 전문가를 대상으로 합니다. Google Cloud콘텐츠에서 참조하는 일반적인 역할과 예시 태스크를 자세히 알아보려면 일반 GKE 사용자 역할 및 태스크를 참조하세요.

이 페이지를 읽기 전 다음 내용을 숙지해야 합니다.

목표

이 가이드의 목표는 다음과 같습니다.

클러스터 및 노드 풀 만들기
워크로드 준비
워크로드 배포
LLM 인터페이스와 상호작용

시작하기 전에

시작하기 전에 다음 태스크를 수행했는지 확인합니다.

Google Kubernetes Engine API를 사용 설정합니다.

Google Kubernetes Engine API 사용 설정

이 태스크에 Google Cloud CLI를 사용하려면 gcloud CLI를 설치한 후 초기화하세요. 이전에 gcloud CLI를 설치한 경우 gcloud components update를 실행하여 최신 버전을 가져옵니다.
참고: 기존 gcloud CLI 설치의 경우 compute/region 속성을 설정해야 합니다. 주로 영역 클러스터를 사용하는 경우에는 대신 compute/zone을 설정합니다. 기본 위치를 설정하면 gcloud CLI에서 One of [--zone, --region] must be supplied: Please specify location과 같은 오류를 방지할 수 있습니다. 클러스터 위치가 설정한 기본값과 다른 경우 특정 명령어에서 위치를 지정해야 할 수 있습니다.

일부 모델에는 추가 요구사항이 있습니다. 다음 요구사항을 충족해야 합니다.
- Hugging Face의 모델에 액세스하려면 HuggingFace 토큰을 사용합니다.
- Mixtral 8x7b 모델의 경우 Mistral Mixtral 모델 조건을 수락합니다.
- Llama 3 70b 모델의 경우 Meta Llama 모델의 활성 라이선스가 있는지 확인합니다.
경고: Llama 모델에 대한 액세스 및 승인을 받는 데 최대 3일이 걸릴 수 있습니다.

개발 환경 준비

Google Cloud 콘솔에서 Cloud Shell 인스턴스를 시작합니다.
Cloud Shell 열기
기본 환경 변수를 설정합니다.
```
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CONTROL_PLANE_LOCATION=us-central1
```
PROJECT_ID를 Google Cloud프로젝트 ID로 바꿉니다.

참고: 튜토리얼 실행 과정에서 Cloud Shell 인스턴스가 연결 해제되면 이전 단계를 반복합니다.

GKE 클러스터 및 노드 풀 만들기

GPU를 활용하여 GKE Autopilot 또는 Standard 클러스터에서 LLM을 제공할 수 있습니다. 완전 관리형 Kubernetes 환경을 위해서는 Autopilot을 사용하는 것이 좋습니다. 워크로드에 가장 적합한 GKE 작업 모드를 선택하려면 GKE 작업 모드 선택을 참조하세요.

Autopilot

Cloud Shell에서 다음 명령어를 실행합니다.
```
gcloud container clusters create-auto l4-demo \
  --project=${PROJECT_ID} \
  --location=${CONTROL_PLANE_LOCATION} \
  --release-channel=rapid
```
GKE는 배포된 워크로드의 요청에 따라 CPU 및 GPU 노드를 사용하여 Autopilot 클러스터를 만듭니다.

클러스터와 통신하도록 kubectl을 구성합니다.

gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}

Standard

Cloud Shell에서 다음 명령어를 실행하여 GKE에 대해 워크로드 아이덴티티 제휴를 사용하는 Standard 클러스터를 만듭니다.

gcloud container clusters create l4-demo \
  --location ${CONTROL_PLANE_LOCATION} \
  --workload-pool ${PROJECT_ID}.svc.id.goog \
  --enable-image-streaming \
  --node-locations=${CONTROL_PLANE_LOCATION}-a \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --machine-type n2d-standard-4 \
  --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --release-channel=rapid

클러스터 만들기는 몇 분 정도 걸릴 수 있습니다.

다음 명령어를 실행하여 클러스터에 대해 노드 풀을 만듭니다.
```
gcloud container node-pools create g2-standard-24 --cluster l4-demo \
  --location ${CONTROL_PLANE_LOCATION} \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --machine-type g2-standard-24 \
  --enable-autoscaling --enable-image-streaming \
  --num-nodes=0 --min-nodes=0 --max-nodes=3 \
  --node-locations ${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c \
  --spot
```
GKE는 LLM에 다음 리소스를 만듭니다.
- 공개 Google Kubernetes Engine(GKE) Standard 버전 클러스터
- g2-standard-24 머신 유형이 0개 노드로 축소된 노드 풀. 포드를 실행하여 GPU를 요청하기 전까지는 GPU 비용이 청구되지 않습니다. 이 노드 풀은 기본 표준 Compute Engine VM보다 저렴한 Spot VM을 프로비저닝하고 가용성을 보장하지 않습니다. 주문형 VM을 사용하려면 이 명령어와 text-generation-inference.yaml 구성의 cloud.google.com/gke-spot 노드 선택기에서 --spot 플래그를 삭제할 수 있습니다.

클러스터와 통신하도록 kubectl을 구성합니다.

gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION}

워크로드 준비

이 섹션에서는 사용할 모델에 따라 워크로드를 설정하는 방법을 보여줍니다. 이 튜토리얼에서는 Kubernetes 배포를 사용하여 모델을 배포합니다. 배포는 클러스터에서 노드 간에 배포되는 여러 포드 복제본을 실행할 수 있는 Kubernetes API 객체입니다.

Llama 3 70b

기본 환경 변수를 설정합니다.
```
export HF_TOKEN=HUGGING_FACE_TOKEN
```
HUGGING_FACE_TOKEN을 HuggingFace 토큰으로 바꿉니다.

HuggingFace 토큰에 Kubernetes 보안 비밀을 만듭니다.

kubectl create secret generic l4-demo \
    --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

다음 text-generation-inference.yaml 배포 매니페스트를 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
        resources:
          requests:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: meta-llama/Meta-Llama-3-70B-Instruct
        - name: NUM_SHARD
          value: "2"
        - name: MAX_INPUT_TOKENS
          value: "2048"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: l4-demo
              key: HUGGING_FACE_TOKEN
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          # mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment
          # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
          # i.e. where the downloaded model from the Hub will be stored
          - mountPath: /tmp
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 150Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
        cloud.google.com/gke-spot: "true"

이 매니페스트에서 각 항목은 다음을 수행합니다.

모델에 2개의 NVIDIA L4 GPU가 필요하므로 NUM_SHARD는 2여야 합니다.
QUANTIZE는 bitsandbytes-nf4로 설정됩니다. 즉, 모델이 32비트 대신 4비트에 로드됩니다. 이렇게 하면 GKE가 필요한 GPU 메모리 양을 줄이고 추론 속도를 개선할 수 있습니다. 하지만 모델 정확성이 떨어질 수 있습니다. 요청할 GPU를 계산하는 방법은 GPU 양 계산을 참조하세요.

매니페스트를 적용합니다.

kubectl apply -f text-generation-inference.yaml

출력은 다음과 비슷합니다.

deployment.apps/llm created

모델의 상태를 확인합니다.

kubectl get deploy

출력은 다음과 비슷합니다.

NAME          READY   UP-TO-DATE   AVAILABLE   AGE
llm           1/1     1            1           20m

실행 중인 배포의 로그를 봅니다.

kubectl logs -l app=llm

출력은 다음과 비슷합니다.

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Mixtral 8x7b

기본 환경 변수를 설정합니다.
```
export HF_TOKEN=HUGGING_FACE_TOKEN
```
HUGGING_FACE_TOKEN을 HuggingFace 토큰으로 바꿉니다.

HuggingFace 토큰에 Kubernetes 보안 비밀을 만듭니다.

kubectl create secret generic l4-demo \
    --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

다음 text-generation-inference.yaml 배포 매니페스트를 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
        resources:
          requests:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: mistralai/Mixtral-8x7B-Instruct-v0.1
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: l4-demo
              key: HUGGING_FACE_TOKEN          
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          # mountPath is set to /tmp as it's the path where the HF_HOME environment
          # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
          # i.e. where the downloaded model from the Hub will be stored
          - mountPath: /tmp
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 100Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
        cloud.google.com/gke-spot: "true"

이 매니페스트에서 각 항목은 다음을 수행합니다.

모델에 2개의 NVIDIA L4 GPU가 필요하므로 NUM_SHARD는 2여야 합니다.
QUANTIZE는 bitsandbytes-nf4로 설정됩니다. 즉, 모델이 32비트 대신 4비트에 로드됩니다. 이렇게 하면 GKE가 필요한 GPU 메모리 양을 줄이고 추론 속도를 개선할 수 있습니다. 하지만 모델 정확도가 줄어들 수 있습니다. 요청할 GPU를 계산하는 방법은 GPU 양 계산을 참조하세요.

매니페스트를 적용합니다.

kubectl apply -f text-generation-inference.yaml

출력은 다음과 비슷합니다.

deployment.apps/llm created

모델의 상태를 확인합니다.
```
watch kubectl get deploy
```
배포가 준비되면 출력은 다음과 비슷합니다.
```
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
llm           1/1     1            1           10m
```
확인을 종료하려면 CTRL + C를 입력합니다.

실행 중인 배포의 로그를 봅니다.

kubectl logs -l app=llm

출력은 다음과 비슷합니다.

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Falcon 40b

다음 text-generation-inference.yaml 배포 매니페스트를 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310
        resources:
          requests:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: tiiuae/falcon-40b-instruct
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment
          # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be
          # stored
          - mountPath: /data
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 175Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
        cloud.google.com/gke-spot: "true"

이 매니페스트에서 각 항목은 다음을 수행합니다.

모델에 2개의 NVIDIA L4 GPU가 필요하므로 NUM_SHARD는 2여야 합니다.
QUANTIZE는 bitsandbytes-nf4로 설정됩니다. 즉, 모델이 32비트 대신 4비트에 로드됩니다. 이렇게 하면 GKE가 필요한 GPU 메모리 양을 줄이고 추론 속도를 개선할 수 있습니다. 하지만 모델 정확성이 떨어질 수 있습니다. 요청할 GPU를 계산하는 방법은 GPU 양 계산을 참조하세요.

매니페스트를 적용합니다.

kubectl apply -f text-generation-inference.yaml

출력은 다음과 비슷합니다.

deployment.apps/llm created

모델의 상태를 확인합니다.
```
watch kubectl get deploy
```
배포가 준비되면 출력은 다음과 비슷합니다.
```
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
llm           1/1     1            1           10m
```
확인을 종료하려면 CTRL + C를 입력합니다.

실행 중인 배포의 로그를 봅니다.

kubectl logs -l app=llm

출력은 다음과 비슷합니다.

{"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

ClusterIP 유형의 서비스 만들기

다른 애플리케이션에서 포드를 검색하고 액세스할 수 있도록 클러스터 내에서 포드를 내부적으로 노출합니다.

다음 llm-service.yaml 매니페스트를 만듭니다.

apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

매니페스트를 적용합니다.
```
kubectl apply -f llm-service.yaml
```

채팅 인터페이스 배포

Gradio를 사용하여 모델과 상호작용할 수 있는 웹 애플리케이션을 빌드합니다. Gradio는 챗봇용 사용자 인터페이스를 만드는 ChatInterface 래퍼가 있는 Python 라이브러리입니다.

Llama 3 70b

gradio.yaml이라는 파일을 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service"
        - name: LLM_ENGINE
          value: "tgi"
        - name: MODEL_ID
          value: "meta-llama/Meta-Llama-3-70B-Instruct"
        - name: USER_PROMPT
          value: "<|begin_of_text|><|start_header_id|>user<|end_header_id|> prompt <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
        - name: SYSTEM_PROMPT
          value: "prompt <|eot_id|>"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  type: LoadBalancer
  selector:
    app: gradio
  ports:
  - port: 80
    targetPort: 7860

매니페스트를 적용합니다.
```
kubectl apply -f gradio.yaml
```

서비스의 외부 IP 주소를 찾습니다.

kubectl get svc

출력은 다음과 비슷합니다.

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m

EXTERNAL-IP 열에서 외부 IP 주소를 복사합니다.
외부 IP 주소와 노출된 포트를 사용하여 웹브라우저에서 모델 인터페이스를 확인합니다.
```
http://EXTERNAL_IP
```

Mixtral 8x7b

gradio.yaml이라는 파일을 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service"
        - name: LLM_ENGINE
          value: "tgi"
        - name: MODEL_ID
          value: "mixtral-8x7b"
        - name: USER_PROMPT
          value: "[INST] prompt [/INST]"
        - name: SYSTEM_PROMPT
          value: "prompt"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  type: LoadBalancer
  selector:
    app: gradio
  ports:
  - port: 80
    targetPort: 7860

매니페스트를 적용합니다.
```
kubectl apply -f gradio.yaml
```

서비스의 외부 IP 주소를 찾습니다.

kubectl get svc

출력은 다음과 비슷합니다.

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m

EXTERNAL-IP 열에서 외부 IP 주소를 복사합니다.
외부 IP 주소와 노출된 포트를 사용하여 웹브라우저에서 모델 인터페이스를 확인합니다.
```
http://EXTERNAL_IP
```

Falcon 40b

gradio.yaml이라는 파일을 만듭니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service"
        - name: LLM_ENGINE
          value: "tgi"
        - name: MODEL_ID
          value: "falcon-40b-instruct"
        - name: USER_PROMPT
          value: "User: prompt"
        - name: SYSTEM_PROMPT
          value: "Assistant: prompt"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  type: LoadBalancer
  selector:
    app: gradio
  ports:
  - port: 80
    targetPort: 7860

매니페스트를 적용합니다.
```
kubectl apply -f gradio.yaml
```

서비스의 외부 IP 주소를 찾습니다.

kubectl get svc

출력은 다음과 비슷합니다.

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m

EXTERNAL-IP 열에서 외부 IP 주소를 복사합니다.
외부 IP 주소와 노출된 포트를 사용하여 웹브라우저에서 모델 인터페이스를 확인합니다.
```
http://EXTERNAL_IP
```

GPU 양 계산

GPU 양은 QUANTIZE 플래그의 값에 따라 달라집니다. 이 튜토리얼에서는 QUANTIZE가 bitsandbytes-nf4로 설정됩니다. 즉, 모델이 4비트로 로드됩니다.

700억 개 매개변수 모델에는 700억 x 4비트(700억 x 4비트 = 35GB)에 해당하는 최소 40GB의 GPU 메모리가 필요하며 5GB의 오버헤드를 고려합니다. 이 경우 단일 L4 GPU에 충분한 메모리가 없습니다. 따라서 이 튜토리얼의 예시에서는 2개의 L4 GPU 메모리(2 x 24 = 48GB)를 사용합니다. 이 구성은 L4 GPU에서 Falcon 40b 또는 Llama 3 70b를 실행하는 데 충분합니다.

삭제

이 튜토리얼에서 사용된 리소스 비용이 Google Cloud 계정에 청구되지 않도록 하려면 리소스가 포함된 프로젝트를 삭제하거나 프로젝트를 유지하고 개별 리소스를 삭제하세요.

클러스터 삭제

이 가이드에서 만든 리소스에 대한 비용이 Google Cloud 계정에 청구되지 않게 하려면 GKE 클러스터를 삭제합니다.

gcloud container clusters delete l4-demo --location ${CONTROL_PLANE_LOCATION}

GKE에서 여러 GPU로 LLM 서빙 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

목표

시작하기 전에

개발 환경 준비

GKE 클러스터 및 노드 풀 만들기

Autopilot

Standard

워크로드 준비

Llama 3 70b

Mixtral 8x7b

Falcon 40b

ClusterIP 유형의 서비스 만들기

채팅 인터페이스 배포

Llama 3 70b

Mixtral 8x7b

Falcon 40b

GPU 양 계산

삭제

클러스터 삭제

다음 단계

GKE에서 여러 GPU로 LLM 서빙