NVIDIA Data Center GPU Manager (DCGM)

本文說明如何設定 Google Kubernetes Engine 部署作業,以便使用 Google Cloud Managed Service for Prometheus 從 NVIDIA Data Center GPU Manager 收集指標。本文將說明如何執行下列操作:

  • 設定 DCGM 的匯出工具,以回報指標。
  • 為 Managed Service for Prometheus 設定 PodMonitoring 資源,以收集匯出的指標。

只有在使用 Managed Service for Prometheus 的 代管收集作業時,才適用這些操作說明。如果您使用自行部署的收集作業,請參閱 DCGM Exporter 的來源存放區,瞭解安裝資訊。

這些操作說明僅供參考,適用於大多數 Kubernetes 環境。如要瞭解代管 DCGM 服務,請參閱「 收集及查看 DCGM 指標」。

如果因安全或機構政策限制而無法安裝應用程式或匯出工具,建議您參閱開放原始碼文件尋求支援。

如要瞭解 NVIDIA Data Center GPU Manager,請參閱 NVIDIA DCGM

必要條件

如要使用 Managed Service for Prometheus 和代管收集作業,從 DCGM 收集指標,部署作業必須符合下列規定:

  • 叢集必須執行 Google Kubernetes Engine 1.28.15-gke.2475000 以上版本。
  • 您必須執行 Managed Service for Prometheus,並啟用代管收集作業。詳情請參閱「 開始使用代管集合」一文。

  • 確認您有 足夠的 NVIDIA GPU 配額

  • 如要列舉 GKE 叢集中的 GPU 節點及其 GPU 類型,請在相關叢集中執行下列指令:

    kubectl get nodes -l cloud.google.com/gke-gpu -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.cloud\.google\.com/gke-accelerator}{"\n"}{end}'
    
  • 請注意,如果自動安裝功能已停用,或您的 GKE 版本不支援這項功能,您可能必須在節點上 安裝相容的 NVIDIA GPU 驅動程式。如要確認 NVIDIA GPU 裝置外掛程式是否正在執行,請執行下列指令:

    kubectl get pods -n kube-system | grep nvidia-gpu-device-plugin
    

安裝 DCGM 匯出工具

建議您使用下列設定安裝 DCGM 匯出工具 DCGM-Exporter

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm
  namespace: gmp-public
  labels:
    app: nvidia-dcgm
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-dcgm
        app: nvidia-dcgm
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - operator: "Exists"
      volumes:
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
          type: Directory
      containers:
      - image: "nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04"
        command: ["nv-hostengine", "-n", "-b", "ALL"]
        ports:
        - containerPort: 5555
          hostPort: 5555
        name: nvidia-dcgm
        securityContext:
          privileged: true
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
  namespace: gmp-public
  labels:
    app.kubernetes.io/name: nvidia-dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-dcgm-exporter
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - operator: "Exists"
      volumes:
      - name: nvidia-dcgm-exporter-metrics
        configMap:
          name: nvidia-dcgm-exporter-metrics
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
          type: Directory
      - name: pod-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
      containers:
      - name: nvidia-dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
        command: ["/bin/bash", "-c"]
        args:
        - hostname $NODE_NAME; dcgm-exporter --remote-hostengine-info $(NODE_IP) --collectors /etc/dcgm-exporter/counters.csv
        ports:
        - name: metrics
          containerPort: 9400
        securityContext:
          privileged: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
          value: "device-name"
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: DCGM_EXPORTER_KUBERNETES
          value: 'true'
        - name: DCGM_EXPORTER_LISTEN
          value: ':9400'
        volumeMounts:
        - name: nvidia-dcgm-exporter-metrics
          mountPath: "/etc/dcgm-exporter"
          readOnly: true
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: pod-resources
          mountPath: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-dcgm-exporter-metrics
  namespace: gmp-public
data:
  counters.csv: |
    # Utilization (the sample period varies depending on the product),,
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).

    # Temperature and power usage,,
    DCGM_FI_DEV_GPU_TEMP, gauge, Current temperature readings for the device in degrees C.
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature for the device.
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage for the device in Watts.

    # Utilization of IP blocks,,
    DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned
    DCGM_FI_PROF_SM_OCCUPANCY, gauge, The fraction of resident warps on a multiprocessor
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)
    DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, The fraction of cycles the FP64 (double precision) pipe was active.
    DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, The fraction of cycles the FP32 (single precision) pipe was active.
    DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, The fraction of cycles the FP16 (half precision) pipe was active.

    # Memory usage,,
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_FB_TOTAL, gauge, Total Frame Buffer of the GPU in MB.

    # PCIE,,
    DCGM_FI_PROF_PCIE_TX_BYTES, gauge, Total number of bytes transmitted through PCIe TX
    DCGM_FI_PROF_PCIE_RX_BYTES, gauge, Total number of bytes received through PCIe RX

    # NVLink,,
    DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload.
    DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, The number of bytes of active NvLink rx (read) data including both header and payload.
如要確認 DCGM Exporter 是否在預期端點上發出指標,請按照下列步驟操作:

  1. 使用下列指令設定通訊埠轉送:

    kubectl -n gmp-public port-forward POD_NAME 9400
    
  2. 在另一個終端機工作階段中,使用瀏覽器或 curl 公用程式存取端點 localhost:9400/metrics

您可以自訂 ConfigMap 區段,選取要發布的 GPU 指標

或者,您也可以考慮使用官方 Helm 資訊套件安裝 DCGM Exporter。

如要套用本機檔案的設定變更,請執行下列指令:

kubectl apply -n NAMESPACE_NAME -f FILE_NAME

您也可以使用 Terraform 管理設定。

定義 PodMonitoring 資源

如要探索目標,Managed Service for Prometheus Operator 需要與相同命名空間中的 DCGM Exporter 相對應的 PodMonitoring 資源。

您可以使用下列 PodMonitoring 設定:

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: ClusterPodMonitoring
metadata:
  name: nvidia-dcgm-exporter
  labels:
    app.kubernetes.io/name: nvidia-dcgm-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s
  targetLabels:
    metadata: []

如要套用本機檔案的設定變更,請執行下列指令:

kubectl apply -n NAMESPACE_NAME -f FILE_NAME

您也可以使用 Terraform 管理設定。

驗證設定

您可以使用 Metrics Explorer 驗證是否已正確設定 DCGM Exporter。Cloud Monitoring 可能需要一到兩分鐘才能擷取指標。

如要確認指標已擷取,請按照下列步驟操作:

  1. 前往 Google Cloud 控制台的 「Metrics Explorer」頁面:

    前往 Metrics Explorer

    如果您是使用搜尋列尋找這個頁面,請選取子標題為「Monitoring」的結果

  2. 在查詢建構工具窗格的工具列中,選取名稱為  MQL PromQL 的按鈕。
  3. 確認已在「Language」(語言) 切換按鈕中選取「PromQL」。語言切換按鈕位於同一工具列,可供你設定查詢格式。
  4. 輸入並執行下列查詢:
    DCGM_FI_DEV_GPU_UTIL{cluster="CLUSTER_NAME", namespace="gmp-public"}
    

疑難排解

如要瞭解如何排解指標擷取問題,請參閱「 排解擷取端問題」一文中的「 收集匯出工具資料時發生問題」一節。