[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-03。"],[],[],null,["# Improve workload efficiency using NCCL Fast Socket\n\n[Autopilot](/kubernetes-engine/docs/concepts/autopilot-overview) [Standard](/kubernetes-engine/docs/concepts/choose-cluster-mode)\n\n*** ** * ** ***\n\nThis page shows you how to use the\n[NVIDIA Collective Communication Library (NCCL) Fast Socket plugin](https://github.com/google/nccl-fastsocket)\nto run more efficient workloads on your Google Kubernetes Engine (GKE) clusters.\n\nBefore you begin\n----------------\n\nBefore you start, make sure that you have performed the following tasks:\n\n- Enable the Google Kubernetes Engine API.\n[Enable Google Kubernetes Engine API](https://console.cloud.google.com/flows/enableapi?apiid=container.googleapis.com)\n- If you want to use the Google Cloud CLI for this task, [install](/sdk/docs/install) and then [initialize](/sdk/docs/initializing) the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running `gcloud components update`. **Note:** For existing gcloud CLI installations, make sure to set the `compute/region` [property](/sdk/docs/properties#setting_properties). If you use primarily zonal clusters, set the `compute/zone` instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: `One of [--zone, --region] must be supplied: Please specify location`. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.\n\n### Limitations\n\n- [Compute Engine limitations](/compute/docs/networking/using-gvnic#limitations) apply.\n- [gVNIC limitations](/kubernetes-engine/docs/how-to/using-gvnic#limitations) apply.\n- NCCL Fast Socket is only supported on node pools that have [hardware accelerators](/kubernetes-engine/docs/how-to/gpus) enabled.\n\n### Requirements\n\nGKE Autopilot:\n\n- GKE Autopilot clusters must be running 1.30.2-gke.1023000 or later.\n\nFor details, see [Creating an Autopilot cluster](/kubernetes-engine/docs/how-to/creating-an-autopilot-cluster#set-version).\n\nGKE Standard:\n\n- Your node pools must have gVNIC enabled to use NCCL Fast Socket.\n- GKE nodes must use a Container-Optimized OS [node image](/kubernetes-engine/docs/concepts/node-images#cos).\n- Your clusters must be running GKE version 1.25.2-gke.1700 or later.\n\nFor details, see [Creating a regional cluster](/kubernetes-engine/docs/how-to/creating-a-regional-cluster).\n\nEnable NCCL Fast Socket in Standard clusters\n--------------------------------------------\n\nThis section shows you how to enable the NCCL Fast Socket plugin in\nGKE Standard node pools. If you use\nGKE Autopilot clusters, GKE automatically enables the plugin when you request NCCL Fast Socket in your workloads. For instructions, see the\n[NCCL Fast Socket in Autopilot](#nccl-fastsocket-autopilot) section.\n\nFor Standard clusters, create a node pool that uses the NCCL Fast Socket plugin. You can also\nupdate an existing node pool using\n[`gcloud container node-pools update`](/sdk/gcloud/reference/container/node-pools/update). \n\n gcloud container node-pools create \u003cvar translate=\"no\"\u003eNODEPOOL_NAME\u003c/var\u003e \\\n --accelerator type=\u003cvar translate=\"no\"\u003eACCELERATOR_TYPE\u003c/var\u003e,count=\u003cvar translate=\"no\"\u003eACCELERATOR_COUNT\u003c/var\u003e \\\n --machine-type=\u003cvar translate=\"no\"\u003eMACHINE_TYPE\u003c/var\u003e \\\n --cluster=\u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --enable-fast-socket \\\n --enable-gvnic\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003eNODEPOOL_NAME\u003c/var\u003e: the name of the new node pool.\n- \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the cluster.\n- \u003cvar translate=\"no\"\u003eACCELERATOR_TYPE\u003c/var\u003e: the type of [GPU accelerator](/compute/docs/gpus) that you use. For example, `nvidia-tesla-t4`.\n- \u003cvar translate=\"no\"\u003eACCELERATOR_COUNT\u003c/var\u003e: the number of GPUs per node.\n- \u003cvar translate=\"no\"\u003eMACHINE_TYPE\u003c/var\u003e: the type of machine you want to use. NCCL Fast Socket is not supported on [memory-optimized machine types](/compute/docs/machine-types#memory-optimized_machine_type_family).\n\nInstall NVIDIA GPU device drivers\n---------------------------------\n\nIn Autopilot, GPU device drivers are automatically installed.\n\nFor Standard clusters, follow the instructions in\n[Installing NVIDIA GPU device drivers](/kubernetes-engine/docs/how-to/gpus#installing_drivers)\nto install the required NVIDIA device drivers on your nodes.\n\nNCCL Fast Socket in Autopilot\n-----------------------------\n\nIn Autopilot clusters, you request NCCL Fast Socket in your workloads by using the `cloud.google.com/gke-nccl-fastsocket` node selector.\nWhen you request NCCL Fast Socket in a workload, GKE\nenables gVNIC and NCCL Fast Socket on nodes that GKE\nprovisions for the workload.\nYou can use NCCL Fast Socket with any GPU type that Autopilot supports.\n\nThe following pod requests NCCL Fast Socket: \n\n apiVersion: v1\n kind: Pod\n metadata:\n name: my-gpu-pod\n spec:\n nodeSelector:\n cloud.google.com/gke-accelerator: \u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-l devsite-syntax-l-Scalar devsite-syntax-l-Scalar-Plain\"\u003eGPU_TYPE\u003c/span\u003e\u003c/var\u003e\n cloud.google.com/gke-nccl-fastsocket: \"true\"\n containers:\n - name: my-gpu-container\n image: nvidia/cuda:11.0.3-runtime-ubuntu20.04\n command: [\"/bin/bash\", \"-c\", \"--\"]\n args: [\"while true; do sleep 600; done;\"]\n resources:\n limits:\n nvidia.com/gpu: \u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-l devsite-syntax-l-Scalar devsite-syntax-l-Scalar-Plain\"\u003eGPU_QUANTITY\u003c/span\u003e\u003c/var\u003e\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003eGPU_TYPE\u003c/var\u003e: the type of GPU hardware. Allowed values are the following:\n - `nvidia-b200`: NVIDIA B200 (180GB)\n - `nvidia-h200-141gb`: NVIDIA H200 (141GB)\n - `nvidia-h100-mega-80gb`: NVIDIA H100 Mega (80GB)\n - `nvidia-h100-80gb`: NVIDIA H100 (80GB)\n - `nvidia-a100-80gb`: NVIDIA A100 (80GB)\n - `nvidia-tesla-a100`: NVIDIA A100 (40GB)\n - `nvidia-l4`: NVIDIA L4\n - `nvidia-tesla-t4`: NVIDIA T4\n- \u003cvar translate=\"no\"\u003eGPU_QUANTITY\u003c/var\u003e: the number of GPUs to allocate to the container.\n\nVerify that NCCL Fast Socket is enabled\n---------------------------------------\n\nTo verify that NCCL Fast Socket is enabled, view the kube-system pods: \n\n kubectl get pods -n kube-system\n\nThe output is similar to the following: \n\n NAME READY STATUS RESTARTS AGE\n nccl-fastsocket-installer-qvfdw 2/2 Running 0 10m\n nccl-fastsocket-installer-rtjs4 2/2 Running 0 10m\n nccl-fastsocket-installer-tm294 2/2 Running 0 10m\n\nIn this output, the number of Pods should be equal to the number of nodes\nin the node pool.\n\nDisable NCCL Fast Socket\n------------------------\n\nIn GKE Autopilot clusters, the NCCL Fast Socket plugin is disabled by default. To disable the plugin on an existing workload, redeploy the workload without the NCCL Fast Socket node selector.\n\nTo disable NCCL Fast Socket for a node pool in Standard clusters, run the following command: \n\n gcloud container node-pools update \u003cvar translate=\"no\"\u003eNODEPOOL_NAME\u003c/var\u003e \\\n --cluster=\u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e \\\n --no-enable-fast-socket\n\nExisting nodes still have the plugin installed. You must manually\n[resize the node pool](/kubernetes-engine/docs/how-to/resizing-a-cluster)\nto migrate workloads to new nodes.\n\nTroubleshooting\n---------------\n\nTo troubleshoot gVNIC, see\n[Troubleshooting Google Virtual NIC](/compute/docs/troubleshooting/gvnic).\n\nWhat's next\n-----------\n\n- Use [network policy logging](/kubernetes-engine/docs/how-to/network-policy-logging) to record when connections to Pods are allowed or denied by your cluster's [network policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/)."]]