[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-02。"],[],[],null,["# Deploy generative AI models\n\nThis page provides guidance for deploying a generative AI model to an endpoint\nfor online inference.\n\nCheck the Model Garden\n----------------------\n\nIf the model is in Model Garden, you can deploy it by clicking\n**Deploy** (available for some models) or **Open Notebook**.\n\n[Go to Model Garden](https://console.cloud.google.com/vertex-ai/model-garden)\n\nOtherwise, you can do one of the following:\n\n- If your model is similar to one in the Model Garden, you might be\n able to directly reuse one of the\n [model garden containers](https://us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers).\n\n- Build your own custom container that adheres to\n [Custom container requirements for inference](/vertex-ai/docs/predictions/custom-container-requirements)\n before [importing your model](/vertex-ai/docs/model-registry/import-model)\n into the [Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction).\n After it's imported, it becomes a [`model`](/vertex-ai/docs/reference/rest/v1/projects.locations.models)\n resource that you can [deploy to an endpoint](/vertex-ai/docs/general/deployment).\n\n You can use the [Dockerfiles and scripts](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden)\n that we use to build our Model Garden containers as a reference or\n starting point to build your own custom containers.\n\nServing inferences with NVIDIA NIM\n----------------------------------\n\n[NVIDIA Inference Microservices (NIM)](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)\nare pre-trained and optimized AI models that are packaged as microservices.\nThey're designed to simplify the deployment of high-performance,\nproduction-ready AI into applications.\n\nNVIDIA NIM can be used together with\n[Artifact Registry](/artifact-registry/docs/overview) and Vertex AI\nto deploy generative AI models for online inference. \n| To see an example of using NVIDIA NIM,\n| run the \"NVIDIA NIM on Google Cloud Vertex AI\" notebook in one of the following\n| environments:\n|\n| [Open in Colab](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open in Colab Enterprise](https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open\n| in Vertex AI Workbench](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [View on GitHub](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n\nSettings for custom containers\n------------------------------\n\nThis section describes fields in your model's\n[`containerSpec`](/vertex-ai/docs/reference/rest/v1/ModelContainerSpec) that you may need to\nspecify when importing generative AI models.\n\nYou can specify these fields by using the Vertex AI REST API or the\n[`gcloud ai models upload` command](/sdk/gcloud/reference/ai/models/upload).\nFor more information, see\n[Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`sharedMemorySizeMb`\n\n: Some generative AI models require more **shared memory**. Shared memory is\n an Inter-process communication (IPC) mechanism that allows multiple\n processes to access and manipulate a common block of memory. The default\n shared memory size is 64MB.\n\n Some model servers, such as [vLLM](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden/model_oss/vllm)\n or Nvidia Triton, use shared memory to cache internal data during model\n inferences. Without enough shared memory, some model servers cannot serve\n inferences for generative models. The amount of shared memory needed, if\n any, is an implementation detail of your container and model. Consult your\n model server documentation for guidelines.\n\n Also, because shared memory can be used for cross GPU communication, using\n more shared memory can improve performance for accelerators without NVLink\n capabilities (for example, L4), if the model container requires\n communication across GPUs.\n\n For information on how to specify a custom value for shared memory, see\n [Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`startupProbe`\n\n: A **startup probe** is an optional probe that is used to detect when the\n container has started. This probe is used to delay the health probe and\n [liveness checks](/vertex-ai/docs/predictions/custom-container-requirements#liveness_checks)\n until the container has started, which helps prevent slow starting containers\n from getting shut down prematurely.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\n`healthProbe`\n\n: The **health probe** checks whether a container is ready to accept traffic.\n If health probe is not provided, Vertex AI will use the default\n health checks which issues a HTTP request to the container's port and looks\n for a `200 OK` response from the model server.\n\n If your model server responds with `200 OK` before the model is fully\n loaded, which is possible, especially for large models, then the health\n check will succeed prematurely and Vertex AI will route traffic to\n the container before it is ready.\n\n In these cases, specify a custom health probe that succeeds only after the\n model is fully loaded and ready to accept traffic.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\nLimitations\n-----------\n\nConsider the following limitations when deploying generative AI models:\n\n- Generative AI models can only be deployed to a single machine. Multi-host deployment isn't supported.\n- For very large models that don't fit in the largest supported vRAM, such as [Llama 3.1 405B](/vertex-ai/generative-ai/docs/open-models/use-llama#llama_31), we recommend quantizing them to fit."]]