You can use the Dockerfiles and scripts
that we use to build our Model Garden containers as a reference or
starting point to build your own custom containers.
Serving inferences with NVIDIA NIM
NVIDIA Inference Microservices (NIM)
are pre-trained and optimized AI models that are packaged as microservices.
They're designed to simplify the deployment of high-performance,
production-ready AI into applications.
NVIDIA NIM can be used together with
Artifact Registry and Vertex AI
to deploy generative AI models for online inference.
Settings for custom containers
This section describes fields in your model's
containerSpec that you may need to
specify when importing generative AI models.
Some generative AI models require more shared memory. Shared memory is
an Inter-process communication (IPC) mechanism that allows multiple
processes to access and manipulate a common block of memory. The default
shared memory size is 64MB.
Some model servers, such as vLLM
or Nvidia Triton, use shared memory to cache internal data during model
inferences. Without enough shared memory, some model servers cannot serve
inferences for generative models. The amount of shared memory needed, if
any, is an implementation detail of your container and model. Consult your
model server documentation for guidelines.
Also, because shared memory can be used for cross GPU communication, using
more shared memory can improve performance for accelerators without NVLink
capabilities (for example, L4), if the model container requires
communication across GPUs.
A startup probe is an optional probe that is used to detect when the
container has started. This probe is used to delay the health probe and
liveness checks
until the container has started, which helps prevent slow starting containers
from getting shut down prematurely.
The health probe checks whether a container is ready to accept traffic.
If health probe is not provided, Vertex AI will use the default
health checks which issues a HTTP request to the container's port and looks
for a 200 OK response from the model server.
If your model server responds with 200 OK before the model is fully
loaded, which is possible, especially for large models, then the health
check will succeed prematurely and Vertex AI will route traffic to
the container before it is ready.
In these cases, specify a custom health probe that succeeds only after the
model is fully loaded and ready to accept traffic.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Deploy generative AI models\n\nThis page provides guidance for deploying a generative AI model to an endpoint\nfor online inference.\n\nCheck the Model Garden\n----------------------\n\nIf the model is in Model Garden, you can deploy it by clicking\n**Deploy** (available for some models) or **Open Notebook**.\n\n[Go to Model Garden](https://console.cloud.google.com/vertex-ai/model-garden)\n\nOtherwise, you can do one of the following:\n\n- If your model is similar to one in the Model Garden, you might be\n able to directly reuse one of the\n [model garden containers](https://us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers).\n\n- Build your own custom container that adheres to\n [Custom container requirements for inference](/vertex-ai/docs/predictions/custom-container-requirements)\n before [importing your model](/vertex-ai/docs/model-registry/import-model)\n into the [Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction).\n After it's imported, it becomes a [`model`](/vertex-ai/docs/reference/rest/v1/projects.locations.models)\n resource that you can [deploy to an endpoint](/vertex-ai/docs/general/deployment).\n\n You can use the [Dockerfiles and scripts](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden)\n that we use to build our Model Garden containers as a reference or\n starting point to build your own custom containers.\n\nServing inferences with NVIDIA NIM\n----------------------------------\n\n[NVIDIA Inference Microservices (NIM)](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)\nare pre-trained and optimized AI models that are packaged as microservices.\nThey're designed to simplify the deployment of high-performance,\nproduction-ready AI into applications.\n\nNVIDIA NIM can be used together with\n[Artifact Registry](/artifact-registry/docs/overview) and Vertex AI\nto deploy generative AI models for online inference. \n| To see an example of using NVIDIA NIM,\n| run the \"NVIDIA NIM on Google Cloud Vertex AI\" notebook in one of the following\n| environments:\n|\n| [Open in Colab](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open in Colab Enterprise](https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open\n| in Vertex AI Workbench](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [View on GitHub](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n\nSettings for custom containers\n------------------------------\n\nThis section describes fields in your model's\n[`containerSpec`](/vertex-ai/docs/reference/rest/v1/ModelContainerSpec) that you may need to\nspecify when importing generative AI models.\n\nYou can specify these fields by using the Vertex AI REST API or the\n[`gcloud ai models upload` command](/sdk/gcloud/reference/ai/models/upload).\nFor more information, see\n[Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`sharedMemorySizeMb`\n\n: Some generative AI models require more **shared memory**. Shared memory is\n an Inter-process communication (IPC) mechanism that allows multiple\n processes to access and manipulate a common block of memory. The default\n shared memory size is 64MB.\n\n Some model servers, such as [vLLM](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden/model_oss/vllm)\n or Nvidia Triton, use shared memory to cache internal data during model\n inferences. Without enough shared memory, some model servers cannot serve\n inferences for generative models. The amount of shared memory needed, if\n any, is an implementation detail of your container and model. Consult your\n model server documentation for guidelines.\n\n Also, because shared memory can be used for cross GPU communication, using\n more shared memory can improve performance for accelerators without NVLink\n capabilities (for example, L4), if the model container requires\n communication across GPUs.\n\n For information on how to specify a custom value for shared memory, see\n [Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`startupProbe`\n\n: A **startup probe** is an optional probe that is used to detect when the\n container has started. This probe is used to delay the health probe and\n [liveness checks](/vertex-ai/docs/predictions/custom-container-requirements#liveness_checks)\n until the container has started, which helps prevent slow starting containers\n from getting shut down prematurely.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\n`healthProbe`\n\n: The **health probe** checks whether a container is ready to accept traffic.\n If health probe is not provided, Vertex AI will use the default\n health checks which issues a HTTP request to the container's port and looks\n for a `200 OK` response from the model server.\n\n If your model server responds with `200 OK` before the model is fully\n loaded, which is possible, especially for large models, then the health\n check will succeed prematurely and Vertex AI will route traffic to\n the container before it is ready.\n\n In these cases, specify a custom health probe that succeeds only after the\n model is fully loaded and ready to accept traffic.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\nLimitations\n-----------\n\nConsider the following limitations when deploying generative AI models:\n\n- Generative AI models can only be deployed to a single machine. Multi-host deployment isn't supported.\n- For very large models that don't fit in the largest supported vRAM, such as [Llama 3.1 405B](/vertex-ai/generative-ai/docs/open-models/use-llama#llama_31), we recommend quantizing them to fit."]]