Version 2.3 is a lightweight image that contains only core components,
reducing exposure to Common Vulnerabilities and Exposures (CVEs). For higher
security compliance requirements, use the image version 2.3or later, when
creating a Dataproc cluster.
If you choose to install
optional components when
creating a Dataproc cluster with 2.3 image, they will be
downloaded and installed during cluster creation. This might increase the
cluster startup time. To avoid this delay, you can create a
custom image
with the optional components pre-installed. This is achieved by running
generate_custom_image.py
with the
--optional-components
flag.
Notes:
The following are the optional components in 2.3 images:
Apache Flink
Apache Hive WebHCat
Apache Hudi
Apache Iceberg
Apache Pig
Delta Lake
Docker
JupyterLab Notebook
Ranger
Solr
Zeppelin Notebook
Zookeeper
yarn.nodemanager.recovery.enabled and HDFS Audit Logging
are enabled by default in 2.3 images.
micromamba, instead of conda in previous image versions, is installed as part
of the Python installation.
2.3.x-*-arm images
support only the pre-installed components and the following optional
components. The other 2.3 optional components and all initialization actions
aren't supported:
Installation fails if the cluster has no public internet access. As a
workaround, create a cluster that uses a custom image with optional
components pre-installed. You can do this by running
generate_custom_image.py
with the
--optional-components flag.
Installation can fail if the cluster is pinned to an older sub-minor image
version: Packages are installed on demand from public OSS repositories, and a package
might not be available upstream to support the installation.
As a workaround, create a cluster that uses a custom image with optional
components pre-installed in the custom image. To do this, run
generate_custom_image.py
with the
--optional-components flag.
Image version 2.3 machine learning (ML) components
The Dataproc 2.3-ml-ubuntu image extends the 2.3 base image
with ML-specific software. It supports 2.3 image optional components and other
2.3 features, and adds the component versions listed in the following sections.
GPU-specific libraries
For Dataproc jobs that use GPU VMs,
the following NVIDIA driver and libraries are available in the
2.3-ml-ubuntu image. You can use them to accomplish the following
tasks:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# 2.3.x release versions\n\nImportant changes in 2.3:\n-------------------------\n\n- Version `2.3` is a lightweight image that contains only core components,\n reducing exposure to Common Vulnerabilities and Exposures (CVEs). For higher\n security compliance requirements, use the image version `2.3`or later, when\n creating a Dataproc cluster.\n\n- If you choose to install\n [optional components](/dataproc/docs/concepts/components/overview) when\n creating a Dataproc cluster with `2.3` image, they will be\n downloaded and installed during cluster creation. This might increase the\n cluster startup time. To avoid this delay, you can create a\n [custom image](/dataproc/docs/guides/dataproc-images#generate_a_custom_image)\n with the optional components pre-installed. This is achieved by running\n [`generate_custom_image.py`](https://github.com/GoogleCloudDataproc/custom-images?tab=readme-ov-file#generate-custom-image)\n with the\n [`--optional-components`](/dataproc/docs/guides/dataproc-images#run_the_code)\n flag.\n\n | **Note:** You must specify the optional components that you want to install when you create the cluster. For more information, see [Add optional components](/dataproc/docs/concepts/components/overview#add_optional_components). \n | The following example shows the Google Cloud CLI command for creating a cluster with optional components: \n |\n | ```\n | gcloud dataproc clusters create CLUSTER_NAME\n | --optional-components=COMPONENT_NAME \\\n | ... other flags\n | ```\n\nNotes:\n------\n\n- The following are the optional components in 2.3 images:\n\n - Apache Flink\n - Apache Hive WebHCat\n - Apache Hudi\n - Apache Iceberg\n - Apache Pig\n - Delta Lake\n - Docker\n - JupyterLab Notebook\n - Ranger\n - Solr\n - Zeppelin Notebook\n - Zookeeper\n- `yarn.nodemanager.recovery.enabled` and HDFS Audit Logging\n are enabled by default in 2.3 images.\n\n- micromamba, instead of conda in previous image versions, is installed as part\n of the Python installation.\n\n- Docker and Zeppelin installation issues:\n\n - Installation fails if the cluster has no public internet access. As a workaround, create a cluster that uses a custom image with optional components pre-installed. You can do this by running [`generate_custom_image.py`](https://github.com/GoogleCloudDataproc/custom-images) with the [`--optional-components` flag](/dataproc/docs/guides/dataproc-images#run_the_code).\n - Installation can fail if the cluster is pinned to an older sub-minor image version: Packages are installed on demand from public OSS repositories, and a package might not be available upstream to support the installation. As a workaround, create a cluster that uses a custom image with optional components pre-installed in the custom image. To do this, run [`generate_custom_image.py`](https://github.com/GoogleCloudDataproc/custom-images) with the [`--optional-components` flag](/dataproc/docs/guides/dataproc-images#run_the_code).\n\nImage version 2.3 machine learning (ML) components\n--------------------------------------------------\n\nThe Dataproc `2.3-ml-ubuntu` image extends the 2.3 base image\nwith ML-specific software. It supports 2.3 image optional components and other\n2.3 features, and adds the component versions listed in the following sections.\n\n### GPU-specific libraries\n\nFor Dataproc jobs that use GPU VMs,\nthe following NVIDIA driver and libraries are available in the\n`2.3-ml-ubuntu` image. You can use them to accomplish the following\ntasks:\n\n- Accelerate Spark batch workloads with the [NVIDIA Spark Rapids library](https://docs.nvidia.com/spark-rapids/index.html)\n- Train machine learning workloads\n- Run distributed batch inference using Spark\n\n### XGBoost libraries\n\nThe following [Maven package versions](https://mvnrepository.com/artifact/ml.dmlc)\nare available in `2.3-ml-ubuntu` image to let you use\n[XGBoost](https://www.nvidia.com/en-us/glossary/xgboost/) with Spark in Java or\nScala.\n\n| **Note:** You cannot use distributed Spark XGBoost on a Dataproc job that has [autoscaling](/dataproc/docs/concepts/configuring-clusters/autoscaling#enable_autoscaling%20enabled) (the default behavior) because new nodes that start elastic scaling cannot receive new tasks and remain idle. To use XGBoost with a batch workload, you can set the [`spark.dynamicAllocation.enabled = false`](/dataproc-serverless/docs/concepts/autoscaling#spark_dynamic_allocation_properties) property on a Dataproc job to disable dynamic allocation.\n\n### Python libraries\n\nThe `2.3-ml-ubuntu` image contains the following libraries, which support different\nstages in the ML lifecycle.\n\\`2.3-ml-ubuntu\\` image Python libraries\n\n### R libraries\n\nThe following R library versions are included in `2.3-ml-ubuntu` image.\n\\`2.3-ml-ubuntu\\` image R libraries"]]