Stay organized with collections
Save and categorize content based on your preferences.
You can specify a custom container image to use with Dataproc on GKE .
Your custom container image must use one of the Dataproc on GKE
base Spark images.
gcloud dataproc jobs submit spark \
--properties=spark.kubernetes.container.image=custom-image \
... other args ...
Custom container image requirements and settings
Base images
You can use docker tools for building customized docker based upon one of
the published Dataproc on GKE base Spark images.
Container user
Dataproc on GKE runs Spark containers as the Linux spark user with a
1099 UID and a 1099 GID. Use the UID and GID for filesystem permissions.
For example, if you add a jar file at /opt/spark/jars/my-lib.jar in the image
as a workload dependency, you must give the spark user read permission to the file.
Components
Java: The JAVA_HOME environment variable points to the location of the
Java installation. The current default value is /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64,
which is subject to change (see the
Dataproc release notes for updated
information).
If you customize the Java environment, make sure that JAVA_HOME
is set to the correct location and PATH includes the path to binaries.
Python: Dataproc on GKE
base Spark images
have Miniconda3 installed at /opt/conda. CONDA_HOME points to this
location, ${CONDA_HOME}/bin is included in PATH, and PYSPARK_PYTHON
is set to ${CONDA_HOME}/python.
If you customize Conda, make sure that CONDA_HOME points to the
Conda home directory ,${CONDA_HOME}/bin is included in PATH, and
PYSPARK_PYTHON is set to ${CONDA_HOME}/python.
You can install, remove, and update packages in the default base environment,
or create a new environment, but it is strongly recommended that the environment
include all packages installed in the base environment of the base container image.
If you add Python modules, such as a Python script with utility functions,
to the container image, include the module directories in PYTHONPATH.
Spark: Spark is installed in /usr/lib/spark, and SPARK_HOME points to
this location. Spark cannot be customized. If it is changed, the container
image will be rejected or fail to operate correctly.
Jobs: You can customize Spark job dependencies. SPARK_EXTRA_CLASSPATH defines
the extra classpath for Spark JVM processes. Recommendation: put jars under
/opt/spark/jars, and set SPARK_EXTRA_CLASSPATH to /opt/spark/jars/*.
If you embed the job jar in the image, the recommended directory is
/opt/spark/job. When you submit the job, you can reference it with a
local path, for example, file:///opt/spark/job/my-spark-job.jar.
Cloud Storage connector: The Cloud Storage connector
is installed at /usr/lib/spark/jars.
Utilities: The procps and tini utility packages are required to run
Spark. These utilities are included in the
base Spark images, so custom images do not need to
re-install them.
Entrypoint:Dataproc on GKE ignores any
changes made to the ENTRYPOINT and CMD primitives in the
container image.
Initialization scripts: you can add an optional initialization script at /opt/init-script.sh.
An initialization script can download files from Cloud Storage,
start a proxy within the container, call other scripts, and perform other startup
tasks.
The entrypoint script calls the initialization script with all command line args ($@)
before starting the Spark driver, Spark executor, and other processes. The initialization script
can select the type of Spark process based on the first arg ($1): possible values include spark-submit for driver containers, and executor
for executor containers.
Configs: Spark configs are located under /etc/spark/conf.
The SPARK_CONF_DIR environment variable points to this location.
Don't customize Spark configs in the container image. Instead,
submit any properties via the Dataproc on GKE API for
the following reasons:
Some properties, such as executor memory size, are determined at runtime, not
at container image build time; they must be injected by Dataproc on GKE .
Dataproc on GKE places restrictions on the properties supplied by users.
Dataproc on GKE mounts configs from configMap into /etc/spark/conf
in the container, overriding settings embedded in the image.
Base Spark images
Dataproc supports the following base Spark container images:
FROMus-central1-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.0:latest# Change to root temporarily so that it has permissions to create dirs and copy# files.USERroot# Add a BigQuery connector jar.ENVSPARK_EXTRA_JARS_DIR=/opt/spark/jars/ENVSPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'RUNmkdir-p"${SPARK_EXTRA_JARS_DIR}"\ && chownspark:spark"${SPARK_EXTRA_JARS_DIR}"COPY--chown=spark:spark\spark-bigquery-with-dependencies_2.12-0.22.2.jar"${SPARK_EXTRA_JARS_DIR}"# Install Cloud Storage client Conda package.RUN"${CONDA_HOME}/bin/conda"installgoogle-cloud-storage# Add a custom Python file.ENVPYTHONPATH=/opt/python/packagesRUNmkdir-p"${PYTHONPATH}"COPYtest_util.py"${PYTHONPATH}"# Add an init script.COPY--chown=spark:sparkinit-script.sh/opt/init-script.sh# (Optional) Set user back to `spark`.USERspark
Build the container image
Run the following commands in the Dockerfile directory
Set image (example: us-central1-docker.pkg.dev/my-project/spark/spark-test-image:latest)
and change to build directory.
IMAGE=custom container image \
BUILD_DIR=$(mktemp -d) \
cd "${BUILD_DIR}"
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-01 UTC."],[[["\u003cp\u003eDataproc on GKE allows the use of custom container images, which must be based on one of the provided base Spark images.\u003c/p\u003e\n"],["\u003cp\u003eTo utilize a custom container image, set the \u003ccode\u003espark.kubernetes.container.image\u003c/code\u003e property when creating a virtual cluster or submitting a Spark job.\u003c/p\u003e\n"],["\u003cp\u003eCustom images run Spark containers as the \u003ccode\u003espark\u003c/code\u003e user (UID 1099, GID 1099), requiring specific filesystem permissions to be granted to this user for any added dependencies.\u003c/p\u003e\n"],["\u003cp\u003eWhile Java and Python environments can be customized, Spark itself cannot be modified within the container image, and changes to the \u003ccode\u003eENTRYPOINT\u003c/code\u003e and \u003ccode\u003eCMD\u003c/code\u003e are ignored.\u003c/p\u003e\n"],["\u003cp\u003eInitialization scripts at \u003ccode\u003e/opt/init-script.sh\u003c/code\u003e can be included in custom images to execute tasks before the Spark processes are started, and Spark configurations should not be customized in the image, but rather submitted via the API.\u003c/p\u003e\n"]]],[],null,["You can specify a custom container image to use with Dataproc on GKE .\nYour custom container image must use one of the Dataproc on GKE\n[base Spark images](#base_spark_images).\n\nUse a custom container image\n\nTo use a Dataproc on GKE custom container image, set the\n`spark.kubernetes.container.image property` when you\n[create a Dataproc on GKE virtual cluster](/dataproc/docs/guides/dpgke/quickstarts/dataproc-gke-quickstart-create-cluster)\nor [submit a Spark job](/dataproc/docs/guides/dpgke/quickstarts/dataproc-gke-quickstart-create-cluster#submit_a_spark_job) to the cluster.\n| **Note:** The `spark:` file prefix is needed when creating a cluster, but omitted when submitting a job (see [Cluster properties](/dataproc/docs/concepts/configuring-clusters/cluster-properties#formatting)).\n\n- gcloud CLI cluster creation example: \n\n ```\n gcloud dataproc clusters gke create \"${DP_CLUSTER}\" \\\n --properties=spark:spark.kubernetes.container.image=custom-image \\\n ... other args ...\n ```\n- gcloud CLI job submit example: \n\n ```\n gcloud dataproc jobs submit spark \\\n --properties=spark.kubernetes.container.image=custom-image \\\n ... other args ...\n ```\n\nCustom container image requirements and settings\n\nBase images\n\nYou can use `docker` tools for building customized docker based upon one of\nthe published Dataproc on GKE [base Spark images](#base_spark_images).\n\nContainer user\n\nDataproc on GKE runs Spark containers as the Linux `spark` user with a\n`1099` UID and a `1099` GID. Use the UID and GID for filesystem permissions.\nFor example, if you add a jar file at `/opt/spark/jars/my-lib.jar` in the image\nas a workload dependency, you must give the `spark` user read permission to the file.\n\nComponents\n\n- **Java:** The `JAVA_HOME` environment variable points to the location of the\n Java installation. The current default value is `/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64`,\n which is subject to change (see the\n [Dataproc release notes](/dataproc/docs/release-notes) for updated\n information).\n\n - If you customize the Java environment, make sure that `JAVA_HOME` is set to the correct location and `PATH` includes the path to binaries.\n- **Python:** Dataproc on GKE\n [base Spark images](/dataproc/docs/guides/dpgke/dataproc-gke-versions)\n have Miniconda3 installed at `/opt/conda`. `CONDA_HOME` points to this\n location, `${CONDA_HOME}/bin` is included in `PATH`, and `PYSPARK_PYTHON`\n is set to `${CONDA_HOME}/python`.\n\n - If you customize Conda, make sure that `CONDA_HOME` points to the\n Conda home directory ,`${CONDA_HOME}/bin` is included in `PATH`, and\n `PYSPARK_PYTHON` is set to `${CONDA_HOME}/python.`\n\n - You can install, remove, and update packages in the default base environment,\n or create a new environment, but it is strongly recommended that the environment\n include all packages installed in the base environment of the base container image.\n\n - If you add Python modules, such as a Python script with utility functions,\n to the container image, include the module directories in `PYTHONPATH`.\n\n- **Spark:** Spark is installed in `/usr/lib/spark`, and `SPARK_HOME` points to\n this location. **Spark cannot be customized.** If it is changed, the container\n image will be rejected or fail to operate correctly.\n\n - **Jobs** : You can customize Spark job dependencies. `SPARK_EXTRA_CLASSPATH` defines\n the extra classpath for Spark JVM processes. Recommendation: put jars under\n `/opt/spark/jars`, and set `SPARK_EXTRA_CLASSPATH` to `/opt/spark/jars/*`.\n\n If you embed the job jar in the image, the recommended directory is\n `/opt/spark/job`. When you submit the job, you can reference it with a\n local path, for example, `file:///opt/spark/job/my-spark-job.jar`.\n\n \u003cbr /\u003e\n\n - **Cloud Storage connector:** The Cloud Storage connector\n is installed at `/usr/lib/spark/jars`.\n\n - **Utilities** : The `procps` and `tini` utility packages are required to run\n Spark. These utilities are included in the\n [base Spark images](#base_spark_images), so custom images do not need to\n re-install them.\n\n - **Entrypoint:** **Dataproc on GKE ignores any\n changes made to the `ENTRYPOINT` and `CMD` primitives in the\n container image.**\n\n - **Initialization scripts:** you can add an optional initialization script at `/opt/init-script.sh`.\n An initialization script can download files from Cloud Storage,\n start a proxy within the container, call other scripts, and perform other startup\n tasks.\n\n The entrypoint script calls the initialization script with all command line args (`$@`)\n before starting the Spark driver, Spark executor, and other processes. The initialization script\n can select the type of Spark process based on the first arg (`$1`): possible values include `spark-submit` for driver containers, and `executor`\n for executor containers.\n\n \u003cbr /\u003e\n\n- **Configs:** Spark configs are located under `/etc/spark/conf`.\n The `SPARK_CONF_DIR` environment variable points to this location.\n\n Don't customize Spark configs in the container image. Instead,\n submit any properties via the Dataproc on GKE API for\n the following reasons:\n - Some properties, such as executor memory size, are determined at runtime, not at container image build time; they must be injected by Dataproc on GKE .\n - Dataproc on GKE places restrictions on the properties supplied by users. Dataproc on GKE mounts configs from `configMap` into `/etc/spark/conf` in the container, overriding settings embedded in the image.\n\nBase Spark images\n\nDataproc supports the following base Spark container images:\n\n- [Spark 3.5](/dataproc/docs/guides/dpgke/dataproc-gke-versions#spark_engine_35): ${REGION}-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.2\n\nSample custom container image build\n\nSample Dockerfile \n\n FROM us-central1-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.0:latest\n\n # Change to root temporarily so that it has permissions to create dirs and copy\n # files.\n USER root\n\n # Add a BigQuery connector jar.\n ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/\n ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'\n RUN mkdir -p \"${SPARK_EXTRA_JARS_DIR}\" \\\n && chown spark:spark \"${SPARK_EXTRA_JARS_DIR}\"\n COPY --chown=spark:spark \\\n spark-bigquery-with-dependencies_2.12-0.22.2.jar \"${SPARK_EXTRA_JARS_DIR}\"\n\n # Install Cloud Storage client Conda package.\n RUN \"${CONDA_HOME}/bin/conda\" install google-cloud-storage\n\n # Add a custom Python file.\n ENV PYTHONPATH=/opt/python/packages\n RUN mkdir -p \"${PYTHONPATH}\"\n COPY test_util.py \"${PYTHONPATH}\"\n\n # Add an init script.\n COPY --chown=spark:spark init-script.sh /opt/init-script.sh\n\n # (Optional) Set user back to `spark`.\n USER spark\n\nBuild the container image\n\nRun the following commands in the Dockerfile directory\n\n1. Set image (example: `us-central1-docker.pkg.dev/my-project/spark/spark-test-image:latest`) and change to build directory. \n\n ```\n IMAGE=custom container image \\\n BUILD_DIR=$(mktemp -d) \\\n cd \"${BUILD_DIR}\"\n ```\n2. Download the BigQuery connector.\n\n ```\n gcloud storage cp \\\n gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar .\n ```\n\n \u003cbr /\u003e\n\n3. Create a Python example file.\n\n ```\n cat \u003etest_util.py \u003c\u003c'EOF'\n def hello(name):\n print(\"hello {}\".format(name))\n\n def read_lines(path):\n with open(path) as f:\n return f.readlines()\n EOF\n ```\n\n \u003cbr /\u003e\n\n4. Create an example init script.\n\n ```\n cat \u003einit-script.sh \u003c\u003cEOF\n echo \"hello world\" \u003e/tmp/init-script.out\n EOF\n ```\n\n \u003cbr /\u003e\n\n5. Build and push the image.\n\n ```\n docker build -t \"${IMAGE}\" . && docker push \"${IMAGE}\"\n ```\n\n \u003cbr /\u003e"]]