Skip to content

Commit 48f228c

Browse files
authored
#16691 Providing more information in docs for DataprocCreateCluster operator migration (#19446)
1 parent 0c9ce54 commit 48f228c

File tree

3 files changed

+91
-0
lines changed

3 files changed

+91
-0
lines changed

β€ŽUPDATING.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1873,6 +1873,58 @@ https://cloud.google.com/compute/docs/disks/performance
18731873
18741874
Hence, the default value for `master_disk_size` in `DataprocCreateClusterOperator` has been changed from 500GB to 1TB.
18751875
1876+
##### Generating Cluster Config
1877+
1878+
If you are upgrading from Airflow 1.10.x and are not using **CLUSTER_CONFIG**,
1879+
You can easily generate config using **make()** of `airflow.providers.google.cloud.operators.dataproc.ClusterGenerator`
1880+
1881+
This has been proved specially useful if you are using **metadata** argument from older API, refer [AIRFLOW-16911](https://github.com/apache/airflow/issues/16911) for details.
1882+
1883+
eg. your cluster creation may look like this in **v1.10.x**
1884+
1885+
```python
1886+
path = f"gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh"
1887+
1888+
create_cluster = DataprocClusterCreateOperator(
1889+
task_id="create_dataproc_cluster",
1890+
cluster_name="test",
1891+
project_id="test",
1892+
zone="us-central1-a",
1893+
region="us-central1",
1894+
master_machine_type="n1-standard-4",
1895+
worker_machine_type="n1-standard-4",
1896+
num_workers=2,
1897+
storage_bucket="test_bucket",
1898+
init_actions_uris=[path],
1899+
metadata={"PIP_PACKAGES": "pyyaml requests pandas openpyxl"},
1900+
)
1901+
```
1902+
1903+
After upgrading to **v2.x.x** and using **CLUSTER_CONFIG**, it will look like followed:
1904+
1905+
```python
1906+
path = f"gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh"
1907+
1908+
CLUSTER_CONFIG = ClusterGenerator(
1909+
project_id="test",
1910+
zone="us-central1-a",
1911+
master_machine_type="n1-standard-4",
1912+
worker_machine_type="n1-standard-4",
1913+
num_workers=2,
1914+
storage_bucket="test",
1915+
init_actions_uris=[path],
1916+
metadata={"PIP_PACKAGES": "pyyaml requests pandas openpyxl"},
1917+
).make()
1918+
1919+
create_cluster_operator = DataprocClusterCreateOperator(
1920+
task_id="create_dataproc_cluster",
1921+
cluster_name="test",
1922+
project_id="test",
1923+
region="us-central1",
1924+
cluster_config=CLUSTER_CONFIG,
1925+
)
1926+
```
1927+
18761928
#### `airflow.providers.google.cloud.operators.bigquery.BigQueryGetDatasetTablesOperator`
18771929
18781930
We changed signature of `BigQueryGetDatasetTablesOperator`.

β€Žairflow/providers/google/cloud/example_dags/example_dataproc.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@
2424
from datetime import datetime
2525

2626
from airflow import models
27+
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
2728
from airflow.providers.google.cloud.operators.dataproc import (
29+
ClusterGenerator,
2830
DataprocCreateClusterOperator,
2931
DataprocCreateWorkflowTemplateOperator,
3032
DataprocDeleteClusterOperator,
@@ -64,6 +66,30 @@
6466

6567
# [END how_to_cloud_dataproc_create_cluster]
6668

69+
# Cluster definition: Generating Cluster Config for DataprocClusterCreateOperator
70+
# [START how_to_cloud_dataproc_create_cluster_generate_cluster_config]
71+
path = "gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh"
72+
73+
CLUSTER_CONFIG = ClusterGenerator(
74+
project_id="test",
75+
zone="us-central1-a",
76+
master_machine_type="n1-standard-4",
77+
worker_machine_type="n1-standard-4",
78+
num_workers=2,
79+
storage_bucket="test",
80+
init_actions_uris=[path],
81+
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
82+
).make()
83+
84+
create_cluster_operator = DataprocClusterCreateOperator(
85+
task_id='create_dataproc_cluster',
86+
cluster_name="test",
87+
project_id="test",
88+
region="us-central1",
89+
cluster_config=CLUSTER_CONFIG,
90+
)
91+
# [END how_to_cloud_dataproc_create_cluster_generate_cluster_config]
92+
6793
# Update options
6894
# [START how_to_cloud_dataproc_updatemask_cluster_operator]
6995
CLUSTER_UPDATE = {

β€Ždocs/apache-airflow-providers-google/operators/cloud/dataproc.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,19 @@ With this configuration we can create the cluster:
5757
:start-after: [START how_to_cloud_dataproc_create_cluster_operator]
5858
:end-before: [END how_to_cloud_dataproc_create_cluster_operator]
5959

60+
Generating Cluster Config
61+
^^^^^^^^^^^^^^^^^^^^^^^^^
62+
You can also generate **CLUSTER_CONFIG** using functional API,
63+
this could be easily done using **make()** of
64+
:class:`~airflow.providers.google.cloud.operators.dataproc.ClusterGenerator`
65+
You can generate and use config as followed:
66+
67+
.. exampleinclude:: /../../airflow/providers/google/cloud/example_dags/example_dataproc.py
68+
:language: python
69+
:dedent: 0
70+
:start-after: [START how_to_cloud_dataproc_create_cluster_generate_cluster_config]
71+
:end-before: [END how_to_cloud_dataproc_create_cluster_generate_cluster_config]
72+
6073
Update a cluster
6174
----------------
6275
You can scale the cluster up or down by providing a cluster config and a updateMask.

0 commit comments

Comments
 (0)