Skip to content

Commit 083c3c1

Browse files
authored
Simplified GCSTaskHandler configuration (#10365)
1 parent 439f7dc commit 083c3c1

File tree

8 files changed

+215
-127
lines changed

8 files changed

+215
-127
lines changed

β€ŽUPDATING.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -561,6 +561,20 @@ better handle the case when a DAG file has multiple DAGs.
561561
Sentry is disabled by default. To enable these integrations, you need set ``sentry_on`` option
562562
in ``[sentry]`` section to ``"True"``.
563563

564+
#### Simplified GCSTaskHandler configuration
565+
566+
In previous versions, in order to configure the service account key file, you had to create a connection entry.
567+
In the current version, you can configure ``google_key_path`` option in ``[logging]`` section to set
568+
the key file path.
569+
570+
Users using Application Default Credentials (ADC) need not take any action.
571+
572+
The change aims to simplify the configuration of logging, to prevent corruption of
573+
the instance configuration by changing the value controlled by the user - connection entry. If you
574+
configure a backend secret, it also means the webserver doesn't need to connect to it. This
575+
simplifies setups with multiple GCP projects, because only one project will require the Secret Manager API
576+
to be enabled.
577+
564578
### Changes to the core operators/hooks
565579

566580
We strive to ensure that there are no changes that may affect the end user and your files, but this

β€Žairflow/config_templates/airflow_local_settings.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,13 +196,15 @@
196196

197197
DEFAULT_LOGGING_CONFIG['handlers'].update(CLOUDWATCH_REMOTE_HANDLERS)
198198
elif REMOTE_BASE_LOG_FOLDER.startswith('gs://'):
199+
key_path = conf.get('logging', 'GOOGLE_KEY_PATH', fallback=None)
199200
GCS_REMOTE_HANDLERS: Dict[str, Dict[str, str]] = {
200201
'task': {
201-
'class': 'airflow.utils.log.gcs_task_handler.GCSTaskHandler',
202+
'class': 'airflow.providers.google.cloud.log.gcs_task_handler.GCSTaskHandler',
202203
'formatter': 'airflow',
203204
'base_log_folder': str(os.path.expanduser(BASE_LOG_FOLDER)),
204205
'gcs_log_folder': REMOTE_BASE_LOG_FOLDER,
205206
'filename_template': FILENAME_TEMPLATE,
207+
'gcp_key_path': key_path
206208
},
207209
}
208210

@@ -222,7 +224,7 @@
222224

223225
DEFAULT_LOGGING_CONFIG['handlers'].update(WASB_REMOTE_HANDLERS)
224226
elif REMOTE_BASE_LOG_FOLDER.startswith('stackdriver://'):
225-
key_path = conf.get('logging', 'STACKDRIVER_KEY_PATH', fallback=None)
227+
key_path = conf.get('logging', 'GOOGLE_KEY_PATH', fallback=None)
226228
# stackdriver:///airflow-tasks => airflow-tasks
227229
log_name = urlparse(REMOTE_BASE_LOG_FOLDER).path[1:]
228230
STACKDRIVER_REMOTE_HANDLERS = {

β€Žairflow/config_templates/config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,9 +402,9 @@
402402
type: string
403403
example: ~
404404
default: ""
405-
- name: stackdriver_key_path
405+
- name: google_key_path
406406
description: |
407-
Path to GCP Credential JSON file. If omitted, authorization based on `the Application Default
407+
Path to Google Credential JSON file. If omitted, authorization based on `the Application Default
408408
Credentials
409409
<https://cloud.google.com/docs/authentication/production#finding_credentials_automatically>`__ will
410410
be used.

β€Žairflow/config_templates/default_airflow.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -230,11 +230,11 @@ remote_logging = False
230230
# location.
231231
remote_log_conn_id =
232232

233-
# Path to GCP Credential JSON file. If omitted, authorization based on `the Application Default
233+
# Path to Google Credential JSON file. If omitted, authorization based on `the Application Default
234234
# Credentials
235235
# <https://cloud.google.com/docs/authentication/production#finding_credentials_automatically>`__ will
236236
# be used.
237-
stackdriver_key_path =
237+
google_key_path =
238238

239239
# Storage bucket URL for remote logging
240240
# S3 buckets should start with "s3://"

β€Žairflow/providers/google/cloud/log/gcs_task_handler.py

Lines changed: 66 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -16,48 +16,86 @@
1616
# specific language governing permissions and limitations
1717
# under the License.
1818
import os
19-
from urllib.parse import urlparse
19+
from typing import Collection, Optional
2020

2121
from cached_property import cached_property
22+
from google.api_core.client_info import ClientInfo
23+
from google.cloud import storage
2224

23-
from airflow.configuration import conf
24-
from airflow.exceptions import AirflowException
25+
from airflow import version
26+
from airflow.providers.google.cloud.utils.credentials_provider import get_credentials_and_project_id
2527
from airflow.utils.log.file_task_handler import FileTaskHandler
2628
from airflow.utils.log.logging_mixin import LoggingMixin
2729

30+
_DEFAULT_SCOPESS = frozenset([
31+
"https://www.googleapis.com/auth/devstorage.read_write",
32+
])
33+
2834

2935
class GCSTaskHandler(FileTaskHandler, LoggingMixin):
3036
"""
3137
GCSTaskHandler is a python log handler that handles and reads
3238
task instance logs. It extends airflow FileTaskHandler and
3339
uploads to and reads from GCS remote storage. Upon log reading
3440
failure, it reads from host machine's local disk.
41+
42+
:param base_log_folder: Base log folder to place logs.
43+
:type base_log_folder: str
44+
:param gcs_log_folder: Path to a remote location where logs will be saved. It must have the prefix
45+
``gs://``. For example: ``gs://bucket/remote/log/location``
46+
:type gcs_log_folder: str
47+
:param filename_template: template filename string
48+
:type filename_template: str
49+
:param gcp_key_path: Path to GCP Credential JSON file. Mutually exclusive with gcp_keyfile_dict.
50+
If omitted, authorization based on `the Application Default Credentials
51+
<https://cloud.google.com/docs/authentication/production#finding_credentials_automatically>`__ will
52+
be used.
53+
:type gcp_key_path: str
54+
:param gcp_keyfile_dict: Dictionary of keyfile parameters. Mutually exclusive with gcp_key_path.
55+
:type gcp_keyfile_dict: dict
56+
:param gcp_scopes: Comma-separated string containing GCP scopes
57+
:type gcp_scopes: str
58+
:param project_id: Project ID to read the secrets from. If not passed, the project ID from credentials
59+
will be used.
60+
:type project_id: str
3561
"""
36-
def __init__(self, base_log_folder, gcs_log_folder, filename_template):
62+
def __init__(
63+
self,
64+
*,
65+
base_log_folder: str,
66+
gcs_log_folder: str,
67+
filename_template: str,
68+
gcp_key_path: Optional[str] = None,
69+
gcp_keyfile_dict: Optional[dict] = None,
70+
# See: https://github.com/PyCQA/pylint/issues/2377
71+
gcp_scopes: Optional[Collection[str]] = _DEFAULT_SCOPESS, # pylint: disable=unsubscriptable-object
72+
project_id: Optional[str] = None,
73+
):
3774
super().__init__(base_log_folder, filename_template)
3875
self.remote_base = gcs_log_folder
3976
self.log_relative_path = ''
4077
self._hook = None
4178
self.closed = False
4279
self.upload_on_close = True
80+
self.gcp_key_path = gcp_key_path
81+
self.gcp_keyfile_dict = gcp_keyfile_dict
82+
self.scopes = gcp_scopes
83+
self.project_id = project_id
4384

4485
@cached_property
45-
def hook(self):
46-
"""
47-
Returns GCS hook.
48-
"""
49-
remote_conn_id = conf.get('logging', 'REMOTE_LOG_CONN_ID')
50-
try:
51-
from airflow.providers.google.cloud.hooks.gcs import GCSHook
52-
return GCSHook(
53-
google_cloud_storage_conn_id=remote_conn_id
54-
)
55-
except Exception as e: # pylint: disable=broad-except
56-
self.log.error(
57-
'Could not create a GoogleCloudStorageHook with connection id '
58-
'"%s". %s\n\nPlease make sure that airflow[gcp] is installed '
59-
'and the GCS connection exists.', remote_conn_id, str(e)
60-
)
86+
def client(self) -> storage.Client:
87+
"""Returns GCS Client."""
88+
credentials, project_id = get_credentials_and_project_id(
89+
key_path=self.gcp_key_path,
90+
keyfile_dict=self.gcp_keyfile_dict,
91+
scopes=self.scopes,
92+
disable_logging=True
93+
)
94+
return storage.Client(
95+
credentials=credentials,
96+
client_info=ClientInfo(client_library_version='airflow_v' + version.version),
97+
project=self.project_id if self.project_id else project_id
98+
)
6199

62100
def set_context(self, ti):
63101
super().set_context(ti)
@@ -111,7 +149,8 @@ def _read(self, ti, try_number, metadata=None):
111149
remote_loc = os.path.join(self.remote_base, log_relative_path)
112150

113151
try:
114-
remote_log = self.gcs_read(remote_loc)
152+
blob = storage.Blob.from_string(remote_loc, self.client)
153+
remote_log = blob.download_as_string()
115154
log = '*** Reading remote log from {}.\n{}\n'.format(
116155
remote_loc, remote_log)
117156
return log, {'end_of_log': True}
@@ -123,19 +162,9 @@ def _read(self, ti, try_number, metadata=None):
123162
log += local_log
124163
return log, metadata
125164

126-
def gcs_read(self, remote_log_location):
127-
"""
128-
Returns the log found at the remote_log_location.
129-
130-
:param remote_log_location: the log's location in remote storage
131-
:type remote_log_location: str (path)
132-
"""
133-
bkt, blob = self.parse_gcs_url(remote_log_location)
134-
return self.hook.download(bkt, blob).decode('utf-8')
135-
136165
def gcs_write(self, log, remote_log_location):
137166
"""
138-
Writes the log to the remote_log_location. Fails silently if no hook
167+
Writes the log to the remote_log_location. Fails silently if no log
139168
was created.
140169
141170
:param log: the log to write to the remote_log_location
@@ -144,28 +173,16 @@ def gcs_write(self, log, remote_log_location):
144173
:type remote_log_location: str (path)
145174
"""
146175
try:
147-
old_log = self.gcs_read(remote_log_location)
176+
blob = storage.Blob.from_string(remote_log_location, self.client)
177+
old_log = blob.download_as_string()
148178
log = '\n'.join([old_log, log]) if old_log else log
149179
except Exception as e: # pylint: disable=broad-except
150180
if not hasattr(e, 'resp') or e.resp.get('status') != '404': # pylint: disable=no-member
151181
log = '*** Previous log discarded: {}\n\n'.format(str(e)) + log
182+
self.log.info("Previous log discarded: %s", e)
152183

153184
try:
154-
bkt, blob = self.parse_gcs_url(remote_log_location)
155-
self.hook.upload(bkt, blob, data=log)
185+
blob = storage.Blob.from_string(remote_log_location, self.client)
186+
blob.upload_from_string(log, content_type="text/plain")
156187
except Exception as e: # pylint: disable=broad-except
157188
self.log.error('Could not write logs to %s: %s', remote_log_location, e)
158-
159-
@staticmethod
160-
def parse_gcs_url(gsurl):
161-
"""
162-
Given a Google Cloud Storage URL (gs://<bucket>/<blob>), returns a
163-
tuple containing the corresponding bucket and blob.
164-
"""
165-
parsed_url = urlparse(gsurl)
166-
if not parsed_url.netloc:
167-
raise AirflowException('Please provide a bucket name')
168-
else:
169-
bucket = parsed_url.netloc
170-
blob = parsed_url.path.strip('/')
171-
return bucket, blob

β€Ždocs/howto/write-logs.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,10 +194,11 @@ example:
194194
# configuration requirements.
195195
remote_logging = True
196196
remote_base_log_folder = gs://my-bucket/path/to/logs
197-
remote_log_conn_id = MyGCSConn
198197
199-
#. Install the ``google`` package first, like so: ``pip install 'apache-airflow[google]'``.
200-
#. Make sure a Google Cloud Platform connection hook has been defined in Airflow. The hook should have read and write access to the Google Cloud Storage bucket defined above in ``remote_base_log_folder``.
198+
#. By default Application Default Credentials are used to obtain credentials. You can also
199+
set ``google_key_path`` option in ``[logging]`` section, if you want to use your own service account.
200+
#. Make sure a Google Cloud Platform account have read and write access to the Google Cloud Storage bucket defined above in ``remote_base_log_folder``.
201+
#. Install the ``google`` package, like so: ``pip install 'apache-airflow[google]'``.
201202
#. Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.
202203
#. Verify that logs are showing up for newly executed tasks in the bucket you've defined.
203204
#. Verify that the Google Cloud Storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:
@@ -311,7 +312,7 @@ For integration with Stackdriver, this option should start with ``stackdriver://
311312
The path section of the URL specifies the name of the log e.g. ``stackdriver://airflow-tasks`` writes
312313
logs under the name ``airflow-tasks``.
313314

314-
You can set ``stackdriver_key_path`` option in the ``[logging]`` section to specify the path to `the service
315+
You can set ``google_key_path`` option in the ``[logging]`` section to specify the path to `the service
315316
account key file <https://cloud.google.com/iam/docs/service-accounts>`__.
316317
If omitted, authorization based on `the Application Default Credentials
317318
<https://cloud.google.com/docs/authentication/production#finding_credentials_automatically>`__ will

0 commit comments

Comments
 (0)