Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค ์‚ฌ์šฉ

Dataflow ๊ด€๋ฆฌํ˜• ์„œ๋น„์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Dataflow ์ž‘์—…์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค๋Š” Google Cloud CLI์˜ ๋ช…๋ น์ค„ ๋„๊ตฌ์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

Google Cloud ์ฝ˜์†”์„ ์‚ฌ์šฉํ•˜์—ฌ Dataflow ์ž‘์—…์„ ๋ณด๊ณ  ์ƒํ˜ธ์ž‘์šฉํ•˜๋ ค๋ฉด Dataflow ๋ชจ๋‹ˆํ„ฐ๋ง ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Dataflow ๋ช…๋ น์ค„ ๊ตฌ์„ฑ์š”์†Œ ์„ค์น˜

๋กœ์ปฌ ํ„ฐ๋ฏธ๋„์—์„œ Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Google Cloud CLI๋ฅผ ์„ค์น˜ํ•˜๊ณ  ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Cloud Shell์˜ ๊ฒฝ์šฐ Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค๋Š” ์ž๋™์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ช…๋ น์–ด ์‹คํ–‰

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค์™€ ์ƒํ˜ธ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ Dataflow ๋ช…๋ น์–ด ๋ชฉ๋ก์„ ๋ณด๋ ค๋ฉด ์…ธ ๋˜๋Š” ํ„ฐ๋ฏธ๋„์— ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  gcloud dataflow --help

์ถœ๋ ฅ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด Dataflow ๋ช…๋ น์–ด์—๋Š” flex-template, jobs, snapshots sql ๋“ฑ ๊ทธ๋ฃน 4๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Flex ํ…œํ”Œ๋ฆฟ ๋ช…๋ น์–ด

flex-template ํ•˜์œ„ ๋ช…๋ น์–ด ๊ทธ๋ฃน์„ ์‚ฌ์šฉ ์„ค์ •ํ•˜๋ฉด Dataflow Flex ํ…œํ”Œ๋ฆฟ์œผ๋กœ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ์ž‘์—…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • build: ์ง€์ •๋œ ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ Flex ํ…œํ”Œ๋ฆฟ ํŒŒ์ผ์„ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค.
  • run: ์ง€์ •๋œ ๊ฒฝ๋กœ์—์„œ ์ž‘์—…์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ…œํ”Œ๋ฆฟ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด Cloud Storage ๋ฒ„ํ‚ท์— ์ €์žฅ๋˜๋Š” ํ…œํ”Œ๋ฆฟ ์‚ฌ์–‘ ํŒŒ์ผ์„ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ…œํ”Œ๋ฆฟ ์‚ฌ์–‘ ํŒŒ์ผ์—๋Š” SDK ์ •๋ณด ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ๊ฐ™์ด ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ชจ๋“  ์ •๋ณด๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ metadata.json ํŒŒ์ผ์—๋Š” ์ด๋ฆ„, ์„ค๋ช…, ์ž…๋ ฅ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ฐ™์€ ํ…œํ”Œ๋ฆฟ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ํ…œํ”Œ๋ฆฟ ์‚ฌ์–‘ ํŒŒ์ผ์„ ๋งŒ๋“  ํ›„ Java ๋˜๋Š” Python ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Flex ํ…œํ”Œ๋ฆฟ์„ ๋นŒ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Google Cloud CLI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Flex ํ…œํ”Œ๋ฆฟ ๋งŒ๋“ค๊ธฐ ๋ฐ ์‹คํ–‰์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Flex ํ…œํ”Œ๋ฆฟ ๋นŒ๋“œ ๋ฐ ์‹คํ–‰ ํŠœํ† ๋ฆฌ์–ผ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

Jobs ๋ช…๋ น์–ด

jobs ํ•˜์œ„ ๋ช…๋ น์–ด ๊ทธ๋ฃน์„ ์‚ฌ์šฉํ•˜๋ฉด ํ”„๋กœ์ ํŠธ์—์„œ Dataflow ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ์ž‘์—…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • cancel: ๋ช…๋ น์ค„ ์ธ์ˆ˜์™€ ์ผ์น˜ํ•˜๋Š” ๋ชจ๋“  ์ž‘์—…์„ ์ทจ์†Œํ•ฉ๋‹ˆ๋‹ค.
  • describe: Get API์—์„œ ๊ฐ€์ ธ์˜จ ์ž‘์—… ๊ฐ์ฒด๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • drain: ๋ช…๋ น์ค„ ์ธ์ˆ˜์™€ ์ผ์น˜ํ•˜๋Š” ๋ชจ๋“  ์ž‘์—…์„ ๋“œ๋ ˆ์ด๋‹ํ•ฉ๋‹ˆ๋‹ค.
  • list: ํŠน์ • ํ”„๋กœ์ ํŠธ์˜ ๋ชจ๋“  ์ž‘์—…์„ ๋‚˜์—ดํ•˜๋ฉฐ ์„ ํƒ์ ์œผ๋กœ ๋ฆฌ์ „๋ณ„๋กœ ํ•„ํ„ฐ๋ง๋ฉ๋‹ˆ๋‹ค.
  • run: ์ง€์ •๋œ ๊ฒฝ๋กœ์—์„œ ์ž‘์—…์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • show: ์ง€์ •๋œ ์ž‘์—…์— ๋Œ€ํ•œ ๊ฐ„๋‹จํ•œ ์„ค๋ช…์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ์˜ ๋ชจ๋“  Dataflow ์ž‘์—… ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ์…ธ ๋˜๋Š” ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

gcloud dataflow jobs list

์ด ๋ช…๋ น์–ด๋Š” ํ˜„์žฌ ์ž‘์—… ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์ƒ˜ํ”Œ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

  ID                                        NAME                                    TYPE   CREATION_TIME        STATE   REGION
  2015-06-03_16_39_22-4020553808241078833   wordcount-janedoe-0603233849            Batch  2015-06-03 16:39:22  Done    us-central1
  2015-06-03_16_38_28-4363652261786938862   wordcount-johndoe-0603233820            Batch  2015-06-03 16:38:28  Done    us-central1
  2015-05-21_16_24_11-17823098268333533078  bigquerytornadoes-johndoe-0521232402    Batch  2015-05-21 16:24:11  Done    europe-west1
  2015-05-21_13_38_06-16409850040969261121  bigquerytornadoes-johndoe-0521203801    Batch  2015-05-21 13:38:06  Done    us-central1
  2015-05-21_13_17_18-18349574013243942260  bigquerytornadoes-johndoe-0521201710    Batch  2015-05-21 13:17:18  Done    europe-west1
  2015-05-21_12_49_37-9791290545307959963   wordcount-johndoe-0521194928            Batch  2015-05-21 12:49:37  Done    us-central1
  2015-05-20_15_54_51-15905022415025455887  wordcount-johndoe-0520225444            Batch  2015-05-20 15:54:51  Failed  us-central1
  2015-05-20_15_47_02-14774624590029708464  wordcount-johndoe-0520224637            Batch  2015-05-20 15:47:02  Done    us-central1

๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ํ‘œ์‹œ๋˜๋Š” ์ž‘์—… ID๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ describe ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ž‘์—…์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ •๋ณด๋ฅผ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

gcloud dataflow jobs describe JOB_ID

JOB_ID๋ฅผ ํ”„๋กœ์ ํŠธ์˜ Dataflow ์ž‘์—… ์ค‘ ํ•˜๋‚˜์˜ ์ž‘์—… ID๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ž‘์—… ID 2015-02-09_11_39_40-15635991037808002875์˜ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ ์ƒ˜ํ”Œ ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

createTime: '2015-02-09T19:39:41.140Z'
currentState: JOB_STATE_DONE
currentStateTime: '2015-02-09T19:56:39.510Z'
id: 2015-02-09_11_39_40-15635991037808002875
name: tfidf-bchambers-0209193926
projectId: google.com:clouddfe
type: JOB_TYPE_BATCH

๊ฒฐ๊ณผ๋ฅผ JSON ํ˜•์‹์œผ๋กœ ์ง€์ •ํ•˜๋ ค๋ฉด --format=json ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

gcloud --format=json dataflow jobs describe JOB_ID

JOB_ID๋ฅผ ํ”„๋กœ์ ํŠธ์˜ Dataflow ์ž‘์—… ์ค‘ ํ•˜๋‚˜์˜ ์ž‘์—… ID๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

๋‹ค์Œ ์ƒ˜ํ”Œ ์ถœ๋ ฅ์€ JSON์œผ๋กœ ํ˜•์‹ ์ง€์ •๋ฉ๋‹ˆ๋‹ค.

{
  "createTime": "2015-02-09T19:39:41.140Z",
  "currentState": "JOB_STATE_DONE",
  "currentStateTime": "2015-02-09T19:56:39.510Z",
  "id": "2015-02-09_11_39_40-15635991037808002875",
  "name": "tfidf-bchambers-0209193926",
  "projectId": "google.com:clouddfe",
  "type": "JOB_TYPE_BATCH"
}

์Šค๋ƒ…์ƒท ๋ช…๋ น์–ด

snapshots ํ•˜์œ„ ๋ช…๋ น์–ด ๊ทธ๋ฃน์„ ์‚ฌ์šฉ ์„ค์ •ํ•˜๋ฉด Dataflow ์Šค๋ƒ…์ƒท์œผ๋กœ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ์ž‘์—…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • create: Dataflow ์ž‘์—…์˜ ์Šค๋ƒ…์ƒท์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • delete: Dataflow ์Šค๋ƒ…์ƒท์„ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.
  • describe: Dataflow ์Šค๋ƒ…์ƒท์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  • list: ์„ ํƒ์ ์œผ๋กœ ์ž‘์—… ID๋กœ ํ•„ํ„ฐ๋ง๋œ ์ง€์ •๋œ ๋ฆฌ์ „์˜ ํ”„๋กœ์ ํŠธ์— ์žˆ๋Š” ๋ชจ๋“  Dataflow ์Šค๋ƒ…์ƒท์„ ๋‚˜์—ดํ•ฉ๋‹ˆ๋‹ค.

Dataflow์—์„œ ์Šค๋ƒ…์ƒท ์‚ฌ์šฉ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Dataflow ์Šค๋ƒ…์ƒท ์‚ฌ์šฉ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

SQL ๋ช…๋ น์–ด

sql ํ•˜์œ„ ๋ช…๋ น์–ด ๊ทธ๋ฃน์„ ์‚ฌ์šฉ ์„ค์ •ํ•˜๋ฉด Dataflow SQL๋กœ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. gcloud Dataflow sql query ๋ช…๋ น์–ด๋Š” Dataflow์—์„œ ์‚ฌ์šฉ์ž ์ง€์ • SQL ์ฟผ๋ฆฌ๋ฅผ ์ˆ˜๋ฝํ•˜๊ณ  ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด BigQuery ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ฝ๊ณ  ๋‹ค๋ฅธ BigQuery ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์“ฐ๋Š” Dataflow ์ž‘์—…์—์„œ ๊ฐ„๋‹จํ•œ SQL ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์…ธ ๋˜๋Š” ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

gcloud dataflow sql query 'SELECT word FROM
bigquery.table.PROJECT_ID.input_dataset.input_table
where count > 3'
    --job-name=JOB_NAME \
    --region=us-west1 \
    --bigquery-dataset=OUTPUT_DATASET \
    --bigquery-table=OUTPUT_TABLE

๋‹ค์Œ์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  • PROJECT_ID: ํ”„๋กœ์ ํŠธ์˜ ์ „์—ญ์ ์œผ๋กœ ๊ณ ์œ ํ•œ ์ด๋ฆ„
  • JOB_NAME: Dataflow ์ž‘์—… ์ด๋ฆ„
  • OUTPUT_DATASET: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ด๋ฆ„
  • OUTPUT_TABLE: ์ถœ๋ ฅ ํ…Œ์ด๋ธ” ์ด๋ฆ„

Dataflow SQL ์ž‘์—…์„ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๋ช‡ ๋ถ„ ์ •๋„ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž‘์—…์„ ๋งŒ๋“  ํ›„์—๋Š” ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Dataflow SQL ์ž‘์—…์€ ์ž๋™ ํ™•์žฅ์„ ์‚ฌ์šฉํ•˜๋ฉฐ Dataflow๊ฐ€ ์ž๋™์œผ๋กœ ์‹คํ–‰ ๋ชจ๋“œ๋ฅผ ์ผ๊ด„ ๋˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. Dataflow SQL ์ž‘์—…์˜ ๊ฒฝ์šฐ ์ด ๋™์ž‘์„ ์ œ์–ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Dataflow SQL ์ž‘์—…์„ ์ค‘์ง€ํ•˜๋ ค๋ฉด cancel ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. drain์„ ์‚ฌ์šฉํ•˜์—ฌ Dataflow SQL ์ž‘์—…์„ ์ค‘์ง€ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

Dataflow์—์„œ SQL ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Dataflow SQL ์ฐธ์กฐ ๋ฐ gcloud Dataflow sql query ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ฆฌ์ „์ด ์ง€์ •๋œ ๋ช…๋ น์–ด ์‚ฌ์šฉ

Dataflow ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค๋Š” gcloud CLI ๋ฒ„์ „ 176๋ถ€ํ„ฐ ๋ฆฌ์ „์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ช…๋ น์–ด์— --region ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž‘์—…์„ ๊ด€๋ฆฌํ•˜๋Š” ๋ฆฌ์ „์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด gcloud dataflow jobs list๋Š” ๋ชจ๋“  ๋ฆฌ์ „์˜ ์ž‘์—…์„ ๋‚˜์—ดํ•˜์ง€๋งŒ gcloud dataflow jobs list --region=europe-west1๋Š” europe-west1์—์„œ ๊ด€๋ฆฌ๋˜๋Š” ์ž‘์—…๋งŒ ๋‚˜์—ดํ•ฉ๋‹ˆ๋‹ค.