Hive to Cloud Storage ํ…œํ”Œ๋ฆฟ

์„œ๋ฒ„๋ฆฌ์Šค Dataproc Hive to Cloud Storage ํ…œํ”Œ๋ฆฟ์„ ์‚ฌ์šฉํ•˜์—ฌ Hive์—์„œ Cloud Storage๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์„ธ์š”.

ํ…œํ”Œ๋ฆฟ ์‚ฌ์šฉํ•˜๊ธฐ

gcloud CLI ๋˜๋Š” Dataproc API๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ…œํ”Œ๋ฆฟ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

gcloud

์•„๋ž˜์˜ ๋ช…๋ น์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ๋‹ค์Œ์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  • PROJECT_ID: (ํ•„์ˆ˜์‚ฌํ•ญ) IAM ์„ค์ •์— ๋‚˜์—ด๋œ Google Cloud ํ”„๋กœ์ ํŠธ ID์ž…๋‹ˆ๋‹ค.
  • REGION: (ํ•„์ˆ˜์‚ฌํ•ญ) Compute Engine ๋ฆฌ์ „์ž…๋‹ˆ๋‹ค.
  • TEMPLATE_VERSION: (ํ•„์ˆ˜์‚ฌํ•ญ) ์ตœ์‹  ํ…œํ”Œ๋ฆฟ ๋ฒ„์ „ ๋˜๋Š” ํŠน์ • ๋ฒ„์ „์˜ ๋‚ ์งœ๋กœ latest๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ์‹œ: 2023-03-17_v0.1.0-beta). ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ…œํ”Œ๋ฆฟ ๋ฒ„์ „์„ ๋‚˜์—ดํ•˜๋ ค๋ฉด gs://dataproc-templates-binaries๋ฅผ ๋ฐฉ๋ฌธํ•˜๊ฑฐ๋‚˜ gcloud storage ls gs://dataproc-templates-binaries๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”.
  • SUBNET: (์„ ํƒ์‚ฌํ•ญ) ์„œ๋ธŒ๋„ท์ด ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ default ๋„คํŠธ์›Œํฌ์˜ ์ง€์ •๋œ ๋ฆฌ์ „์— ์žˆ๋Š” ์„œ๋ธŒ๋„ท์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • HOST ๋ฐ PORT: (ํ•„์ˆ˜์‚ฌํ•ญ) ์†Œ์Šค Hive ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ํ˜ธ์ŠคํŠธ์˜ ํ˜ธ์ŠคํŠธ ์ด๋ฆ„ ๋˜๋Š” IP ์ฃผ์†Œ์™€ ํฌํŠธ์ž…๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: 10.0.0.33

  • TABLE: (ํ•„์ˆ˜์‚ฌํ•ญ) Hive ์ž…๋ ฅ ํ…Œ์ด๋ธ” ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
  • DATABASE: (ํ•„์ˆ˜์‚ฌํ•ญ) Hive ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
  • CLOUD_STORAGE_OUTPUT_PATH: (ํ•„์ˆ˜์‚ฌํ•ญ) ์ถœ๋ ฅ์„ ์ €์žฅํ•  Cloud Storage ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: gs://dataproc-templates/hive_to_cloud_storage_output

  • FORMAT: (์„ ํƒ์‚ฌํ•ญ) ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ ํ˜•์‹์ž…๋‹ˆ๋‹ค. ์˜ต์…˜: avro, parquet, csv, json. ๊ธฐ๋ณธ๊ฐ’์€ avro์ž…๋‹ˆ๋‹ค. ์ฐธ๊ณ : avro์˜ ๊ฒฝ์šฐ file:///usr/lib/spark/connector/spark-avro.jar๋ฅผ jars gcloud CLI ํ”Œ๋ž˜๊ทธ ๋˜๋Š” API ํ•„๋“œ์— ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ(file:// ์ ‘๋‘์‚ฌ๋Š” Dataproc Serverless jar ํŒŒ์ผ์„ ์ฐธ์กฐ):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... ๊ธฐํƒ€ jar]
  • HIVE_PARTITION_COLUMN: (์„ ํƒ์‚ฌํ•ญ) Hive ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆŒ ์—ด์ž…๋‹ˆ๋‹ค.
  • MODE: (ํ•„์ˆ˜์‚ฌํ•ญ) Cloud Storage ์ถœ๋ ฅ์˜ ์“ฐ๊ธฐ ๋ชจ๋“œ์ž…๋‹ˆ๋‹ค. ์˜ต์…˜: append, overwrite, ignore, errorifexists.
  • SERVICE_ACCOUNT: (์„ ํƒ์‚ฌํ•ญ) ์ž…๋ ฅํ•˜์ง€ ์•Š์œผ๋ฉด ๊ธฐ๋ณธ Compute Engine ์„œ๋น„์Šค ๊ณ„์ •์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • PROPERTY ๋ฐ PROPERTY_VALUE: (์„ ํƒ์‚ฌํ•ญ) Spark ์†์„ฑ=value ์Œ์˜ ์‰ผํ‘œ๋กœ ๊ตฌ๋ถ„๋œ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
  • LABEL ๋ฐ LABEL_VALUE: (์„ ํƒ์‚ฌํ•ญ) label=value ์Œ์˜ ์‰ผํ‘œ๋กœ ๊ตฌ๋ถ„๋œ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
  • LOG_LEVEL: (์„ ํƒ์‚ฌํ•ญ) ๋กœ๊น… ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN ์ค‘ ํ•˜๋‚˜์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ INFO์ž…๋‹ˆ๋‹ค.
  • KMS_KEY: (์„ ํƒ์‚ฌํ•ญ) ์•”ํ˜ธํ™”์— ์‚ฌ์šฉํ•  Cloud Key Management Service ํ‚ค์ž…๋‹ˆ๋‹ค. ํ‚ค๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด Google ์†Œ์œ  ๋ฐ Google ๊ด€๋ฆฌ ํ‚ค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ ์ƒํƒœ์—์„œ ์•”ํ˜ธํ™”๋ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Linux, macOS ๋˜๋Š” Cloud Shell

gcloud dataproc batches submit spark \
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \
    --version="1.2" \
    --project="PROJECT_ID" \
    --region="REGION" \
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" \
    --subnet="SUBNET" \
    --kms-key="KMS_KEY" \
    --service-account="SERVICE_ACCOUNT" \
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" \
    --labels="LABEL=LABEL_VALUE" \
    -- --template=HIVETOGCS \
    --templateProperty log.level="LOG_LEVEL" \
    --templateProperty hive.input.table="TABLE" \
    --templateProperty hive.input.db="DATABASE" \
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" \
    --templateProperty hive.gcs.output.format="FORMAT" \
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" \
    --templateProperty hive.gcs.save.mode="MODE"

Windows(PowerShell)

gcloud dataproc batches submit spark `
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate `
    --version="1.2" `
    --project="PROJECT_ID" `
    --region="REGION" `
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" `
    --subnet="SUBNET" `
    --kms-key="KMS_KEY" `
    --service-account="SERVICE_ACCOUNT" `
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" `
    --labels="LABEL=LABEL_VALUE" `
    -- --template=HIVETOGCS `
    --templateProperty log.level="LOG_LEVEL" `
    --templateProperty hive.input.table="TABLE" `
    --templateProperty hive.input.db="DATABASE" `
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" `
    --templateProperty hive.gcs.output.format="FORMAT" `
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" `
    --templateProperty hive.gcs.save.mode="MODE"

Windows(cmd.exe)

gcloud dataproc batches submit spark ^
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^
    --version="1.2" ^
    --project="PROJECT_ID" ^
    --region="REGION" ^
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ^
    --subnet="SUBNET" ^
    --kms-key="KMS_KEY" ^
    --service-account="SERVICE_ACCOUNT" ^
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" ^
    --labels="LABEL=LABEL_VALUE" ^
    -- --template=HIVETOGCS ^
    --templateProperty log.level="LOG_LEVEL" ^
    --templateProperty hive.input.table="TABLE" ^
    --templateProperty hive.input.db="DATABASE" ^
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" ^
    --templateProperty hive.gcs.output.format="FORMAT" ^
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" ^
    --templateProperty hive.gcs.save.mode="MODE"

REST

์š”์ฒญ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ๋‹ค์Œ์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  • PROJECT_ID: (ํ•„์ˆ˜์‚ฌํ•ญ) IAM ์„ค์ •์— ๋‚˜์—ด๋œ Google Cloud ํ”„๋กœ์ ํŠธ ID์ž…๋‹ˆ๋‹ค.
  • REGION: (ํ•„์ˆ˜์‚ฌํ•ญ) Compute Engine ๋ฆฌ์ „์ž…๋‹ˆ๋‹ค.
  • TEMPLATE_VERSION: (ํ•„์ˆ˜์‚ฌํ•ญ) ์ตœ์‹  ํ…œํ”Œ๋ฆฟ ๋ฒ„์ „ ๋˜๋Š” ํŠน์ • ๋ฒ„์ „์˜ ๋‚ ์งœ๋กœ latest๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ์‹œ: 2023-03-17_v0.1.0-beta). ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ…œํ”Œ๋ฆฟ ๋ฒ„์ „์„ ๋‚˜์—ดํ•˜๋ ค๋ฉด gs://dataproc-templates-binaries๋ฅผ ๋ฐฉ๋ฌธํ•˜๊ฑฐ๋‚˜ gcloud storage ls gs://dataproc-templates-binaries๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”.
  • SUBNET: (์„ ํƒ์‚ฌํ•ญ) ์„œ๋ธŒ๋„ท์ด ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ default ๋„คํŠธ์›Œํฌ์˜ ์ง€์ •๋œ ๋ฆฌ์ „์— ์žˆ๋Š” ์„œ๋ธŒ๋„ท์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • HOST ๋ฐ PORT: (ํ•„์ˆ˜์‚ฌํ•ญ) ์†Œ์Šค Hive ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ํ˜ธ์ŠคํŠธ์˜ ํ˜ธ์ŠคํŠธ ์ด๋ฆ„ ๋˜๋Š” IP ์ฃผ์†Œ์™€ ํฌํŠธ์ž…๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: 10.0.0.33

  • TABLE: (ํ•„์ˆ˜์‚ฌํ•ญ) Hive ์ž…๋ ฅ ํ…Œ์ด๋ธ” ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
  • DATABASE: (ํ•„์ˆ˜์‚ฌํ•ญ) Hive ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
  • CLOUD_STORAGE_OUTPUT_PATH: (ํ•„์ˆ˜์‚ฌํ•ญ) ์ถœ๋ ฅ์„ ์ €์žฅํ•  Cloud Storage ๊ฒฝ๋กœ์ž…๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: gs://dataproc-templates/hive_to_cloud_storage_output

  • FORMAT: (์„ ํƒ์‚ฌํ•ญ) ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ ํ˜•์‹์ž…๋‹ˆ๋‹ค. ์˜ต์…˜: avro, parquet, csv, json. ๊ธฐ๋ณธ๊ฐ’์€ avro์ž…๋‹ˆ๋‹ค. ์ฐธ๊ณ : avro์˜ ๊ฒฝ์šฐ file:///usr/lib/spark/connector/spark-avro.jar๋ฅผ jars gcloud CLI ํ”Œ๋ž˜๊ทธ ๋˜๋Š” API ํ•„๋“œ์— ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ(file:// ์ ‘๋‘์‚ฌ๋Š” Dataproc Serverless jar ํŒŒ์ผ์„ ์ฐธ์กฐ):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... ๊ธฐํƒ€ jar]
  • HIVE_PARTITION_COLUMN: (์„ ํƒ์‚ฌํ•ญ) Hive ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆŒ ์—ด์ž…๋‹ˆ๋‹ค.
  • MODE: (ํ•„์ˆ˜์‚ฌํ•ญ) Cloud Storage ์ถœ๋ ฅ์˜ ์“ฐ๊ธฐ ๋ชจ๋“œ์ž…๋‹ˆ๋‹ค. ์˜ต์…˜: append, overwrite, ignore, errorifexists.
  • SERVICE_ACCOUNT: (์„ ํƒ์‚ฌํ•ญ) ์ž…๋ ฅํ•˜์ง€ ์•Š์œผ๋ฉด ๊ธฐ๋ณธ Compute Engine ์„œ๋น„์Šค ๊ณ„์ •์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • PROPERTY ๋ฐ PROPERTY_VALUE: (์„ ํƒ์‚ฌํ•ญ) Spark ์†์„ฑ=value ์Œ์˜ ์‰ผํ‘œ๋กœ ๊ตฌ๋ถ„๋œ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
  • LABEL ๋ฐ LABEL_VALUE: (์„ ํƒ์‚ฌํ•ญ) label=value ์Œ์˜ ์‰ผํ‘œ๋กœ ๊ตฌ๋ถ„๋œ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
  • LOG_LEVEL: (์„ ํƒ์‚ฌํ•ญ) ๋กœ๊น… ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN ์ค‘ ํ•˜๋‚˜์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ INFO์ž…๋‹ˆ๋‹ค.
  • KMS_KEY: (์„ ํƒ์‚ฌํ•ญ) ์•”ํ˜ธํ™”์— ์‚ฌ์šฉํ•  Cloud Key Management Service ํ‚ค์ž…๋‹ˆ๋‹ค. ํ‚ค๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด Google ์†Œ์œ  ๋ฐ Google ๊ด€๋ฆฌ ํ‚ค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ ์ƒํƒœ์—์„œ ์•”ํ˜ธํ™”๋ฉ๋‹ˆ๋‹ค.

    ์˜ˆ์‹œ: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

HTTP ๋ฉ”์„œ๋“œ ๋ฐ URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

JSON ์š”์ฒญ ๋ณธ๋ฌธ:

{
  "environmentConfig":{
    "executionConfig":{
      "subnetworkUri":"SUBNET",
      "kmsKey": "KMS_KEY",
      "serviceAccount": "SERVICE_ACCOUNT"
    }
  },
  "labels": {
    "LABEL": "LABEL_VALUE"
  },
  "runtimeConfig": {
    "version": "1.2",
    "properties": {
      "spark.hadoop.hive.metastore.uris":"thrift://HOST:PORT",
      "PROPERTY": "PROPERTY_VALUE"
    }
  },
  "sparkBatch":{
    "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate",
    "args":[
      "--template","HIVETOGCS",
      "--templateProperty","log.level=LOG_LEVEL",
      "--templateProperty","hive.input.table=TABLE",
      "--templateProperty","hive.input.db=DATABASE",
      "--templateProperty","hive.gcs.output.path=CLOUD_STORAGE_OUTPUT_PATH",
      "--templateProperty","hive.gcs.output.format=FORMAT",
      "--templateProperty","hive.partition.col=HIVE_PARTITION_COLUMN",
      "--templateProperty","hive.gcs.save.mode=MODE"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/connector/spark-avro.jar",
      "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"
    ]
  }
}

์š”์ฒญ์„ ๋ณด๋‚ด๋ ค๋ฉด ๋‹ค์Œ ์˜ต์…˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ํŽผ์นฉ๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ JSON ์‘๋‹ต์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

{
  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
    "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",
    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
    "createTime": "2023-02-24T03:31:03.440329Z",
    "operationType": "BATCH",
    "description": "Batch"
  }
}