Cloud Storage Avro ๅˆฐ Bigtable ็ฏ„ๆœฌ

Cloud Storage Avro ๅˆฐ Bigtable ็ฏ„ๆœฌๆ˜ฏไธ€็จฎ็ฎก้“๏ผŒๅฏๅพž Cloud Storage ๅ€ผๅ€ไธญ็š„ Avro ๆช”ๆกˆ่ฎ€ๅ–่ณ‡ๆ–™๏ผŒไธฆๅฐ‡่ณ‡ๆ–™ๅฏซๅ…ฅ Bigtable ่ณ‡ๆ–™่กจใ€‚ๆ‚จๅฏไปฅไฝฟ็”จ้€™ๅ€‹็ฏ„ๆœฌๅฐ‡่ณ‡ๆ–™ๅพž Cloud Storage ่ค‡่ฃฝๅˆฐ Bigtableใ€‚

็ฎก้“็›ธ้—œ่ฆๅฎš

  • Bigtable ่ณ‡ๆ–™่กจๅฟ…้ ˆๅญ˜ๅœจ๏ผŒไธฆไธ”่ˆ‡ๅพž Avro ๆช”ๆกˆไธญๅŒฏๅ‡บ็š„ๅ…งๅฎนๅ…ทๆœ‰็›ธๅŒ็š„่ณ‡ๆ–™ๆฌ„็ณปๅˆ—ใ€‚
  • ๅœจๅŸท่กŒ็ฎก้“ไน‹ๅ‰๏ผŒ่ผธๅ…ฅ Avro ๆช”ๆกˆๅฟ…้ ˆๅญ˜ๅœจๆ–ผ Cloud Storage ๅ€ผๅ€ไธญใ€‚
  • Bigtable ้ ๆœŸ่ผธๅ…ฅ Avro ๆช”ๆกˆๆŽก็‰นๅฎš ็ตๆง‹ๅฎš็พฉใ€‚

็ฏ„ๆœฌๅƒๆ•ธ

ๅฟ…่ฆๅƒๆ•ธ

  • bigtableProjectId๏ผšๅŒ…ๅซๆ‚จ่ฆๅฏซๅ…ฅ่ณ‡ๆ–™็š„ Bigtable ๅŸท่กŒๅ€‹้ซ”็š„ Google Cloud ๅฐˆๆกˆ IDใ€‚
  • bigtableInstanceId๏ผšๅŒ…ๅซ่ณ‡ๆ–™่กจ็š„ Bigtable ๅŸท่กŒๅ€‹้ซ” IDใ€‚
  • bigtableTableId๏ผš่ฆๅŒฏๅ…ฅ็š„ Bigtable ่ณ‡ๆ–™่กจ IDใ€‚
  • inputFilePattern๏ผš่ณ‡ๆ–™ๅญ˜ๆ”พไฝ็ฝฎ็š„ Cloud Storage ่ทฏๅพ‘ๆจกๅผ๏ผŒไพ‹ๅฆ‚๏ผšgs://<BUCKET_NAME>/FOLDER/PREFIX*ใ€‚

้ธ็”จๅƒๆ•ธ

  • splitLargeRows๏ผš้€™ๅ€‹ๆ——ๆจ™็”จๆ–ผๅ•Ÿ็”จๅฐ‡ๅคงๅž‹่ณ‡ๆ–™ๅˆ—ๅˆ†ๅ‰ฒๆˆๅคšๅ€‹ MutateRows ่ฆๆฑ‚ใ€‚่ซ‹ๆณจๆ„๏ผŒๅฆ‚ๆžœๅคงๅž‹่ณ‡ๆ–™ๅˆ—ๅœจๅคšๅ€‹ API ๅ‘ผๅซไน‹้–“ๅˆ†ๅ‰ฒ๏ผŒ่ณ‡ๆ–™ๅˆ—็š„ๆ›ดๆ–ฐไฝœๆฅญๅฐฑไธๆ˜ฏไธๅฏๅˆ†ๅ‰ฒใ€‚

ๅŸท่กŒ็ฏ„ๆœฌ

ๆŽงๅˆถๅฐ

  1. ๅ‰ๅพ€ Dataflow ็š„ใ€ŒCreate job from templateใ€(้€้Ž็ฏ„ๆœฌๅปบ็ซ‹ๅทฅไฝœ) ้ ้ขใ€‚
  2. ๅ‰ๅพ€ใ€Œไพๆ“š็ฏ„ๆœฌๅปบ็ซ‹ๅทฅไฝœใ€
  3. ๅœจใ€Œๅทฅไฝœๅ็จฑใ€ๆฌ„ไฝไธญ๏ผŒ่ผธๅ…ฅๅฐˆๅฑฌๅทฅไฝœๅ็จฑใ€‚
  4. ้ธ็”จ๏ผšๅฆ‚่ฆไฝฟ็”จๅ€ๅŸŸ็ซฏ้ปž๏ผŒ่ซ‹ๅพžไธ‹ๆ‹‰ๅผ้ธๅ–ฎไธญ้ธๅ–ๅ€ผใ€‚้ ่จญๅ€ๅŸŸ็‚บ us-central1ใ€‚

    ๅฆ‚้œ€ๅฏๅŸท่กŒ Dataflow ๅทฅไฝœ็š„ๅœฐๅ€ๆธ…ๅ–ฎ๏ผŒ่ซ‹ๅƒ้–ฑใ€ŒDataflow ไฝ็ฝฎใ€ใ€‚

  5. ๅพžใ€ŒDataflow templateใ€(Dataflow ็ฏ„ๆœฌ) ไธ‹ๆ‹‰ๅผ้ธๅ–ฎไธญ้ธๅ– the Avro Files on Cloud Storage to Cloud Bigtable templateใ€‚
  6. ๅœจๆไพ›็š„ๅƒๆ•ธๆฌ„ไฝไธญ่ผธๅ…ฅๅƒๆ•ธๅ€ผใ€‚
  7. ๆŒ‰ไธ€ไธ‹ใ€ŒRun Jobใ€(ๅŸท่กŒๅทฅไฝœ)ใ€‚

gcloud

ๅœจๆฎผๅฑคๆˆ–็ต‚็ซฏๆฉŸไธญๅŸท่กŒ็ฏ„ๆœฌ๏ผš

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/GCS_Avro_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

ๆ›ดๆ”นไธ‹ๅˆ—ๅ…งๅฎน๏ผš

  • JOB_NAME๏ผš ๆ‚จ้ธๆ“‡็š„ไธ้‡่ค‡ๅทฅไฝœๅ็จฑ
  • VERSION๏ผš ๆ‚จ่ฆไฝฟ็”จ็š„็ฏ„ๆœฌ็‰ˆๆœฌ

    ๆ‚จๅฏไปฅไฝฟ็”จไธ‹ๅˆ—ๅ€ผ๏ผš

    • latest๏ผŒไฝฟ็”จ็ฏ„ๆœฌ็š„ๆœ€ๆ–ฐ็‰ˆๆœฌ๏ผŒ่ฉฒ็‰ˆๆœฌไฝๆ–ผๅ€ผๅ€ไธญ้žไพๆ—ฅๆœŸๅ‘ฝๅ็š„ไธŠๅฑค่ณ‡ๆ–™ๅคพ๏ผšgs://dataflow-templates-REGION_NAME/latest/
    • ็‰ˆๆœฌๅ็จฑ (ไพ‹ๅฆ‚ 2023-09-12-00_RC00)๏ผŒ็”จๆ–ผๆŒ‡ๅฎš็ฏ„ๆœฌ็‰ˆๆœฌ๏ผŒ่ฉฒ็‰ˆๆœฌๆœƒไปฅๅทข็‹€็ตๆง‹ๅญ˜ๆ”พๅœจๅ€ผๅ€ไธญไพๆ—ฅๆœŸๅ‘ฝๅ็š„ไธŠๅฑค่ณ‡ๆ–™ๅคพไธญ๏ผšgs://dataflow-templates-REGION_NAME/
  • REGION_NAME๏ผš ๆ‚จ่ฆ้ƒจ็ฝฒ Dataflow ๅทฅไฝœ็š„ๅœฐๅ€๏ผŒไพ‹ๅฆ‚ us-central1
  • BIGTABLE_PROJECT_ID๏ผšๆ‚จ่ฆ่ฎ€ๅ–่ณ‡ๆ–™็š„ Bigtable ๅŸท่กŒๅ€‹้ซ” Google Cloud ๅฐˆๆกˆ ID
  • INSTANCE_ID๏ผšๅŒ…ๅซ่ณ‡ๆ–™่กจ็š„ Bigtable ๅŸท่กŒๅ€‹้ซ” ID
  • TABLE_ID๏ผš่ฆๅŒฏๅ‡บ็š„ Bigtable ่ณ‡ๆ–™่กจ ID
  • INPUT_FILE_PATTERN๏ผš่ณ‡ๆ–™ๅญ˜ๆ”พไฝ็ฝฎ็š„ Cloud Storage ่ทฏๅพ‘ๆจกๅผ๏ผŒไพ‹ๅฆ‚ gs://mybucket/somefolder/prefix*

API

ๅฆ‚่ฆไฝฟ็”จ REST API ๅŸท่กŒ็ฏ„ๆœฌ๏ผŒ่ซ‹ๅ‚ณ้€ HTTP POST ่ฆๆฑ‚ใ€‚ๅฆ‚่ฆ้€ฒไธ€ๆญฅ็žญ่งฃ API ๅ’ŒๆŽˆๆฌŠ็ฏ„ๅœ๏ผŒ่ซ‹ๅƒ้–ฑ projects.templates.launchใ€‚

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/GCS_Avro_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

ๆ›ดๆ”นไธ‹ๅˆ—ๅ…งๅฎน๏ผš

  • PROJECT_ID๏ผš ๆ‚จ่ฆๅŸท่กŒ Dataflow ๅทฅไฝœ็š„ๅฐˆๆกˆ ID Google Cloud
  • JOB_NAME๏ผš ๆ‚จ้ธๆ“‡็š„ไธ้‡่ค‡ๅทฅไฝœๅ็จฑ
  • VERSION๏ผš ๆ‚จ่ฆไฝฟ็”จ็š„็ฏ„ๆœฌ็‰ˆๆœฌ

    ๆ‚จๅฏไปฅไฝฟ็”จไธ‹ๅˆ—ๅ€ผ๏ผš

    • latest๏ผŒไฝฟ็”จ็ฏ„ๆœฌ็š„ๆœ€ๆ–ฐ็‰ˆๆœฌ๏ผŒ่ฉฒ็‰ˆๆœฌไฝๆ–ผๅ€ผๅ€ไธญ้žไพๆ—ฅๆœŸๅ‘ฝๅ็š„ไธŠๅฑค่ณ‡ๆ–™ๅคพ๏ผšgs://dataflow-templates-REGION_NAME/latest/
    • ็‰ˆๆœฌๅ็จฑ (ไพ‹ๅฆ‚ 2023-09-12-00_RC00)๏ผŒ็”จๆ–ผๆŒ‡ๅฎš็ฏ„ๆœฌ็‰ˆๆœฌ๏ผŒ่ฉฒ็‰ˆๆœฌๆœƒไปฅๅทข็‹€็ตๆง‹ๅญ˜ๆ”พๅœจๅ€ผๅ€ไธญไพๆ—ฅๆœŸๅ‘ฝๅ็š„ไธŠๅฑค่ณ‡ๆ–™ๅคพไธญ๏ผšgs://dataflow-templates-REGION_NAME/
  • LOCATION๏ผš ๆ‚จ่ฆ้ƒจ็ฝฒ Dataflow ๅทฅไฝœ็š„ๅœฐๅ€๏ผŒไพ‹ๅฆ‚ us-central1
  • BIGTABLE_PROJECT_ID๏ผšๆ‚จ่ฆ่ฎ€ๅ–่ณ‡ๆ–™็š„ Bigtable ๅŸท่กŒๅ€‹้ซ” Google Cloud ๅฐˆๆกˆ ID
  • INSTANCE_ID๏ผšๅŒ…ๅซ่ณ‡ๆ–™่กจ็š„ Bigtable ๅŸท่กŒๅ€‹้ซ” ID
  • TABLE_ID๏ผš่ฆๅŒฏๅ‡บ็š„ Bigtable ่ณ‡ๆ–™่กจ ID
  • INPUT_FILE_PATTERN๏ผš่ณ‡ๆ–™ๅญ˜ๆ”พไฝ็ฝฎ็š„ Cloud Storage ่ทฏๅพ‘ๆจกๅผ๏ผŒไพ‹ๅฆ‚ gs://mybucket/somefolder/prefix*

ๅพŒ็บŒๆญฅ้ฉŸ