Document AI (Python)๋ฅผ ์‚ฌ์šฉํ•œ ์–‘์‹ ํŒŒ์‹ฑ

1. ์†Œ๊ฐœ

์ด Codelab์—์„œ๋Š” Document AI ์–‘์‹ ํŒŒ์„œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Python์œผ๋กœ ํ•„๊ธฐ ์–‘์‹์„ ํŒŒ์‹ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” ๊ฐ„๋‹จํ•œ ์˜๋ฃŒ ์ ‘์ˆ˜ ์–‘์‹์„ ์˜ˆ๋กœ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ด ์ ˆ์ฐจ๋Š” DocAI์—์„œ ์ง€์›ํ•˜๋Š” ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋“  ์–‘์‹์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ์š”๊ฑด

์ด Codelab์€ ๋‹ค๋ฅธ Document AI Codelabs์—์„œ ๋‹ค๋ฃฌ ์ฝ˜ํ…์ธ ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ Codelab์„ ๋จผ์ € ์™„๋ฃŒํ•œ ํ›„์— ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

ํ•™์Šตํ•  ๋‚ด์šฉ

  • Document AI ์–‘์‹ ํŒŒ์„œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Šค์บ”ํ•œ ์–‘์‹์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์‹ฑํ•˜๊ณ  ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•

ํ•„์š”ํ•œ ํ•ญ๋ชฉ

  • Google Cloud ํ”„๋กœ์ ํŠธ
  • ๋ธŒ๋ผ์šฐ์ €(Chrome, Firefox ๋“ฑ)
  • Python 3์— ๊ด€ํ•œ ์ง€์‹

์„ค๋ฌธ์กฐ์‚ฌ

์ด ํŠœํ† ๋ฆฌ์–ผ์„ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜์‹ค ๊ณ„ํš์ธ๊ฐ€์š”?

์ฝ๊ธฐ๋งŒ ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค ์ฝ์€ ๋‹ค์Œ ์—ฐ์Šต ํ™œ๋™์„ ์™„๋ฃŒํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค

๊ท€ํ•˜์˜ Python ์‚ฌ์šฉ ๊ฒฝํ—˜์ด ์–ด๋–ค์ง€ ํ‰๊ฐ€ํ•ด ์ฃผ์„ธ์š”.

์ดˆ๊ธ‰ ์ค‘๊ธ‰ ๊ณ ๊ธ‰

๊ท€ํ•˜์˜ Google Cloud ์„œ๋น„์Šค ์‚ฌ์šฉ ๊ฒฝํ—˜์„ ํ‰๊ฐ€ํ•ด ์ฃผ์„ธ์š”.

<ph type="x-smartling-placeholder"></ph> ์ดˆ๋ณด์ž ์ค‘๊ธ‰ ์ˆ™๋ จ๋„

2. ์„ค์ • ๋ฐ ์š”๊ตฌ์‚ฌํ•ญ

์ด Codelab์—์„œ๋Š” Document AI OCR Codelab์— ๋‚˜์—ด๋œ Document AI ์„ค์ • ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•œ ํ›„์— ์ง„ํ–‰ํ•˜์„ธ์š”.

Python์šฉ ์˜คํ”ˆ์†Œ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ Pandas๋„ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

pip3 install --upgrade pandas

3. ์–‘์‹ ํŒŒ์„œ ํ”„๋กœ์„ธ์„œ ๋งŒ๋“ค๊ธฐ

๋จผ์ € ์ด ํŠœํ† ๋ฆฌ์–ผ์˜ Document AI Platform์—์„œ ์‚ฌ์šฉํ•  ์–‘์‹ ํŒŒ์„œ ํ”„๋กœ์„ธ์„œ ์ธ์Šคํ„ด์Šค๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ฝ˜์†”์—์„œ Document AI Platform ๊ฐœ์š”๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
  2. ํ”„๋กœ์„ธ์„œ ๋งŒ๋“ค๊ธฐ๋ฅผ ํด๋ฆญํ•˜๊ณ  ์–‘์‹ ํŒŒ์„œํ”„๋กœ์„ธ์„œ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  3. ํ”„๋กœ์„ธ์„œ ์ด๋ฆ„์„ ์ง€์ •ํ•˜๊ณ  ๋ชฉ๋ก์—์„œ ๋ฆฌ์ „์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  4. ๋งŒ๋“ค๊ธฐ๋ฅผ ํด๋ฆญํ•˜์—ฌ ํ”„๋กœ์„ธ์„œ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  5. ํ”„๋กœ์„ธ์„œ ID๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์ค‘์— ์ฝ”๋“œ์—์„œ ์ด ID๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Cloud ์ฝ˜์†”์—์„œ ํ”„๋กœ์„ธ์„œ ํ…Œ์ŠคํŠธ

์ฝ˜์†”์—์„œ ๋ฌธ์„œ๋ฅผ ์—…๋กœ๋“œํ•˜์—ฌ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ ์—…๋กœ๋“œ๋ฅผ ํด๋ฆญํ•˜๊ณ  ํŒŒ์‹ฑํ•  ์–‘์‹์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ƒ˜ํ”Œ ์–‘์‹์ด ์—†๋Š” ๊ฒฝ์šฐ ์ด ์ƒ˜ํ”Œ ์–‘์‹์„ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒํƒœ ์–‘์‹

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ํŒŒ์‹ฑ๋œ ์–‘์‹

4. ์ƒ˜ํ”Œ ์–‘์‹ ๋‹ค์šด๋กœ๋“œ

๊ฐ„๋‹จํ•œ ์˜๋ฃŒ ์ ‘์ˆ˜ ์–‘์‹์ด ํฌํ•จ๋œ ์ƒ˜ํ”Œ ๋ฌธ์„œ๊ฐ€ ์ค€๋น„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋งํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PDF๋ฅผ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ Cloud Shell ์ธ์Šคํ„ด์Šค์— ์—…๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

๋˜๋Š” gsutil๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐœ Google Cloud Storage ๋ฒ„ํ‚ท์—์„œ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/intake-form.pdf .

์•„๋ž˜ ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ผ์ด Cloud Shell์— ๋‹ค์šด๋กœ๋“œ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

ls -ltr intake-form.pdf

5. ์–‘์‹ ํ‚ค/๊ฐ’ ์Œ ์ถ”์ถœ

์ด ๋‹จ๊ณ„์—์„œ๋Š” ์˜จ๋ผ์ธ ์ฒ˜๋ฆฌ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „์— ๋งŒ๋“  ์–‘์‹ ํŒŒ์„œ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ฌธ์„œ์—์„œ ๋ฐœ๊ฒฌ๋œ ํ‚ค-๊ฐ’ ์Œ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

์˜จ๋ผ์ธ ์ฒ˜๋ฆฌ๋Š” ๋‹จ์ผ ๋ฌธ์„œ๋ฅผ ๋ณด๋‚ด๊ณ  ์‘๋‹ต์„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ํŒŒ์ผ์„ ์ „์†กํ•˜๋ ค๋Š” ๊ฒฝ์šฐ ๋˜๋Š” ํŒŒ์ผ ํฌ๊ธฐ๊ฐ€ ์˜จ๋ผ์ธ ์ฒ˜๋ฆฌ ์ตœ๋Œ€ ํŽ˜์ด์ง€ ์ˆ˜๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ฒฝ์šฐ์—๋„ ์ผ๊ด„ ์ฒ˜๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. OCR Codelab์—์„œ ์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ฒ€ํ† ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ”„๋กœ์„ธ์Šค ์š”์ฒญ์„ ์œ„ํ•œ ์ฝ”๋“œ๋Š” ํ”„๋กœ์„ธ์„œ ID๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ํ”„๋กœ์„ธ์„œ ์œ ํ˜•์— ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Document ์‘๋‹ต ๊ฐ์ฒด์—๋Š” ์ž…๋ ฅ ๋ฌธ์„œ์˜ ํŽ˜์ด์ง€ ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

๊ฐ page ๊ฐ์ฒด์—๋Š” ์–‘์‹ ์ž…๋ ฅ๋ž€ ๋ชฉ๋ก๊ณผ ํ…์ŠคํŠธ ๋‚ด ์œ„์น˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ์ฝ”๋“œ๋Š” ๊ฐ ํŽ˜์ด์ง€๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ฐ ํ‚ค, ๊ฐ’, ์‹ ๋ขฐ๋„ ์ ์ˆ˜๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋” ์‰ฝ๊ฒŒ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

form_parser.py๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ค๊ณ  ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

form_parser.py

import pandas as pd
from google.cloud import documentai_v1 as documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

        # Load Binary Data into Document AI RawDocument Object
        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type
        )

        # Configure the process request
        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document
        )

        # Use the Document AI client to process the sample form
        result = documentai_client.process_document(request=request)

        return result.document


def trim_text(text: str):
    """
    Remove extra space characters from text (blank, newline, tab, etc.)
    """
    return text.strip().replace("\n", " ")


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID"  # Create processor in Cloud Console

# The local file in your current working directory
FILE_PATH = "intake-form.pdf"
# Refer to https://cloud.google.com/document-ai/docs/processors-list
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

names = []
name_confidence = []
values = []
value_confidence = []

for page in document.pages:
    for field in page.form_fields:
        # Get the extracted field names
        names.append(trim_text(field.field_name.text_anchor.content))
        # Confidence - How "sure" the Model is that the text is correct
        name_confidence.append(field.field_name.confidence)

        values.append(trim_text(field.field_value.text_anchor.content))
        value_confidence.append(field.field_value.confidence)

# Create a Pandas Dataframe to print the values in tabular format.
df = pd.DataFrame(
    {
        "Field Name": names,
        "Field Name Confidence": name_confidence,
        "Field Value": values,
        "Field Value Confidence": value_confidence,
    }
)

print(df)

์ด์ œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ํ…์ŠคํŠธ๊ฐ€ ์ถ”์ถœ๋˜์–ด ์ฝ˜์†”์— ์ถœ๋ ฅ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ ๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

$ python3 form_parser.py
                                           Field Name  Field Name Confidence                                        Field Value  Field Value Confidence
0                                            Phone #:               0.999982                                     (906) 917-3486                0.999982
1                                  Emergency Contact:               0.999972                                         Eva Walker                0.999972
2                                     Marital Status:               0.999951                                             Single                0.999951
3                                             Gender:               0.999933                                                  F                0.999933
4                                         Occupation:               0.999914                                  Software Engineer                0.999914
5                                        Referred By:               0.999862                                               None                0.999862
6                                               Date:               0.999858                                            9/14/19                0.999858
7                                                DOB:               0.999716                                         09/04/1986                0.999716
8                                            Address:               0.999147                                     24 Barney Lane                0.999147
9                                               City:               0.997718                                             Towaco                0.997718
10                                              Name:               0.997345                                       Sally Walker                0.997345
11                                             State:               0.996944                                                 NJ                0.996944
...

6. ํŒŒ์‹ฑ ํ…Œ์ด๋ธ”

์–‘์‹ ํŒŒ์„œ๋Š” ๋ฌธ์„œ ๋‚ด์˜ ํ…Œ์ด๋ธ”์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” ์ƒˆ ์ƒ˜ํ”Œ ๋ฌธ์„œ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ํ‘œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ Pandas๋กœ ๋กœ๋“œํ•˜๋ฏ€๋กœ ๋‹จ์ผ ๋ฉ”์„œ๋“œ ํ˜ธ์ถœ๋กœ ์ด ๋ฐ์ดํ„ฐ๋ฅผ CSV ํŒŒ์ผ๊ณผ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ํ˜•์‹์œผ๋กœ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ…Œ์ด๋ธ”์ด ํฌํ•จ๋œ ์ƒ˜ํ”Œ ์–‘์‹ ๋‹ค์šด๋กœ๋“œ

์ƒ˜ํ”Œ ์–‘์‹๊ณผ ํ‘œ๊ฐ€ ํฌํ•จ๋œ ์ƒ˜ํ”Œ ๋ฌธ์„œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋งํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PDF๋ฅผ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ Cloud Shell ์ธ์Šคํ„ด์Šค์— ์—…๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

๋˜๋Š” gsutil๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐœ Google Cloud Storage ๋ฒ„ํ‚ท์—์„œ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

gsutil cp gs://cloud-samples-data/documentai/codelabs/form-parser/form_with_tables.pdf .

์•„๋ž˜ ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ผ์ด Cloud Shell์— ๋‹ค์šด๋กœ๋“œ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

ls -ltr form_with_tables.pdf

ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ ์ถ”์ถœ

ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ ์š”์ฒญ์€ ํ‚ค-๊ฐ’ ์Œ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ๊ณผ ์ •ํ™•ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์ฐจ์ด์ ์€ ์‘๋‹ต์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•„๋“œ์ž…๋‹ˆ๋‹ค. ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ๋Š” pages[].tables[] ํ•„๋“œ์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

์ด ์˜ˆ์—์„œ๋Š” ๊ฐ ํ‘œ์™€ ํŽ˜์ด์ง€์˜ ํ‘œ ํ—ค๋” ํ–‰๊ณผ ๋ณธ๋ฌธ ํ–‰์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ ๋‹ค์Œ ํ‘œ๋ฅผ ์ถœ๋ ฅํ•˜๊ณ  ํ‘œ๋ฅผ CSV ํŒŒ์ผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

table_parsing.py๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ค๊ณ  ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

table_parsing.py

# type: ignore[1]
"""
Uses Document AI online processing to call a form parser processor
Extracts the tables and data in the document.
"""
from os.path import splitext
from typing import List, Sequence

import pandas as pd
from google.cloud import documentai


def online_process(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    """
    Processes a document using the Document AI Online Processing API.
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

        # Load Binary Data into Document AI RawDocument Object
        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type
        )

        # Configure the process request
        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document
        )

        # Use the Document AI client to process the sample form
        result = documentai_client.process_document(request=request)

        return result.document


def get_table_data(
    rows: Sequence[documentai.Document.Page.Table.TableRow], text: str
) -> List[List[str]]:
    """
    Get Text data from table rows
    """
    all_values: List[List[str]] = []
    for row in rows:
        current_row_values: List[str] = []
        for cell in row.cells:
            current_row_values.append(
                text_anchor_to_text(cell.layout.text_anchor, text)
            )
        all_values.append(current_row_values)
    return all_values


def text_anchor_to_text(text_anchor: documentai.Document.TextAnchor, text: str) -> str:
    """
    Document AI identifies table data by their offsets in the entirety of the
    document's text. This function converts offsets to a string.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in text_anchor.text_segments:
        start_index = int(segment.start_index)
        end_index = int(segment.end_index)
        response += text[start_index:end_index]
    return response.strip().replace("\n", " ")


PROJECT_ID = "YOUR_PROJECT_ID"
LOCATION = "YOUR_PROJECT_LOCATION"  # Format is 'us' or 'eu'
PROCESSOR_ID = "FORM_PARSER_ID"  # Create processor before running sample

# The local file in your current working directory
FILE_PATH = "form_with_tables.pdf"
# Refer to https://cloud.google.com/document-ai/docs/file-types
# for supported file types
MIME_TYPE = "application/pdf"

document = online_process(
    project_id=PROJECT_ID,
    location=LOCATION,
    processor_id=PROCESSOR_ID,
    file_path=FILE_PATH,
    mime_type=MIME_TYPE,
)

header_row_values: List[List[str]] = []
body_row_values: List[List[str]] = []

# Input Filename without extension
output_file_prefix = splitext(FILE_PATH)[0]

for page in document.pages:
    for index, table in enumerate(page.tables):
        header_row_values = get_table_data(table.header_rows, document.text)
        body_row_values = get_table_data(table.body_rows, document.text)

        # Create a Pandas Dataframe to print the values in tabular format.
        df = pd.DataFrame(
            data=body_row_values,
            columns=pd.MultiIndex.from_arrays(header_row_values),
        )

        print(f"Page {page.page_number} - Table {index}")
        print(df)

        # Save each table as a CSV file
        output_filename = f"{output_file_prefix}_pg{page.page_number}_tb{index}.csv"
        df.to_csv(output_filename, index=False)

์ด์ œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ํ…์ŠคํŠธ๊ฐ€ ์ถ”์ถœ๋˜์–ด ์ฝ˜์†”์— ์ถœ๋ ฅ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ ๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

$ python3 table_parsing.py
Page 1 - Table 0
     Item    Description
0  Item 1  Description 1
1  Item 2  Description 2
2  Item 3  Description 3
Page 1 - Table 1
  Form Number:     12345678
0   Form Date:   2020/10/01
1        Name:   First Last
2     Address:  123 Fake St

๋˜ํ•œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋””๋ ‰ํ„ฐ๋ฆฌ์— ๋‘ ๊ฐœ์˜ ์ƒˆ CSV ํŒŒ์ผ์ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

$ ls
form_with_tables_pg1_tb0.csv form_with_tables_pg1_tb1.csv table_parsing.py

7. ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค

์ˆ˜๊ณ ํ•˜์…จ์Šต๋‹ˆ๋‹ค. Document AI API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•„๊ธฐ ์ž…๋ ฅ ์–‘์‹์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์–‘์‹ ๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉํ•ด ๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

์‚ญ์ œ

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์‚ฌ์šฉํ•œ ๋ฆฌ์†Œ์Šค ๋น„์šฉ์ด Google Cloud ๊ณ„์ •์— ์ฒญ๊ตฌ๋˜์ง€ ์•Š๋„๋ก ํ•˜๋ ค๋ฉด ๋‹ค์Œ ์•ˆ๋‚ด๋ฅผ ๋”ฐ๋ฅด์„ธ์š”.

  • Cloud ์ฝ˜์†”์—์„œ ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
  • ํ”„๋กœ์ ํŠธ ๋ชฉ๋ก์—์„œ ํ•ด๋‹น ํ”„๋กœ์ ํŠธ๋ฅผ ์„ ํƒํ•œ ํ›„ ์‚ญ์ œ๋ฅผ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.
  • ๋Œ€ํ™”์ƒ์ž์—์„œ ํ”„๋กœ์ ํŠธ ID๋ฅผ ์ž…๋ ฅํ•œ ํ›„ ์ข…๋ฃŒ๋ฅผ ํด๋ฆญํ•˜์—ฌ ํ”„๋กœ์ ํŠธ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.

์ž์„ธํžˆ ์•Œ์•„๋ณด๊ธฐ

๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ›„์† Codelab์—์„œ Document AI ํ•™์Šต์„ ๊ณ„์†ํ•˜์„ธ์š”.

๋ฆฌ์†Œ์Šค

๋ผ์ด์„ ์Šค

์ด ์ž‘์—…๋ฌผ์€ Creative Commons Attribution 2.0 ์ผ๋ฐ˜ ๋ผ์ด์„ ์Šค์— ๋”ฐ๋ผ ์‚ฌ์šฉ์ด ํ—ˆ๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.