Gemini Live API์—์„œ Vertex AI RAG Engine ์‚ฌ์šฉ

๊ฒ€์ƒ‰ ์ฆ๊ฐ• ์ƒ์„ฑ (RAG)์€ LLM์ด ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๋Œ€๋‹ต์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ด€๋ จ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ์ œ๊ณตํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ •๋ณด์—๋Š” ์ตœ์‹  ์ •๋ณด, ์ฃผ์ œ ๋ฐ ์ปจํ…์ŠคํŠธ ๋˜๋Š” ์ •๋‹ต์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€์—์„œ๋Š” RAG ์ฝ”ํผ์Šค์—์„œ ์ •๋ณด๋ฅผ ์ง€์ •ํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” Gemini Live API์™€ ํ•จ๊ป˜ Vertex AI RAG Engine์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ์š”๊ฑด

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ Live API์™€ ํ•จ๊ป˜ Vertex AI RAG Engine์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ ์„ ํ–‰ ์กฐ๊ฑด์„ ์™„๋ฃŒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. Vertex AI์—์„œ RAG API๋ฅผ ์‚ฌ์šฉ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

  2. RAG ์ฝ”ํผ์Šค ๋งŒ๋“ค๊ธฐ ์˜ˆ์‹œ

  3. RAG ์ฝ”ํผ์Šค์— ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•˜๋ ค๋ฉด RAG ํŒŒ์ผ ๊ฐ€์ ธ์˜ค๊ธฐ ์˜ˆ์‹œ API๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์„ค์ •

Vertex AI RAG Engine์„ ๋„๊ตฌ๋กœ ์ง€์ •ํ•˜์—ฌ Live API์™€ ํ•จ๊ป˜ Vertex AI RAG Engine์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์ฝ”๋“œ ์ƒ˜ํ”Œ์€ Vertex AI RAG ์—”์ง„์„ ๋„๊ตฌ๋กœ ์ง€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋‹ค์Œ ๋ณ€์ˆ˜๋ฅผ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

  • YOUR_PROJECT_ID: Google Cloud ํ”„๋กœ์ ํŠธ์˜ ID.
  • YOUR_CORPUS_ID: ์ฝ”ํผ์Šค ID์ž…๋‹ˆ๋‹ค.
  • YOUR_LOCATION: ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฆฌ์ „.
PROJECT_ID = "YOUR_PROJECT_ID"
RAG_CORPUS_ID = "YOUR_CORPUS_ID"
LOCATION = "YOUR_LOCATION"

TOOLS = {
  "retrieval": {
    "vertex_rag_store": {
        "rag_resources": {
        "rag_corpus": "projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}"
      }
    }
  }
}

์‹ค์‹œ๊ฐ„ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์— Websocket ์‚ฌ์šฉํ•˜๊ธฐ

ํด๋ผ์ด์–ธํŠธ์™€ ์„œ๋ฒ„ ๊ฐ„์˜ ์‹ค์‹œ๊ฐ„ ํ†ต์‹ ์„ ์‚ฌ์šฉ ์„ค์ •ํ•˜๋ ค๋ฉด Websocket๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ฝ”๋“œ ์ƒ˜ํ”Œ์€ Python API์™€ Python SDK๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Websocket๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Python API

CONFIG = {"response_modalities": ["TEXT"], "speech_config": { "language_code": "en-US" }}
headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {bearer_token[0]}",
}
HOST= "${LOCATION}-aiplatform.googleapis.com"
SERVICE_URL = f"wss://{HOST}/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent"
MODEL="gemini-2.0-flash-exp"

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
  # Setup the session
  await ws.send(
json.dumps(
          {
              "setup": {
                  "model": MODEL,
                  "generation_config": CONFIG,
                  # Setup RAG as a retrieval tool
                  "tools": TOOLS,
              }
          }
      )
  )

  # Receive setup response
  raw_response = await ws.recv(decode=False)
  setup_response = json.loads(raw_response.decode("ascii"))

  # Send text message
  text_input = "What are popular LLMs?"
  display(Markdown(f"**Input:** {text_input}"))

  msg = {
      "client_content": {
          "turns": [{"role": "user", "parts": [{"text": text_input}]}],
          "turn_complete": True,
      }
  }

  await ws.send(json.dumps(msg))

  responses = []

  # Receive chunks of server response
  async for raw_response in ws:
      response = json.loads(raw_response.decode())
      server_content = response.pop("serverContent", None)
      if server_content is None:
          break

      model_turn = server_content.pop("modelTurn", None)
      if model_turn is not None:
          parts = model_turn.pop("parts", None)
          if parts is not None:
              display(Markdown(f"**parts >** {parts}"))
              responses.append(parts[0]["text"])

      # End of turn
      turn_complete = server_content.pop("turnComplete", None)
      if turn_complete:
          grounding_metadata = server_content.pop("groundingMetadata", None)
          if grounding_metadata is not None:
            grounding_chunks = grounding_metadata.pop("groundingChunks", None)
            if grounding_chunks is not None:
              for chunk in grounding_chunks:
                display(Markdown(f"**grounding_chunk >** {chunk}"))
          break

  # Print the server response
  display(Markdown(f"**Response >** {''.join(responses)}"))

Python SDK

์ƒ์„ฑํ˜• AI SDK๋ฅผ ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๋ ค๋ฉด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

from google import genai
from google.genai import types
from google.genai.types import (Content, LiveConnectConfig, HttpOptions, Modality, Part,)
from IPython import display

MODEL="gemini-2.0-flash-exp"

client = genai.Client(
  vertexai=True,
  project=PROJECT_ID,
  location=LOCATION
)

async with client.aio.live.connect(
  model=MODEL,
  config=LiveConnectConfig(response_modalities=[Modality.TEXT],
                            tools=TOOLS),
) as session:
  text_input = "\'What are core LLM techniques?\'"
  print("> ", text_input, "\n")
  await session.send_client_content(
      turns=Content(role="user", parts=[Part(text=text_input)])
  )

  async for message in session.receive()
      if message.text:
          display.display(display.Markdown(message.text))
          continue

Vertex AI RAG Engine์„ ์ปจํ…์ŠคํŠธ ์Šคํ† ์–ด๋กœ ์‚ฌ์šฉ

Vertex AI RAG Engine์„ Gemini Live API์˜ ์ปจํ…์ŠคํŠธ ์Šคํ† ์–ด๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์„ธ์…˜ ์ปจํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•˜์—ฌ ๋Œ€ํ™”์™€ ๊ด€๋ จ๋œ ์ด์ „ ์ปจํ…์ŠคํŠธ๋ฅผ ํ˜•์„ฑํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•˜๋ฉฐ ๋ชจ๋ธ ์ƒ์„ฑ์„ ์œ„ํ•œ ํ˜„์žฌ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ณด๊ฐ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ Live API ์„ธ์…˜ ๊ฐ„์— ์ปจํ…์ŠคํŠธ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Vertex AI RAG Engine์€ ์„ธ์…˜ ์ปจํ…์ŠคํŠธ์—์„œ ๋‹ค์Œ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์ƒ‰์ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • ํ…์ŠคํŠธ
  • ์˜ค๋””์˜ค ์Œ์„ฑ

MemoryCorpus ์œ ํ˜• ์ฝ”ํผ์Šค ๋งŒ๋“ค๊ธฐ

์„ธ์…˜ ์ปจํ…์ŠคํŠธ์˜ ๋Œ€ํ™” ํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•˜๊ณ  ์ƒ‰์ธ์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด MemoryCorpus ์œ ํ˜•์˜ RAG ์ฝ”ํผ์Šค๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ƒ‰์ธ ์ƒ์„ฑ์„ ์œ„ํ•ด Live API์—์„œ ์ €์žฅ๋œ ์„ธ์…˜ ์ปจํ…์ŠคํŠธ๋ฅผ ํŒŒ์‹ฑํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋นŒ๋“œํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ฝ”ํผ์Šค ๊ตฌ์„ฑ์—์„œ LLM ํŒŒ์„œ๋ฅผ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ฝ”๋“œ ์ƒ˜ํ”Œ์€ ์ฝ”ํผ์Šค๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋จผ์ € ๋ณ€์ˆ˜๋ฅผ ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ”์•ผ ํ•ฉ๋‹ˆ๋‹ค.

# Currently supports Google first-party embedding models
EMBEDDING_MODEL = YOUR_EMBEDDING_MODEL  # Such as "publishers/google/models/text-embedding-005"
MEMORY_CORPUS_DISPLAY_NAME = YOUR_MEMORY_CORPUS_DISPLAY_NAME
LLM_PARSER_MODEL_NAME = YOUR_LLM_PARSER_MODEL_NAME  # Such as "projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.5-pro-preview-05-06"

memory_corpus = rag.create_corpus(
   display_name=MEMORY_CORPUS_DISPLAY_NAME,
   corpus_type_config=rag.RagCorpusTypeConfig(
       corpus_type_config=rag.MemoryCorpus(
           llm_parser=rag.LlmParserConfig(
               model_name=LLM_PARSER_MODEL_NAME,
           )
       )
   ),
   backend_config=rag.RagVectorDbConfig(
       rag_embedding_model_config=rag.RagEmbeddingModelConfig(
           vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
               publisher_model=EMBEDDING_MODEL
           )
       )
   ),
)

์ปจํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”ํผ์Šค ์ง€์ •

Live API์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ ์ฝ”ํผ์Šค๋ฅผ ๊ฒ€์ƒ‰ ๋„๊ตฌ๋กœ ์ง€์ •ํ•œ ๋‹ค์Œ store_context๋ฅผ true๋กœ ์„ค์ •ํ•˜์—ฌ Live API๊ฐ€ ์„ธ์…˜ ์ปจํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•˜๋„๋ก ํ—ˆ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ฝ”๋“œ ์ƒ˜ํ”Œ์€ ์ปจํ…์ŠคํŠธ๋ฅผ ์ €์žฅํ•  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”ํผ์Šค๋ฅผ ์ง€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋‹จ, ๋จผ์ € ๋ณ€์ˆ˜๋ฅผ ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ”์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from google import genai
from google.genai import types
from google.genai.types import (Content, LiveConnectConfig, HttpOptions, Modality, Part)
from IPython import display

PROJECT_ID=YOUR_PROJECT_ID
LOCATION=YOUR_LOCATION
TEXT_INPUT=YOUR_TEXT_INPUT
MODEL_NAME=YOUR_MODEL_NAME  # Such as "gemini-2.0-flash-exp"

client = genai.Client(
   vertexai=True,
   project=PROJECT_ID,
   location=LOCATION,
)

memory_store=types.VertexRagStore(
   rag_resources=[
       types.VertexRagStoreRagResource(
           rag_corpus=memory_corpus.name
       )
   ],
   store_context=True
)

async with client.aio.live.connect(
   model=MODEL_NAME,
   config=LiveConnectConfig(response_modalities=[Modality.TEXT],
                            tools=[types.Tool(
                                retrieval=types.Retrieval(
                                    vertex_rag_store=memory_store))]),
) as session:
   text_input=TEXT_INPUT
   await session.send_client_content(
       turns=Content(role="user", parts=[Part(text=text_input)])
   )

   async for message in session.receive():
       if message.text:
           display.display(display.Markdown(message.text))
           continue

๋‹ค์Œ ๋‹จ๊ณ„

  • Vertex AI RAG Engine์„ ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด Vertex AI RAG Engine ๊ฐœ์š”๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
  • RAG API์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด Vertex AI RAG Engine API๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
  • RAG ์ฝ”ํผ์Šค๋ฅผ ๊ด€๋ฆฌํ•˜๋ ค๋ฉด ์ฝ”ํผ์Šค ๊ด€๋ฆฌ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
  • RAG ํŒŒ์ผ์„ ๊ด€๋ฆฌํ•˜๋ ค๋ฉด ํŒŒ์ผ ๊ด€๋ฆฌ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
  • Vertex AI SDK๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Vertex AI RAG Engine ํƒœ์Šคํฌ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ Python์šฉ RAG ๋น ๋ฅธ ์‹œ์ž‘์„ ์ฐธ์กฐํ•˜์„ธ์š”.