Speech-to-Text ์š”์ฒญ ๊ตฌ์„ฑ

์ด ๋ฌธ์„œ๋Š” Speech-to-Text ์‚ฌ์šฉ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ์‚ฌํ•ญ์„ ์„ค๋ช…ํ•˜๋Š” ๊ฐ€์ด๋“œ์ž…๋‹ˆ๋‹ค. ์ด ๊ฐœ๋… ๊ฐ€์ด๋“œ๋Š” Speech-to-Text์— ๋ณด๋‚ผ ์ˆ˜ ์žˆ๋Š” ์š”์ฒญ ์œ ํ˜•, ํ•ด๋‹น ์š”์ฒญ ์ž‘์„ฑ ๋ฐฉ๋ฒ•, ์‘๋‹ต ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. Speech-to-Text์˜ ๋ชจ๋“  ์‚ฌ์šฉ์ž๋Š” ์ด API๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ์ด ๊ฐ€์ด๋“œ์™€ ๊ด€๋ จ ๊ฐ€์ด๋“œ ์ค‘ ํ•˜๋‚˜๋ฅผ ์ฝ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

์ง์ ‘ ์‚ฌ์šฉํ•ด ๋ณด๊ธฐ

Google Cloud๋ฅผ ์ฒ˜์Œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๊ณ„์ •์„ ๋งŒ๋“ค์–ด ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ Speech-to-Text์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์‹ ๊ทœ ๊ณ ๊ฐ์—๊ฒŒ๋Š” ์›Œํฌ๋กœ๋“œ๋ฅผ ์‹คํ–‰, ํ…Œ์ŠคํŠธ, ๋ฐฐํฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” $300์˜ ๋ฌด๋ฃŒ ํฌ๋ ˆ๋”ง์ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

๋ฌด๋ฃŒ๋กœ Speech-to-Text ์‚ฌ์šฉํ•ด ๋ณด๊ธฐ

์Œ์„ฑ ์š”์ฒญ

Speech-to-Text์—๋Š” ์Œ์„ฑ ์ธ์‹์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋™๊ธฐ ์ธ์‹(REST, gRPC)์€ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ Speech-to-Text API๋กœ ๋ณด๋‚ด๊ณ , ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์‹ํ•˜๊ณ , ๋ชจ๋“  ์˜ค๋””์˜ค๊ฐ€ ์ฒ˜๋ฆฌ๋œ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ๋™๊ธฐ์‹ ์ธ์‹ ์š”์ฒญ ๋Œ€์ƒ์€ ๊ธธ์ด๊ฐ€ 1๋ถ„ ์ดํ•˜์ธ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.

  • ๋น„๋™๊ธฐ ์ธ์‹(REST ๋ฐ gRPC)์€ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ Speech-to-Text API๋กœ ๋ณด๋‚ด๊ณ , ์žฅ๊ธฐ ์‹คํ–‰ ์ž‘์—…์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ๊ธฐ์ ์œผ๋กœ ์ธ์‹ ๊ฒฐ๊ณผ๋ฅผ ํด๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๋Œ€ 480๋ถ„ ๊ธธ์ด์˜ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์—๋Š” ๋น„๋™๊ธฐ์‹ ์š”์ฒญ์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

  • ์ŠคํŠธ๋ฆฌ๋ฐ ์ธ์‹(gRPC๋งŒ ํ•ด๋‹น)์€ gRPC ์–‘๋ฐฉํ–ฅ ์ŠคํŠธ๋ฆผ์— ์ œ๊ณต๋˜๋Š” ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค. ์ŠคํŠธ๋ฆฌ๋ฐ ์š”์ฒญ์€ ๋งˆ์ดํฌ์—์„œ ๋ผ์ด๋ธŒ ์˜ค๋””์˜ค ์บก์ฒ˜ ์šฉ๋„์™€ ๊ฐ™์€ ์‹ค์‹œ๊ฐ„ ์ธ์‹ ์šฉ๋„๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ŠคํŠธ๋ฆฌ๋ฐ ์ธ์‹์€ ์˜ค๋””์˜ค ์บก์ฒ˜ ์ค‘์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ, ์‚ฌ์šฉ์ž๊ฐ€ ๊ณ„์† ๋งํ•˜๋Š” ์ค‘์—๋„ ๊ฒฐ๊ณผ๋ฅผ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์ฒญ์—๋Š” ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋Š” ๋ฌผ๋ก  ๊ตฌ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์€ ์ด๋Ÿฌํ•œ ์œ ํ˜•์˜ ์ธ์‹ ์š”์ฒญ, ์š”์ฒญ์ด ์ƒ์„ฑํ•˜๋Š” ์‘๋‹ต, ํ•ด๋‹น ์‘๋‹ต ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

Speech-to-Text API ์ธ์‹

Speech-to-Text API ๋™๊ธฐ์‹ ์ธ์‹ ์š”์ฒญ์€ ์Œ์„ฑ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ ์ธ์‹์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Speech-to-Text๋Š” ๋™๊ธฐ์‹ ์š”์ฒญ์œผ๋กœ ์ „์†ก๋œ ์Œ์„ฑ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ(์ตœ๋Œ€ 1๋ถ„ ๊ธธ์ด)๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Speech-to-Text๋Š” ์˜ค๋””์˜ค๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ธ์‹ํ•œ ํ›„ ์‘๋‹ต์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

๋™๊ธฐ์‹ ์š”์ฒญ์€ ์ฐจ๋‹จ ์š”์ฒญ์ด๋ฏ€๋กœ Speech-to-Text๊ฐ€ ์‘๋‹ต์„ ๋ฐ˜ํ™˜ํ•œ ํ›„์—์•ผ ๋‹ค์Œ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Speech-to-Text๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์‹ค์‹œ๊ฐ„๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ์˜ค๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, 30์ดˆ ๊ธธ์ด์˜ ์˜ค๋””์˜ค๋ฅผ ํ‰๊ท  15์ดˆ ๋‚ด์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์˜ค๋””์˜ค ํ’ˆ์งˆ์ด ๋‚˜์˜๋ฉด ์ธ์‹ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„์ด ์ƒ๋‹นํžˆ ์˜ค๋ž˜ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Speech-to-Text์—๋Š” Speech-to-Text API ๋™๊ธฐ์‹ ๋ฐ ๋น„๋™๊ธฐ์‹ ์š”์ฒญ์„ ํ˜ธ์ถœํ•˜๋Š” REST ๋ฐ gRPC ๋ฉ”์†Œ๋“œ ๋ชจ๋‘๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. API์˜ ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๊ณ  ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์ด ๋” ๊ฐ„๋‹จํ•˜๋ฏ€๋กœ, ์ด ๋ฌธ์„œ๋Š” REST API๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ REST ๋˜๋Š” gRPC ์š”์ฒญ์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ์€ ๋งค์šฐ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ŠคํŠธ๋ฆฌ๋ฐ ์ธ์‹ ์š”์ฒญ ์€ gRPC์—์„œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

๋™๊ธฐ์‹ ์Œ์„ฑ ์ธ์‹ ์š”์ฒญ

๋™๊ธฐ์‹ Speech-to-Text API ์š”์ฒญ์€ ์Œ์„ฑ ์ธ์‹ ๊ตฌ์„ฑ๊ณผ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ƒ˜ํ”Œ ์š”์ฒญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
    },
    "audio": {
        "uri": "gs://bucket-name/path_to_audio_file"
    }
}

๋ชจ๋“  Speech-to-Text API ๋™๊ธฐ ์ธ์‹ ์š”์ฒญ์—๋Š” ์Œ์„ฑ ์ธ์‹ config ํ•„๋“œ(RecognitionConfig ์œ ํ˜•)๊ฐ€ ํฌํ•จ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. RecognitionConfig์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•˜์œ„ ํ•„๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  • encoding - (ํ•„์ˆ˜) ์ œ๊ณต๋œ ์˜ค๋””์˜ค์˜ ์ธ์ฝ”๋”ฉ ์ฒด๊ณ„๋ฅผ AudioEncoding ์œ ํ˜•์œผ๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฝ”๋ฑ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ, ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ์›ํ•œ๋‹ค๋ฉด FLAC ๋˜๋Š” LINEAR16๊ณผ ๊ฐ™์€ ๋ฌด์†์‹ค ์ธ์ฝ”๋”ฉ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. (์ž์„ธํ•œ ๋‚ด์šฉ์€ ์˜ค๋””์˜ค ์ธ์ฝ”๋”ฉ์„ ์ฐธ์กฐํ•˜์„ธ์š”.) ํŒŒ์ผ ํ—ค๋”์— ์ธ์ฝ”๋”ฉ์ด ํฌํ•จ๋œ FLAC ๋ฐ WAV ํŒŒ์ผ์˜ ๊ฒฝ์šฐ encoding ํ•„๋“œ๋Š” ํ•„์ˆ˜๊ฐ€ ์•„๋‹Œ ์„ ํƒ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค.
  • sampleRateHertz - (ํ•„์ˆ˜) ์ œ๊ณต๋œ ์˜ค๋””์˜ค์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ(Hz)๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.) ํŒŒ์ผ ํ—ค๋”์— ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๊ฐ€ ํฌํ•จ๋œ FLAC ๋ฐ WAV ํŒŒ์ผ์˜ ๊ฒฝ์šฐ sampleRateHertz ํ•„๋“œ๋Š” ํ•„์ˆ˜๊ฐ€ ์•„๋‹Œ ์„ ํƒ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค.
  • languageCode - (ํ•„์ˆ˜) ์ œ๊ณต๋œ ์˜ค๋””์˜ค์˜ ์Œ์„ฑ ์ธ์‹์— ์‚ฌ์šฉํ•  ์–ธ์–ด์™€ ๋ฆฌ์ „/์ง€์—ญ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์–ธ์–ด ์ฝ”๋“œ๋Š” BCP-47 ์‹๋ณ„์ž์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์–ธ์–ด ์ฝ”๋“œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์–ธ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ธฐ๋ณธ ์–ธ์–ด ํƒœ๊ทธ์™€ ๋ณด์กฐ ๋ฆฌ์ „ ํ•˜์œ„ ํƒœ๊ทธ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค(์˜ˆ: ์œ„์˜ ์˜ˆ์—์„œ 'en'์€ ์˜์–ด๋ฅผ, 'US'๋Š” ๋ฏธ๊ตญ์„ ๋‚˜ํƒ€๋ƒ„). (์ง€์›๋˜๋Š” ์–ธ์–ด ๋ชฉ๋ก์€ ์ง€์›๋˜๋Š” ์–ธ์–ด๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.)
  • maxAlternatives - (์„ ํƒ์‚ฌํ•ญ, ๊ธฐ๋ณธ๊ฐ’์€ 1) ์‘๋‹ต์—์„œ ์ œ๊ณตํ•  ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ Speech-to-Text API๋Š” ๊ธฐ๋ณธ ํ…์ŠคํŠธ ๋ณ€ํ™˜ ํ•œ ๊ฐœ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋ฅผ ํ‰๊ฐ€ํ•˜๋ ค๋ฉด maxAlternatives๋ฅผ ๋” ๋†’์€ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์ธ์‹์ž๊ฐ€ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ ํ’ˆ์งˆ์ด ์ ์ ˆํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉด Speech-to-Text๋Š” ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋งŒ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋Š” ์‚ฌ์šฉ์ž ์˜๊ฒฌ์ด ํ•„์š”ํ•œ ์‹ค์‹œ๊ฐ„ ์š”์ฒญ(์˜ˆ: ์Œ์„ฑ ๋ช…๋ น)์— ๋” ์ ํ•ฉํ•˜๋ฏ€๋กœ ์ŠคํŠธ๋ฆฌ๋ฐ ์ธ์‹ ์š”์ฒญ์— ์•Œ๋งž์Šต๋‹ˆ๋‹ค.
  • profanityFilter- (์„ ํƒ์‚ฌํ•ญ) ๋ชจ์š•์ ์ธ ๋‹จ์–ด ๋˜๋Š” ๊ตฌ๋ฌธ ํ•„ํ„ฐ๋ง ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ•„ํ„ฐ๋ง๋œ ๋‹จ์–ด๋Š” ์ฒซ ๋ฒˆ์งธ ๋ฌธ์ž์™€ ๋ณ„ํ‘œ(๋‚˜๋จธ์ง€ ๋ฌธ์ž)๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค(์˜ˆ: f***). ์š•์„ค ํ•„ํ„ฐ๋Š” ๋‹จ์ผ ๋‹จ์–ด์— ์ ์šฉ๋˜์ง€๋งŒ ๊ตฌ๋ฌธ ๋˜๋Š” ๋‹จ์–ด ์กฐํ•ฉ์œผ๋กœ ๋œ ์š•์„ค์ด๋‚˜ ๊ณต๊ฒฉ์ ์ธ ์Œ์„ฑ์„ ๊ฐ์ง€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • speechContext- (์„ ํƒ์‚ฌํ•ญ) ์ด ์˜ค๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ถ”๊ฐ€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ปจํ…์ŠคํŠธ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•˜์œ„ ํ•„๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    • boost - ์ง€์ •๋œ ๋‹จ์–ด ๋˜๋Š” ๊ตฌ๋ฌธ ์ธ์‹์— ๊ฐ€์ค‘์น˜๋ฅผ ํ• ๋‹นํ•˜๋Š” ๊ฐ’์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    • phrases - ์Œ์„ฑ ์ธ์‹ ์ž‘์—…์˜ ํžŒํŠธ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋‹จ์–ด์™€ ๊ตฌ๋ฌธ์˜ ๋ชฉ๋ก์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์Œ์„ฑ ์ ์‘์˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์˜ค๋””์˜ค๋Š” RecognitionAudio ์œ ํ˜•์˜ audio ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ†ตํ•ด Speech-to-Text์— ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. audio ํ•„๋“œ์—๋Š” ๋‹ค์Œ ํ•˜์œ„ ํ•„๋“œ ์ค‘ ํ•˜๋‚˜๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

  • content๋Š” ์š”์ฒญ์— ์‚ฝ์ž…๋œ ํ˜•ํƒœ์˜ ํ‰๊ฐ€ ๋Œ€์ƒ ์˜ค๋””์˜ค์ž…๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์˜ ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ  ์‚ฝ์ž…์„ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด ํ•„๋“œ์—์„œ ์ง์ ‘ ์ „๋‹ฌ๋˜๋Š” ์˜ค๋””์˜ค ๊ธธ์ด๋Š” 1๋ถ„์œผ๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.
  • uri์—๋Š” ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” URI๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. gzip ๋“ฑ์œผ๋กœ ์••์ถ•๋˜์ง€ ์•Š์€ ํŒŒ์ผ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ์ด ํ•„๋“œ์—๋Š” Google Cloud Storage URI(gs://bucket-name/path_to_audio_file ํ˜•์‹)๊ฐ€ ํฌํ•จ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ URI๋กœ ์˜ค๋””์˜ค ์ฐธ์กฐ ์ „๋‹ฌ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

์ด๋Ÿฌํ•œ ์š”์ฒญ๊ณผ ์‘๋‹ต ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์— ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ

์š”์ฒญ ๊ตฌ์„ฑ์˜ sampleRateHertz ํ•„๋“œ์—์„œ ์˜ค๋””์˜ค์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋Š” ๊ด€๋ จ๋œ ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ  ๋˜๋Š” ์ŠคํŠธ๋ฆผ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Speech-to-Text์—์„œ๋Š” 8000Hz์™€ 48000Hz ์‚ฌ์ด์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๊ฐ€ ์ง€์›๋ฉ๋‹ˆ๋‹ค. sampleRateHertz ํ•„๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  ํŒŒ์ผ ํ—ค๋”์— FLAC ๋˜๋Š” WAV ํŒŒ์ผ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Speech-to-Text API์— ์ œ์ถœํ•˜๋ ค๋ฉด FLAC ํŒŒ์ผ์˜ FLAC ํ—ค๋”์— ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๊ฐ€ ํฌํ•จ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์†Œ์Šค ์ž๋ฃŒ๋ฅผ ์ธ์ฝ”๋”ฉํ•  ๋•Œ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด 16,000Hz ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค๋ฅผ ์บก์ฒ˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๊ฐ’์ด ์ด ๊ฐ’๋ณด๋‹ค ๋‚ฎ์œผ๋ฉด ์Œ์„ฑ ์ธ์‹ ์ •ํ™•๋„๊ฐ€ ์†์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ ˆ๋ฒจ์ด ๋†’์•„๋„ ์Œ์„ฑ ์ธ์‹ ํ’ˆ์งˆ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๊ฐ€ 16,000Hz๊ฐ€ ์•„๋‹Œ ๊ธฐ์กด ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋กœ ์ด๋ฏธ ๋…น์Œ๋œ ๊ฒฝ์šฐ์—๋Š” ์˜ค๋””์˜ค๋ฅผ 16,000Hz๋กœ ๋‹ค์‹œ ์ƒ˜ํ”Œ๋งํ•˜์ง€ ๋งˆ์„ธ์š”. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋Œ€๋ถ€๋ถ„์˜ ์ด์ „ ์ „ํ™” ํ†ต์‹  ์˜ค๋””์˜ค๋Š” 8,000Hz ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๊ฒฐ๊ณผ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜ค๋””์˜ค๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ, Speech API์— ์›๋ž˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋กœ ์˜ค๋””์˜ค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์–ธ์–ด

Speech-to-Text์˜ ์ธ์‹ ์—”์ง„์€ ๋‹ค์–‘ํ•œ ์–ธ์–ด์™€ ๋ฐฉ์–ธ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์š”์ฒญ ๊ตฌ์„ฑ์˜ languageCode ํ•„๋“œ์—์„œ BCP-47 ์‹๋ณ„์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค ์–ธ์–ด(๊ทธ๋ฆฌ๊ณ  ๊ตญ๊ฐ€ ๋˜๋Š” ์ง€์—ญ ๋ฐฉ์–ธ)๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

์–ธ์–ด ์ง€์› ํŽ˜์ด์ง€์—์„œ ๊ฐ ๊ธฐ๋Šฅ์— ์ง€์›๋˜๋Š” ์–ธ์–ด์˜ ์ „์ฒด ๋ชฉ๋ก์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ฐจ(ํƒ€์ž„์Šคํƒฌํ”„)

Speech-to-Text์—๋Š” ์ œ๊ณต๋œ ์˜ค๋””์˜ค์—์„œ ์ธ์‹๋˜๋Š” ๊ฐ ๋ง์˜ ์‹œ์ž‘ ๋ถ€๋ถ„๊ณผ ๋ ๋ถ€๋ถ„์˜ ์‹œ์ฐจ ๊ฐ’(ํƒ€์ž„์Šคํƒฌํ”„)์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œ์ฐจ ๊ฐ’์€ ์˜ค๋””์˜ค ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ๊ฒฝ๊ณผ๋œ ์‹œ๊ฐ„์„ 100ms ๋‹จ์œ„๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์‹œ์ฐจ๋Š” ํŠนํžˆ ๊ธด ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๋ถ„์„ํ•˜๋Š” ๊ฒฝ์šฐ, ์ฆ‰ ์ธ์‹๋œ ํ…์ŠคํŠธ์—์„œ ํŠน์ • ๋‹จ์–ด๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ์›๋ณธ ์˜ค๋””์˜ค์—์„œ ์ฐพ์•„์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹œ์ฐจ๋Š” recognize, streamingrecognize, longrunningrecognize ๋“ฑ ๋ชจ๋“  ์ธ์‹ ๋ฐฉ๋ฒ•์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

์ธ์‹ ์‘๋‹ต์— ์ œ๊ณต๋œ ์ฒซ ๋ฒˆ์งธ ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜์˜ ์‹œ์ฐจ ๊ฐ’๋งŒ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

์š”์ฒญ ๊ฒฐ๊ณผ์— ์‹œ์ฐจ๋ฅผ ํฌํ•จํ•˜๋ ค๋ฉด ์š”์ฒญ ๊ตฌ์„ฑ์—์„œ enableWordTimeOffsets ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ true๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. REST API ๋˜๋Š” ํด๋ผ์ด์–ธํŠธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ๋Š” ์‹œ์ฐจ(ํƒ€์ž„์Šคํƒฌํ”„) ์‚ฌ์šฉ์„ ์ฐธ์กฐํ•˜์„ธ์š”. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์ฒญ ๊ตฌ์„ฑ์— enableWordTimeOffsets ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

{
"config": {
  "languageCode": "en-US",
  "enableWordTimeOffsets": true
  },
"audio":{
  "uri":"gs://gcs-test-data/gettysburg.flac"
  }
}

Speech-to-Text API์—์„œ ํ™•์ธํ•œ ๊ฒฐ๊ณผ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ธ์‹๋œ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ์‹œ์ฐจ ๊ฐ’์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

{
  "name": "6212202767953098955",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-24T10:21:22.013650Z",
    "lastUpdateTime": "2017-07-24T10:21:45.278630Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "Four score and twenty...(etc)...",
            "confidence": 0.97186122,
            "words": [
              {
                "startTime": "1.300s",
                "endTime": "1.400s",
                "word": "Four"
              },
              {
                "startTime": "1.400s",
                "endTime": "1.600s",
                "word": "score"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.600s",
                "word": "and"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.900s",
                "word": "twenty"
              },
              ...
            ]
          }
        ]
      },
      {
        "alternatives": [
          {
            "transcript": "for score and plenty...(etc)...",
            "confidence": 0.9041967,
          }
        ]
      }
    ]
  }
}

๋ชจ๋ธ ์„ ํƒ

Speech-to-Text๋Š” ์—ฌ๋Ÿฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค ํŒŒ์ผ์˜ ํ…์ŠคํŠธ๋ฅผ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Google์€ ์ด๋Ÿฌํ•œ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ์— ํŠน์ • ์˜ค๋””์˜ค ์œ ํ˜• ๋ฐ ์†Œ์Šค๋ฅผ ํ•™์Šต์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

Speech-to-Text์— ์˜ค๋””์˜ค ํ…์ŠคํŠธ ๋ณ€ํ™˜ ์š”์ฒญ์„ ๋ณด๋‚ด๋Š” ๊ฒฝ์šฐ ์›๋ณธ ์˜ค๋””์˜ค ์†Œ์Šค๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋” ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด Speech-to-Text API๋Š” ํŠน์ • ์†Œ์Šค ์œ ํ˜•์˜ ์Œ์„ฑ ์˜ค๋””์˜ค๋ฅผ ์ธ์‹ํ•˜๋„๋ก ํ•™์Šต๋œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ์„ ์ง€์ •ํ•˜๋ ค๋ฉด ์š”์ฒญ์˜ RecognitionConfig ๊ฐ์ฒด์— model ํ•„๋“œ๋ฅผ ํฌํ•จํ•˜์—ฌ ์‚ฌ์šฉํ•  ๋ชจ๋ธ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ Speech-to-Text ํ…์ŠคํŠธ ๋ณ€ํ™˜ ๋ชจ๋ธ ๋ชฉ๋ก์„ ์ฐธ์กฐํ•˜์„ธ์š”.

์‚ฝ์ž…๋œ ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ 

์š”์ฒญ์˜ audio ํ•„๋“œ ๋‚ด์— content ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ์‚ฝ์ž…๋œ ์˜ค๋””์˜ค๊ฐ€ ์Œ์„ฑ ์ธ์‹ ์š”์ฒญ์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. gRPC ์š”์ฒญ ๋‚ด์—์„œ ์ฝ˜ํ…์ธ ๋กœ ์ œ๊ณต๋˜๋Š” ์‚ฝ์ž…๋œ ์˜ค๋””์˜ค์˜ ๊ฒฝ์šฐ, ํ•ด๋‹น ์˜ค๋””์˜ค๋Š” Proto3 ์ง๋ ฌํ™”์™€ ํ˜ธํ™˜๋˜๊ณ  ๋ฐ”์ด๋„ˆ๋ฆฌ ๋ฐ์ดํ„ฐ๋กœ ์ œ๊ณต๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. REST ์š”์ฒญ ๋‚ด์—์„œ ์ฝ˜ํ…์ธ ๋กœ ์ œ๊ณต๋˜๋Š” ์‚ฝ์ž…๋œ ์˜ค๋””์˜ค์˜ ๊ฒฝ์šฐ, ํ•ด๋‹น ์˜ค๋””์˜ค๋Š” JSON ์ง๋ ฌํ™”์™€ ํ˜ธํ™˜๋˜๊ณ  ๋จผ์ € Base64๋กœ ์ธ์ฝ”๋”ฉ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์ž์„ธํ•œ ๋‚ด์šฉ์€ ์˜ค๋””์˜ค๋ฅผ Base64 ์ธ์ฝ”๋”ฉ์„ ์ฐธ์กฐํ•˜์„ธ์š”.)

Google Cloud ํด๋ผ์ด์–ธํŠธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์š”์ฒญ์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒฝ์šฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์ด ๋ฐ”์ด๋„ˆ๋ฆฌ(๋˜๋Š” base-64๋กœ ์ธ์ฝ”๋”ฉ๋œ) ๋ฐ์ดํ„ฐ๋ฅผ content ํ•„๋“œ์— ์ง์ ‘ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

URI ์ฐธ์กฐ๋กœ ์˜ค๋””์˜ค ์ „๋‹ฌ

๋ณด๋‹ค ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” ์Œ์„ฑ ์š”์ฒญ์˜ audio ํ•„๋“œ์— Google Cloud Storage์— ์žˆ๋Š” ๋‹ค์Œ ํ˜•์‹์˜ ์˜ค๋””์˜ค ํŒŒ์ผ(base64๊ฐ€ ์•„๋‹Œ ๋ฐ”์ด๋„ˆ๋ฆฌ ํ˜•์‹)์„ ๊ฐ€๋ฆฌํ‚ค๋Š” uri ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

gs://bucket-name/path_to_audio_file

์˜ˆ๋ฅผ ๋“ค์–ด, ์Œ์„ฑ ์š”์ฒญ์˜ ๋‹ค์Œ ๋ถ€๋ถ„์€ ๋น ๋ฅธ ์‹œ์ž‘์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ƒ˜ํ”Œ ์˜ค๋””์˜ค ํŒŒ์ผ์„ ์ฐธ์กฐํ•ฉ๋‹ˆ๋‹ค.

...
    "audio": {
        "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
    }
...

๋‹ค์Œ ์ค‘ ํ•˜๋‚˜์™€ ๊ฐ™์ด Google Cloud Storage ํŒŒ์ผ์„ ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ์ ์ ˆํ•œ ์•ก์„ธ์Šค ๊ถŒํ•œ์ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ณต๊ฐœ์ ์œผ๋กœ ์ฝ๊ธฐ ๊ฐ€๋Šฅ(์˜ˆ: Google ์ƒ˜ํ”Œ ์˜ค๋””์˜ค ํŒŒ์ผ)
  • ์„œ๋น„์Šค ๊ณ„์ • ์Šน์ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์„œ๋น„์Šค ๊ณ„์ •์—์„œ ์ฝ๊ธฐ ๊ฐ€๋Šฅ
  • ์‚ฌ์šฉ์ž ๊ณ„์ • ์Šน์ธ์— 3-legged OAuth๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž ๊ณ„์ •์—์„œ ์ฝ๊ธฐ ๊ฐ€๋Šฅ

Google Cloud Storage์˜ ์•ก์„ธ์Šค ๊ด€๋ฆฌ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Google Cloud Storage ๋ฌธ์„œ์˜ ์•ก์„ธ์Šค ์ œ์–ด ๋ชฉ๋ก ์ƒ์„ฑ ๋ฐ ๊ด€๋ฆฌ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Speech-to-Text API ์‘๋‹ต

์•ž์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋™๊ธฐ์‹ Speech-to-Text API ์‘๋‹ต์€ ์ œ๊ณต๋œ ์˜ค๋””์˜ค ๊ธธ์ด์— ๋น„๋ก€ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฐ ๋‹ค์†Œ์˜ ์‹œ๊ฐ„์„ ์†Œ๋ชจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด API๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‘๋‹ต์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.98267895,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

๋‹ค์Œ์€ ํ•„๋“œ์— ๋Œ€ํ•œ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

  • results์—๋Š” ๊ฒฐ๊ณผ(SpeechRecognitionResult ํ˜•์‹) ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ ๊ฒฐ๊ณผ๋Š” ํŠน์ • ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ์— ํ•ด๋‹น๋˜๊ณ , ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ๋Š” ๊ตฌ๋‘์ ์œผ๋กœ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ ํ•„๋“œ ์ค‘ ํ•˜๋‚˜ ์ด์ƒ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
    • alternatives์—๋Š” SpeechRecognitionAlternatives ํ˜•์‹์˜ ๋ณ€ํ™˜ ํ…์ŠคํŠธ ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ ๋‚˜ํƒ€๋‚ ์ง€ ์—ฌ๋ถ€๋Š” maxAlternatives๋ฅผ 1๋ณด๋‹ค ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋ฅผ ๋‘ ๊ฐœ ์ด์ƒ ์š”์ฒญํ–ˆ๋Š”์ง€ ์—ฌ๋ถ€์™€ Speech-to-Text๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋†’์€ ํ’ˆ์งˆ์˜ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ–ˆ๋Š”์ง€ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๊ฐ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋Š” ๋‹ค์Œ ํ•„๋“œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
      • transcript์—๋Š” ๋ณ€ํ™˜ ํ…์ŠคํŠธ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ๋ณ€ํ™˜ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
      • confidence์—๋Š” ํŠน์ • ๋ณ€ํ™˜ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ Speech-to-Text์˜ ์‹ ๋ขฐ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’(0~1)์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ์‹ ๋ขฐ๊ฐ’ ํ•ด์„์„ ์ฐธ์กฐํ•˜์„ธ์š”.

์ œ๊ณต๋œ ์˜ค๋””์˜ค์—์„œ ์Œ์„ฑ์„ ์ธ์‹ํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ, ๋ฐ˜ํ™˜๋œ results ๋ชฉ๋ก์— ํ•ญ๋ชฉ์ด ์—†์Šต๋‹ˆ๋‹ค. ์Œ์„ฑ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ด์œ ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์˜ค๋””์˜ค ํ’ˆ์งˆ์ด ๋งค์šฐ ๋‚˜์˜๊ฑฐ๋‚˜ ์–ธ์–ด ์ฝ”๋“œ, ์ธ์ฝ”๋”ฉ ๋˜๋Š” ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ ๊ฐ’์ด ์ œ๊ณต๋œ ์˜ค๋””์˜ค์™€ ์ผ์น˜ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ์„น์…˜์€ ์ด ์‘๋‹ต์˜ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

Speech-to-Text API์˜ ๊ฐ ๋™๊ธฐ ์‘๋‹ต์€ ์ธ์‹๋œ ๋ชจ๋“  ์˜ค๋””์˜ค๊ฐ€ ํฌํ•จ๋œ ๋‹จ์ผ ๊ฒฐ๊ณผ๊ฐ€ ์•„๋‹Œ ์—ฌ๋Ÿฌ ๊ฒฐ๊ณผ ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ธ์‹๋œ ์˜ค๋””์˜ค ๋ชฉ๋ก(transcript ์š”์†Œ ๋‚ด)์€ ์—ฐ์† ์ˆœ์„œ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

๋Œ€์ฒด ๊ฐ’ ์„ ํƒ

์„ฑ๊ณต์ ์ธ ๋™๊ธฐ ์ธ์‹ ์‘๋‹ต ๋‚ด์˜ ๊ฐ ๊ฒฐ๊ณผ์—๋Š” alternatives๊ฐ€ ํ•˜๋‚˜ ์ด์ƒ ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์š”์ฒญ์˜ maxAlternatives ๊ฐ’์ด 1๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ). Speech-to-Text๊ฐ€ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ์˜ ์‹ ๋ขฐ๊ฐ’์ด ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉด ์‘๋‹ต์— ์ด ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์‘๋‹ต์˜ ์ฒซ ๋ฒˆ์งธ ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ๋Š” ํ•ญ์ƒ ์ตœ๊ณ ์˜(๊ฐ€๋Šฅ์„ฑ์ด ๊ฐ€์žฅ ๋†’์€) ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค.

maxAlternatives๋ฅผ 1๋ณด๋‹ค ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋”๋ผ๋„ ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜์ด ์—ฌ๋Ÿฌ ๊ฐœ ๋ฐ˜ํ™˜๋œ๋‹ค๊ณ  ์•”์‹œํ•˜๊ฑฐ๋‚˜ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ŠคํŠธ๋ฆฌ๋ฐ ์ธ์‹ ์š”์ฒญ์„ ํ†ตํ•ด ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ์‚ฌ์šฉ์ž์—๊ฒŒ ์‹ค์‹œ๊ฐ„ ์˜ต์…˜์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•œ ๊ฐœ ์ด์ƒ์˜ ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜์ด ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

ํ…์ŠคํŠธ ๋ณ€ํ™˜ ์ฒ˜๋ฆฌ

์‘๋‹ต ๋‚ด์—์„œ ์ œ๊ณต๋˜๋Š” ๊ฐ ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜์—๋Š” ์ธ์‹๋œ ํ…์ŠคํŠธ๋ฅผ ํฌํ•จํ•˜๋Š” transcript๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ œ๊ณต๋˜๋ฉด ์ด๋Ÿฌํ•œ ํ…์ŠคํŠธ ๋ณ€ํ™˜์„ ํ•จ๊ป˜ ์—ฐ๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ Python ์ฝ”๋“œ๋Š” ๊ฒฐ๊ณผ ๋ชฉ๋ก์—์„œ ๋ฐ˜๋ณต๋˜๊ณ  ํ…์ŠคํŠธ ๋ณ€ํ™˜์„ ํ•จ๊ป˜ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์–ด๋– ํ•œ ๊ฒฝ์šฐ์—๋„ ์ฒซ ๋ฒˆ์งธ ๋Œ€์ฒด ํ…์ŠคํŠธ ๋ณ€ํ™˜(0๋ฒˆ์งธ)์„ ์ทจํ•ฉ๋‹ˆ๋‹ค.

response = service_request.execute()
recognized_text = 'Transcribed Text: \n'
for i in range(len(response['results'])):
    recognized_text += response['results'][i]['alternatives'][0]['transcript']

์‹ ๋ขฐ๊ฐ’

confidence ๊ฐ’์€ 0.0๊ณผ 1.0 ์‚ฌ์ด์˜ ์ถ”์ •์น˜์ด๋ฉฐ, ์˜ค๋””์˜ค์˜ ๊ฐ ๋‹จ์–ด์— ํ• ๋‹น๋œ '๊ฐ€๋Šฅ์„ฑ' ๊ฐ’์„ ์ง‘๊ณ„ํ•˜์—ฌ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ์ˆซ์ž๊ฐ€ ํด์ˆ˜๋ก ๊ฐ ๋‹จ์–ด๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ์ธ์‹๋˜์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค. ์ด ํ•„๋“œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ตœ์ƒ์œ„ ๊ฐ€์„ค์—๋งŒ ์ œ๊ณต๋˜๋ฉฐ is_final=true์ธ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ๋งŒ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, confidence ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€์ฒด ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ํ‘œ์‹œํ• ์ง€ ์•„๋‹ˆ๋ฉด ์‚ฌ์šฉ์ž์—๊ฒŒ ํ™•์ธ์„ ์š”์ฒญํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ชจ๋ธ์€ confidence ์ ์ˆ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฌธ์žฅ ์ปจํ…์ŠคํŠธ์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์‹ ํ˜ธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ '์ตœ์ƒ์˜' ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ์ตœ์ƒ์œ„ ๊ฒฐ๊ณผ์˜ ์‹ ๋ขฐ๋„ ์ ์ˆ˜๊ฐ€ ์ตœ๊ณ ์ ์ด ์•„๋‹Œ ๊ฒฝ์šฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€์ฒด ๊ฒฐ๊ณผ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์š”์ฒญํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ฐ˜ํ™˜๋œ ํ•˜๋‚˜์˜ '์ตœ์ƒ์˜' ๊ฒฐ๊ณผ๋Š” ์˜ˆ์ƒ๋ณด๋‹ค ๋‚ฎ์€ ์‹ ๋ขฐ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ๋งŽ์ด ์“ฐ์ด์ง€ ์•Š๋Š” ๋‹จ์–ด๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์„ ๋•Œ ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํฌ๊ท€์–ด๋Š” ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ธ์‹๋˜๋”๋ผ๋„ ๋‚ฎ์€ '๊ฐ€๋Šฅ์„ฑ' ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ปจํ…์ŠคํŠธ์— ๋”ฐ๋ผ ํฌ๊ท€์–ด๊ฐ€ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์˜ต์…˜์œผ๋กœ ๊ฒฐ์ •๋˜๋ฉด ๋Œ€์ฒด ์˜ต์…˜๋ณด๋‹ค confidence ๊ฐ’์ด ๋‚ฎ๋”๋ผ๋„ ์ตœ์ƒ์œ„์— ๊ฒฐ๊ณผ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

๋น„๋™๊ธฐ์‹ ์š”์ฒญ ๋ฐ ์‘๋‹ต

LongRunningRecognize ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•œ ๋น„๋™๊ธฐ์‹ Speech-to-Text API ์š”์ฒญ์€ ๋™๊ธฐ์‹ Speech-to-Text API ์š”์ฒญ๊ณผ ํ˜•์‹์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋น„๋™๊ธฐ์‹ ์š”์ฒญ์€ ์‘๋‹ต์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋Œ€์‹  ์žฅ๊ธฐ ์‹คํ–‰ ์ž‘์—…(์ž‘์—… ์œ ํ˜•)์„ ์‹œ์ž‘ํ•˜๊ณ  ์ด ์ž‘์—…์„ ํ”ผํ˜ธ์ถœ์ž์—๊ฒŒ ์ฆ‰์‹œ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ธธ์ด๊ฐ€ ์ตœ๋Œ€ 480๋ถ„์ธ ์˜ค๋””์˜ค์— ๋น„๋™๊ธฐ์‹ ์Œ์„ฑ ์ธ์‹์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ ์ž‘์—… ์‘๋‹ต์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

{
  "name": "operation_name",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata"
    "progressPercent": 34,
    "startTime": "2016-08-30T23:26:29.579144Z",
    "lastUpdateTime": "2016-08-30T23:26:29.826903Z"
  }
}

์•„์ง ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Speech-to-Text๋Š” ์˜ค๋””์˜ค๋ฅผ ๊ณ„์† ์ฒ˜๋ฆฌํ•˜๊ณ  ์ด ์ž‘์—…์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. LongRunningRecognize ์š”์ฒญ์ด ์™„๋ฃŒ๋  ๋•Œ ๋ฐ˜ํ™˜๋˜๋Š” ์ž‘์—…์˜ response ํ•„๋“œ์— ๊ฒฐ๊ณผ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

์š”์ฒญ ์™„๋ฃŒ ํ›„ ์ „์ฒด ์‘๋‹ต์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

{
  "name": "1268386125834704889",
  "metadata": {
    "lastUpdateTime": "2016-08-31T00:16:32.169Z",
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongrunningRecognizeMetadata",
    "startTime": "2016-08-31T00:16:29.539820Z",
    "progressPercent": 100
  }
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [{
      "alternatives": [{
        "confidence": 0.98267895,
        "transcript": "how old is the Brooklyn Bridge"
      }]}]
  },
  "done": True,
}

done์ด True๋กœ ์„ค์ •๋˜์—ˆ๊ณ  ์ž‘์—…์˜ response์— SpeechRecognitionResult ์œ ํ˜•์˜ ๊ฒฐ๊ณผ ์ง‘ํ•ฉ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ ์œ ํ˜•์€ ๋™๊ธฐ์‹ Speech-to-Text API ์ธ์‹ ์š”์ฒญ์—์„œ ๋ฐ˜ํ™˜๋˜๋Š” ์œ ํ˜•๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ์ ์œผ๋กœ ๋น„๋™๊ธฐ REST ์‘๋‹ต์€ done์„ ๊ธฐ๋ณธ๊ฐ’์ธ False๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ JSON์€ ํ•„๋“œ ๋‚ด์— ๊ธฐ๋ณธ๊ฐ’์ด ์กด์žฌํ•˜๋„๋ก ์š”๊ตฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, ์ž‘์—… ์™„๋ฃŒ ์—ฌ๋ถ€ ํ…Œ์ŠคํŠธ ์‹œ done ํ•„๋“œ๊ฐ€ ์กด์žฌํ•˜๊ณ  True๋กœ ์„ค์ •๋˜์–ด ์žˆ๋Š”์ง€ ๋ชจ๋‘ ํ…Œ์ŠคํŠธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ŠคํŠธ๋ฆฌ๋ฐ Speech-to-Text API ์ธ์‹ ์š”์ฒญ

์ŠคํŠธ๋ฆฌ๋ฐ Speech-to-Text API ์ธ์‹ ํ˜ธ์ถœ์€ ์–‘๋ฐฉํ–ฅ ์ŠคํŠธ๋ฆผ ๋‚ด์—์„œ ์˜ค๋””์˜ค์˜ ์‹ค์‹œ๊ฐ„ ์บก์ฒ˜ ๋ฐ ์ธ์‹์„ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ์š”์ฒญ ์ŠคํŠธ๋ฆผ์—์„œ ์˜ค๋””์˜ค๋ฅผ ๋ณด๋‚ด๊ณ  ์‘๋‹ต ์ŠคํŠธ๋ฆผ์—์„œ ์ค‘๊ฐ„ ๋ฐ ์ตœ์ข… ์ธ์‹ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋Š” ํŠน์ • ์˜ค๋””์˜ค ์„น์…˜์˜ ํ˜„์žฌ ์ธ์‹ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์ตœ์ข… ์ธ์‹ ๊ฒฐ๊ณผ๋Š” ํ•ด๋‹น ์˜ค๋””์˜ค ์„น์…˜์˜ ๊ฐ€์žฅ ๋†’์€ ์ตœ์ข… ์ถ”์ธก์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ŠคํŠธ๋ฆฌ๋ฐ ์š”์ฒญ

๋‹จ์ผ ์š”์ฒญ ๋‚ด์—์„œ ๊ตฌ์„ฑ ๋ฐ ์˜ค๋””์˜ค ๋ชจ๋‘ ๋ณด๋‚ด๋Š” ๋™๊ธฐ ๋ฐ ๋น„๋™๊ธฐ ํ˜ธ์ถœ๊ณผ ๋‹ฌ๋ฆฌ ์ŠคํŠธ๋ฆฌ๋ฐ Speech API๋ฅผ ํ˜ธ์ถœํ•˜๋ ค๋ฉด ์š”์ฒญ์„ ์—ฌ๋Ÿฌ ๊ฐœ ์ „์†กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ StreamingRecognizeRequest์—๋Š” ์˜ค๋””์˜ค ์—†์ด StreamingRecognitionConfig ์œ ํ˜•์˜ ๊ตฌ์„ฑ์ด ํฌํ•จ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋™์ผํ•œ ์ŠคํŠธ๋ฆผ์„ ํ†ตํ•ด ์ „์†ก๋˜๋Š” ํ›„์† StreamingRecognizeRequest๋Š” ์›์‹œ ์˜ค๋””์˜ค ๋ฐ”์ดํŠธ์˜ ์—ฐ์† ํ”„๋ ˆ์ž„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

StreamingRecognitionConfig๋Š” ๋‹ค์Œ ํ•„๋“œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • config - (ํ•„์ˆ˜) RecognitionConfig ์œ ํ˜•์˜ ์˜ค๋””์˜ค ๊ตฌ์„ฑ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋ฉฐ ๋™๊ธฐ์‹ ๋ฐ ๋น„๋™๊ธฐ์‹ ์š”์ฒญ์— ํ‘œ์‹œ๋˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  • single_utterance - (์„ ํƒ์‚ฌํ•ญ, ๊ธฐ๋ณธ๊ฐ’์€ false) ์Œ์„ฑ์ด ๋” ์ด์ƒ ๊ฐ์ง€๋˜์ง€ ์•Š์œผ๋ฉด ์ด ์š”์ฒญ์„ ์ž๋™์œผ๋กœ ์ข…๋ฃŒํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ํ•„๋“œ๊ฐ€ ์„ค์ •๋˜๋ฉด Speech-to-Text๊ฐ€ ์ผ์‹œ์ค‘์ง€, ๋ฌด์Œ ๋˜๋Š” ๋น„์Œ์„ฑ ์˜ค๋””์˜ค๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ์ธ์‹์„ ์ข…๋ฃŒํ•  ์‹œ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์„ค์ •๋˜์ง€ ์•Š์œผ๋ฉด ์ŠคํŠธ๋ฆผ์ด ์ง์ ‘ ๋‹ซํžˆ๊ฑฐ๋‚˜ ์ŠคํŠธ๋ฆผ ์ œํ•œ ๊ธธ์ด๊ฐ€ ์ดˆ๊ณผ๋  ๋•Œ๊นŒ์ง€ ์ŠคํŠธ๋ฆผ์ด ์˜ค๋””์˜ค๋ฅผ ๊ณ„์† ๋ฆฌ์Šจํ•˜๊ณ  ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. single_utterance๋ฅผ true๋กœ ์„ค์ •ํ•˜๋ฉด ์Œ์„ฑ ๋ช…๋ น ์ฒ˜๋ฆฌ์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • interim_results - (์„ ํƒ์‚ฌํ•ญ, ๊ธฐ๋ณธ๊ฐ’์€ false) ์ด ์ŠคํŠธ๋ฆผ ์š”์ฒญ์ด ๋‚˜์ค‘์— ์ถ”๊ฐ€ ์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ํ›„ ์„ธ๋ถ„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์ž„์‹œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. is_final์„ false๋กœ ์„ค์ •ํ•˜๋ฉด ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๊ฐ€ ์‘๋‹ต ๋‚ด์— ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค.

์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต

์ŠคํŠธ๋ฆฌ๋ฐ ์Œ์„ฑ ์ธ์‹ ๊ฒฐ๊ณผ๋Š” StreamingRecognitionResponse ํ˜•์‹์˜ ์ผ๋ จ์˜ ์‘๋‹ต ๋‚ด์—์„œ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘๋‹ต์€ ๋‹ค์Œ ํ•„๋“œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • speechEventType์€ SpeechEventType ์œ ํ˜•์˜ ์ด๋ฒคํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด๋ฒคํŠธ ๊ฐ’์€ ๋‹จ์ผ ๋ฐœํ™”๊ฐ€ ์™„๋ฃŒ๋œ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋˜๋Š” ์‹œ์ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์Œ์„ฑ ์ด๋ฒคํŠธ๋Š” ์ŠคํŠธ๋ฆผ ์‘๋‹ต ๋‚ด์—์„œ ๋งˆ์ปค๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • results์—๋Š” StreamingRecognitionResult ํ˜•์‹์˜ ์ค‘๊ฐ„ ๋˜๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์ธ ๊ฒฐ๊ณผ ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. results ๋ชฉ๋ก์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•˜์œ„ ํ•„๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    • alternatives์—๋Š” ๋Œ€์ฒด ๋ณ€ํ™˜ ํ…์ŠคํŠธ ๋ชฉ๋ก์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
    • isFinal์€ ์ด ๋ชฉ๋ก ํ•ญ๋ชฉ์—์„œ ์–ป์€ ๊ฒฐ๊ณผ๊ฐ€ ์ค‘๊ฐ„ ๋˜๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์ธ์ง€ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Google์€ ๋‹จ์ผ ์‹œ์Šคํ…œ ์ „์ฒด์—์„œ ์—ฌ๋Ÿฌ isFinal=true ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์“ฐ๊ธฐ ์‹œ์Šคํ…œ์ด ๋‹ซํžŒ(์ ˆ๋ฐ˜ ๋‹ซํž˜) ํ›„์—๋งŒ isFinal=true ๊ฒฐ๊ณผ๊ฐ€ ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค.
    • stability๋Š” ์ง€๊ธˆ๊นŒ์ง€ ์–ป์€ ๊ฒฐ๊ณผ์˜ ๋ณ€๋™์„ฑ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. 0.0์€ ์™„์ „ํ•œ ๋ถˆ์•ˆ์ •์„ฑ์„ ๋‚˜ํƒ€๋‚ด๊ณ  1.0์€ ์™„์ „ํ•œ ์•ˆ์ „์„ฑ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ณ€ํ™˜ ํ…์ŠคํŠธ์˜ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์‹ ๋ขฐ๋„์™€ ๋‹ฌ๋ฆฌ stability๋Š” ์ฃผ์–ด์ง„ ๋ถ€๋ถ„ ๊ฒฐ๊ณผ์˜ ๋ณ€๊ฒฝ ์—ฌ๋ถ€๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. isFinal์ด true๋กœ ์„ค์ •๋˜๋ฉด, stability๋Š” ์„ค์ •๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.