1. Tokenizing

Tokenizing

Tokenizing๋Š” ํ…์ŠคํŠธ์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž‘๊ณ  ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ์กฐ๊ฐ์ธ _tokens_์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ๊ฐ token์€ ๊ณ ์œ ํ•œ ์ˆซ์ž ์‹๋ณ„์ž(ID)๊ฐ€ ํ• ๋‹น๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ, ํŠนํžˆ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)๋ฅผ ์œ„ํ•œ ํ…์ŠคํŠธ ์ค€๋น„์˜ ๊ธฐ๋ณธ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.

์ด ์ดˆ๊ธฐ ๋‹จ๊ณ„์˜ ๋ชฉํ‘œ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: ์˜๋ฏธ ์žˆ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅ์„ tokens (ids)๋กœ ๋‚˜๋ˆ„๊ธฐ์ž…๋‹ˆ๋‹ค.

How Tokenizing Works

  1. ํ…์ŠคํŠธ ๋ถ„ํ• :

  • ๊ธฐ๋ณธ ํ† ํฌ๋‚˜์ด์ €: ๊ฐ„๋‹จํ•œ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ๋ฅผ ๊ฐœ๋ณ„ ๋‹จ์–ด์™€ ๊ตฌ๋‘์ ์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์˜ˆ์‹œ: ํ…์ŠคํŠธ: "Hello, world!" ํ† ํฐ: ["Hello", ",", "world", "!"]

  1. ์–ดํœ˜ ์ƒ์„ฑ:

  • ํ† ํฐ์„ ์ˆซ์ž ID๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ์–ดํœ˜๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ด ์–ดํœ˜๋Š” ๋ชจ๋“  ๊ณ ์œ ํ•œ ํ† ํฐ(๋‹จ์–ด ๋ฐ ๊ธฐํ˜ธ)์„ ๋‚˜์—ดํ•˜๊ณ  ๊ฐ ํ† ํฐ์— ํŠน์ • ID๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

  • ํŠน์ˆ˜ ํ† ํฐ: ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์–ดํœ˜์— ์ถ”๊ฐ€๋œ ํŠน์ˆ˜ ๊ธฐํ˜ธ์ž…๋‹ˆ๋‹ค:

  • [BOS] (์‹œํ€€์Šค ์‹œ์ž‘): ํ…์ŠคํŠธ์˜ ์‹œ์ž‘์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • [EOS] (์‹œํ€€์Šค ๋): ํ…์ŠคํŠธ์˜ ๋์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • [PAD] (ํŒจ๋”ฉ): ๋ฐฐ์น˜์˜ ๋ชจ๋“  ์‹œํ€€์Šค๋ฅผ ๋™์ผํ•œ ๊ธธ์ด๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • [UNK] (์•Œ ์ˆ˜ ์—†์Œ): ์–ดํœ˜์— ์—†๋Š” ํ† ํฐ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • ์˜ˆ์‹œ: ๋งŒ์•ฝ "Hello"๊ฐ€ ID 64์— ํ• ๋‹น๋˜๊ณ , ","๊ฐ€ 455, "world"๊ฐ€ 78, "!"๊ฐ€ 467์ด๋ผ๋ฉด: "Hello, world!" โ†’ [64, 455, 78, 467]

  • ์•Œ ์ˆ˜ ์—†๋Š” ๋‹จ์–ด ์ฒ˜๋ฆฌ: ๋งŒ์•ฝ "Bye"์™€ ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ์–ดํœ˜์— ์—†๋‹ค๋ฉด, [UNK]๋กœ ๋Œ€์ฒด๋ฉ๋‹ˆ๋‹ค. "Bye, world!" โ†’ ["[UNK]", ",", "world", "!"] โ†’ [987, 455, 78, 467] (์—ฌ๊ธฐ์„œ [UNK]์˜ ID๋Š” 987๋ผ๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค)

Advanced Tokenizing Methods

๊ธฐ๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ๊ฐ„๋‹จํ•œ ํ…์ŠคํŠธ์— ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ํŠนํžˆ ํฐ ์–ดํœ˜์™€ ์ƒˆ๋กœ์šด ๋˜๋Š” ํฌ๊ท€ํ•œ ๋‹จ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณ ๊ธ‰ ํ† ํฌ๋‚˜์ด์ง• ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ๋ฅผ ๋” ์ž‘์€ ํ•˜์œ„ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ฑฐ๋‚˜ ํ† ํฌ๋‚˜์ด์ง• ํ”„๋กœ์„ธ์Šค๋ฅผ ์ตœ์ ํ™”ํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ฐ”์ดํŠธ ์Œ ์ธ์ฝ”๋”ฉ (BPE):

  • ๋ชฉ์ : ์–ดํœ˜์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  ํฌ๊ท€ํ•˜๊ฑฐ๋‚˜ ์•Œ ์ˆ˜ ์—†๋Š” ๋‹จ์–ด๋ฅผ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฐ”์ดํŠธ ์Œ์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ž‘๋™ ๋ฐฉ์‹:

  • ๊ฐœ๋ณ„ ๋ฌธ์ž๋ฅผ ํ† ํฐ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐ€์žฅ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ํ† ํฐ ์Œ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ๋‹จ์ผ ํ† ํฐ์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  • ๋” ์ด์ƒ ๋ณ‘ํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ์Œ์ด ์—†์„ ๋•Œ๊นŒ์ง€ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ :

  • ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๊ธฐ์กด์˜ ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ [UNK] ํ† ํฐ์ด ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.

  • ๋” ํšจ์œจ์ ์ด๊ณ  ์œ ์—ฐํ•œ ์–ดํœ˜์ž…๋‹ˆ๋‹ค.

  • ์˜ˆ์‹œ: "playing"์€ "play"์™€ "ing"๊ฐ€ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ํ•˜์œ„ ๋‹จ์–ด๋ผ๋ฉด ["play", "ing"]๋กœ ํ† ํฌ๋‚˜์ด์ฆˆ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. WordPiece:

  • ์‚ฌ์šฉ ๋ชจ๋ธ: BERT์™€ ๊ฐ™์€ ๋ชจ๋ธ.

  • ๋ชฉ์ : BPE์™€ ์œ ์‚ฌํ•˜๊ฒŒ, ์•Œ ์ˆ˜ ์—†๋Š” ๋‹จ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ดํœ˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‹จ์–ด๋ฅผ ํ•˜์œ„ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

  • ์ž‘๋™ ๋ฐฉ์‹:

  • ๊ฐœ๋ณ„ ๋ฌธ์ž์˜ ๊ธฐ๋ณธ ์–ดํœ˜๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๊ฐ€์žฅ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ํ•˜์œ„ ๋‹จ์–ด๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • ์–ด๋–ค ํ•˜์œ„ ๋‹จ์–ด๋ฅผ ๋ณ‘ํ•ฉํ• ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ™•๋ฅ  ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ :

  • ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ์–ดํœ˜ ํฌ๊ธฐ์™€ ํšจ๊ณผ์ ์ธ ๋‹จ์–ด ํ‘œํ˜„ ๊ฐ„์˜ ๊ท ํ˜•์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  • ํฌ๊ท€ํ•˜๊ณ  ๋ณตํ•ฉ์ ์ธ ๋‹จ์–ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • ์˜ˆ์‹œ: "unhappiness"๋Š” ์–ดํœ˜์— ๋”ฐ๋ผ ["un", "happiness"] ๋˜๋Š” ["un", "happy", "ness"]๋กœ ํ† ํฌ๋‚˜์ด์ฆˆ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ์œ ๋‹ˆ๊ทธ๋žจ ์–ธ์–ด ๋ชจ๋ธ:

  • ์‚ฌ์šฉ ๋ชจ๋ธ: SentencePiece์™€ ๊ฐ™์€ ๋ชจ๋ธ.

  • ๋ชฉ์ : ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐ ์ง‘ํ•ฉ์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ™•๋ฅ  ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ์ž‘๋™ ๋ฐฉ์‹:

  • ์ž ์žฌ์ ์ธ ํ† ํฐ์˜ ํฐ ์ง‘ํ•ฉ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ชจ๋ธ ํ™•๋ฅ ์„ ๊ฐ€์žฅ ์ ๊ฒŒ ๊ฐœ์„ ํ•˜๋Š” ํ† ํฐ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐ ๋‹จ์–ด๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ํ•˜์œ„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ‘œํ˜„๋˜๋Š” ์–ดํœ˜๋ฅผ ์ตœ์ข…ํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ :

  • ์œ ์—ฐํ•˜๋ฉฐ ์–ธ์–ด๋ฅผ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ข…์ข… ๋” ํšจ์œจ์ ์ด๊ณ  ๊ฐ„๊ฒฐํ•œ ํ† ํฌ๋‚˜์ด์ง• ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

  • ์˜ˆ์‹œ: "internationalization"์€ ["international", "ization"]๊ณผ ๊ฐ™์€ ๋” ์ž‘๊ณ  ์˜๋ฏธ ์žˆ๋Š” ํ•˜์œ„ ๋‹จ์–ด๋กœ ํ† ํฌ๋‚˜์ด์ฆˆ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Code Example

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb์—์„œ ์ฝ”๋“œ ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์ด๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•ด ๋ด…์‹œ๋‹ค:

# Download a text to pre-train the model
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()

# Tokenize the code using GPT2 tokenizer version
import tiktoken
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"

# Print first 50 tokens
print(token_ids[:50])
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]

References

Last updated