[Hugging Face] apply_chat_template 함수에 대해 알아보자

AI & DL/Hugging Face

[Hugging Face] apply_chat_template 함수에 대해 알아보자

Giliit 2024. 4. 20. 16:28

728x90

안녕하세요, 오늘은 LLM에서 dictionary를 Chat 형식으로 변환하는 apply_chat_template 함수에 대해 알아보려고 합니다. 이 함수는 최근에 Chat Bot, Chat Model이 많아지면서, Chat 형식으로 변환하는 Tokenizer의 필요로 의해 만들어졌습니다. 그래서 이 함수에 대해 자세히 설명해드리려고 합니다.

선언 방법

사용하는 방법은 쉽습니다. Chat(Dialog)에 대해 apply_chat_template()를 사용하면 바로 출력이 나옵니다.

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-alpha"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, cache_dir='/home/hgjeong/hdd1/hub')

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
 
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

매개변수 목록

conversation: Union[List[Dict[str, str]], "Conversation"] : 대화내용입니다.
add_generation_prompt (bool, optional): 프롬프트의 끝에 모델로부터 응답을 생성하는 데 사용되는 토큰을 추가할지 여부를 결정합니다.
tokenize (bool, defaults to True): 출력을 토큰화할지 여부를 결정합니다. False인 경우, 출력은 문자열이 됩니다.
padding (bool, defaults to False): 시퀀스를 최대 길이까지 패딩할지 여부를 결정합니다.
truncation (bool, defaults to False): 시퀀스를 최대 길이에서 잘라낼지 여부를 결정합니다.
max_length (int, optional):패딩에 사용할 최대 길이(토큰 단위)를 지정합니다. 지정되지 않은 경우 토크나이저의 max_length 속성이 기본값으로 사용됩니다.
return_tensors (str or ~utils.TensorType, optional): 특정 프레임워크의 텐서를 반환합니다.

사용 예시

밑의 사용 예시 코드는 위의 이 코드가 동일하게 쓰였다고 생각하시면 됩니다.

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-alpha"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, cache_dir='/home/hgjeong/hdd1/hub')

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

tokenize 사용 여부

사용하지 않았을 때

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)

### 실행결과
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>

사용했을 때

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)


### 실행결과 
tensor([[  523, 28766,  6574, 28766, 28767,    13,  1976,   460,   264, 10131,
         10706, 10093,   693,  1743,  2603,  3673,   297,   272,  3238,   302,
           264, 17368,   380,     2, 28705,    13, 28789, 28766,  1838, 28766,
         28767,    13,  5660,  1287, 19624,   410,  1532,   541,   264,  2930,
          5310,   297,   624,  6398, 28804,     2, 28705,    13, 28789, 28766,
           489, 11143, 28766, 28767,    13]])

Text가 토큰화된 것을 확인할 수 있습니다.

add_generation_prompt 사용 여부

사용하지 않았을 때

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))


### 실행결과
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s>

사용했을 때

ㄷtokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

### 실행 결과
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>

<| assistant |> 가 추가되어 있는 것을 확인할 수 있습니다.

결론

tokenizer에서 apply_chat_template를 이용해서 Chat(Dialog)를 다음과 같이 간편하게 변환할 수 있습니다. 이를 통해서 Chat(Dialog)를 모델에 입력할 때 다음과 같이 사용할 수 있게 있습니다. 감사합니다.

출처

https://huggingface.co/docs/transformers/v4.39.3/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template

저작자표시 (새창열림)

'AI & DL > Hugging Face' 카테고리의 다른 글

[Hugging Face] 내가 만든 Dataset을 Hugging Face에 올려보기 (0)	2024.04.30
[Hugging Face] 모델 가져오기(Read), 모델 및 데이터 업로드(Write)를 위한 Token 발급 받는 방법 (1)	2024.04.28
[Hugging Face] 벤치마크 클래스 만들기 ( Benchmark Class) (0)	2024.01.16
[Hugging Face] Dataset의 map 함수 사용법 (1)	2024.01.14
[Hugging Face] evaluate.evaluator을 이용하여 모델 평가하기 (2)	2024.01.04

현재글[Hugging Face] apply_chat_template 함수에 대해 알아보자

250x250

코딩 메모장

Silver, paper, 백준, c++, Metric, huggingface, LEVEL 2, Python, nlp, docker, numpy, 실버, pytorch, BOJ, 파이썬, ESconv, 오블완, 프로그래머스, PyTorch Lightning, DataSet,

Today :
Yesterday :

코딩 메모장