Avoid duplicate special tokens in chat formats#1439
Conversation
This is to ensure that we don't duplicate any special tokens. Hopefully I amended the existing formats correctly?
|
@CISC thank you (also for the other PRs), this looks good. I may update the warning to use the python |
|
me too |
|
Hi, Matt from Hugging Face here. Just a note that we expect chat templates to contain all the special tokens the model expects, and therefore in Transformers, whenever a chat has been formatted with a chat template, it should be tokenized without adding additional special tokens. We really had no choice but to do it this way - many base models add EOS at the end of sequences, but this will always cause problems if we want to generate a chat completion following a user message, as user messages often don't end with EOS in many chat formats. |
|
@Rocketknight1 Thanks for checking in. :) Yeah, I perfectly understand the reasoning, it just wasn't obvious until recently that this and some built in chat formats were causing duplicate tokens due to this. |
Having multiple BOS can ruin generation, this would occur in several ways, usually through user adding them unnecessarily, in this case output a warning if we detect two in a row.
However, several formats/templates also have BOS in them, and we would add another one later in completion, therefore tokenize format/template prompts before completion to avoid this.
Added a new
added_specialproperty toChatFormatterResponseto detect when BOS/EOS has already been added to prompt.Also added missing
token_cls,token_sep,toadd_bos_tokenandadd_eos_token_LlamaModelbecause I thought I needed them (turns out I didn't).Fixes #1501