Avoid duplicate special tokens in chat formats by CISC · Pull Request #1439 · abetlen/llama-cpp-python

CISC · 2024-05-09T18:08:32Z

Having multiple BOS can ruin generation, this would occur in several ways, usually through user adding them unnecessarily, in this case output a warning if we detect two in a row.

However, several formats/templates also have BOS in them, and we would add another one later in completion, therefore tokenize format/template prompts before completion to avoid this.

Added a new added_special property to ChatFormatterResponse to detect when BOS/EOS has already been added to prompt.

Also added missing token_cls, token_sep, ~~add_bos_token and add_eos_token~~ to _LlamaModel because I thought I needed them (turns out I didn't).

Fixes #1501

This is to ensure that we don't duplicate any special tokens. Hopefully I amended the existing formats correctly?

abetlen · 2024-06-04T14:07:27Z

@CISC thank you (also for the other PRs), this looks good. I may update the warning to use the python warnings library (would be useful for deprecations as well) so it can be silenced in a standard way.

BigCatGit · 2024-06-05T04:57:10Z

me too

Rocketknight1 · 2024-06-06T14:14:59Z

Hi, Matt from Hugging Face here. Just a note that we expect chat templates to contain all the special tokens the model expects, and therefore in Transformers, whenever a chat has been formatted with a chat template, it should be tokenized without adding additional special tokens.

We really had no choice but to do it this way - many base models add EOS at the end of sequences, but this will always cause problems if we want to generate a chat completion following a user message, as user messages often don't end with EOS in many chat formats.

CISC · 2024-06-06T17:08:13Z

@Rocketknight1 Thanks for checking in. :)

Yeah, I perfectly understand the reasoning, it just wasn't obvious until recently that this and some built in chat formats were causing duplicate tokens due to this.

CISC added 3 commits May 9, 2024 20:04

Templates sometimes have BOS in them, remove duplicate

9d053d6

tokenize chat format prompts before completion

a3df77d

This is to ensure that we don't duplicate any special tokens. Hopefully I amended the existing formats correctly?

updated comment

803e8fa

CISC changed the title ~~Templates sometimes have BOS in them, remove duplicate~~ Avoid duplicate special tokens in chat formats May 10, 2024

CISC added 7 commits May 11, 2024 08:30

corrected a few

ed4e56b

add some missing internals

06cf25d

proper bos/eos detection

bb6cf4f

just let tokenizer do the job

2e26f2d

typo--

aa25cd3

align test with new response

aef3b1c

changed to a warning

b9a1e61

CISC mentioned this pull request May 29, 2024

Support SPM infill #1492

Merged

move to another PR

a6e5917

CISC mentioned this pull request May 31, 2024

Llama 3 Double BOS #1501

Closed

4 tasks

abetlen added 2 commits June 4, 2024 10:09

Merge branch 'main' into remove-unwanted-bos

9f14fd2

Use python warnings module

7180535

abetlen closed this Jun 4, 2024

abetlen deleted the remove-unwanted-bos branch June 4, 2024 14:13

abetlen restored the remove-unwanted-bos branch June 4, 2024 14:13

abetlen deleted the remove-unwanted-bos branch June 4, 2024 14:14

abetlen restored the remove-unwanted-bos branch June 4, 2024 14:14

abetlen reopened this Jun 4, 2024

abetlen merged commit 027f7bc into abetlen:main Jun 4, 2024

CISC deleted the remove-unwanted-bos branch June 4, 2024 17:03

hirschrobert mentioned this pull request Jun 6, 2024

added llama3 prompt zylon-ai/private-gpt#1962

Merged

Spacellary mentioned this pull request Jun 13, 2024

Feature request: Option to disable auto adding BOS token (double BOS token) if it's already present/added. LostRuins/koboldcpp#917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicate special tokens in chat formats#1439

Avoid duplicate special tokens in chat formats#1439
abetlen merged 13 commits intoabetlen:mainfrom
CISC:remove-unwanted-bos

CISC commented May 9, 2024 •

edited

Loading

Uh oh!

abetlen commented Jun 4, 2024

Uh oh!

BigCatGit commented Jun 5, 2024

Uh oh!

Rocketknight1 commented Jun 6, 2024 •

edited

Loading

Uh oh!

CISC commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

CISC commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Jun 4, 2024

Uh oh!

BigCatGit commented Jun 5, 2024

Uh oh!

Rocketknight1 commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CISC commented May 9, 2024 •

edited

Loading

Rocketknight1 commented Jun 6, 2024 •

edited

Loading