Fix UnicodeDecodeError permanently#118
Conversation
According to this: They supposedly fixed this by changing the model converter in ggml-org/llama.cpp#79 > import llama_cpp
> lparams = llama_cpp.llama_context_default_params()
> ctx = llama_cpp.llama_init_from_file("./models/7B/ggml-model-q4_0.bin", lparams)
> def _tokenize(prompt, bos=True):
_arr = (llama_cpp.llama_token * (len(prompt) + 1))()
_n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
return _arr[:_n]
> _tokenize("😀", False)
llama_tokenize: too many tokens
[]
> def _tokenize(prompt, bos=True):
_arr = (llama_cpp.llama_token * (len(prompt) + 6))()
_n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
return _arr[:_n]
> _tokenize("😀", False)
[243, 162, 155, 131]
> [llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]
[b'\xf0', b'\x9f', b'\x98', b'\x80']
> b"".join([llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]).decode("utf8")
'😀'(also yes thats technically a bug in my main example, it'll get fixed when somebody submits an issue 😋 ) The only real solution is to save invalid tokens until a valid output occurs (at probably most 4 tokens). According to this > 243 & 240
240 |
I have tried to address UTF8 properly by detecting multibytes and waiting for their completion. |
|
@SagsMug can we reduce the use of |
I have removed a bunch of cases and add a test for this. |
|
Because of errors from Llama.generate() and low level api, I use below code snippet. |
https://docs.python.org/3/library/codecs.html#error-handlers
This detects a multibyte UTF8 character, and doesn't return if its incomplete.
If there otherwise wasn't enough tokens or the model somehow returns invalid bytes, we use errors="ignore" to remove invalid characters.
Ascii example:
But there will never be thrown a decode or encode error.
Fixes:
#36
#57
#100
#116