Microsoft.KernelMemory version 0.68+ compatibility fix#862
Merged
martindevans merged 8 commits intoSciSharp:masterfrom Jul 24, 2024
SpaceAntelope:kernel-memory-68-compatibility-fix
Merged
Microsoft.KernelMemory version 0.68+ compatibility fix#862martindevans merged 8 commits intoSciSharp:masterfrom SpaceAntelope:kernel-memory-68-compatibility-fix
martindevans merged 8 commits intoSciSharp:masterfrom
SpaceAntelope:kernel-memory-68-compatibility-fix
Conversation
Member
|
I've submitted a few review comments. The one with the empty strings I'm not really sure how best to handle, and if you want to go ahead with the current implementation I'm happy with that as long as there's at least a test covering this weirdness and a comment explaining what's going on. |
…of redundant tokens resulting from multi-token characters with ref to PR #862
Contributor
Author
|
@martindevans I pushed the relevant changes. I created a duplicate unit test with only the unicode cases and added this comment (also referenced in the GetTokens implementations) : |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #859
Issue details
The latest version of Microsoft.KernelMemory (0.68.240716.1 in my case) adds IReadOnlyList GetTokens(string) to interface Microsoft.KernelMemory.AI.ITextTokenizer
This breaks any project that would reference the latest packages of LlamaSharp.kernel-memory and Microsoft.KernelMemory.Core together, affecting mostly developers just getting into LLamaSharp.
How it's solved in this commit
This commit provides a tentative implementation using LLamaContext.Tokenizer to get the tokens in embedding form and StreamingTokenDecoder to turn them back into (parts of) words and return them.
My assumptions for the overall expected behavior are based on the implementation of CountTokens in LLamaSharpTextEmbedingsGenerator and LLamaSharpTextGenerator, This means that it breaks on null input and returns an empty token that corresponds to the BOS embedding. Unit tests also check that the result of CountTokens matches the actual count of the tokens return from GetTokens.
Other considerations
In the unit tests I trim the 'actual' result to match the 'expected' to account for the added empty space that corresponds to the BOS token. Issues such as #856 indicate that further clarity will emerge with respect to how this should be properly handled.