knowledge-support: Split large documents#777
knowledge-support: Split large documents#777anik120 wants to merge 2 commits intoinstructlab:mainfrom
Conversation
4976192 to
8263b1b
Compare
4ff083c to
70894f2
Compare
Resolves #750 Co-authored-by: aajha <aajha@redhat.com> Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
70894f2 to
eda8fef
Compare
|
Okay, @xukai92 @aartij22, I am fairly confident now that it's the functional test
@aartij22 with the change to a brute force algorithm for splitting, we're eliminating the possibilities of anything risks being introduced by the introduction of we should bring @xukai92 I am fairly confident on these changes now, at least on the fact that it does not break any existing functionalities. I don't think we can justify investing any more time on looking at the functional tests. At least I don't think this PR should be blocked because of that. We can look at the functional tests as a follow up. |
cli/utils.py
Outdated
| no_tokens_per_doc = int(split_kd_wc*1.3) # 1 word =~ 1.3 token | ||
| if no_tokens_per_doc > int(ctx_window_size - 1024): | ||
| logger.error("Error: Word count for each doc will exceed context window size") | ||
| sys.exit(1) |
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
aea2619 to
bb89e69
Compare
| show_default=True, | ||
| ) | ||
| @click.option( | ||
| "--kdoc-wc", |
There was a problem hiding this comment.
can we simply call it --chunk-size?
--kdoc-wc is not accuracy (as it's not a parameter for knowledge doc word count) and hard to read (abbreviation used)
| List[str]: List of split documents. | ||
| """ | ||
|
|
||
| no_tokens_per_doc = int(split_kd_wc * 1.3) # 1 word =~ 1.3 token |
There was a problem hiding this comment.
as said, we should just let user input chunk_size and we should use it to calculate number words we want to keep here.
|
…b#777) **Description:** Thought that the skills_guide and knowlegde_guide should belong in the taxonomy repo instead of the community repo. Just copied the files and fixed the links. Still working on fixing Avoid these topics section, might have to do in anoter PR **Additional info:** Removing/fixing content in community repo in instructlab/community#228 Signed-off-by: Kelly Brown <kelbrown@redhat.com>
Changes
Which issue is resolved by this Pull Request:
Resolves #750
Co-authored-by: aajha aajha@redhat.com