knowledge-support: Split large documents by anik120 · Pull Request #777 · instructlab/instructlab

anik120 · 2024-04-02T21:15:13Z

Changes

Which issue is resolved by this Pull Request:
Resolves #750

Co-authored-by: aajha aajha@redhat.com

Resolves #750 Co-authored-by: aajha <aajha@redhat.com> Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>

anik120 · 2024-04-03T19:15:50Z

Okay, @xukai92 @aartij22, I am fairly confident now that it's the functional test test_ctx_size() that's the problem.

In this PR, I'm starting us off with a simple, brute force doc splitting algorithm. This should be good enough to get us started, we can assess if we need something smarter with libraries like langchain after we've used this algorithm for a while. Essentially, the algorithm in this PR:
- Takes a list of documents
- Splits each documents in a way that the result will have documents that have at most 2400 words in them (based off of this math)
See this test PR, where I only introduce split_knowledge_docs for ilab generate, and the tests are failing at the same spot, with the same error.

Interestingly enough, tests are failing at the EXACT same place in these PRs too:
- Use the taxonomy schema to validate taxonomy yaml files #776
- test #778
- and of course, @aartij22 's initial PR add langchain for chunking #772

@aartij22 with the change to a brute force algorithm for splitting, we're eliminating the possibilities of anything risks being introduced by the introduction of langchain. Once
a) it becomes apparent that we do need a smarter way to split
b) we've written some tests around these changes to give us the confidence to iterate and become smarter

we should bring langchain PR back up coz it looks very promising, and possibly much more efficient than us introducing our logic (if we can lean on the community to do the heavy lifting for us, that's less maintenance for us 🎉 . ie if we have to iterate on the algorithm I introduced, that's a higher maintainance burden, so langchain is definitely on the cards)

@xukai92 I am fairly confident on these changes now, at least on the fact that it does not break any existing functionalities. I don't think we can justify investing any more time on looking at the functional tests. At least I don't think this PR should be blocked because of that. We can look at the functional tests as a follow up.

anik120 · 2024-04-03T19:33:32Z

cli/utils.py

+    no_tokens_per_doc = int(split_kd_wc*1.3) # 1 word =~ 1.3 token 
+    if no_tokens_per_doc > int(ctx_window_size - 1024):
+        logger.error("Error: Word count for each doc will exceed context window size")
+        sys.exit(1)


@xukai92 @aartij22 does this calculation look good?

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>

xukai92 · 2024-04-04T00:46:50Z

cli/lab.py

    show_default=True,
 )
+@click.option(
+    "--kdoc-wc",


can we simply call it --chunk-size?
--kdoc-wc is not accuracy (as it's not a parameter for knowledge doc word count) and hard to read (abbreviation used)

cli/lab.py

xukai92 · 2024-04-04T00:50:12Z

cli/utils.py

+         List[str]: List of split documents.
+    """
+
+    no_tokens_per_doc = int(split_kd_wc * 1.3)  # 1 word =~ 1.3 token


as said, we should just let user input chunk_size and we should use it to calculate number words we want to keep here.

anik120 · 2024-04-04T02:43:12Z

Okay, @xukai92 @aartij22, I am fairly confident now that it's the functional test test_ctx_size() that's the problem.

In this PR, I'm starting us off with a simple, brute force doc splitting algorithm. This should be good enough to get us started, we can assess if we need something smarter with libraries like langchain after we've used this algorithm for a while. Essentially, the algorithm in this PR:

Takes a list of documents

Splits each documents in a way that the result will have documents that have at most 2400 words in them (based off of this math)

See this test PR, where I only introduce split_knowledge_docs for ilab generate, and the tests are failing at the same spot, with the same error.

Interestingly enough, tests are failing at the EXACT same place in these PRs too:

Use the taxonomy schema to validate taxonomy yaml files #776

test #778

and of course, @aartij22 's initial PR add langchain for chunking #772

@aartij22 with the change to a brute force algorithm for splitting, we're eliminating the possibilities of anything risks being introduced by the introduction of langchain. Once a) it becomes apparent that we do need a smarter way to split b) we've written some tests around these changes to give us the confidence to iterate and become smarter

we should bring langchain PR back up coz it looks very promising, and possibly much more efficient than us introducing our logic (if we can lean on the community to do the heavy lifting for us, that's less maintenance for us 🎉 . ie if we have to iterate on the algorithm I introduced, that's a higher maintainance burden, so langchain is definitely on the cards)

@xukai92 I am fairly confident on these changes now, at least on the fact that it does not break any existing functionalities. I don't think we can justify investing any more time on looking at the functional tests. At least I don't think this PR should be blocked because of that. We can look at the functional tests as a follow up.

Turns out, it was #788 all along! Closing in favor of #772

…b#777) **Description:** Thought that the skills_guide and knowlegde_guide should belong in the taxonomy repo instead of the community repo. Just copied the files and fixed the links. Still working on fixing Avoid these topics section, might have to do in anoter PR **Additional info:** Removing/fixing content in community repo in instructlab/community#228 Signed-off-by: Kelly Brown <kelbrown@redhat.com>

anik120 requested review from Tomcli, abhi1092, afrittoli, hickeyma, markstur, mrutkows, soltysh, spzala and xukai92 as code owners April 2, 2024 21:15

anik120 force-pushed the knowledge-split-docs branch 6 times, most recently from 4976192 to 8263b1b Compare April 3, 2024 02:53

anik120 marked this pull request as draft April 3, 2024 04:48

anik120 changed the title ~~knowledge-support: Split large documents~~ WIP knowledge-support: Split large documents Apr 3, 2024

anik120 force-pushed the knowledge-split-docs branch 12 times, most recently from 4ff083c to 70894f2 Compare April 3, 2024 18:10

knowledge-support: Split large documents

eda8fef

Resolves #750 Co-authored-by: aajha <aajha@redhat.com> Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>

anik120 force-pushed the knowledge-split-docs branch from 70894f2 to eda8fef Compare April 3, 2024 18:30

anik120 changed the title ~~WIP knowledge-support: Split large documents~~ knowledge-support: Split large documents Apr 3, 2024

anik120 marked this pull request as ready for review April 3, 2024 19:02

anik120 commented Apr 3, 2024

View reviewed changes

calculate no_of_tokens instead of hardcoding

bb89e69

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>

anik120 force-pushed the knowledge-split-docs branch from aea2619 to bb89e69 Compare April 3, 2024 19:35

xukai92 requested changes Apr 4, 2024

View reviewed changes

xukai92 mentioned this pull request Apr 4, 2024

update functional tests for newer llama_cpp_python version #788

Closed

anik120 closed this Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

knowledge-support: Split large documents#777

knowledge-support: Split large documents#777
anik120 wants to merge 2 commits intoinstructlab:mainfrom
anik120:knowledge-split-docs

anik120 commented Apr 2, 2024 •

edited

Loading

Uh oh!

anik120 commented Apr 3, 2024 •

edited

Loading

Uh oh!

anik120 Apr 3, 2024

Uh oh!

xukai92 Apr 4, 2024

Uh oh!

Uh oh!

xukai92 Apr 4, 2024

Uh oh!

anik120 commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anik120 commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

anik120 commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anik120 Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

xukai92 Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xukai92 Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

anik120 commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anik120 commented Apr 2, 2024 •

edited

Loading

anik120 commented Apr 3, 2024 •

edited

Loading