Skip to content

knowledge-support: Split large documents#777

Closed
anik120 wants to merge 2 commits intoinstructlab:mainfrom
anik120:knowledge-split-docs
Closed

knowledge-support: Split large documents#777
anik120 wants to merge 2 commits intoinstructlab:mainfrom
anik120:knowledge-split-docs

Conversation

@anik120
Copy link
Contributor

@anik120 anik120 commented Apr 2, 2024

Changes

Which issue is resolved by this Pull Request:
Resolves #750

Co-authored-by: aajha aajha@redhat.com

@anik120 anik120 force-pushed the knowledge-split-docs branch 6 times, most recently from 4976192 to 8263b1b Compare April 3, 2024 02:53
@anik120 anik120 marked this pull request as draft April 3, 2024 04:48
@anik120 anik120 changed the title knowledge-support: Split large documents WIP knowledge-support: Split large documents Apr 3, 2024
@anik120 anik120 force-pushed the knowledge-split-docs branch 12 times, most recently from 4ff083c to 70894f2 Compare April 3, 2024 18:10
Resolves #750

Co-authored-by: aajha <aajha@redhat.com>
Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
@anik120 anik120 force-pushed the knowledge-split-docs branch from 70894f2 to eda8fef Compare April 3, 2024 18:30
@anik120 anik120 changed the title WIP knowledge-support: Split large documents knowledge-support: Split large documents Apr 3, 2024
@anik120 anik120 marked this pull request as ready for review April 3, 2024 19:02
@anik120
Copy link
Contributor Author

anik120 commented Apr 3, 2024

Okay, @xukai92 @aartij22, I am fairly confident now that it's the functional test test_ctx_size() that's the problem.

  1. In this PR, I'm starting us off with a simple, brute force doc splitting algorithm. This should be good enough to get us started, we can assess if we need something smarter with libraries like langchain after we've used this algorithm for a while. Essentially, the algorithm in this PR:

    • Takes a list of documents
    • Splits each documents in a way that the result will have documents that have at most 2400 words in them (based off of this math)
  2. See this test PR, where I only introduce split_knowledge_docs for ilab generate, and the tests are failing at the same spot, with the same error.

@aartij22 with the change to a brute force algorithm for splitting, we're eliminating the possibilities of anything risks being introduced by the introduction of langchain. Once
a) it becomes apparent that we do need a smarter way to split
b) we've written some tests around these changes to give us the confidence to iterate and become smarter

we should bring langchain PR back up coz it looks very promising, and possibly much more efficient than us introducing our logic (if we can lean on the community to do the heavy lifting for us, that's less maintenance for us 🎉 . ie if we have to iterate on the algorithm I introduced, that's a higher maintainance burden, so langchain is definitely on the cards)

@xukai92 I am fairly confident on these changes now, at least on the fact that it does not break any existing functionalities. I don't think we can justify investing any more time on looking at the functional tests. At least I don't think this PR should be blocked because of that. We can look at the functional tests as a follow up.

cli/utils.py Outdated
Comment on lines 228 to 231
no_tokens_per_doc = int(split_kd_wc*1.3) # 1 word =~ 1.3 token
if no_tokens_per_doc > int(ctx_window_size - 1024):
logger.error("Error: Word count for each doc will exceed context window size")
sys.exit(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xukai92 @aartij22 does this calculation look good?

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
@anik120 anik120 force-pushed the knowledge-split-docs branch from aea2619 to bb89e69 Compare April 3, 2024 19:35
show_default=True,
)
@click.option(
"--kdoc-wc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simply call it --chunk-size?
--kdoc-wc is not accuracy (as it's not a parameter for knowledge doc word count) and hard to read (abbreviation used)

List[str]: List of split documents.
"""

no_tokens_per_doc = int(split_kd_wc * 1.3) # 1 word =~ 1.3 token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said, we should just let user input chunk_size and we should use it to calculate number words we want to keep here.

@anik120
Copy link
Contributor Author

anik120 commented Apr 4, 2024

Okay, @xukai92 @aartij22, I am fairly confident now that it's the functional test test_ctx_size() that's the problem.

  1. In this PR, I'm starting us off with a simple, brute force doc splitting algorithm. This should be good enough to get us started, we can assess if we need something smarter with libraries like langchain after we've used this algorithm for a while. Essentially, the algorithm in this PR:

    • Takes a list of documents
    • Splits each documents in a way that the result will have documents that have at most 2400 words in them (based off of this math)
  2. See this test PR, where I only introduce split_knowledge_docs for ilab generate, and the tests are failing at the same spot, with the same error.

@aartij22 with the change to a brute force algorithm for splitting, we're eliminating the possibilities of anything risks being introduced by the introduction of langchain. Once a) it becomes apparent that we do need a smarter way to split b) we've written some tests around these changes to give us the confidence to iterate and become smarter

we should bring langchain PR back up coz it looks very promising, and possibly much more efficient than us introducing our logic (if we can lean on the community to do the heavy lifting for us, that's less maintenance for us 🎉 . ie if we have to iterate on the algorithm I introduced, that's a higher maintainance burden, so langchain is definitely on the cards)

@xukai92 I am fairly confident on these changes now, at least on the fact that it does not break any existing functionalities. I don't think we can justify investing any more time on looking at the functional tests. At least I don't think this PR should be blocked because of that. We can look at the functional tests as a follow up.

Turns out, it was #788 all along! Closing in favor of #772

@anik120 anik120 closed this Apr 4, 2024
jgato pushed a commit to jgato/instructlab that referenced this pull request Jun 21, 2024
…b#777)

**Description:** Thought that the skills_guide and knowlegde_guide
should belong in the taxonomy repo instead of the community repo. Just
copied the files and fixed the links. Still working on fixing Avoid
these topics section, might have to do in anoter PR

**Additional info:** 
Removing/fixing content in community repo in
instructlab/community#228

Signed-off-by: Kelly Brown <kelbrown@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auto-chunking support for knowldge

2 participants