Skip to content

Use instructlab-schema package to parse qna.yaml files#1962

Merged
mergify[bot] merged 1 commit intoinstructlab:mainfrom
bjhargrave:schema-taxonomy-parsing
Aug 19, 2024
Merged

Use instructlab-schema package to parse qna.yaml files#1962
mergify[bot] merged 1 commit intoinstructlab:mainfrom
bjhargrave:schema-taxonomy-parsing

Conversation

@bjhargrave
Copy link
Contributor

We replace parsing/validation code to use the shared code from instructlab-schema package.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing dependencies Relates to dependencies labels Aug 2, 2024
@bjhargrave bjhargrave requested a review from hickeyma August 2, 2024 16:35
@nathan-weinberg nathan-weinberg linked an issue Aug 5, 2024 that may be closed by this pull request
@mergify
Copy link
Contributor

mergify bot commented Aug 8, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bjhargrave please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Aug 8, 2024
@russellb russellb removed their request for review August 8, 2024 18:21
@bjhargrave bjhargrave force-pushed the schema-taxonomy-parsing branch from fc4518c to 2c90fb8 Compare August 8, 2024 19:17
@mergify mergify bot removed the needs-rebase This Pull Request needs to be rebased label Aug 8, 2024
@nathan-weinberg nathan-weinberg added this to the 0.19.0 milestone Aug 9, 2024
@mergify
Copy link
Contributor

mergify bot commented Aug 9, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bjhargrave please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Aug 9, 2024
@bjhargrave bjhargrave force-pushed the schema-taxonomy-parsing branch from 2c90fb8 to af90b58 Compare August 9, 2024 21:24
@mergify mergify bot removed the needs-rebase This Pull Request needs to be rebased label Aug 9, 2024
@mairin
Copy link
Member

mairin commented Aug 12, 2024

Right now taxonomy diff does 5 layers of checking. Those 5 layers will be handled by this now.

@nathan-weinberg can you describe the 5 layers here?

This came up in our team weekly meeting today, when it was suggested taxonomy diff should check the yaml schema version and let the user know if supported or not. Right now we flag at sdg generate and that's a bit late in the process.

@bjhargrave
Copy link
Contributor Author

The parsing/validation code will check the knowledge version for v3 and report an error for v1/v2. There is a test case in this PR for this.

https://github.com/instructlab/instructlab/pull/1962/files#diff-f2fb20ddc8bf7f3b80b1a99bfe9fd3e563c5675f9faf7f914ac7f74e276d15f1R348-R368

The parsing/validation code will handle the linting and json schema validation reporting errors if found. I don't know what the 5 layers are.

We replace parsing/validation code to use the shared code
from instructlab-schema package.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
@bjhargrave bjhargrave force-pushed the schema-taxonomy-parsing branch from af90b58 to 0c3a5b2 Compare August 16, 2024 15:34
@nathan-weinberg
Copy link
Member

5 layers are:

  1. file extension check (is it .yaml)
  2. content check (is the file empty)
  3. YAML lint check (is the file valid yaml)
  4. Schema check (does the file follow the taxonomy schema)
  5. Knowledge check (does the file have everything needed for knowledge if applicable)

This PR should defintely be a 0.19.0 priority IMO - we want to align the schema checking as much as possible while minimizing code duplication (there's some code for this in the SDG library as well)

@nathan-weinberg nathan-weinberg requested a review from a team August 19, 2024 02:51
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot added the one-approval PR has one approval from a maintainer label Aug 19, 2024
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Aug 19, 2024
@bjhargrave bjhargrave removed the request for review from hickeyma August 19, 2024 19:44
@mergify mergify bot merged commit 1bb13cc into instructlab:main Aug 19, 2024
@bjhargrave bjhargrave deleted the schema-taxonomy-parsing branch August 19, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Relates to dependencies testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove yamllint from requirements.txt

4 participants