feat: Parameter to send custom page range when splitting pdf#125
Merged
feat: Parameter to send custom page range when splitting pdf#125
Conversation
When the client prepares the request, it turns list parameters into multiple instances of the same key. For instance: `extract_image_block_types=["Image", "Table"]` becomes `extract_image_block_types[]="Image"` `extract_image_block_types[]="Table"` We need to account for this in our `parse_form_data` helper if we want to use list params in our hooks. Likewise, we need to go the other way when recreating the request in `create_request_body`.
b441902 to
17f84c6
Compare
17f84c6 to
80902b5
Compare
61ff079 to
ab11a4d
Compare
awalker4
added a commit
to Unstructured-IO/unstructured-js-client
that referenced
this pull request
Aug 7, 2024
To match the python feature: Unstructured-IO/unstructured-python-client#125 Add a client-side param called `splitPdfPageRange` which takes a list of two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is set, slice the doc from `start` up to and including `end`. Only this page range will be sent to the API. The subset of pages is still split up as needed. If `[start, end]` is out of bounds, throw an error to the user.
awalker4
added a commit
to Unstructured-IO/unstructured-js-client
that referenced
this pull request
Aug 9, 2024
To match the python feature: Unstructured-IO/unstructured-python-client#125 # New parameter Add a client-side param called `splitPdfPageRange` which takes a list of two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is set, slice the doc from `start` up to and including `end`. Only this page range will be sent to the API. The subset of pages is still split up as needed. If `[start, end]` is out of bounds, throw an error to the user. # Testing Check out this branch and set up a request to your local API: ``` const client = new UnstructuredClient({ serverURL: "http://localhost:8000", security: { apiKeyAuth: key, }, }); const filename = "layout-parser-paper.pdf"; const data = fs.readFileSync(filename); client.general.partition({ partitionParameters: { files: { content: data, fileName: filename, }, strategy: Strategy.Fast, splitPdfPage: true, splitPdfPageRange: [4, 8], } }).then((res: PartitionResponse) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ``` Test out various page ranges and confirm that the returned elements are within the range. Invalid ranges should throw a useful Error (pages are out of bounds, or end_page < start_page).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New parameter
Add a client side param called
split_pdf_page_rangewhich takes a list of two integers,[start_page, end_page]. Ifsplit_pdf_pageisTrueand a range is set, slice the doc fromstart_pageup to and includingend_page. Only this page range will be sent to the API. The subset of pages is still split up as needed.Other changes
Allow our custom hooks to properly access list parameters, so we're able to intercept
split_pdf_page_range. We need extra handling to get list params out of the request inparse_form_data, and to rebuild the payload increate_request_body.Testing
Check out this branch and set up a request to your local API:
Test out various page ranges and confirm that the returned elements are within the range. Invalid ranges should throw a ValueError (pages are out of bounds, or end_page < start_page).