-
Notifications
You must be signed in to change notification settings - Fork 244
feat: reduce and update methods for DocumentArray and BaseDocument #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
a5ef6ca
feat: add reduce utils
745d1c2
feat: support sets reducing
9419ec5
feat: support sub docarrays reducing
5f0ecef
feat: finish feature implementation and testing
5c439b8
docs: add documentation and fix ruff
a517a33
Merge branch 'feat-rewrite-v2' into feat-reduce-v2
7c3d9f4
fix: apply comments and support dicts
5bb4262
feat: add update method to BaseDocument and fix reduce behavior
44c5b8a
refactor: move reduce docs to update
72684cd
refactor: use get origin instead of private _GenericAlais
samsja e921937
fix: fix ruff
samsja dadffb5
docs: add clarification about tuples
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| from docarray import DocumentArray | ||
| from typing import List, Optional, Dict | ||
|
|
||
|
|
||
| def reduce( | ||
| left: DocumentArray, right: DocumentArray, left_id_map: Optional[Dict] = None | ||
| ) -> 'DocumentArray': | ||
| """ | ||
| Reduces left and right DocumentArray into one DocumentArray in-place. | ||
| Changes are applied to the left DocumentArray. | ||
| Reducing 2 DocumentArrays consists in adding Documents in the second DocumentArray | ||
| to the first DocumentArray if they do not exist. | ||
| If a Document exists in both DocumentArrays (identified by ID), | ||
| the data properties are merged with priority to the left Document. | ||
|
|
||
| Nested DocumentArrays are also reduced in the same way. | ||
| :param left: First DocumentArray to be reduced. Changes will be applied to it | ||
| in-place | ||
| :param right: Second DocumentArray to be reduced | ||
| :param left_id_map: Optional parameter to be passed in repeated calls | ||
| for optimizations, keeping a map of the Document ID to its offset | ||
| in the DocumentArray | ||
| :return: Reduced DocumentArray | ||
| """ | ||
| left_id_map = left_id_map or {doc.id: i for i, doc in enumerate(left)} | ||
|
|
||
| for doc in right: | ||
| if doc.id in left_id_map: | ||
| left[left_id_map[doc.id]].update(doc) | ||
| else: | ||
| left.append(doc) | ||
|
|
||
| return left | ||
|
|
||
|
|
||
| def reduce_all(docarrays: List[DocumentArray]) -> DocumentArray: | ||
| """ | ||
| Reduces a list of DocumentArrays into one DocumentArray. | ||
| Changes are applied to the first DocumentArray in-place. | ||
|
|
||
| The resulting DocumentArray contains Documents of all DocumentArrays. | ||
| If a Document exists (identified by their ID) in many DocumentArrays, | ||
| data properties are merged with priority to the left-most | ||
| DocumentArrays (that is, if a data attribute is set in a Document | ||
| belonging to many DocumentArrays, the attribute value of the left-most | ||
| DocumentArray is kept). | ||
| Nested DocumentArrays belonging to many DocumentArrays | ||
| are also reduced in the same way. | ||
| .. note:: | ||
| - Nested DocumentArrays order does not follow any specific rule. | ||
| You might want to re-sort them in a later step. | ||
| - The final result depends on the order of DocumentArrays | ||
| when applying reduction. | ||
|
|
||
| :param docarrays: List of DocumentArrays to be reduced | ||
| :return: the resulting DocumentArray | ||
| """ | ||
| if len(docarrays) <= 1: | ||
| raise Exception( | ||
| 'In order to reduce DocumentArrays' | ||
| ' we should have more than one DocumentArray' | ||
| ) | ||
| left = docarrays[0] | ||
| others = docarrays[1:] | ||
| left_id_map = {doc.id: i for i, doc in enumerate(left)} | ||
| for da in others: | ||
| reduce(left, da, left_id_map) | ||
| return left |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,25 @@ | ||
| from typing import Optional, List | ||
| from docarray.base_document.document import BaseDocument | ||
|
|
||
|
|
||
| def test_base_document_init(): | ||
|
|
||
| doc = BaseDocument() | ||
|
|
||
| assert doc.id is not None | ||
|
|
||
|
|
||
| def test_update(): | ||
| class MyDocument(BaseDocument): | ||
| content: str | ||
| title: Optional[str] = None | ||
| tags_: List | ||
|
|
||
| doc1 = MyDocument( | ||
| content='Core content of the document', title='Title', tags_=['python', 'AI'] | ||
| ) | ||
| doc2 = MyDocument(content='Core content updated', tags_=['docarray']) | ||
|
|
||
| doc1.update(doc2) | ||
| assert doc1.content == 'Core content updated' | ||
| assert doc1.title == 'Title' | ||
| assert doc1.tags_ == ['python', 'AI', 'docarray'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| import pytest | ||
| from typing import Optional, List, Dict, Set | ||
| from docarray import BaseDocument, DocumentArray | ||
| from docarray.documents import Image | ||
|
|
||
|
|
||
| class InnerDoc(BaseDocument): | ||
| integer: int | ||
| l: List | ||
|
|
||
|
|
||
| class MMDoc(BaseDocument): | ||
| text: str = '' | ||
| price: int = 0 | ||
| categories: Optional[List[str]] = None | ||
| image: Optional[Image] = None | ||
| matches: Optional[DocumentArray] = None | ||
| matches_with_same_id: Optional[DocumentArray] = None | ||
| opt_int: Optional[int] = None | ||
| test_set: Optional[Set] = None | ||
| inner_doc: Optional[InnerDoc] = None | ||
| test_dict: Optional[Dict] = None | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def doc1(): | ||
| return MMDoc( | ||
| text='hey here', | ||
| categories=['a', 'b', 'c'], | ||
| price=10, | ||
| matches=DocumentArray[MMDoc]([MMDoc()]), | ||
| matches_with_same_id=DocumentArray[MMDoc]( | ||
| [MMDoc(id='a', matches=DocumentArray[MMDoc]([MMDoc()]))] | ||
| ), | ||
| test_set={'a', 'a'}, | ||
| inner_doc=InnerDoc(integer=2, l=['c', 'd']), | ||
| test_dict={'a': 0, 'b': 2, 'd': 4, 'z': 3}, | ||
| ) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def doc2(doc1): | ||
| return MMDoc( | ||
| id=doc1.id, | ||
| text='hey here 2', | ||
| categories=['d', 'e', 'f'], | ||
| price=5, | ||
| opt_int=5, | ||
| matches=DocumentArray[MMDoc]([MMDoc()]), | ||
| matches_with_same_id=DocumentArray[MMDoc]( | ||
| [MMDoc(id='a', matches=DocumentArray[MMDoc]([MMDoc()]))] | ||
| ), | ||
| test_set={'a', 'b'}, | ||
| inner_doc=InnerDoc(integer=3, l=['a', 'b']), | ||
| test_dict={'a': 10, 'b': 10, 'c': 3, 'z': None}, | ||
| ) | ||
|
|
||
|
|
||
| def test_update_complex(doc1, doc2): | ||
| doc1.update(doc2) | ||
| # doc1 is changed in place (no extra memory) | ||
| assert doc1.text == 'hey here 2' | ||
| assert doc1.categories == ['a', 'b', 'c', 'd', 'e', 'f'] | ||
| assert len(doc1.matches) == 2 | ||
| assert doc1.opt_int == 5 | ||
| assert doc1.price == 5 | ||
| assert doc1.test_set == {'a', 'b'} | ||
| assert len(doc1.matches_with_same_id) == 1 | ||
| assert len(doc1.matches_with_same_id[0].matches) == 2 | ||
| assert doc1.inner_doc.integer == 3 | ||
| assert doc1.inner_doc.l == ['c', 'd', 'a', 'b'] | ||
| assert doc1.test_dict == {'a': 10, 'b': 10, 'c': 3, 'd': 4, 'z': None} | ||
|
|
||
|
|
||
| def test_update_simple(): | ||
| class MyDocument(BaseDocument): | ||
| content: str | ||
| title: Optional[str] = None | ||
| tags_: List | ||
|
|
||
| my_doc1 = MyDocument( | ||
| content='Core content of the document', title='Title', tags_=['python', 'AI'] | ||
| ) | ||
| my_doc2 = MyDocument(content='Core content updated', tags_=['docarray']) | ||
|
|
||
| my_doc1.update(my_doc2) | ||
| assert my_doc1.content == 'Core content updated' | ||
| assert my_doc1.title == 'Title' | ||
| assert my_doc1.tags_ == ['python', 'AI', 'docarray'] | ||
|
|
||
|
|
||
| def test_update_different_schema_fails(): | ||
| class DocA(BaseDocument): | ||
| content: str | ||
|
|
||
| class DocB(BaseDocument): | ||
| image: Optional[Image] = None | ||
|
|
||
| docA = DocA(content='haha') | ||
| docB = DocB() | ||
| with pytest.raises(Exception): | ||
| docA.update(docB) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.