Skip to content

Commit 915406d

Browse files
committed
chore: add docstring for document array
Signed-off-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>
1 parent 31d85d8 commit 915406d

1 file changed

Lines changed: 63 additions & 2 deletions

File tree

docarray/array/array.py

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,70 @@ class DocumentArray(
3232
BaseNode,
3333
):
3434
"""
35-
a DocumentArray is a list-like container of Document of the same schema
35+
a DocumentArray is a container of Document.
3636
3737
:param docs: iterable of Document
38+
39+
40+
A DocumentArray can only contain Document that follow the same schema. To precise
41+
this schema you can use the `DocumentArray[Document]` syntax. This will create a
42+
DocumentArray that can only contain Document of the type Document. (Note that there
43+
exist a special schema (AnySchema) that allow any Document to be stored in the
44+
DocumentArray, but this is not recommended to use and exist here mostly for
45+
serialization purpose with protobuf when the receiver does not know the schema of
46+
DocumentArray but still want to deserialize it).
47+
48+
EXAMPLE USAGE
49+
.. code-block:: python
50+
from docarray import Document, DocumentArray
51+
from docarray.typing import NdArray, ImageUrl
52+
53+
class Image(Document):
54+
tensor: Optional[NdArray[100]]
55+
url: ImageUrl
56+
57+
da = DocumentArray[Image](Image(url='http://url.com') for _ in range(10))
58+
59+
60+
DocumentArray define setter and getter for each field of the Document schema. These
61+
getter and setter are defined dynamically at runtime. This allows to access the
62+
field of the Document in a natural way. For example, if you have a DocumentArray of
63+
Image you can do: `da.tensor` to get the tensor of all the Image in the
64+
DocumentArray. You can also do `da.tensor = np.random.random([10, 100])` to set the
65+
tensor of all the Image.
66+
67+
68+
DocumentArray can be in two mode: row and stack.
69+
70+
71+
By default, it is in row mode (or unstacked mode), a DocumentArray is almost just a
72+
list of Document. You can append Document to it, iterate over it, etc. Each Document
73+
owns its data. You can see it as a row based datastructure. In this case
74+
the getter and setter shown above will return a list of the field of each Document
75+
(or a DocumentArray if the field is a nested Document) This list/DocumentArray will
76+
be created in the fly. The setter will set the field of each Document to the value
77+
of the list/DocumentArray/Tensor passed as parameters.
78+
79+
Nevertheless, this list-like behavior is not always optimal especially when you want
80+
to process you data in batch and do operation which involves matrix computation,
81+
like in deep learning or to compute the cosine similarly of embeddings, in this case
82+
you want to stack all the tensor in a single batch. This is where the stack mode of
83+
the DocumentArray comes in handy.
84+
In stacked mode each field which are Tensor are stored as a column in a tensor of
85+
batch the size of the DocumentArray. This allows to do operation on the whole batch
86+
instead of iterating over the DocumentArray.
87+
In this mode the Document inside in the Document don't own the data anymore but just
88+
reference to the data in the tensor. But For the user they are no difference it
89+
looks and feel the same.
90+
In stacked mode the getter and setter just replace the tensor of the DocumentArray
91+
of the given field.
92+
Finally, in stacked mode operation like `da.append` are not allowed anymore because
93+
they are too slow and not recommended to use. You should rather use the unstacked
94+
mode.
95+
96+
To switch from stacked mode to unstacked mode you need to call `da.unstack()` and
97+
`da.stack`. There are as well two context manager to for these modes.
98+
`with da.stack_mode():` and `with da.unstack_mode():`
3899
"""
39100

40101
document_type: Type[BaseDocument] = AnyDocument
@@ -194,7 +255,7 @@ def unstack(self):
194255
only method if the DocumentArray is already in stacked mode. (Calling it while
195256
being in unstacked mode will have no effect)
196257
197-
Unstack will unstack all the columns of the DocumentArray and put the data back
258+
Calling unstack will unstack all the columns of the DocumentArray and put the data back
198259
in each Document of the DocumentArray.
199260
200261
In unstacked mode DocumentArray behaves like a normal python list of Documents.

0 commit comments

Comments
 (0)