@@ -32,9 +32,70 @@ class DocumentArray(
3232 BaseNode ,
3333):
3434 """
35- a DocumentArray is a list-like container of Document of the same schema
35+ a DocumentArray is a container of Document.
3636
3737 :param docs: iterable of Document
38+
39+
40+ A DocumentArray can only contain Document that follow the same schema. To precise
41+ this schema you can use the `DocumentArray[Document]` syntax. This will create a
42+ DocumentArray that can only contain Document of the type Document. (Note that there
43+ exist a special schema (AnySchema) that allow any Document to be stored in the
44+ DocumentArray, but this is not recommended to use and exist here mostly for
45+ serialization purpose with protobuf when the receiver does not know the schema of
46+ DocumentArray but still want to deserialize it).
47+
48+ EXAMPLE USAGE
49+ .. code-block:: python
50+ from docarray import Document, DocumentArray
51+ from docarray.typing import NdArray, ImageUrl
52+
53+ class Image(Document):
54+ tensor: Optional[NdArray[100]]
55+ url: ImageUrl
56+
57+ da = DocumentArray[Image](Image(url='http://url.com') for _ in range(10))
58+
59+
60+ DocumentArray define setter and getter for each field of the Document schema. These
61+ getter and setter are defined dynamically at runtime. This allows to access the
62+ field of the Document in a natural way. For example, if you have a DocumentArray of
63+ Image you can do: `da.tensor` to get the tensor of all the Image in the
64+ DocumentArray. You can also do `da.tensor = np.random.random([10, 100])` to set the
65+ tensor of all the Image.
66+
67+
68+ DocumentArray can be in two mode: row and stack.
69+
70+
71+ By default, it is in row mode (or unstacked mode), a DocumentArray is almost just a
72+ list of Document. You can append Document to it, iterate over it, etc. Each Document
73+ owns its data. You can see it as a row based datastructure. In this case
74+ the getter and setter shown above will return a list of the field of each Document
75+ (or a DocumentArray if the field is a nested Document) This list/DocumentArray will
76+ be created in the fly. The setter will set the field of each Document to the value
77+ of the list/DocumentArray/Tensor passed as parameters.
78+
79+ Nevertheless, this list-like behavior is not always optimal especially when you want
80+ to process you data in batch and do operation which involves matrix computation,
81+ like in deep learning or to compute the cosine similarly of embeddings, in this case
82+ you want to stack all the tensor in a single batch. This is where the stack mode of
83+ the DocumentArray comes in handy.
84+ In stacked mode each field which are Tensor are stored as a column in a tensor of
85+ batch the size of the DocumentArray. This allows to do operation on the whole batch
86+ instead of iterating over the DocumentArray.
87+ In this mode the Document inside in the Document don't own the data anymore but just
88+ reference to the data in the tensor. But For the user they are no difference it
89+ looks and feel the same.
90+ In stacked mode the getter and setter just replace the tensor of the DocumentArray
91+ of the given field.
92+ Finally, in stacked mode operation like `da.append` are not allowed anymore because
93+ they are too slow and not recommended to use. You should rather use the unstacked
94+ mode.
95+
96+ To switch from stacked mode to unstacked mode you need to call `da.unstack()` and
97+ `da.stack`. There are as well two context manager to for these modes.
98+ `with da.stack_mode():` and `with da.unstack_mode():`
3899 """
39100
40101 document_type : Type [BaseDocument ] = AnyDocument
@@ -194,7 +255,7 @@ def unstack(self):
194255 only method if the DocumentArray is already in stacked mode. (Calling it while
195256 being in unstacked mode will have no effect)
196257
197- Unstack will unstack all the columns of the DocumentArray and put the data back
258+ Calling unstack will unstack all the columns of the DocumentArray and put the data back
198259 in each Document of the DocumentArray.
199260
200261 In unstacked mode DocumentArray behaves like a normal python list of Documents.
0 commit comments