fix(array): fix getting docs by two ids by davidbp · Pull Request #29 · docarray/docarray

davidbp · 2022-01-11T12:41:55Z

The DocumentArray method __getitem__ is incorrect in the case two ids are passed:

Minimal working example to showcase bug:

In [2]: from docarray import Document, DocumentArray
   ...: 
   ...: da = DocumentArray([Document(id='1'),
   ...:                     Document(id='2'),
   ...:                     Document(id='3')])

Currently selecting with two doc ids produces

In [4]: da['1','2']

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/05/h71x7gh54sx_5y43ppkq9_dw0000gq/T/ipykernel_8535/2066379862.py in <module>
----> 1 da['1','2']

~/Documents/jina_stuff/docarray/docarray/array/document.py in __getitem__(self, index)
    151                 if isinstance(_attrs, str):
    152                     _attrs = (index[1],)
--> 153                 return _docs._get_attributes(*_attrs)
    154             elif isinstance(index[0], bool):
    155                 return DocumentArray(itertools.compress(self._data, index))

~/Documents/jina_stuff/docarray/docarray/document/mixins/attribute.py in _get_attributes(self, *fields)
     20                 value = dunder_get(self, k)
     21             else:
---> 22                 value = getattr(self, k)
     23 
     24             ret.append(value)

AttributeError: 'Document' object has no attribute '2'

This PR solves this and produces

In [3]: da['1']
Out[3]: <Document ('id',) at 1>

In [4]: da['1','2']
Out[4]: <DocumentArray (length=2) at 140552956821312>

In [5]: da['1','2','3']
Out[5]: <DocumentArray (length=3) at 140552956819056>

codecov · 2022-01-11T12:48:47Z

Codecov Report

Merging #29 (0fd2c7c) into main (79ad236) will increase coverage by 0.18%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   83.09%   83.27%   +0.18%     
==========================================
  Files          67       67              
  Lines        3135     3151      +16     
==========================================
+ Hits         2605     2624      +19     
+ Misses        530      527       -3

Flag	Coverage Δ
docarray	`83.27% <100.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/array/document.py	`85.77% <100.00%> (+1.52%)`	⬆️
docarray/proto/io/ndarray.py	`94.78% <0.00%> (+0.09%)`	⬆️
docarray/array/mixins/io/binary.py	`95.55% <0.00%> (+0.31%)`	⬆️
docarray/document/mixins/porting.py	`92.59% <0.00%> (+0.92%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79ad236...0fd2c7c. Read the comment docs.

hanxiao · 2022-01-11T13:00:56Z

docarray/array/document.py

                and isinstance(index[0], (slice, Sequence))
            ):
+                # edge case where 2 ids are passed
+                if isinstance(index[0], str) and index[0] in self._id2offset:


this logic itself is also problematic, it does not work with column selector https://docarray.jina.ai/fundamentals/documentarray/access-attributes/

you can try da['2', 'id'] under your branch it will break

I would suggest a complete readthrough on "access attributes" chapter before fixing edge cases like this

Now it should be fixed. There is still the possibility that an id has exactly the same str than an element_selector or a column_selector in which case I guess the notation could be ambiguous as Nan mentions below.

I think this is a 'very stretched edge case' but it's a possibility:

from docarray import Document, DocumentArray da = DocumentArray([Document(id='@r', text='a1'), Document(id='2', text='a2'), Document(id='3', text='a3')])

Then da['@r'] should be a single item or a DocumentArray with all root items?

We could assume id strings do not start with @, otherwise the notation in getitem/setitem seems ambiguous. To me this is a reasonable assumption but I guess we should make a Note in the docs?

nan-wang · 2022-01-12T06:14:22Z

I think da['1','2'] is confusing. It is not clear whether we want to select by id or to select attributes. Anyway, the part of selecting by ids does not work. I suggest remove this part for the time being.

MWE:

from docarray import Document, DocumentArray

da = DocumentArray([Document(id='1'),
                    Document(id='2'),
                    Document(id='3')])

da['1', '2']

leads to the error,

Traceback (most recent call last):
  File "/Users/nanwang/Codes/jina-ai/docarray/toys/toy2.py", line 7, in <module>
    print(da['1', '2'])
  File "/Users/nanwang/Codes/jina-ai/docarray/docarray/array/document.py", line 154, in __getitem__
    return _docs._get_attributes(*_attrs)
  File "/Users/nanwang/Codes/jina-ai/docarray/docarray/document/mixins/attribute.py", line 22, in _get_attributes
    value = getattr(self, k)
AttributeError: 'Document' object has no attribute '2'

nan-wang · 2022-01-12T06:40:13Z

One principle we can stick to is to use the first arg to select ids\index\slices and the second arg to select the attributes. If we follow this rule, the grammar for selecting by integers should be refactored to

- da[1, 2, 3]
+ da[[1, 2, 3]]

This pattern is similar as the pandas DataFrame.loc. The id in DocArray is equivalent to index in pandas.DataFrame. Here is an example

import pandas as pd

df = pd.DataFrame([['a', 1, 'xx'], ['b', 2, 'xx'], ['c', 3, 'xx']], columns=['name', 'age', 'place'])
df.set_index('name', inplace=True)

# select by one id
print(df.loc['a'])

"""output
age       1
place    xx
Name: a, dtype: object
      age place
"""

# select by multiple ids
print(df.loc[['a', 'b']])

"""output
name           
a       1    xx
b       2    xx
1
"""

# select by one id & one attribute
print(df.loc['a', 'age'])

"""output
1
"""

# select by multiple ids & one attribute
print(df.loc[['a', 'b'], 'age'])

"""output
name
a    1
b    2
Name: age, dtype: int64
"""

# select by multiple ids & multiple attribute
print(df.loc[['a', 'b'], ['age', 'place']])

"""output
      age place
name           
a       1    xx
b       2    xx
"""

davidbp · 2022-01-12T07:06:32Z

This error that you point out @nan-wang is precisely what I explain in the example on top of the PR and should be solved by the PR.

fix(array): fix getting docs by two ids

7347eda

github-actions bot added size/s area/core area/testing component/array labels Jan 11, 2022

davidbp added 2 commits January 11, 2022 13:43

fix: remove bad assertion

b8809d5

fix: black formatting

fcf3a6d

github-actions bot added size/m and removed size/s labels Jan 11, 2022

refactor: black single quotes

049bff4

github-actions bot added size/s and removed size/m labels Jan 11, 2022

davidbp added 2 commits January 11, 2022 13:51

refactor: black single quotes

b23ede1

refactor: follow black standard

1516776

hanxiao reviewed Jan 11, 2022

View reviewed changes

fix: take into account all strs are ids when selecting item

5d22c66

davidbp added 2 commits January 12, 2022 08:18

test: cover sequence of id, attribute case

b559d50

fix: test len

0fd2c7c

hanxiao mentioned this pull request Jan 12, 2022

fix: fix edge case when two elements are strings #37

Merged

hanxiao closed this in #37 Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(array): fix getting docs by two ids#29

fix(array): fix getting docs by two ids#29
davidbp wants to merge 9 commits intomainfrom
fix-docarray-by-ids

davidbp commented Jan 11, 2022 •

edited

Loading

Uh oh!

codecov bot commented Jan 11, 2022 •

edited

Loading

Uh oh!

hanxiao Jan 11, 2022

Uh oh!

hanxiao Jan 11, 2022

Uh oh!

hanxiao Jan 11, 2022 •

edited

Loading

Uh oh!

davidbp Jan 12, 2022 •

edited

Loading

Uh oh!

nan-wang commented Jan 12, 2022

Uh oh!

nan-wang commented Jan 12, 2022

Uh oh!

davidbp commented Jan 12, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davidbp commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hanxiao Jan 11, 2022

Choose a reason for hiding this comment

Uh oh!

hanxiao Jan 11, 2022

Choose a reason for hiding this comment

Uh oh!

hanxiao Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidbp Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nan-wang commented Jan 12, 2022

Uh oh!

nan-wang commented Jan 12, 2022

Uh oh!

davidbp commented Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidbp commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

hanxiao Jan 11, 2022 •

edited

Loading

davidbp Jan 12, 2022 •

edited

Loading

davidbp commented Jan 12, 2022 •

edited

Loading