Skip to content

fix(array): fix getting docs by two ids#29

Closed
davidbp wants to merge 9 commits intomainfrom
fix-docarray-by-ids
Closed

fix(array): fix getting docs by two ids#29
davidbp wants to merge 9 commits intomainfrom
fix-docarray-by-ids

Conversation

@davidbp
Copy link
Copy Markdown
Contributor

@davidbp davidbp commented Jan 11, 2022

The DocumentArray method __getitem__ is incorrect in the case two ids are passed:

Minimal working example to showcase bug:

In [2]: from docarray import Document, DocumentArray
   ...: 
   ...: da = DocumentArray([Document(id='1'),
   ...:                     Document(id='2'),
   ...:                     Document(id='3')])

Currently selecting with two doc ids produces

In [4]: da['1','2']

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/05/h71x7gh54sx_5y43ppkq9_dw0000gq/T/ipykernel_8535/2066379862.py in <module>
----> 1 da['1','2']

~/Documents/jina_stuff/docarray/docarray/array/document.py in __getitem__(self, index)
    151                 if isinstance(_attrs, str):
    152                     _attrs = (index[1],)
--> 153                 return _docs._get_attributes(*_attrs)
    154             elif isinstance(index[0], bool):
    155                 return DocumentArray(itertools.compress(self._data, index))

~/Documents/jina_stuff/docarray/docarray/document/mixins/attribute.py in _get_attributes(self, *fields)
     20                 value = dunder_get(self, k)
     21             else:
---> 22                 value = getattr(self, k)
     23 
     24             ret.append(value)

AttributeError: 'Document' object has no attribute '2'

This PR solves this and produces

In [3]: da['1']
Out[3]: <Document ('id',) at 1>

In [4]: da['1','2']
Out[4]: <DocumentArray (length=2) at 140552956821312>

In [5]: da['1','2','3']
Out[5]: <DocumentArray (length=3) at 140552956819056>

@github-actions github-actions bot added size/m and removed size/s labels Jan 11, 2022
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 11, 2022

Codecov Report

Merging #29 (0fd2c7c) into main (79ad236) will increase coverage by 0.18%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   83.09%   83.27%   +0.18%     
==========================================
  Files          67       67              
  Lines        3135     3151      +16     
==========================================
+ Hits         2605     2624      +19     
+ Misses        530      527       -3     
Flag Coverage Δ
docarray 83.27% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/document.py 85.77% <100.00%> (+1.52%) ⬆️
docarray/proto/io/ndarray.py 94.78% <0.00%> (+0.09%) ⬆️
docarray/array/mixins/io/binary.py 95.55% <0.00%> (+0.31%) ⬆️
docarray/document/mixins/porting.py 92.59% <0.00%> (+0.92%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79ad236...0fd2c7c. Read the comment docs.

@github-actions github-actions bot added size/s and removed size/m labels Jan 11, 2022
and isinstance(index[0], (slice, Sequence))
):
# edge case where 2 ids are passed
if isinstance(index[0], str) and index[0] in self._id2offset:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic itself is also problematic, it does not work with column selector https://docarray.jina.ai/fundamentals/documentarray/access-attributes/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try da['2', 'id'] under your branch it will break

Copy link
Copy Markdown
Member

@hanxiao hanxiao Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a complete readthrough on "access attributes" chapter before fixing edge cases like this

Copy link
Copy Markdown
Contributor Author

@davidbp davidbp Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it should be fixed. There is still the possibility that an id has exactly the same str than an element_selector or a column_selector in which case I guess the notation could be ambiguous as Nan mentions below.

I think this is a 'very stretched edge case' but it's a possibility:

from docarray import Document, DocumentArray

da = DocumentArray([Document(id='@r', text='a1'),
                    Document(id='2', text='a2'),
                    Document(id='3', text='a3')])

Then da['@r'] should be a single item or a DocumentArray with all root items?

We could assume id strings do not start with @, otherwise the notation in getitem/setitem seems ambiguous. To me this is a reasonable assumption but I guess we should make a Note in the docs?

@nan-wang
Copy link
Copy Markdown
Collaborator

I think da['1','2'] is confusing. It is not clear whether we want to select by id or to select attributes. Anyway, the part of selecting by ids does not work. I suggest remove this part for the time being.

image

MWE:

from docarray import Document, DocumentArray

da = DocumentArray([Document(id='1'),
                    Document(id='2'),
                    Document(id='3')])

da['1', '2']

leads to the error,

Traceback (most recent call last):
  File "/Users/nanwang/Codes/jina-ai/docarray/toys/toy2.py", line 7, in <module>
    print(da['1', '2'])
  File "/Users/nanwang/Codes/jina-ai/docarray/docarray/array/document.py", line 154, in __getitem__
    return _docs._get_attributes(*_attrs)
  File "/Users/nanwang/Codes/jina-ai/docarray/docarray/document/mixins/attribute.py", line 22, in _get_attributes
    value = getattr(self, k)
AttributeError: 'Document' object has no attribute '2'

@nan-wang
Copy link
Copy Markdown
Collaborator

One principle we can stick to is to use the first arg to select ids\index\slices and the second arg to select the attributes. If we follow this rule, the grammar for selecting by integers should be refactored to

- da[1, 2, 3]
+ da[[1, 2, 3]]

This pattern is similar as the pandas DataFrame.loc. The id in DocArray is equivalent to index in pandas.DataFrame. Here is an example

import pandas as pd

df = pd.DataFrame([['a', 1, 'xx'], ['b', 2, 'xx'], ['c', 3, 'xx']], columns=['name', 'age', 'place'])
df.set_index('name', inplace=True)

# select by one id
print(df.loc['a'])

"""output
age       1
place    xx
Name: a, dtype: object
      age place
"""

# select by multiple ids
print(df.loc[['a', 'b']])

"""output
name           
a       1    xx
b       2    xx
1
"""

# select by one id & one attribute
print(df.loc['a', 'age'])

"""output
1
"""

# select by multiple ids & one attribute
print(df.loc[['a', 'b'], 'age'])

"""output
name
a    1
b    2
Name: age, dtype: int64
"""

# select by multiple ids & multiple attribute
print(df.loc[['a', 'b'], ['age', 'place']])

"""output
      age place
name           
a       1    xx
b       2    xx
"""

@davidbp
Copy link
Copy Markdown
Contributor Author

davidbp commented Jan 12, 2022

This error that you point out @nan-wang is precisely what I explain in the example on top of the PR and should be solved by the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants