Skip to content

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

@oytuntez

Description

@oytuntez

Initial Checks

  • I have read and followed the docs and still think this is a bug

Description

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'                        

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Example Code

No response

Python, DocArray & OS Version

Python 3.8.12
docarray==0.40.0

Affected Components

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions