Skip to content

feat(index): add epsilla connector#1835

Merged
JoanFM merged 7 commits into
docarray:mainfrom
tonyyanga:tonyyang-add-epsilla-db-connector
Dec 19, 2023
Merged

feat(index): add epsilla connector#1835
JoanFM merged 7 commits into
docarray:mainfrom
tonyyanga:tonyyang-add-epsilla-db-connector

Conversation

@tonyyanga

Copy link
Copy Markdown
Contributor

This PR adds the EpsillaDocumentIndex. It allows docarray users to use Epsilla vector database for queries.

It is the 1st PR for issue #1832. I intend to make a follow-up PR to add documentation for the connector, but want to limit the size of the PR to make it easier to review.

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from de419f5 to 91e9351 Compare December 4, 2023 19:06
@codecov

codecov Bot commented Dec 4, 2023

Copy link
Copy Markdown

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (522811f) 85.05% compared to head (03c2b9c) 85.07%.
Report is 2 commits behind head on main.

Files Patch % Lines
docarray/index/backends/epsilla.py 87.44% 27 Missing ⚠️
docarray/index/__init__.py 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1835      +/-   ##
==========================================
+ Coverage   85.05%   85.07%   +0.02%     
==========================================
  Files         135      136       +1     
  Lines        9031     9254     +223     
==========================================
+ Hits         7681     7873     +192     
- Misses       1350     1381      +31     
Flag Coverage Δ
docarray 85.07% <86.42%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JoanFM JoanFM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice contribution, I have some comments about the PR.

Comment thread docarray/index/backends/epsilla.py Outdated
Comment thread docarray/index/backends/epsilla.py Outdated
Comment thread docarray/index/backends/epsilla.py Outdated
Comment thread docarray/index/backends/epsilla.py Outdated
Comment thread docarray/index/backends/epsilla.py Outdated
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga

Copy link
Copy Markdown
Contributor Author

@JoanFM Thanks for the review!

I have updated the input validation logic to report specific parameters missing.

It is unfortunate that Epsilla's python library does not include pythonic handling of error (throw instead of status codes). It directly returns HTTP status codes. For now I replaced it with from http import HTTPStatus.

When there is a conflict on loading the db, Epsilla returns HTTP 500 when it should return HTTP 409. I have filed an issue, but parsing the string error message is the best workaround for now.

What do you think? Please let me know if there are other issues we should address.

Comment thread docarray/index/backends/epsilla.py Outdated
or "already exists" in response["message"]
if status_code != HTTPStatus.OK:
# Epsilla returns HTTP 500 when multiple clients connect to the same db
# Bug filed at https://github.com/epsilla-cloud/vectordb/issues/93

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the quick is fix on Epsilla side, would it be possible to wait for a patch on epsille side?

)

def num_docs(self) -> int:
raise NotImplementedError

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a way to know the number of docs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no public API for counting the number of docs. One workaround is to run filter query without any condition, but I worry that is quite inefficient. Would you prefer to implement num_docs with that rather than throwing here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no API, okey. But I will double check if some functionality actually depends on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked. num_docs impacts these two methods:

  • __len__
  • _is_index_empty

I pushed a _filter based implementation for _is_index_empty. Confirmed there is nothing that depends on num_docs is limited (nothing in abstract.py depends on this). But there is no easy way to check all the len() locations.

$ grep -rn -l --exclude-dir=./tests "num_docs()"
./docarray/index/abstract.py
./docarray/index/backends/milvus.py
./docarray/index/backends/redis.py
./docarray/index/backends/hnswlib.py
./docs/user_guide/storing/index_hnswlib.md
./docs/user_guide/storing/index_elastic.md
./docs/user_guide/storing/index_in_memory.md
./docs/user_guide/storing/index_qdrant.md
./docs/user_guide/storing/index_redis.md

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and _is_index_empty, I think it was checked in many queries. I am not sure if the result is kind of cached. I have little access to check properly now the code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For _is_index_empty, it should work because I have overriden the default implementation. See https://github.com/docarray/docarray/pull/1835/files#diff-2584da09c45f6eeaca7247dcdfd1a1f1cb1da5752f9a00289cb3f9a52e228969R361-R368

@JoanFM

JoanFM commented Dec 5, 2023

Copy link
Copy Markdown
Member

@JoanFM Thanks for the review!

I have updated the input validation logic to report specific parameters missing.

It is unfortunate that Epsilla's python library does not include pythonic handling of error (throw instead of status codes). It directly returns HTTP status codes. For now I replaced it with from http import HTTPStatus.

When there is a conflict on loading the db, Epsilla returns HTTP 500 when it should return HTTP 409. I have filed an issue, but parsing the string error message is the best workaround for now.

What do you think? Please let me know if there are other issues we should address.

Do you plan to have the fix for that behavior anytime soon on epsilla side?

@tonyyanga

Copy link
Copy Markdown
Contributor Author

Epsilla is going to merge a quick fix the load_db behavior: epsilla-cloud/vectordb#93 Will wait for them.

fix input validation logic

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from 744d7ee to c6deb79 Compare December 5, 2023 20:00
@tonyyanga

Copy link
Copy Markdown
Contributor Author

Epsilla fixed on their end. I've pushed a change to use HTTP 409 status codes rather than relying on error messages.

Love to know what you think about the num_docs implementation. :-)

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga tonyyanga force-pushed the tonyyang-add-epsilla-db-connector branch from a3923d0 to 1d3ff6d Compare December 6, 2023 03:19
@tonyyanga

Copy link
Copy Markdown
Contributor Author

@JoanFM Any other issues to address before we can merge this change?

@JoanFM

JoanFM commented Dec 9, 2023

Copy link
Copy Markdown
Member

I will review the implications of num_docs in detail on Monday.

@JoanFM

JoanFM commented Dec 11, 2023

Copy link
Copy Markdown
Member

Is it possible for you to have an API that decides whether the table is empty or not? Otherwise, u can try to override _is_index_empty to say always False. And make sure that all the methods work when the table is empty returning empty results properly.

Signed-off-by: Tony Yang <tonyyanga@gmail.com>
Signed-off-by: Tony Yang <tonyyanga@gmail.com>
@tonyyanga

Copy link
Copy Markdown
Contributor Author

@JoanFM

Unfortunately, there is no API for that. I have pushed a change to override _is_index_empty to return False, and added test cases with the index being empty.

Please let me know if there are other things to address.

@tonyyanga

Copy link
Copy Markdown
Contributor Author

Hey @JoanFM

Gently bumping this - please let me know if there are other things to address.

@JoanFM JoanFM merged commit ff00b60 into docarray:main Dec 19, 2023
@JoanFM

JoanFM commented Dec 19, 2023

Copy link
Copy Markdown
Member

Hello @tonyyanga ,

Thanks for the contribution, we will be waiting for the documentation PR

@tonyyanga

Copy link
Copy Markdown
Contributor Author

Thank you! will follow up with it, hopefully later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants