-
Notifications
You must be signed in to change notification settings - Fork 237
Description
Test with Jina/Weaviate
Project Scope:
We would like to store files as vectors, so that we can perform A.I. tasks against the data and return
human-readable responses. For this first test we will only use a transcript. Once we can perfect that
we will incorporate other modalities like audio, images, video, etc...
Project Definitions:
Search: Refers to a search against the data by a user defined term. This function is similar to
performing a Google search against the users' data.
Match: Refers to a search from one entire document to another entire document. The user selects
the document they want to find matches for and the result is other similar documents.
##Testing setup:
Weaviate is running on a Docker and on the same server I have Jina installed on Ubuntu OS.
My main app file looks like:
from jina import Flow, Document, DocumentArray, Client
import sys # used to see if there are command line args
'''
Workflows
Ingest: encode Document, then store in DB
Term Search: encode search term, then lookup in DB
Match: lookup Document in DB by ID, then lookup similar documents in DB
The encoder works and is included in the yml file.
The encoder will only vectorize items in the data object labeled "text" and then put them in the embedding object of the
document. since the model being used is a sentence model, each sentence should be broken into different arrays like:
"data": [
{"text": "you think about it, It's a little strange."},
{"text": "We still call these things phones."},
{"text": "They still make calls, but really, it's become less about hello and more about."},
{"text": "But that's actually changing to increasingly, when you take out your phone, it's not to make a call, not to type your input."},
{"text": "Your starting point is through the lens."},
{"text": "Right now."},
...
]
'''
# This is used to tie in all the executors from the flow.yml file
f = Flow.load_config('flow.yml')
d = Document()
# for passing test using the Command Line.
if len(sys.argv) > 1:
if sys.argv[1] == "svg":
f.plot('flow.svg') # this command will create a flow diagram in the root directory
with f:
if bool(d):
docs = f.post(on='/', inputs=d, on_done=lambda resp: resp) # This starts the execution
f.block() # keeps the script running for the API calls
# Nothing can happen below here since block() halts any further execution until after the scripts stop.
print('\nGoodbye, Your Jina API is no longer running.\n')With the configuration file:
jtype: Flow
with:
protocol: 'http'
port: 1234
expose_endpoints:
/match:
summary: match endpoint
executors:
- name: encoder
uses: TransformerTorchEncoder
volumes: '.cache/huggingface:.cache'
uses_with: {
'pretrained_model_name_or_path':'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
}
uses_metas: {
'description': 'A pre-trained sentence vector package.'
}
py_modules: 'exec/executor-text-transformers-torch-encoder-main/transform_encoder.py'
- name: weaviate
uses: WeaviateExecutor
uses_with: {
'weaviate_port': '8080',
'weaviate_host': 'localhost'
}
uses_metas: {
'description': 'Controls data flow to DB'
}
py_modules: 'exec/weaviate/weaviate-executor.py'And here is the DB Executor:
from jina import Executor, requests, Document, DocumentArray
from typing import Dict, Iterable, Optional, Tuple
class WeaviateExecutor(Executor):
def __init__(
self,
weaviate_port,
weaviate_host,
traversal_paths: str = '@r,c,m',
**kwargs
):
'''
Initiate the DB connection
:param weaviate_port: Passed through on the flow.yml config file
:param weaviate_host: Passed through on the flow.yml config file
:param traversal_paths: Idea to traverse document depths but never worked
:param kwargs: Here, kwargs are reserved for Jina to inject metas and requests
(representing the request-to-function mapping) values when the Executor is used inside a Flow.
You can access the values of these arguments in the __init__ body via
self.metas/self.requests/self.runtime_args, or modify their values before passing them to super().__init__().
'''
super().__init__(**kwargs)
# set DB connection config
self.host = weaviate_host
self.port = weaviate_port
self.db = []
self.connect_db("Document")
def connect_db(
self,
class_name=''
):
if class_name == '':
self.db = DocumentArray(
storage="weaviate", config={"host": self.host, "port": self.port}
)
else:
self.db = DocumentArray(
storage="weaviate", config={"host": self.host, "port": self.port, "name": class_name}
)
@requests(on='/index')
def index(
self,
docs: DocumentArray,
parameters: Dict={},
**kwargs
):
'''
Here I would like to take a document and store it into the DB.
There will be a few things in the Document. There will be a transcript broken out into sentences
and timestamps for each word. The reason for breaking into sentences is that we are using a
pre-trained model that is made for sentences. The vectors are created before we get to this point.
:param parameters: Parameters or variables sent by the API that apply to the whole document
:param docs: the sentences in the document that will be vectorized
:param kwargs: Configuration settings for Jina
'''
'''
At this point the docs looks like:
[{
"id": "1efd06c05078e7b91f9b6c7454cba48d",
"parent_id": null,
"granularity": null,
"adjacency": null,
"blob": null,
"tensor": null,
"mime_type": null,
"text": "Yeah,dana bash,appreciate it.",
"weight": null,
"uri": null,
"tags": {
"end_time": 420940.0,
"start_time": 419240.0,
"start_word_index": 1203.0,
"end_word_index": 1207.0,
"asset_id": "_zPr-H-J6AE"
},
"offset": null,
"location": null,
"embedding": [-0.1570819765329361, 0.08105587214231491, 0.025140587240457535, ...]
"modality": null,
"evaluations": null,
"scores": null,
"chunks": null,
"matches": null
},...]
'''
# asset_id is required so we do a check
if "asset_id" not in parameters.keys():
raise Exception("ERROR: You are missing the required asset_id parameter.")
# @TODO: we need to figure out how to query the DB by asset_id to see if the doc exists before adding it
# now store the document
self.db.extend(DocumentArray(docs)) # doesnt seem to be storing the tags sent
@requests(on='/search')
def search(
self,
docs: DocumentArray,
parameters: Dict={},
**kwargs
):
'''
This is going to be used as a search term lookup. A user will provide a string and we will
get the vector comparison from the DB. Here we need to figure out how to add sorting, filters and limits.
:param parameters: Parameters or variables sent by the API that apply to the whole document
:param docs: the sentences in the document that will be vectorized
:param kwargs: Configuration settings for Jina
'''
'''
API Request looks like:
{
"data": [
{
"text": "where are my car keys?"
}
],
"parameters": {
"limit": 10
}
}
'''
# @TODO: need to figure out a way to apply filters specified by the user
# send the search request which has been vectorized already
results = self.db.find(
docs,
limit=parameters["limit"]
)
return results
'''
Response does not include tags from ingestion:
{
"id": "00001cf94cd5d87713844a98d62bbc5c",
"parent_id": null,
"granularity": null,
"adjacency": null,
"blob": null,
"tensor": null,
"mime_type": null,
"text": "So and so this car.",
"weight": null,
"uri": null,
"tags": {
"wid": "3aa9fde4-3114-5922-8b74-8959f72c48b2" # ????? Where they at???
},
"offset": null,
"location": null,
"embedding": [0.32719624042510986,0.3085232079029083,-0.06597057729959488,...],
"modality": null,
"evaluations": null,
"scores": {
"cosine_similarity": {
"value": 0.5244436264038086,
"op_name": null,
"description": null,
"ref_id": null
},
"weaviate_certainty": {
"value": 0.7622218132019043,
"op_name": null,
"description": null,
"ref_id": null
}
},
"chunks": null,
"matches": null
}
'''
@requests(on='/match')
def match(
self,
parameters: Dict={},
**kwargs
):
'''
This is going to be used as a Document Match lookup. A user will provide a document ID and we will
get the vector similarity to other documents from the DB.
Here we need to figure out how to add sorting, filters and limits.
Not done yet...
:param parameters: Parameters or variables sent by the API that apply to the whole document
:param kwargs: Configuration settings for Jina
'''
# asset_id is required so we do a check
if "asset_id" not in parameters.keys():
raise Exception("ERROR: You are missing the required asset_id parameter.")
print(parameters) # gets the document parameters
pass
@requests(on='/delete')
def delete(
self,
parameters: Dict={},
**kwargs
):
'''
This endpoint will take a document ID and remove it from the DB.
Not done yet...
:param parameters: Parameters or variables sent by the API that apply to the whole document
:param kwargs: Configuration settings for Jina
'''
# asset_id is required so we do a check
if "asset_id" not in parameters.keys():
raise Exception("ERROR: You are missing the required asset_id parameter.")
print(parameters) # gets the document parameters
pass
@requests(on='/update')
def update(
self,
docs: DocumentArray,
parameters: Dict={},
**kwargs
):
'''
This endpoint will take a document ID and update it in the DB. This may require some other parameters
so we know what is being updated.
Not done yet...
:param parameters: Parameters or variables sent by the API that apply to the whole document
:param docs: the sentences in the document that will be vectorized
:param kwargs: Configuration settings for Jina
'''
# asset_id is required so we do a check
if "asset_id" not in parameters.keys():
raise Exception("ERROR: You are missing the required asset_id parameter.")
print(parameters) # gets the document parameters
pass##Current Understanding:
The concept of DocArray is very simple and straightforward. We have also chosen Weaviate as the storage since it has a
nice HNSW indexing feature which is good for what we need. So I will jump into what I think the
document should look like for what we are doing.
We are using a sentence trained Transformer model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
which is why it makes sense for us to break the transcript into sentences.
That makes the document look like:
<Document (tags:[type:"video", asset_id:"unique file ID", name:"file name", etc:"other file related data"]>
└─ chunks
└─ <Document (text:"This is the first sentence.",tags:[
asset_id:"unique file ID",
start_time:"timestamp of the begining of the first word",
end_time:"timestamp of the end of the last word",
start_word_index:"word index of the first word",
end_word_index"word index of the last word",
etc:"other sentence related data"
])>
└─ <Document (text:"This is the second sentence.",tags:[
...
])>
└─ <Document (text:"This is the third sentence.",tags:[
...
])>
...
However, I was told that this parent/child structure is not supported for searching against a large data set.
This is really a bummer because for what I need, it seems ill have to make each sentence a root level document.
Currently, it is looking like this:
<Document (text:"This is the first sentence.",tags:[
type:"video",
name:"file name",
file_etc:"other file related data",
asset_id:"unique file ID",
start_time:"timestamp of the begining of the first word",
end_time:"timestamp of the end of the last word",
start_word_index:"word index of the first word",
end_word_index"word index of the last word",
sentence_etc:"other sentence related data"
])>
<Document (text:"This is the second sentence.",tags:[
type:"video",
name:"file name",
file_etc:"other file related data",
asset_id:"unique file ID",
start_time:"timestamp of the begining of the first word",
end_time:"timestamp of the end of the last word",
start_word_index:"word index of the first word",
end_word_index"word index of the last word",
sentence_etc:"other sentence related data"
])>
<Document (text:"This is the third sentence.",tags:[
type:"video",
name:"file name",
file_etc:"other file related data",
asset_id:"unique file ID",
start_time:"timestamp of the begining of the first word",
end_time:"timestamp of the end of the last word",
start_word_index:"word index of the first word",
end_word_index"word index of the last word",
sentence_etc:"other sentence related data"
])>
...
This can get very cumbersome and confusing especially if we
want to move towards having multiple modalities.
It would be good to hear ideas how this can be improved.
##Problems while experimenting with Jina/DocArray/Weaviate combo
###Issue with class names in Weaviate
The class name I am trying to use is "Document" for now. This is because Weaviate has a schema where you can
separate types of data into classes and perform queries against them. I have tried to let DocArray decide
but as it is generating seemingly random strings it repeats. It is not unique. I loaded 854 transcripts into
the system and about halfway through it was throwing an error saying the class name was already taken.
If i run them all as 1 batch and give them all the same class name then they all load into the DB.
However, if i try to load more it tells me the class name is taken which doesn't make sense.
Also if im not loading any docs into the DB i get the same error after a while of stopping and starting the
program. I think this is because I have the DB connecting in the init function. I just want the init function
to open a connection and then have a way
to close the connection when the request is done but I cant find anyway to do this in the documentation.
This is the only way I have found on how to connect to the DB.
Another issue with the class names is that If i do not specify which class name to search against it will
return no results. When I loaded some files with a specified class name i could make search requests and get
results but if i did not specify the class name when loading search always returned empty.
###Issue with child/parent searches
I did try and put the sentences as chunks of the main document and that seemed to work on ingestion.
However, search never worked. it always returned empty but when I found the documentation to search
nested structures I tried it and always got an error.
This is the code:
results = self.db.find(
docs,
limit=parameters["limit"],
).traverse_flat("@c") # per documentationThis is the result:
weaviate/rep-0@20779[E]:ValueError('`path`:@c is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure')
add "--quiet-error" to suppress the exception details
Traceback (most recent call last):
File ".local/lib/python3.8/site-packages/jina/serve/runtimes/worker/__init__.py", line 101, in process_data
return await self._data_request_handler.handle(requests=requests)
File ".local/lib/python3.8/site-packages/jina/serve/runtimes/request_handlers/data_request_handler.py", line 95, in handle
return_data = await self._executor.__acall__(
File ".local/lib/python3.8/site-packages/jina/serve/executors/__init__.py", line 232, in __acall__
return await self.__acall_endpoint__(req_endpoint, **kwargs)
File ".local/lib/python3.8/site-packages/jina/serve/executors/__init__.py", line 241, in __acall_endpoint__
return await run_in_threadpool(func, self._thread_pool, self, **kwargs)
File ".local/lib/python3.8/site-packages/jina/helper.py", line 1237, in run_in_threadpool
return await get_or_reuse_loop().run_in_executor(
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File ".local/lib/python3.8/site-packages/jina/serve/executors/decorators.py", line 115, in arg_wrapper
return fn(*args, **kwargs)
File "weaviate/testWithJina/search/exec/weaviate/weaviate-executor.py", line 156, in search
results = self.db.find(
File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 193, in traverse_flat
return self._flatten(leaves)
File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 232, in _flatten
return DocumentArray(list(itertools.chain.from_iterable(sequence)))
File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 105, in traverse
for p in _re_traversal_path_split(traversal_paths):
File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 59, in _re_traversal_path_split
raise ValueError(
ValueError: `path`:@c is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure###Issue with Match
I know there is already a Match concept built into DocArray which is kind of like the concept we are wanting to use it for.
From the documentation I see that a Match can be stored in a Document as it is placed into the DB. I'm not sure
if this is helpful or harmful to our use case, but it was worth exploring. After I figured out Search I intended to move
onto Match. But, when I discovered that in order for Search to work I couldn't store the document in a Parent/Child
structure. This leads me to believe that a horizontal Match wouldn't work since I could not relate all the
transcript to the entirety of another transcript. I was forced to flatten the schema so Search would work.
If I cannot Search the large dataset without flattening the schema then how will I be able to perform
a Match against the data?