Skip to content

Issues with my first round of testing #348

@jonathan-rowley

Description

@jonathan-rowley

Test with Jina/Weaviate

Project Scope:

We would like to store files as vectors, so that we can perform A.I. tasks against the data and return
human-readable responses. For this first test we will only use a transcript. Once we can perfect that
we will incorporate other modalities like audio, images, video, etc...

Project Definitions:

Search: Refers to a search against the data by a user defined term. This function is similar to
performing a Google search against the users' data.

Match: Refers to a search from one entire document to another entire document. The user selects
the document they want to find matches for and the result is other similar documents.

##Testing setup:

Weaviate is running on a Docker and on the same server I have Jina installed on Ubuntu OS.

My main app file looks like:

from jina import Flow, Document, DocumentArray, Client
import sys  # used to see if there are command line args
'''
Workflows
Ingest: encode Document, then store in DB
Term Search: encode search term, then lookup in DB
Match: lookup Document in DB by ID, then lookup similar documents in DB

The encoder works and is included in the yml file.
The encoder will only vectorize items in the data object labeled "text" and then put them in the embedding object of the 
document. since the model being used is a sentence model, each sentence should be broken into different arrays like:
"data": [
    {"text": "you think about it, It's a little strange."},
    {"text": "We still call these things phones."},
    {"text": "They still make calls, but really, it's become less about hello and more about."},
    {"text": "But that's actually changing to increasingly, when you take out your phone, it's not to make a call, not to type your input."},
    {"text": "Your starting point is through the lens."},
    {"text": "Right now."},
    ...
  ]
'''


# This is used to tie in all the executors from the flow.yml file
f = Flow.load_config('flow.yml')

d = Document()

# for passing test using the Command Line.
if len(sys.argv) > 1:
    if sys.argv[1] == "svg":
        f.plot('flow.svg')  # this command will create a flow diagram in the root directory

with f:
    if bool(d):
        docs = f.post(on='/', inputs=d, on_done=lambda resp: resp)  # This starts the execution
    f.block()  # keeps the script running for the API calls

# Nothing can happen below here since block() halts any further execution until after the scripts stop.
print('\nGoodbye, Your Jina API is no longer running.\n')

With the configuration file:

jtype: Flow
with:
  protocol: 'http'
  port: 1234
  expose_endpoints:
    /match:
      summary: match endpoint
executors:
  - name: encoder
    uses: TransformerTorchEncoder
    volumes: '.cache/huggingface:.cache'
    uses_with: {
      'pretrained_model_name_or_path':'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
    }
    uses_metas: {
      'description': 'A pre-trained sentence vector package.'
    }
    py_modules: 'exec/executor-text-transformers-torch-encoder-main/transform_encoder.py'
  - name: weaviate
    uses: WeaviateExecutor
    uses_with: {
      'weaviate_port': '8080',
      'weaviate_host': 'localhost'
    }
    uses_metas: {
      'description': 'Controls data flow to DB'
    }
    py_modules: 'exec/weaviate/weaviate-executor.py'

And here is the DB Executor:

from jina import Executor, requests, Document, DocumentArray
from typing import Dict, Iterable, Optional, Tuple

class WeaviateExecutor(Executor):

    def __init__(
            self,
            weaviate_port,
            weaviate_host,
            traversal_paths: str = '@r,c,m',
            **kwargs
    ):
        '''
        Initiate the DB connection

        :param weaviate_port: Passed through on the flow.yml config file
        :param weaviate_host: Passed through on the flow.yml config file
        :param traversal_paths: Idea to traverse document depths but never worked
        :param kwargs: Here, kwargs are reserved for Jina to inject metas and requests
            (representing the request-to-function mapping) values when the Executor is used inside a Flow.
            You can access the values of these arguments in the __init__ body via
            self.metas/self.requests/self.runtime_args, or modify their values before passing them to super().__init__().
        '''
        super().__init__(**kwargs)
        # set DB connection config
        self.host = weaviate_host
        self.port = weaviate_port
        self.db = []
        self.connect_db("Document")

    def connect_db(
            self,
            class_name=''
    ):
        if class_name == '':
            self.db = DocumentArray(
                storage="weaviate", config={"host": self.host, "port": self.port}
            )
        else:
            self.db = DocumentArray(
                storage="weaviate", config={"host": self.host, "port": self.port, "name": class_name}
            )

    @requests(on='/index')
    def index(
            self,
            docs: DocumentArray,
            parameters: Dict={},
            **kwargs
    ):
        '''
        Here I would like to take a document and store it into the DB.
        There will be a few things in the Document. There will be a transcript broken out into sentences
        and timestamps for each word. The reason for breaking into sentences is that we are using a
        pre-trained model that is made for sentences. The vectors are created before we get to this point.

        :param parameters: Parameters or variables sent by the API that apply to the whole document
        :param docs: the sentences in the document that will be vectorized
        :param kwargs: Configuration settings for Jina
        '''
        
        
        '''
        At this point the docs looks like:
        [{
            "id": "1efd06c05078e7b91f9b6c7454cba48d",
            "parent_id": null,
            "granularity": null,
            "adjacency": null,
            "blob": null,
            "tensor": null,
            "mime_type": null,
            "text": "Yeah,dana bash,appreciate it.",
            "weight": null,
            "uri": null,
            "tags": {
                "end_time": 420940.0,
                "start_time": 419240.0,
                "start_word_index": 1203.0,
                "end_word_index": 1207.0,
                "asset_id": "_zPr-H-J6AE"
            },
            "offset": null,
            "location": null,
            "embedding": [-0.1570819765329361, 0.08105587214231491, 0.025140587240457535, ...]
            "modality": null,
            "evaluations": null,
            "scores": null,
            "chunks": null,
            "matches": null
        },...]
        '''

        # asset_id is required so we do a check
        if "asset_id" not in parameters.keys():
            raise Exception("ERROR: You are missing the required asset_id parameter.")

        # @TODO: we need to figure out how to query the DB by asset_id to see if the doc exists before adding it

        # now store the document
        self.db.extend(DocumentArray(docs)) # doesnt seem to be storing the tags sent

    @requests(on='/search')
    def search(
            self,
            docs: DocumentArray,
            parameters: Dict={},
            **kwargs
    ):
        '''
        This is going to be used as a search term lookup. A user will provide a string and we will
        get the vector comparison from the DB. Here we need to figure out how to add sorting, filters and limits.

        :param parameters: Parameters or variables sent by the API that apply to the whole document
        :param docs: the sentences in the document that will be vectorized
        :param kwargs: Configuration settings for Jina
        '''
        
        '''
        API Request looks like:
        {
          "data": [
            {
              "text": "where are my car keys?"
            }
          ],
          "parameters": {
              "limit": 10
          }
        }
        '''
        # @TODO: need to figure out a way to apply filters specified by the user

        # send the search request which has been vectorized already
        results = self.db.find(
            docs,
            limit=parameters["limit"]
        )
        return results
        
        '''
        Response does not include tags from ingestion:
        {
            "id": "00001cf94cd5d87713844a98d62bbc5c",
            "parent_id": null,
            "granularity": null,
            "adjacency": null,
            "blob": null,
            "tensor": null,
            "mime_type": null,
            "text": "So and so this car.",
            "weight": null,
            "uri": null,
            "tags": {
                "wid": "3aa9fde4-3114-5922-8b74-8959f72c48b2" #  ????? Where they at???
            },
            "offset": null,
            "location": null,
            "embedding": [0.32719624042510986,0.3085232079029083,-0.06597057729959488,...],
            "modality": null,
            "evaluations": null,
            "scores": {
                "cosine_similarity": {
                    "value": 0.5244436264038086,
                    "op_name": null,
                    "description": null,
                    "ref_id": null
                },
                "weaviate_certainty": {
                    "value": 0.7622218132019043,
                    "op_name": null,
                    "description": null,
                    "ref_id": null
                }
            },
            "chunks": null,
            "matches": null
        }
        '''

    @requests(on='/match')
    def match(
            self,
            parameters: Dict={},
            **kwargs
    ):
        '''
        This is going to be used as a Document Match lookup. A user will provide a document ID and we will
        get the vector similarity to other documents from the DB.
        Here we need to figure out how to add sorting, filters and limits.
        
        Not done yet...

        :param parameters: Parameters or variables sent by the API that apply to the whole document
        :param kwargs: Configuration settings for Jina
        '''
        
        # asset_id is required so we do a check
        if "asset_id" not in parameters.keys():
            raise Exception("ERROR: You are missing the required asset_id parameter.")

        print(parameters)  # gets the document parameters
        pass

    @requests(on='/delete')
    def delete(
            self,
            parameters: Dict={},
            **kwargs
    ):
        '''
        This endpoint will take a document ID and remove it from the DB.
        
        Not done yet...

        :param parameters: Parameters or variables sent by the API that apply to the whole document
        :param kwargs: Configuration settings for Jina
        '''

        # asset_id is required so we do a check
        if "asset_id" not in parameters.keys():
            raise Exception("ERROR: You are missing the required asset_id parameter.")

        print(parameters)  # gets the document parameters
        pass

    @requests(on='/update')
    def update(
            self,
            docs: DocumentArray,
            parameters: Dict={},
            **kwargs
    ):
        '''
        This endpoint will take a document ID and update it in the DB. This may require some other parameters
        so we know what is being updated.
        
        Not done yet...

        :param parameters: Parameters or variables sent by the API that apply to the whole document
        :param docs: the sentences in the document that will be vectorized
        :param kwargs: Configuration settings for Jina
        '''

        # asset_id is required so we do a check
        if "asset_id" not in parameters.keys():
            raise Exception("ERROR: You are missing the required asset_id parameter.")

        print(parameters)  # gets the document parameters
        pass

##Current Understanding:

The concept of DocArray is very simple and straightforward. We have also chosen Weaviate as the storage since it has a
nice HNSW indexing feature which is good for what we need. So I will jump into what I think the
document should look like for what we are doing.

We are using a sentence trained Transformer model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
which is why it makes sense for us to break the transcript into sentences.

That makes the document look like:

 <Document (tags:[type:"video", asset_id:"unique file ID", name:"file name", etc:"other file related data"]>
    └─ chunks
          └─ <Document (text:"This is the first sentence.",tags:[
            asset_id:"unique file ID",
            start_time:"timestamp of the begining of the first word",
            end_time:"timestamp of the end of the last word",
            start_word_index:"word index of the first word",
            end_word_index"word index of the last word",
            etc:"other sentence related data"
          ])>
          └─ <Document (text:"This is the second sentence.",tags:[
            ...
          ])>
          └─ <Document (text:"This is the third sentence.",tags:[
            ...
          ])>
          ...

However, I was told that this parent/child structure is not supported for searching against a large data set.
This is really a bummer because for what I need, it seems ill have to make each sentence a root level document.

Currently, it is looking like this:


  <Document (text:"This is the first sentence.",tags:[
    type:"video",
    name:"file name",
    file_etc:"other file related data",
    asset_id:"unique file ID",
    start_time:"timestamp of the begining of the first word",
    end_time:"timestamp of the end of the last word",
    start_word_index:"word index of the first word",
    end_word_index"word index of the last word",
    sentence_etc:"other sentence related data"
  ])>
  
  <Document (text:"This is the second sentence.",tags:[
    type:"video",
    name:"file name",
    file_etc:"other file related data",
    asset_id:"unique file ID",
    start_time:"timestamp of the begining of the first word",
    end_time:"timestamp of the end of the last word",
    start_word_index:"word index of the first word",
    end_word_index"word index of the last word",
    sentence_etc:"other sentence related data"
  ])>
  
  <Document (text:"This is the third sentence.",tags:[
    type:"video",
    name:"file name",
    file_etc:"other file related data",
    asset_id:"unique file ID",
    start_time:"timestamp of the begining of the first word",
    end_time:"timestamp of the end of the last word",
    start_word_index:"word index of the first word",
    end_word_index"word index of the last word",
    sentence_etc:"other sentence related data"
  ])>
  
  ...

This can get very cumbersome and confusing especially if we
want to move towards having multiple modalities.

It would be good to hear ideas how this can be improved.

##Problems while experimenting with Jina/DocArray/Weaviate combo

###Issue with class names in Weaviate

The class name I am trying to use is "Document" for now. This is because Weaviate has a schema where you can
separate types of data into classes and perform queries against them. I have tried to let DocArray decide
but as it is generating seemingly random strings it repeats. It is not unique. I loaded 854 transcripts into
the system and about halfway through it was throwing an error saying the class name was already taken.
If i run them all as 1 batch and give them all the same class name then they all load into the DB.
However, if i try to load more it tells me the class name is taken which doesn't make sense.

Also if im not loading any docs into the DB i get the same error after a while of stopping and starting the
program. I think this is because I have the DB connecting in the init function. I just want the init function
to open a connection and then have a way
to close the connection when the request is done but I cant find anyway to do this in the documentation.
This is the only way I have found on how to connect to the DB.

Another issue with the class names is that If i do not specify which class name to search against it will
return no results. When I loaded some files with a specified class name i could make search requests and get
results but if i did not specify the class name when loading search always returned empty.

###Issue with child/parent searches

I did try and put the sentences as chunks of the main document and that seemed to work on ingestion.
However, search never worked. it always returned empty but when I found the documentation to search
nested structures I tried it and always got an error.

This is the code:

results = self.db.find(
    docs,
    limit=parameters["limit"],
).traverse_flat("@c")  # per documentation

This is the result:

 weaviate/rep-0@20779[E]:ValueError('`path`:@c is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure')
 add "--quiet-error" to suppress the exception details
Traceback (most recent call last):
  File ".local/lib/python3.8/site-packages/jina/serve/runtimes/worker/__init__.py", line 101, in process_data
    return await self._data_request_handler.handle(requests=requests)
  File ".local/lib/python3.8/site-packages/jina/serve/runtimes/request_handlers/data_request_handler.py", line 95, in handle
    return_data = await self._executor.__acall__(
  File ".local/lib/python3.8/site-packages/jina/serve/executors/__init__.py", line 232, in __acall__
    return await self.__acall_endpoint__(req_endpoint, **kwargs)
  File ".local/lib/python3.8/site-packages/jina/serve/executors/__init__.py", line 241, in __acall_endpoint__
    return await run_in_threadpool(func, self._thread_pool, self, **kwargs)
  File ".local/lib/python3.8/site-packages/jina/helper.py", line 1237, in run_in_threadpool
    return await get_or_reuse_loop().run_in_executor(
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File ".local/lib/python3.8/site-packages/jina/serve/executors/decorators.py", line 115, in arg_wrapper
    return fn(*args, **kwargs)
  File "weaviate/testWithJina/search/exec/weaviate/weaviate-executor.py", line 156, in search
    results = self.db.find(
  File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 193, in traverse_flat
    return self._flatten(leaves)
  File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 232, in _flatten
    return DocumentArray(list(itertools.chain.from_iterable(sequence)))
  File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 105, in traverse
    for p in _re_traversal_path_split(traversal_paths):
  File ".local/lib/python3.8/site-packages/docarray/array/mixins/traverse.py", line 59, in _re_traversal_path_split
    raise ValueError(
ValueError: `path`:@c is invalid, please refer to https://docarray.jina.ai/fundamentals/documentarray/access-elements/#index-by-nested-structure

###Issue with Match

I know there is already a Match concept built into DocArray which is kind of like the concept we are wanting to use it for.
From the documentation I see that a Match can be stored in a Document as it is placed into the DB. I'm not sure
if this is helpful or harmful to our use case, but it was worth exploring. After I figured out Search I intended to move
onto Match. But, when I discovered that in order for Search to work I couldn't store the document in a Parent/Child
structure. This leads me to believe that a horizontal Match wouldn't work since I could not relate all the
transcript to the entirety of another transcript. I was forced to flatten the schema so Search would work.
If I cannot Search the large dataset without flattening the schema then how will I be able to perform
a Match against the data?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions