lancedb/python/python/lancedb/embeddings/instructor.py at main · zebin-code/lancedb

History

150 lines (127 loc) · 5.93 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from typing import List

import numpy as np

from ..util import attempt_import_or_raise

from .base import TextEmbeddingFunction

from .registry import register

from .utils import TEXT, weak_lru

@register("instructor")

class InstructorEmbeddingFunction(TextEmbeddingFunction):

"""

An embedding function that uses the InstructorEmbedding library. Instructor models

support multi-task learning, and can be used for a variety of tasks, including

text classification, sentence similarity, and document retrieval. If you want to

calculate customized embeddings for specific sentences, you may follow the unified

template to write instructions:

"Represent the `domain` `text_type` for `task_objective`":

* domain is optional, and it specifies the domain of the text, e.g., science,

finance, medicine, etc.

* text_type is required, and it specifies the encoding unit, e.g., sentence,

document, paragraph, etc.

* task_objective is optional, and it specifies the objective of embedding,

e.g., retrieve a document, classify the sentence, etc.

For example, if you want to calculate embeddings for a document, you may write the

instruction as follows:

"Represent the document for retrieval"

Parameters

----------

name: str

The name of the model to use. Available models are listed at

https://github.com/xlang-ai/instructor-embedding#model-list;

The default model is hkunlp/instructor-base

batch_size: int, default 32

The batch size to use when generating embeddings

device: str, default "cpu"

The device to use when generating embeddings

show_progress_bar: bool, default True

Whether to show a progress bar when generating embeddings

normalize_embeddings: bool, default True

Whether to normalize the embeddings

quantize: bool, default False

Whether to quantize the model

source_instruction: str, default "represent the document for retrieval"

The instruction for the source column

query_instruction: str, default "represent the document for retrieving the most

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

instructor.py

Latest commit

History

instructor.py

File metadata and controls