lancedb/python/python/lancedb/index.py at main · zebin-code/lancedb

History

581 lines (433 loc) · 23.2 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

from dataclasses import dataclass

from typing import Literal, Optional

from ._lancedb import (

IndexConfig,

)

lang_mapping = {

"ar": "Arabic",

"da": "Danish",

"du": "Dutch",

"en": "English",

"fi": "Finnish",

"fr": "French",

"de": "German",

"gr": "Greek",

"hu": "Hungarian",

"it": "Italian",

"no": "Norwegian",

"pt": "Portuguese",

"ro": "Romanian",

"ru": "Russian",

"es": "Spanish",

"sv": "Swedish",

"ta": "Tamil",

"tr": "Turkish",

}

@dataclass

class BTree:

"""Describes a btree index configuration

A btree index is an index on scalar columns. The index stores a copy of the

column in sorted order. A header entry is created for each block of rows

(currently the block size is fixed at 4096). These header entries are stored

in a separate cacheable structure (a btree). To search for data the header is

used to determine which blocks need to be read from disk.

For example, a btree index in a table with 1Bi rows requires

sizeof(Scalar) * 256Ki bytes of memory and will generally need to read

sizeof(Scalar) * 4096 bytes to find the correct row ids.

This index is good for scalar columns with mostly distinct values and does best

when the query is highly selective. It works with numeric, temporal, and string

columns.

The btree index does not currently have any parameters though parameters such as

the block size may be added in the future.

"""

pass

@dataclass

class Bitmap:

"""Describe a Bitmap index configuration.

A `Bitmap` index stores a bitmap for each distinct value in the column for

every row.

This index works best for low-cardinality numeric or string columns,

where the number of unique values is small (i.e., less than a few thousands).

`Bitmap` index can accelerate the following filters:

- `<`, `<=`, `=`, `>`, `>=`

- `IN (value1, value2, ...)`

- `between (value1, value2)`

- `is null`

For example, a bitmap index with a table with 1Bi rows, and 128 distinct values,

requires 128 / 8 * 1Bi bytes on disk.

"""

pass

@dataclass

class LabelList:

"""Describe a LabelList index configuration.

`LabelList` is a scalar index that can be used on `List<T>` columns to

support queries with `array_contains_all` and `array_contains_any`

using an underlying bitmap index.

For example, it works with `tags`, `categories`, `keywords`, etc.

"""

pass

@dataclass

class FTS:

"""Describe a FTS index configuration.

`FTS` is a full-text search index that can be used on `String` columns

For example, it works with `title`, `description`, `content`, etc.

Attributes

----------

with_position : bool, default True

Whether to store the position of the token in the document. Setting this

to False can reduce the size of the index and improve indexing speed,

but it will disable support for phrase queries.

base_tokenizer : str, default "simple"

The base tokenizer to use for tokenization. Options are:

- "simple": Splits text by whitespace and punctuation.

- "whitespace": Split text by whitespace, but not punctuation.

- "raw": No tokenization. The entire text is treated as a single token.

language : str, default "English"

The language to use for tokenization.

max_token_length : int, default 40

The maximum token length to index. Tokens longer than this length will be

ignored.

lower_case : bool, default True

Whether to convert the token to lower case. This makes queries case-insensitive.

stem : bool, default False

Whether to stem the token. Stemming reduces words to their root form.

For example, in English "running" and "runs" would both be reduced to "run".

remove_stop_words : bool, default False

Whether to remove stop words. Stop words are common words that are often

removed from text before indexing. For example, in English "the" and "and".

ascii_folding : bool, default False

Whether to fold ASCII characters. This converts accented characters to

their ASCII equivalent. For example, "café" would be converted to "cafe".

"""

with_position: bool = True

base_tokenizer: Literal["simple", "raw", "whitespace"] = "simple"

language: str = "English"

max_token_length: Optional[int] = 40

lower_case: bool = True

stem: bool = False

remove_stop_words: bool = False

ascii_folding: bool = False

@dataclass

class HnswPq:

"""Describe a HNSW-PQ index configuration.

HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization.

It is a variant of the HNSW algorithm that uses product quantization to compress

the vectors. To create an HNSW-PQ index, you can specify the following parameters:

Parameters

----------

distance_type: str, default "L2"

The distance metric used to train the index.

The following distance types are available:

"l2" - Euclidean distance. This is a very common distance metric that

accounts for both magnitude and direction when determining the distance

between vectors. L2 distance has a range of [0, ∞).

"cosine" - Cosine distance. Cosine distance is a distance metric

calculated from the cosine similarity between two vectors. Cosine

similarity is a measure of similarity between two non-zero vectors of an

inner product space. It is defined to equal the cosine of the angle

between them. Unlike L2, the cosine distance is not affected by the

magnitude of the vectors. Cosine distance has a range of [0, 2].

"dot" - Dot product. Dot distance is the dot product of two vectors. Dot

distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their

L2 norm is 1), then dot distance is equivalent to the cosine distance.

num_partitions, default sqrt(num_rows)

The number of IVF partitions to create.

For HNSW, we recommend a small number of partitions. Setting this to 1 works

well for most tables. For very large tables, training just one HNSW graph

will require too much memory. Each partition becomes its own HNSW graph, so

setting this value higher reduces the peak memory use of training.

num_sub_vectors, default is vector dimension / 16

Number of sub-vectors of PQ.

This value controls how much the vector is compressed during the

quantization step. The more sub vectors there are the less the vector is

compressed. The default is the dimension of the vector divided by 16.

If the dimension is not evenly divisible by 16 we use the dimension

divided by 8.

The above two cases are highly preferred. Having 8 or 16 values per

subvector allows us to use efficient SIMD instructions.

If the dimension is not visible by 8 then we use 1 subvector. This is not

ideal and will likely result in poor performance.

num_bits: int, default 8

Number of bits to encode each sub-vector.

This value controls how much the sub-vectors are compressed. The more bits

the more accurate the index but the slower search. Only 4 and 8 are supported.

max_iterations, default 50

Max iterations to train kmeans.

When training an IVF index we use kmeans to calculate the partitions. This

parameter controls how many iterations of kmeans to run.

Increasing this might improve the quality of the index but in most cases the

parameter is unused because kmeans will converge with fewer iterations. The

parameter is only used in cases where kmeans does not appear to converge. In

those cases it is unlikely that setting this larger will lead to the index

converging anyways.

sample_rate, default 256

The rate used to calculate the number of training vectors for kmeans.

When an IVF index is trained, we need to calculate partitions. These are

groups of vectors that are similar to each other. To do this we use an

algorithm called kmeans.

Running kmeans on a large dataset can be slow. To speed this up we

run kmeans on a random sample of the data. This parameter controls the

size of the sample. The total number of vectors used to train the index

is `sample_rate * num_partitions`.

Increasing this value might improve the quality of the index but in

most cases the default should be sufficient.

m, default 20

The number of neighbors to select for each vector in the HNSW graph.

This value controls the tradeoff between search speed and accuracy.

The higher the value the more accurate the search but the slower it will be.

ef_construction, default 300

The number of candidates to evaluate during the construction of the HNSW graph.

This value controls the tradeoff between build speed and accuracy.

The higher the value the more accurate the build but the slower it will be.

150 to 300 is the typical range. 100 is a minimum for good quality search

results. In most cases, there is no benefit to setting this higher than 500.

This value should be set to a value that is not less than `ef` in the

search phase.

"""

distance_type: Literal["l2", "cosine", "dot"] = "l2"

num_partitions: Optional[int] = None

num_sub_vectors: Optional[int] = None

num_bits: int = 8

max_iterations: int = 50

sample_rate: int = 256

m: int = 20

ef_construction: int = 300

@dataclass

class HnswSq:

"""Describe a HNSW-SQ index configuration.

HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization.

It is a variant of the HNSW algorithm that uses scalar quantization to compress

the vectors.