Example of RAG Application Foundation

What is RAG?

RAG, known as Retrieval-Augmented Generation, is a technology that combines information retrieval and text generation to improve the accuracy and relevance of text generated by large language models (LLMs). LLM may not be able to obtain up-to-date information due to limitations of its training data.

For example, when I asked GPT about the latest version of MatrixOne, it didn't give an answer.

In addition, these models can sometimes produce misleading information and produce factually incorrect content. For example, when I asked Lu Xun about his relationship with Zhou Shuren, GPT started a serious nonsense.

To solve the above problem, we can retrain the LLM model, but at a high cost. The main advantage of RAG, on the other hand, is that it avoids having to train again for specific tasks. Its high availability and low threshold make it one of the most popular scenarios in LLM systems, on which many LLM applications are built. The core idea of RAG is for the model to not only rely on what it learns during the training phase when generating responses, but also to utilize external, up-to-date, proprietary sources of information, so that users can optimize the output of the model by enriching the input with additional external knowledge bases based on the actual situation.

RAG's workflow typically consists of the following steps:

Retrieve: Find and extract the information most relevant to the current query from a large data set or knowledge base.
Augment: Combines retrieved information or data sets with the LLM to enhance the performance of the LLM and the accuracy of the output.
Generate: Utilize LLM to generate new text or responses using retrieved information.

The following is a flow chart for Native RAG:

As you can see, the retrieval link plays a crucial role in the RAG architecture, and MatrixOne's ability to retrieve vectors provides powerful data retrieval support for building RAG applications.

Role of Matrixone in RAG

As a hyperconverged database, Matrxione comes with its own vector capabilities, which play an important role in RAG applications in the following ways:

Efficient information retrieval: Matrxione has vector data types specifically designed to process and store high-dimensional vector data. It uses special data structures and indexing strategies, such as KNN queries, to quickly find data items that most closely resemble query vectors.
Support for large-scale data processing: Matrxione's ability to effectively manage and process large-scale vector data is a core feature of the retrieval component of the RAG system, which enables the RAG system to quickly retrieve the information most relevant to user queries from vast amounts of data.
Improved generation quality: Through the retrieval capabilities of Matrxione's vector capabilities, RAG technology can introduce information from an external knowledge base to produce more accurate, rich, and contextualized text that improves the quality of generated text.
Security and privacy protection: Matrxione can also protect data with data security measures such as encrypted storage and access control, which is particularly important for RAG applications that handle sensitive data.
Simplify the development process: Using Matrxione simplifies the development process for RAG applications because it provides an efficient mechanism for storing and retrieving vectorized data, reducing the burden on developers in data management.

Based on Ollama, this paper combines Llama2 and Mxbai-embed-large to quickly build a Native RAG application using Matrixone's vector capabilities.

MatrixOne Python SDK Documentation

This tutorial uses the MatrixOne Python SDK. For complete API reference and advanced features, please refer to the MatrixOne Python SDK Documentation.

Prepare before you start

Relevant knowledge

Ollama: Ollama is an open source large language model service tool that allows users to easily deploy and use large-scale pre-trained models in their hardware environment. Ollama's primary function is to deploy and manage large language models (LLMs) within Docker containers, enabling users to quickly run them locally. Ollama simplifies the deployment process by allowing users to run open source large language models locally with a single command through simple installation instructions.

Llama2:llama2 is an open source language large model for understanding and generating long text that can be used for research and commercial purposes.

Mxbai-embed-large: mxbai-embed-large is an open source embedding model designed for text embedding and retrieval tasks. The model generates an embedding vector size of 1024.

Software Installation

Before you begin, confirm that you have downloaded and installed the following software:

Verify that you have completed the standalone deployment of MatrixOne.
Verify that you have finished installing Python 3.8 (or plus). Verify that the installation was successful by checking the Python version with the following code:

python3 -V

Verify that you have completed installing the MySQL client.
Download and install the matrixone-python-sdk. Download and install using the following code:

pip3 install matrixone-python-sdk

Verify that you have finished installing ollama. Verify that the installation was successful by checking the ollama version with the following code:

ollama -v

Download the LLM model llama2 and embedding model mxbai-embed-large:

ollama pull llama2 ollama pull mxbai-embed-large

Build your app

Setup MatrixOne Client and Define Table Model

Create the python file rag_example.py. First, we'll set up the MatrixOne client, define the data model using ORM, and create the table.

import ollama
from matrixone import Client
from matrixone.orm import declarative_base
from sqlalchemy import Column, Integer, Text
from matrixone.sqlalchemy_ext import create_vector_column

# Create client and connect to MatrixOne
client = Client()
client.connect(
    host='127.0.0.1',
    port=6001,
    user='root',
    password='111',
    database='db1'
)

# Define the RAG table model using MatrixOne ORM
Base = declarative_base()

class RagDocument(Base):
    __tablename__ = 'rag_tab'
    id = Column(Integer, primary_key=True, autoincrement=True)
    content = Column(Text, nullable=False)
    embedding = create_vector_column(1024, "f32")

# Create table
client.create_table(RagDocument)

Text Vectorization and Storage

Next, we'll vectorize the textual information using the mxbai-embed-large embedding model and save it to MatrixOne's rag_tab table.

# Document data
documents = [
    "MatrixOne is a hyper-converged cloud & edge native distributed database with a structure that separates storage, computation, and transactions to form a consolidated HSTAP data engine. This engine enables a single database system to accommodate diverse business loads such as OLTP, OLAP, and stream computing. It also supports deployment and utilization across public, private, and edge clouds, ensuring compatibility with diverse infrastructures.",
    "MatrixOne touts significant features, including real-time HTAP, multi-tenancy, stream computation, extreme scalability, cost-effectiveness, enterprise-grade availability, and extensive MySQL compatibility. MatrixOne unifies tasks traditionally performed by multiple databases into one system by offering a comprehensive ultra-hybrid data solution. This consolidation simplifies development and operations, minimizes data fragmentation, and boosts development agility.",
    "MatrixOne is optimally suited for scenarios requiring real-time data input, large data scales, frequent load fluctuations, and a mix of procedural and analytical business operations. It caters to use cases such as mobile internet apps, IoT data applications, real-time data warehouses, SaaS platforms, and more.",
    "Matrix is a collection of complex or real numbers arranged in a rectangular array.",
    "The lastest version of MatrixOne is v25.3.0.2,released on 2025/09/26.",
    "We are excited to announce MatrixOne v22.0.8.0 release on 2023/6/30."
]

# Generate embeddings and prepare data for batch insert
rag_data = []
for doc in documents:
    response = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
    embedding = response["embedding"]
    rag_data.append({
        'content': doc,
        'embedding': embedding
    })

# Batch insert data
client.batch_insert(RagDocument, rag_data)

View quantity in `rag_tab` table

mysql> select count(*) from rag_tab;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

As you can see, the data was successfully stored into the database.

Create Vector Index (Optional but Recommended)

In large-scale high-dimensional data retrieval, if a full search is used, the similarity calculation with each vector in the entire data set needs to be performed for each query, which results in significant performance overhead and latency. The use of vector index can effectively solve the above problems, by establishing efficient data structures and algorithms to optimize the search process, improve retrieval performance, reduce computing and storage costs, and enhance the user experience. Therefore, we build an IVF-FLAT vector index for the vector field.

# Create IVF-FLAT vector index using client API
client.vector_ops.create_ivf(
    RagDocument,
    name='idx_rag_embedding',
    column='embedding',
    lists=100,
    op_type='vector_l2_ops'
)

Vector Retrieval

Once the data is ready, you can search the database for the most similar content based on the questions we asked. This step relies heavily on the vector retrieval capabilities of MatrixOne, which supports multiple similarity searches, where we use l2_distance to retrieve and set the number of returned results to 3.

# Define the query question
prompt = "What is the latest version of MatrixOne?"

# Generate query embedding
response = ollama.embeddings(
    prompt=prompt,
    model="mxbai-embed-large"
)
query_embedding = response["embedding"]

# Perform vector similarity search using MatrixOne client
results = client.query(
    RagDocument.content,
    RagDocument.embedding.l2_distance(query_embedding).label("distance")
).order_by(
    RagDocument.embedding.l2_distance(query_embedding)
).limit(3).execute()

# Extract content from results
retrieved_docs = [row[0] for row in results.rows]

Alternative: Using Pinecone-Compatible Interface

The MatrixOne Python SDK also provides a Pinecone-compatible interface for vector operations, making it easy for developers familiar with Pinecone to migrate or integrate with MatrixOne. This interface offers a simplified API for common vector search operations.

Setup Pinecone-Compatible Vector Store

from matrixone.vector_store import MatrixOneVectorStore

# Initialize the Pinecone-compatible vector store
vector_store = MatrixOneVectorStore(
    client=client,
    table_name='rag_tab',
    vector_column='embedding',
    content_column='content',
    dimension=1024
)

Insert Documents with Pinecone-Style API

# Prepare documents with embeddings (Pinecone-style)
vectors_data = []
for i, doc in enumerate(documents):
    response = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
    embedding = response["embedding"]
    vectors_data.append({
        'id': str(i + 1),
        'values': embedding,
        'metadata': {'content': doc}
    })

# Upsert vectors (Pinecone-compatible method)
vector_store.upsert(vectors=vectors_data)

Query with Pinecone-Style Interface

# Generate query embedding
prompt = "What is the latest version of MatrixOne?"
response = ollama.embeddings(
    prompt=prompt,
    model="mxbai-embed-large"
)
query_embedding = response["embedding"]

# Perform similarity search using Pinecone-compatible query
search_results = vector_store.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

# Extract content from Pinecone-style results
retrieved_docs_pinecone = [
    match['metadata']['content']
    for match in search_results['matches']
]

# Display results with similarity scores
for i, match in enumerate(search_results['matches'], 1):
    print(f"\n--- Result {i} (Score: {match['score']:.4f}) ---")
    print(match['metadata']['content'][:200] +"...")

Benefits of Pinecone-Compatible Interface:

🔄 Easy Migration: Familiar API for developers coming from Pinecone
🚀 Simplified Operations: Cleaner syntax for common vector operations
📊 Automatic Scoring: Built-in similarity score calculation
🔍 Metadata Support: Easy inclusion of metadata in search results
💡 Developer-Friendly: Intuitive methods like upsert() and query()

Enhanced Generation

We combine what we retrieved in the previous step with LLM to generate an answer.

# Combine retrieved documents as context
context = " ".join(retrieved_docs)

# Generate enhanced response using LLM
output = ollama.generate(
    model="llama2",
    prompt=f"Using this data: {context}. Respond to this prompt: {prompt}"
)

print(output['response'])

# Close database connection
client.disconnect()

Console output related answer:

Based on the provided data, the latest version of MatrixOne is v25.3.0.2, which was released on 2025/10/11.

After enhancement, the model generates the correct answer.