back to features

Custom Datasets

Build AI knowledge bases with semantic search capabilities. Upload documents, add records, and let your bots retrieve relevant information from any knowledge source.

Every AI assistant is only as good as the knowledge behind it. Datasets give your bots access to your specific information - product documentation, support articles, policies, FAQs, or any content your users need. When someone asks a question, your bot searches the dataset semantically to find the most relevant answers, not just keyword matches.

Unlike traditional search that looks for exact words, dataset search understands meaning. Ask "how do I change my subscription" and it finds records about billing modifications, plan upgrades, and account management - even if those exact words never appear. This retrieval-augmented generation (RAG) approach means your bots give accurate, contextual answers grounded in your actual content.

Key Capabilities

Semantic Search with Vector Embeddings

Every record in your dataset is converted into a vector embedding that captures its meaning. When your bot receives a question, it's converted to the same format and compared against your entire knowledge base to find the most semantically similar content. The system ranks results by relevance score, so your bot always retrieves the best matching information.

Document Processing

Upload PDFs, Word documents, spreadsheets, text files, and other common formats. The platform automatically extracts text, splits it into optimal chunks, and indexes everything for search. Large documents are processed asynchronously - upload and walk away while the system handles the heavy lifting.

Flexible Record Management

Add knowledge directly through records - text snippets with optional source URLs and metadata. Build FAQ databases by adding question-answer pairs. Import data from CSV or JSON files. Each record becomes searchable the moment it's indexed, giving your bots immediate access to new information.

Multiple Storage Backends

Choose from different embedding models based on your needs. Ada Sprout uses OpenAI's text-embedding-ada-002 for reliable general-purpose search. Lingo Sprout uses text-embedding-3-small for improved multilingual support. Select the storage backend when creating your dataset to match your use case.

Result Reranking

After initial semantic search retrieves candidates, optional reranking models like BGE v2 M3 or Cohere 3.5 can further refine results. Reranking improves accuracy for complex queries where multiple records might seem equally relevant, helping your bot select the truly best matches.

Custom Search Instructions

Configure what your bot does when it finds matches - and when it doesn't. Set a match instruction that prepends context to found records, helping your bot understand how to use the information. Set a mismatch instruction that guides responses when no relevant records exist, preventing hallucination by acknowledging knowledge gaps.

Real-World Use Cases

Support Knowledge Base

Upload your entire support documentation and let your bot answer customer questions instantly. Instead of customers searching through help articles, they ask questions naturally and get precise answers. The bot retrieves relevant sections from your docs and synthesizes helpful responses grounded in your actual policies and procedures.

Product Information System

Create a dataset with product specifications, pricing, compatibility requirements, and feature comparisons. Sales and support bots can instantly retrieve accurate product details when customers ask. Update the dataset when products change, and your bot immediately has current information.

Internal Knowledge Management

Build datasets from company policies, procedures, and institutional knowledge. Employees ask questions in natural language and get answers from your actual documentation. New hires get instant access to organizational knowledge without hunting through wikis and shared drives.

Legal and Compliance

Index contracts, regulations, and compliance documentation. When questions arise about specific clauses or requirements, your bot retrieves the exact relevant text. Perfect for quick lookups where accuracy matters and you need citations back to source documents.

Educational Content

Upload course materials, textbooks, and reference documents. Students interact with an AI tutor that retrieves relevant explanations, examples, and definitions from the actual curriculum. The bot stays grounded in course content rather than generating generic information.

How It Works

Datasets connect to your bots through simple configuration. Here's the typical workflow:

  1. Create a dataset - Give it a name and optionally select a storage backend and reranker
  2. Add content - Upload files or create records directly through the dashboard or API
  3. Connect to bots - Link the dataset to one or more bots that should have access to the knowledge
  4. Test searches - Use the search endpoint to verify your content is properly indexed and returning relevant results
  5. Deploy - Your bot automatically searches the dataset during conversations to find relevant context

When your bot receives a message, it can search connected datasets to find information relevant to the user's question. The retrieved records provide context that helps the bot generate accurate, grounded responses. All of this happens automatically once your dataset is connected - no additional coding required.

Getting Started

Create your first dataset from the ChatBotKit dashboard under the Datasets section. Start by adding a few records manually to understand how the system works, then upload documents to build out your knowledge base.

For developers, the Dataset API provides complete programmatic control - create datasets, manage records, search content, and integrate with your own systems. The SDK handles all the complexity of working with embeddings and vector search.

Datasets transform your bots from generic assistants into domain experts with access to your specific knowledge, ensuring every response is grounded in accurate, relevant information.