Data Sources

Data sources provide the knowledge your chatbot uses to answer questions. You can add website URLs, upload files, or enter text directly. All data sources are managed in the Data Sources tab of your chatbot's detail page.

Website URLs

Enter URLs to crawl and index. WebChatAgent automatically crawls the pages and extracts content for your chatbot's knowledge base.

Adding a Website

  1. Click Add Website in the Data Sources tab
  2. Enter the website URL (e.g. example.comhttps:// is added automatically)
  3. Configure optional advanced settings (see below)
  4. Click Add to start crawling

Advanced Settings

FieldDescriptionDefault
Crawl DepthHow many link levels deep to crawl. 0 = only this page, empty = unlimited depth.Empty (unlimited)
CSS SelectorsComma-separated CSS selectors to target specific content areas (e.g. article, .content, #main-body). When set, only content within these selectors is extracted.Empty (full page)
Exclude URL PatternsComma-separated URL patterns to skip during crawling (e.g. /login, /impressum, */amp/*).Empty
Auto Re-index IntervalHow often to automatically re-crawl and update the content. Options: Every 3 days, Every 7 days, Every 30 days.Disabled
Auto Re-index is available on Standard plans and above only. Free and Basic plans must re-index manually.

When to Use CSS Selectors

CSS selectors help you extract only the relevant content from a page. This is useful when:

  • Your pages have navigation menus, footers, or sidebars you want to exclude
  • You want to focus on the main article content only
  • The page contains ads or unrelated widgets

Example: To extract only the main content area and skip navigation:

article, .main-content, #post-body

When to Use Exclude URL Patterns

Exclude patterns prevent specific pages from being indexed (added to your chatbot's knowledge base). Common use cases:

  • Login/admin pages: /login, /admin, /wp-admin
  • Legal pages: /impressum, /privacy-policy, /terms
  • Duplicate content: */amp/*, */print/*
  • Category/tag archives: /category/*, /tag/* (often duplicate content)
  • Query parameters: ?hitsPerPage — excludes all URLs containing this parameter (e.g. search result pages with pagination)

Important details:

  • Patterns are case-sensitive. /About and /about are treated as different patterns. Make sure your patterns match the exact casing used in your URLs.
  • Excluded pages are not indexed, but their links are still followed. The crawler will still discover and crawl pages linked from excluded URLs — only the excluded page itself is skipped for indexing. This means excluding a category page like /blog/category/news will prevent that listing page from being indexed, but the individual blog posts linked from it will still be crawled and indexed (unless they also match an exclude pattern).

Language Considerations

If your website has multilingual content (e.g. /en/about and /de/about), it is strongly recommended to index only one language. Indexing the same content in multiple languages leads to:

  • Duplicate information consuming your page quota unnecessarily
  • Lower retrieval quality as the AI may pull answers from the wrong language
  • Wasted token budget on redundant content

Choose the language your customers most frequently use, or the language that matches your chatbot's primary audience.

Text Data Sources

You can manually enter text content as a data source. This is ideal for FAQs, internal knowledge, or content that doesn't exist on a website.

Adding a Text Document

  1. Click Add Text in the Data Sources tab
  2. Fill in the fields:
FieldDescriptionConstraints
Document NameA descriptive name for this document. Leave empty for AI-generated name.Optional
CategoryOrganize documents by category. Select an existing category or type a new one. Leave empty for auto-detection.Optional
ContentThe actual text content the chatbot will use.Required, max 50,000 characters

File Upload

Upload documents directly to your chatbot's knowledge base.

Supported Formats

FormatDescription
PDFManuals, reports, brochures, whitepapers
DOCXWord documents
TXTPlain text files
MDMarkdown documents
XLSXSpreadsheets

File Size Limits

PlanMax File Size
Free1 MB
Basic5 MB
Standard10 MB
Premium50 MB
EnterpriseUnlimited

AI Auto-Filename

On paid plans, if you upload a file with a generic name (e.g. document.txt or untitled.pdf), the system automatically generates a descriptive filename based on the file content.

Categories

Categories help you organize your data sources. You can:

  • Assign a category when adding or editing any data source
  • Filter the data source list by category
  • Use categories to group related content (e.g. "Products", "Support", "Legal")

Categories are stored in lowercase and are shared across all data source types.

Data Source List

The data source list shows each source with its current status:

StatusDescription
IndexedSuccessfully processed and available to the chatbot
PendingCurrently being processed or queued for indexing
ErrorProcessing failed — check the source URL or file and retry

You can filter the list by:

  • Search — Find sources by name or URL
  • Type — Filter by Website or File
  • Category — Filter by assigned category

Editing Data Sources

  • Website URLs: Click a website source to edit the URL, crawl depth, CSS selectors, exclude patterns, and re-index interval
  • Text documents: Click to edit the name, category, and content
  • File documents: Click to edit the name and category (content cannot be changed — re-upload the file instead)

Re-indexing

You can manually re-index individual data sources to update their content. This is useful when:

  • Website content has changed
  • You want to refresh after fixing CSS selectors or exclude patterns
  • A previously errored source has been corrected

Page Limits

PlanMax Pages
Free50
Basic200
Standard1,000
Premium3,000
Enterprise50,000+

Best Practices

  • Keep content clear and well-structured — The AI performs better with organized, readable content
  • Index only relevant pages — Exclude login pages, admin areas, and duplicate content
  • Use CSS selectors to focus on main content and exclude navigation, footers, and ads
  • Monitor the Questions dashboard to identify knowledge gaps and add missing content
  • Update data sources regularly to ensure accurate, current answers
  • Use categories to organize large numbers of data sources
  • Limit to one language when your site has identical content in multiple languages