Data Sources

Data sources provide the knowledge your chatbot uses to answer questions. You can add website URLs, upload files, or enter text directly. All data sources are managed in the Data Sources tab of your chatbot's detail page.

Website URLs

Enter URLs to crawl and index. WebChatAgent automatically crawls the pages and extracts content for your chatbot's knowledge base.

Adding a Website

Click Add Website in the Data Sources tab
Enter the website URL (e.g. example.com — https:// is added automatically)
Configure optional advanced settings (see below)
Click Add to start crawling

Advanced Settings

Field	Description	Default
Crawl Depth	How many link levels deep to crawl. `0` = only this page, empty = unlimited depth.	Empty (unlimited)
CSS Selectors	Comma-separated CSS selectors to target specific content areas (e.g. `article, .content, #main-body`). When set, only content within these selectors is extracted.	Empty (full page)
Exclude URL Patterns	Comma-separated URL patterns to skip during crawling (e.g. `/login, /impressum, /amp/`).	Empty
Auto Re-index Interval	How often to automatically re-crawl and update the content. Options: Every 3 days, Every 7 days, Every 30 days.	Disabled

Auto Re-index is available on Standard plans and above only. Free and Basic plans must re-index manually.

When to Use CSS Selectors

CSS selectors help you extract only the relevant content from a page. This is useful when:

Your pages have navigation menus, footers, or sidebars you want to exclude
You want to focus on the main article content only
The page contains ads or unrelated widgets

Example: To extract only the main content area and skip navigation:

article, .main-content, #post-body

When to Use Exclude URL Patterns

Exclude patterns prevent specific pages from being indexed (added to your chatbot's knowledge base). Common use cases:

Login/admin pages: /login, /admin, /wp-admin
Legal pages: /impressum, /privacy-policy, /terms
Duplicate content: */amp/*, */print/*
Category/tag archives: /category/*, /tag/* (often duplicate content)
Query parameters: ?hitsPerPage — excludes all URLs containing this parameter (e.g. search result pages with pagination)

Important details:

Patterns are case-sensitive. /About and /about are treated as different patterns. Make sure your patterns match the exact casing used in your URLs.
Excluded pages are not indexed, but their links are still followed. The crawler will still discover and crawl pages linked from excluded URLs — only the excluded page itself is skipped for indexing. This means excluding a category page like /blog/category/news will prevent that listing page from being indexed, but the individual blog posts linked from it will still be crawled and indexed (unless they also match an exclude pattern).

Language Considerations

If your website has multilingual content (e.g. /en/about and /de/about), it is strongly recommended to index only one language. Indexing the same content in multiple languages leads to:

Duplicate information consuming your page quota unnecessarily
Lower retrieval quality as the AI may pull answers from the wrong language
Wasted token budget on redundant content

Choose the language your customers most frequently use, or the language that matches your chatbot's primary audience.

Text Data Sources

You can manually enter text content as a data source. This is ideal for FAQs, internal knowledge, or content that doesn't exist on a website.

Adding a Text Document

Click Add Text in the Data Sources tab
Fill in the fields:

Field	Description	Constraints
Document Name	A descriptive name for this document. Leave empty for AI-generated name.	Optional
Category	Organize documents by category. Select an existing category or type a new one. Leave empty for auto-detection.	Optional
Content	The actual text content the chatbot will use.	Required, max 50,000 characters

File Upload

Upload documents directly to your chatbot's knowledge base.

Supported Formats

Format	Description
PDF	Manuals, reports, brochures, whitepapers
DOCX	Word documents
TXT	Plain text files
MD	Markdown documents
XLSX	Spreadsheets

File Size Limits

Plan	Max File Size
Free	1 MB
Basic	5 MB
Standard	10 MB
Premium	50 MB
Enterprise	Unlimited

AI Auto-Filename

On paid plans, if you upload a file with a generic name (e.g. document.txt or untitled.pdf), the system automatically generates a descriptive filename based on the file content.

Data Source List

The data source list shows each source with its current status:

Status	Description
Indexed	Successfully processed and available to the chatbot
Pending	Currently being processed or queued for indexing
Error	Processing failed — check the source URL or file and retry

You can filter the list by:

Search — Find sources by name or URL
Type — Filter by Website or File
Category — Filter by assigned category

Editing Data Sources

Website URLs: Click a website source to edit the URL, crawl depth, CSS selectors, exclude patterns, and re-index interval
Text documents: Click to edit the name, category, and content
File documents: Click to edit the name and category (content cannot be changed — re-upload the file instead)

Re-indexing

You can manually re-index individual data sources to update their content. This is useful when:

Website content has changed
You want to refresh after fixing CSS selectors or exclude patterns
A previously errored source has been corrected

Page Limits

Plan	Max Pages
Free	50
Basic	200
Standard	1,000
Premium	3,000
Enterprise	50,000+

Best Practices

Keep content clear and well-structured — The AI performs better with organized, readable content
Index only relevant pages — Exclude login pages, admin areas, and duplicate content
Use CSS selectors to focus on main content and exclude navigation, footers, and ads
Monitor the Questions dashboard to identify knowledge gaps and add missing content
Update data sources regularly to ensure accurate, current answers
Use categories to organize large numbers of data sources
Limit to one language when your site has identical content in multiple languages