Data Sources
Data sources provide the knowledge your chatbot uses to answer questions. You can add website URLs, upload files, or enter text directly. All data sources are managed in the Data Sources tab of your chatbot's detail page.
Website URLs
Enter URLs to crawl and index. WebChatAgent automatically crawls the pages and extracts content for your chatbot's knowledge base.
Adding a Website
- Click Add Website in the Data Sources tab
- Enter the website URL (e.g.
example.com—https://is added automatically) - Configure optional advanced settings (see below)
- Click Add to start crawling
Advanced Settings
| Field | Description | Default |
|---|---|---|
| Crawl Depth | How many link levels deep to crawl. 0 = only this page, empty = unlimited depth. | Empty (unlimited) |
| CSS Selectors | Comma-separated CSS selectors to target specific content areas (e.g. article, .content, #main-body). When set, only content within these selectors is extracted. | Empty (full page) |
| Exclude URL Patterns | Comma-separated URL patterns to skip during crawling (e.g. /login, /impressum, */amp/*). | Empty |
| Auto Re-index Interval | How often to automatically re-crawl and update the content. Options: Every 3 days, Every 7 days, Every 30 days. | Disabled |
When to Use CSS Selectors
CSS selectors help you extract only the relevant content from a page. This is useful when:
- Your pages have navigation menus, footers, or sidebars you want to exclude
- You want to focus on the main article content only
- The page contains ads or unrelated widgets
Example: To extract only the main content area and skip navigation:
article, .main-content, #post-body
When to Use Exclude URL Patterns
Exclude patterns prevent specific pages from being indexed (added to your chatbot's knowledge base). Common use cases:
- Login/admin pages:
/login, /admin, /wp-admin - Legal pages:
/impressum, /privacy-policy, /terms - Duplicate content:
*/amp/*, */print/* - Category/tag archives:
/category/*, /tag/*(often duplicate content) - Query parameters:
?hitsPerPage— excludes all URLs containing this parameter (e.g. search result pages with pagination)
Important details:
- Patterns are case-sensitive.
/Aboutand/aboutare treated as different patterns. Make sure your patterns match the exact casing used in your URLs. - Excluded pages are not indexed, but their links are still followed. The crawler will still discover and crawl pages linked from excluded URLs — only the excluded page itself is skipped for indexing. This means excluding a category page like
/blog/category/newswill prevent that listing page from being indexed, but the individual blog posts linked from it will still be crawled and indexed (unless they also match an exclude pattern).
Language Considerations
If your website has multilingual content (e.g. /en/about and /de/about), it is strongly recommended to index only one language. Indexing the same content in multiple languages leads to:
- Duplicate information consuming your page quota unnecessarily
- Lower retrieval quality as the AI may pull answers from the wrong language
- Wasted token budget on redundant content
Choose the language your customers most frequently use, or the language that matches your chatbot's primary audience.
Text Data Sources
You can manually enter text content as a data source. This is ideal for FAQs, internal knowledge, or content that doesn't exist on a website.
Adding a Text Document
- Click Add Text in the Data Sources tab
- Fill in the fields:
| Field | Description | Constraints |
|---|---|---|
| Document Name | A descriptive name for this document. Leave empty for AI-generated name. | Optional |
| Category | Organize documents by category. Select an existing category or type a new one. Leave empty for auto-detection. | Optional |
| Content | The actual text content the chatbot will use. | Required, max 50,000 characters |
File Upload
Upload documents directly to your chatbot's knowledge base.
Supported Formats
| Format | Description |
|---|---|
| Manuals, reports, brochures, whitepapers | |
| DOCX | Word documents |
| TXT | Plain text files |
| MD | Markdown documents |
| XLSX | Spreadsheets |
File Size Limits
| Plan | Max File Size |
|---|---|
| Free | 1 MB |
| Basic | 5 MB |
| Standard | 10 MB |
| Premium | 50 MB |
| Enterprise | Unlimited |
AI Auto-Filename
On paid plans, if you upload a file with a generic name (e.g. document.txt or untitled.pdf), the system automatically generates a descriptive filename based on the file content.
Categories
Categories help you organize your data sources. You can:
- Assign a category when adding or editing any data source
- Filter the data source list by category
- Use categories to group related content (e.g. "Products", "Support", "Legal")
Categories are stored in lowercase and are shared across all data source types.
Data Source List
The data source list shows each source with its current status:
| Status | Description |
|---|---|
| Indexed | Successfully processed and available to the chatbot |
| Pending | Currently being processed or queued for indexing |
| Error | Processing failed — check the source URL or file and retry |
You can filter the list by:
- Search — Find sources by name or URL
- Type — Filter by Website or File
- Category — Filter by assigned category
Editing Data Sources
- Website URLs: Click a website source to edit the URL, crawl depth, CSS selectors, exclude patterns, and re-index interval
- Text documents: Click to edit the name, category, and content
- File documents: Click to edit the name and category (content cannot be changed — re-upload the file instead)
Re-indexing
You can manually re-index individual data sources to update their content. This is useful when:
- Website content has changed
- You want to refresh after fixing CSS selectors or exclude patterns
- A previously errored source has been corrected
Page Limits
| Plan | Max Pages |
|---|---|
| Free | 50 |
| Basic | 200 |
| Standard | 1,000 |
| Premium | 3,000 |
| Enterprise | 50,000+ |
Best Practices
- Keep content clear and well-structured — The AI performs better with organized, readable content
- Index only relevant pages — Exclude login pages, admin areas, and duplicate content
- Use CSS selectors to focus on main content and exclude navigation, footers, and ads
- Monitor the Questions dashboard to identify knowledge gaps and add missing content
- Update data sources regularly to ensure accurate, current answers
- Use categories to organize large numbers of data sources
- Limit to one language when your site has identical content in multiple languages
