Data Sources
Data sources provide the knowledge your chatbot uses to answer questions. You can add website URLs, upload files, enter text directly, or connect external platforms like Confluence and Notion. All data sources are managed in the Data Sources tab of your chatbot's detail page.
Website URLs
Enter URLs to crawl and index. WebChatAgent automatically crawls the pages and extracts content for your chatbot's knowledge base.
Adding a Website
- Click Add Website in the Data Sources tab
- Enter the website URL (e.g.
example.com—https://is added automatically) - Configure optional advanced settings (see below)
- Click Add to start crawling
Advanced Settings
| Field | Description | Default |
|---|---|---|
| Crawl Depth | How many link levels deep to crawl. 0 = only this page, empty = unlimited depth. | Empty (unlimited) |
| CSS Selectors | Comma-separated CSS selectors to target specific content areas (e.g. article, .content, #main-body). When set, only content within these selectors is extracted. | Empty (full page) |
| Exclude URL Patterns | Comma-separated URL patterns to skip during crawling (e.g. /login, /impressum, */amp/*). | Empty |
| Auto Re-index Interval | How often to automatically re-crawl and update the content. Options: Every 3 days, Every 7 days, Every 30 days. | Disabled |
When to Use CSS Selectors
CSS selectors help you extract only the relevant content from a page. This is useful when:
- Your pages have navigation menus, footers, or sidebars you want to exclude
- You want to focus on the main article content only
- The page contains ads or unrelated widgets
Example: To extract only the main content area and skip navigation:
article, .main-content, #post-body
When to Use Exclude URL Patterns
Exclude patterns prevent specific pages from being indexed (added to your chatbot's knowledge base). Common use cases:
- Login/admin pages:
/login, /admin, /wp-admin - Legal pages:
/impressum, /privacy-policy, /terms - Duplicate content:
*/amp/*, */print/* - Category/tag archives:
/category/*, /tag/*(often duplicate content) - Query parameters:
?hitsPerPage— excludes all URLs containing this parameter (e.g. search result pages with pagination)
Important details:
- Patterns are case-sensitive.
/Aboutand/aboutare treated as different patterns. Make sure your patterns match the exact casing used in your URLs. - Excluded pages are not indexed, but their links are still followed. The crawler will still discover and crawl pages linked from excluded URLs — only the excluded page itself is skipped for indexing. This means excluding a category page like
/blog/category/newswill prevent that listing page from being indexed, but the individual blog posts linked from it will still be crawled and indexed (unless they also match an exclude pattern).
Language Considerations
If your website has multilingual content (e.g. /en/about and /de/about), it is strongly recommended to index only one language. Indexing the same content in multiple languages leads to:
- Duplicate information consuming your page quota unnecessarily
- Lower retrieval quality as the AI may pull answers from the wrong language
- Wasted token budget on redundant content
Choose the language your customers most frequently use, or the language that matches your chatbot's primary audience.
Text Data Sources
You can manually enter text content as a data source. This is ideal for FAQs, internal knowledge, or content that doesn't exist on a website.
Adding a Text Document
- Click Add Text in the Data Sources tab
- Fill in the fields:
| Field | Description | Constraints |
|---|---|---|
| Document Name | A descriptive name for this document. Leave empty for AI-generated name. | Optional |
| Category | Organize documents by category. Select an existing category or type a new one. Leave empty for auto-detection. | Optional |
| Content | The actual text content the chatbot will use. | Required, max 50,000 characters |
File Upload
Upload documents directly to your chatbot's knowledge base.
Supported Formats
| Format | Description |
|---|---|
| Manuals, reports, brochures, whitepapers | |
| DOCX | Word documents |
| TXT | Plain text files |
| MD | Markdown documents |
| XLSX | Spreadsheets |
File Size Limits
| Plan | Max File Size |
|---|---|
| Free | 1 MB |
| Basic | 5 MB |
| Standard | 10 MB |
| Premium | 50 MB |
| Enterprise | Unlimited |
AI Auto-Filename
On paid plans, if you upload a file with a generic name (e.g. document.txt or untitled.pdf), the system automatically generates a descriptive filename based on the file content.
Confluence Cloud
Connect your Confluence Cloud instance to import wiki pages directly into your chatbot's knowledge base.
Adding a Confluence Source
- Click Add Data Source and select Confluence Cloud
- Enter your Confluence URL (e.g.
https://yourcompany.atlassian.net) - Enter the email associated with your Atlassian account
- Enter an API token from Atlassian API Tokens
- Click Test Connection — available spaces will be loaded
- Select the spaces to index (leave empty to import all accessible spaces)
- Configure the sync interval and click Create & Index
Editing a Confluence Source
After adding a Confluence source, click the edit icon to change:
- Space selection — Add or remove spaces to index
- Include child pages — Whether to include nested pages
- Re-indexing interval — How often to automatically sync
Changes take effect on the next re-index.
Incremental Sync
Confluence sources support incremental sync. Only pages that have changed since the last sync are re-indexed, making updates fast and efficient.
Notion
Connect Notion to import pages and databases into your chatbot's knowledge base.
Adding a Notion Source
- Click Add Data Source and select Notion
- Create an Internal Integration at Notion Integrations
- Share the desired pages or databases with your integration in Notion
- Enter the integration token (starts with
ntn_orsecret_) - Click Test Connection — available databases will be loaded
- Select the databases to index (leave empty to import all accessible pages)
- Configure the sync interval and click Create & Index
Editing a Notion Source
After adding a Notion source, click the edit icon to change:
- Database selection — Add or remove databases to index
- Include child pages — Whether to include nested pages
- Re-indexing interval — How often to automatically sync
Incremental Sync
Notion sources support incremental sync. Only pages modified since the last sync are re-processed.
Categories
Categories help you organize your data sources. You can:
- Assign a category when adding or editing any data source
- Filter the data source list by category
- Use categories to group related content (e.g. "Products", "Support", "Legal")
Categories are stored in lowercase and are shared across all data source types.
Data Source List
The data source list shows each source with its current status:
| Status | Description |
|---|---|
| Indexed | Successfully processed and available to the chatbot |
| Pending | Currently being processed or queued for indexing |
| Error | Processing failed — check the source URL or file and retry |
You can filter the list by:
- Search — Find sources by name or URL
- Type — Filter by Website, File, Confluence, or Notion
- Category — Filter by assigned category
Editing Data Sources
- Website URLs: Click the edit icon to change URL, crawl depth, CSS selectors, exclude patterns, and re-index interval
- Text documents: Click the edit icon to change the name, category, and content
- File documents: Click the edit icon to change the name and category (content cannot be changed — re-upload the file instead)
- Confluence sources: Click the edit icon to change space selection, child page inclusion, and sync interval
- Notion sources: Click the edit icon to change database selection, child page inclusion, and sync interval
Re-indexing
You can manually re-index individual data sources to update their content. This is useful when:
- Website content has changed
- You want to refresh after fixing CSS selectors or exclude patterns
- A previously errored source has been corrected
Page Limits
| Plan | Max Pages |
|---|---|
| Free | 50 |
| Basic | 200 |
| Standard | 1,000 |
| Premium | 3,000 |
| Enterprise | 50,000+ |
Best Practices
- Keep content clear and well-structured — The AI performs better with organized, readable content
- Index only relevant pages — Exclude login pages, admin areas, and duplicate content
- Use CSS selectors to focus on main content and exclude navigation, footers, and ads
- Monitor the Questions dashboard to identify knowledge gaps and add missing content
- Update data sources regularly to ensure accurate, current answers
- Use categories to organize large numbers of data sources
- Limit to one language when your site has identical content in multiple languages
