The InsiteChat website crawler trains your AI chatbot on any URL by automatically extracting, chunking, and indexing accessible pages. Unlike basic single-page crawlers, InsiteChat supports CSS selectors (extract onlyDocumentation Index
Fetch the complete documentation index at: https://docs.insitechat.ai/llms.txt
Use this file to discover all available pages before exploring further.
main content, skip footer and nav), URL path filters, recursion depth control, custom HTTP headers, and sitemap.xml input — so you index exactly the content you want and nothing else.
Add a website source
Enter your URL
Enter your website’s root URL (e.g.,
https://example.com) or a sitemap URL (e.g., https://example.com/sitemap.xml).Configure advanced options (optional)
Click Advanced options to configure crawling behavior — see below for details.
Advanced crawling options
Click Advanced options when adding a URL source to access these controls:Recursion depth
Control how deep the crawler follows links from your starting URL.- 0 (unlimited) — Crawls all discoverable pages (default)
- 1 — Only the starting page
- 2 — Starting page + pages linked from it
- 3–5 — Progressively deeper link following
URL path filters
Control which pages get crawled based on their URL path. Include paths — Only pages matching these patterns will be crawled:CSS selectors
Extract content from specific parts of your pages, or exclude unwanted sections. Include selectors — Only content from these elements will be extracted:If you specify include selectors, only those elements are extracted. If you only specify exclude selectors, everything except those elements is extracted. If both are specified, includes take priority.
Custom HTTP headers
Add custom headers for accessing protected or authenticated content:Header-Name: value.
Auto-sync frequency
Set how often this source automatically re-syncs:| Option | Minimum Plan Required |
|---|---|
| Never (manual only) | All plans |
| Monthly | Starter |
| Weekly | Growth |
| Daily | Scale |
You cannot select a frequency higher than your plan allows. For example, Growth plan users can choose never, monthly, or weekly — but not daily.
Bulk source management
The Sources page includes tools for managing multiple sources at once:- Search — Filter sources by title or URL
- Status filter — Show only pending, crawling, processing, done, or failed sources
- Bulk select — Check multiple sources with checkboxes
- Bulk re-sync — Re-sync all selected sources at once
- Bulk delete — Delete all selected sources at once
What gets indexed
The crawler captures product pages, blog posts, FAQs, and informational pages. It extracts the main text content from each page (or only the content matching your CSS selectors, if configured).Limitations
- Pages blocked by
robots.txtare not crawled - Maximum crawl time per source: 10 minutes
- Pages with less than 50 characters of text are skipped
- Duplicate content (same text on multiple URLs) is automatically deduplicated
- Crawl page limits depend on your plan (Free: 30, Starter: 200, Growth: 1,000, Scale: 5,000)
