Website Crawler — Train AI Chatbot on Any Site

The InsiteChat website crawler trains your AI chatbot on any URL by automatically extracting, chunking, and indexing accessible pages. Unlike basic single-page crawlers, InsiteChat supports CSS selectors (extract only main content, skip footer and nav), URL path filters, recursion depth control, custom HTTP headers, and sitemap.xml input — so you index exactly the content you want and nothing else.

Add a website source

Open your dashboard

Go to sources

Click the Sources tab, then click Add source.

Choose URL

Select URL as the source type.

Enter your URL

Enter your website’s root URL (e.g., https://example.com) or a sitemap URL (e.g., https://example.com/sitemap.xml).

Configure advanced options (optional)

Click Advanced options to configure crawling behavior — see below for details.

Add source

Click Add source. InsiteChat begins indexing your site automatically.

Advanced crawling options

Click Advanced options when adding a URL source to access these controls:

Recursion depth

Control how deep the crawler follows links from your starting URL.

0 (unlimited) — Crawls all discoverable pages (default)
1 — Only the starting page
2 — Starting page + pages linked from it
3–5 — Progressively deeper link following

Use depth 1–2 for targeted crawls (e.g., just your pricing page and its subpages). Use 0 for full-site indexing.

URL path filters

Control which pages get crawled based on their URL path. Include paths — Only pages matching these patterns will be crawled:

/blog
/docs
/help

Exclude paths — Pages matching these patterns will be skipped:

/blog/drafts
/internal
/admin

Patterns match as substrings in the URL path. One pattern per line.

CSS selectors

Extract content from specific parts of your pages, or exclude unwanted sections. Include selectors — Only content from these elements will be extracted:

main
.content
#article

Exclude selectors — Content from these elements will be removed:

footer
.sidebar
#comments
nav

If you specify include selectors, only those elements are extracted. If you only specify exclude selectors, everything except those elements is extracted. If both are specified, includes take priority.

Custom HTTP headers

Add custom headers for accessing protected or authenticated content:

Authorization: Bearer your-api-token
Cookie: session=abc123
User-Agent: MyBot/1.0

One header per line, format: Header-Name: value.

Be careful with authentication tokens. Only use this for content you own or have permission to access.

Auto-sync frequency

Set how often this source automatically re-syncs:

Option	Minimum Plan Required
Never (manual only)	All plans
Monthly	Starter
Weekly	Growth
Daily	Scale

You cannot select a frequency higher than your plan allows. For example, Growth plan users can choose never, monthly, or weekly — but not daily.

Bulk source management

The Sources page includes tools for managing multiple sources at once:

Search — Filter sources by title or URL
Status filter — Show only pending, crawling, processing, done, or failed sources
Bulk select — Check multiple sources with checkboxes
Bulk re-sync — Re-sync all selected sources at once
Bulk delete — Delete all selected sources at once

What gets indexed

The crawler captures product pages, blog posts, FAQs, and informational pages. It extracts the main text content from each page (or only the content matching your CSS selectors, if configured).

The crawler does not process images or embedded videos. Only text content is indexed. Pages behind authentication are not crawled unless you provide custom headers.

Limitations

Pages blocked by robots.txt are not crawled
Maximum crawl time per source: 10 minutes
Pages with less than 50 characters of text are skipped
Duplicate content (same text on multiple URLs) is automatically deduplicated
Crawl page limits depend on your plan (Free: 30, Starter: 200, Growth: 1,000, Scale: 5,000)

​Add a website source

​Advanced crawling options

​Recursion depth

​URL path filters

​CSS selectors

​Custom HTTP headers

​Auto-sync frequency

​Bulk source management

​What gets indexed

​Limitations

Add a website source

Advanced crawling options

Recursion depth

URL path filters

CSS selectors

Custom HTTP headers

Auto-sync frequency

Bulk source management

What gets indexed

Limitations