Monitoring Sitemap of a Website Using a Crawler

Sitemap monitoring is currently in alpha and available in the Professional plan and above.

Introduction
Sitemap Monitoring with Distill: How It Works
Create and Monitor Sitemap for a Website
Exclude Links for Crawling
Frequency of Crawling
Page Macro: Execute Steps Before a Page is Crawled
- Use Cases for Page Macro
  - Error Out a Crawl Job When a Website is Unavailable
  - Skipping URLs Based on Content on the Page
URL Rewrite
- Use Cases for URL Rewrite
  - Eliminating Duplicate URLs by Sorting Query Parameters
View and Manage Crawlers
Import and Monitor Crawled URLs
Export Crawled URLs
How Crawler Data is Cleaned Up

Introduction

A sitemap is an essential tool that helps search engines understand the structure of a website and crawl its content. It usually contains all URLs of a website, including those that may not be easily discoverable by search engines or visitors.

There are different formats of sitemaps, but the most common and widely supported format is the XML sitemap. While XML sitemaps are widely used, they may not be updated regularly or generated automatically. So, if you want to receive updates about new links added to a website, monitoring its sitemap file may not reflect the current state, and you may be missing out on new or updated content. To overcome this limitation, you can use Distill’s crawler to index all URLs on a website. A crawler navigates through a website, discovering and indexing pages just like a search engine would. This ensures that all URLs are extracted from the website, even the hidden or dynamically generated pages.

Sitemap Monitoring with Distill: How It Works

You can keep track of URLs belonging to a particular website using sitemap monitoring. Distill achieves this by first creating a list of URLs through the use of crawlers. The crawlers start at the source URL and search for all links within it. These links are then added to a list, and the crawlers proceed to go through each link one by one, crawling them to search for any additional links on the new page. The crawlers continue this process, crawling all the links they find until there are no more left to crawl.

However, there are some exceptions to what links the crawlers will crawl. They will not crawl links outside of the website’s domain, links that are not a subpath of the domain, or links that match the regular expression mentioned in the Exclude option while adding. By following these rules, Distill is able to create a comprehensive and accurate list of all URLs belonging to a website that can be monitored for changes.

Create and Monitor Sitemap for a Website

If you need to monitor for any updates on links for a website, you only need to create a sitemap monitor. However, if you also need to monitor the contents of those URLs, follow these two steps:

Crawling: Finds all the links present on a website and creates a list of crawled URLs. You can set up the frequency and exclusions for the crawlers at this stage. This creates the sitemap for the monitor.
Monitoring: You can use the Import Monitors option under Change History to select crawled links from the list and import them into your Watchlist."

Website Link Monitoring with a Crawler

Here are the steps to monitor the sitemap using crawler:

Open the Watchlist from the Web app at https://monitor.distill.io
Click Add Monitor -> Sitemap.
On the source page, add the Start URL. Links will be crawled from the same subdomain and the same subpath and not the full domain. So, you can add the URL from where you want to start the crawl. For example, if URL is https://distill.io, all subpaths like https://distill.io/blog, https://distill.io/help, etc will be crawled. But it will not crawl URLs like, https://forums.distill.io as it is a separate subdomain.
If you want to exclude any links for monitoring, you can use the regular expression filter as mentioned below for exclusion.
Click “Done”. It will open the Options page for the monitor’s configuration. You can configure actions and conditions at this page. Save when done.

Whenever URLs are added or removed from the sitemap you will get a change alert. You can view the change history of a sitemap monitor by clicking on the preview.

Change history of a sitemap monitor

Exclude links for crawling

You can use the regular expression filter to exclude links from crawling. This option is available on the source page when you first set up a crawler for sitemap monitoring. Alternatively, you can modify the configuration of any existing crawler from the detail page. By default, all image links are excluded for crawling. Regex filter for exclusion

Frequency of crawling

By default, the crawler frequency is set to 1 day. However, you have the option to modify the crawling frequency according to your requirements. You can do this by navigating to the crawler’s page as shown below:

Then you can click on the “Edit Crawler” option and change the schedule.

Page Macro: Execute steps before a page is Crawled

Before initiating a crawl, it is often necessary to perform validations on the page. The macro will run on each page before crawling them.

Steps to add Page Macros:

Click the hamburger icon in the watchlist -> Select Crawlers.
Click on the crawler you want to add Page macros to -> Click Edit Crawler.
Beside Page Macros click on Edit Steps.

Adding steps to be executed before crawling a page

You can add the steps you want to execute before a URL is crawled.

Use Cases for Page Macro

Error Out a Crawl Job When a Website is Unavailable

When page is temporarily unavailable (503 errors), loaded pages will not contain the original links. If these are crawled successfully, the final list of crawled links will be incomplete. This will trigger a false alert. To prevent it from happening, the crawler should stop the crawl. It can be stopped by raising an error. This can be done by adding validations in a Page Macro.

Add an assert step and click on the caret icon to expand the options.
Check if element_has_text and input general keywords that the site might use to indicate maintenance, such as “Maintenance”.
Click on Set optional Arguments and type in the error message you want to display, for example: “Crawler stopped: Site under maintenance”.

Skipping URLs Based on Content on the Page

In some cases, you may want to skip crawling URLs based on certain criteria to avoid irrelevant crawler notifications. For example, you can skip crawling URLs of Out of Stock products when monitoring a list of available products. Following steps show how to do that:

Add an if...else block.
Check if element_has_text contains “in stock” and negate this step using the overflow button. This will add a NOT to your block.
In the condition body, use skipURL.

This negation in the page macro will skip all URLs that don’t have the text “in stock”.

Stop crawling irrelevant pages

URL Rewrite

URL rewriting optimizes crawling by modifying URLs before they are crawled. This process can normalize URLs for websites with duplicate pages, manage redirects, and consolidate multiple URLs into a single preferred version.

You can use the following “Presets” for common cases of URL rewriting:

Test URL: https://example.com/products?size=10&color=blue&category=shoes

Sort query parameters: Sorts query parameters in ascending order.

Reconstructed URL: https://example.com/products?category=shoes&color=blue&size=10

After sorting the query parameters, you will need to use the Return constructed URL preset.
Remove all query parameters: Removes all query parameters from URLs.

Reconstructed URL: https://example.com/products
Return constructed URL: Returns the URL after it has been reconstructed based on the specified function.

Steps to rewrite URL:

Navigate within the crawler and click “Edit crawler”.
Next to “Rewrite URL” click “Edit Steps”.

You have the option to use different presets.

Use Cases for URL Rewrite

Eliminating Duplicate URLs by Sorting Query Parameters

URLs with multiple parameters in the query string can cause web crawlers to identify them as different pages, even if they lead to the same content. This happens because the order of the parameters is different. By normalizing these URLs—sorting the parameters in a preset order—and consolidating them into a single sitemap entry, we can prevent false notifications of duplicate entries.

For example, a crawler might find the following two URLs leading to the same page:

https://bookstore.com/search?sort=price_asc&category=fiction
https://bookstore.com/search?category=fiction&sort=price_asc

In both URLs, the sort parameter is placed before the category parameter in the first URL and after in the second, but both result in a search for fiction books sorted by price in ascending order. Normalizing these URLs ensures they are recognized as the same entry in the sitemap.

You can use URL rewrite to organize and optimize URLs:

Click Edit Steps in URL rewrite.
Select the preset Sort Query Params.
Then, click Return Constructed URL from the presets.
Observe that the parameters in the URL are now sorted by the crawler after the rewrite function is performed.

Ensure the final step outputs a fully formed and valid URL. For example, the return join_url_fragments function combines different parts of a URL, such as the hash, path, and query with the origin, into a single, fully formed URL. This ensures that all the individual URL components you have been working with are correctly joined together to form a valid and complete URL.

View and Manage Crawlers

Click the hamburger icon and select Crawlers from the drop-down.

The crawler list opens up. It has details of crawler: Name, Start URL, Creation date, Last Run Summary and State.

Click on any crawler to view its details. This will show a list of jobs the crawler has run. You can also click on Edit Crawler to modify settings such as schedule, exclude URLs for crawling, Macro Action, and Rewrite URL.

The Jobs section displays all the jobs that the crawler has executed, along with the details for each run. Each job summary includes the following information:

Crawler Status	Description
`Total`	The total number of URLs found in the sitemap.
`Queued`	The number of URLs waiting to be crawled.
`Crawled`	The number of URLs that have been successfully crawled.
`Errored`	The number of URLs that encountered errors during the crawling process.

Click on the Started on field for a job. It will show you the list of URLs found by the crawler in that run. It will also show the status of each URL, indicating whether it was crawled or not.

You can further click on the Caret > icon next to a URL to view the crawled path of the URL.

Import and Monitor Crawled URLs

Once the sitemap monitor finishes crawling all the links on your website, you can access the Import Monitors button in the Change History section, as shown below:

Clicking this button will display a list of all the crawled links along with additional details. You can filter the list to view only the URLs that are not currently being monitored by selecting Show All URLs.

After choosing the links you want to monitor, click the Import Selected Monitors button and configure the options page to finish importing.

Export Crawled URLs

You can export crawled URLs in CSV format by following these steps:

From your watchlist, click on the sitemap monitor’s preview.
Click on the Download button to export the data.

This will download the list of URLs found by the crawler. The CSV file will contain the following fields: URL, Content Type, Status Code, and Diff Type. The “Diff Type” field indicates changes compared to the previous crawl:

Addition: Newly found URL
Unchanged: URL present in the previous crawl
Deleted: URL that has been removed

How Crawler Data is Cleaned Up

Crawler jobs generate a substantial volume of data with each run. To manage this data effectively, a systematic cleanup process is in place. Specifically, only the latest 10 jobs are retained, and any excess data from previous jobs is removed.

However, jobs that have detected changes in the past are preserved until the change history is truncated. This ensures that users have access to significant historical job data, even if it extends beyond the most recent 10 jobs.

Was this article helpful? Leave a feedback here.