The web is a huge collection of information and data (although the signal to noise ratio may be improved to say the least) as well as a navigation system for exploring information, built on top of the internet. Information is contained in web pages (organized within websites), and users can navigate from one page to another by following hyper links.
It was just a matter of time before someone start to retrieve and exploit all these data for different purpose, the first of them been search engines and web indexation. The web crawling consists in retrieving web page. Depending of the purpose, it may be called web scraping as well.
Web crawling common applications
Web crawling is an essential technique in the digital field and has numerous applications:
- Search engines: Search engines like Google use indexing robots to navigate the web and index pages. This allows them to deliver relevant results to user queries through in-depth analysis of online content.
- Search engine optimization (seo): seo is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. seo study the web graph to study links between pages and sites and to derived metrics.
- Price monitoring: Many companies use web scraping to track competitor prices in real time. This enables them to quickly adjust to market fluctuations and maintain a competitive edge.
- Data aggregation: Platforms such as price comparison sites or news aggregators leverage web crawling to consolidate information from multiple sources.
A basic crawler architecture
It would be a bit tedious if one has to feed a web crawler with page url one by one (url stand for Uniform Resource Locator aka an address on the web) to retrieve their data. A basic web crawler include a mechanism to automatically discover URL within downloaded pages content. Following the hyper links between pages, such a web crawler is known as a web robots or simply as a bot or as a spider by analogy with a spider moving through its own web.
In addition to a url discovery module, the following modules are found in a basic crawler architecture:
- Prioritization: A web page usually contains a fair amount of hyper links which can be extracted from the page content and if you extract url from each retrieved web page, then your collection of harvested urls will grow exponentially. Among these urls, some are more interesting than others and finding the most pertinent one is crucial. A smart crawling process used some intrinsic data about a web page and metrics computed on the web graph to drive the prioritization process. Selected policy also has an influence on the type of exploration you are performing: you can favor crawl depth to get more data about a single website or a more large exploration.
- Crawl limitation and budget: to reduce the footprint of crawling a website, limitations are put in place in order to ensure that the website resources are not monopolized by a single crawler. In a symmetric way, a crawler can not spend all its resources on a single website. For this purpose, a crawling budget is usually allowed and it may also prevent a crawler to be trapped in a loop. In both case, these limits impacts the prioritization algorithm above.
- Index management: it is not a good usage of crawling resources to crawl again the same site in a short period of time. Therefore crawlers need to know which pages has been crawled and when it should crawl them again using an index management module. The recrawl policy is linked to the crawl budget defined above and like the prioritization algorithm above, it should used crawled data and metrics to determine if a website (or one of its page) should be recrawled.
Web crawler (auto)identification
Each crawler accessing a web page is identified through a user-agent, web navigator as well as crawlers. For example, one of the google bot has the following user-agent: “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”. In addition to its identifier, a user-agent gives some information about the crawler: type, version, platform, etc. For crawlers, it usually link to an information page and/or an abuse email address). Each crawler should have its own user-agent but is free to specify any user-agent, even one of an existing browser.
Web crawling good practices
The fact that data are usually freely available on the web does not mean that it is allowed to fetch them for your own purpose : There are copyrighted material, for European websites, the GRPD also introduce restrictions. But without consideration for the legal basis which is specific to each crawl, there is also a unofficial code of policies to well behave and one a the most important is the robots.txt file.
The robots.txt is a file available at the root level of a website and contains instructions about what resources of a website are publicly available. It may completely or partially denied access to one or more user-agent or just impose a slower pace to crawl the website. If you are not a famous search engine, a crawl of your web site often viewed as a burden, so respect of the robots.txt file is a corner stone of bot acceptance. It also implies that a bot use a dedicated user-agent and do not try to spoof an existing one. See this page about the robots.txt file format.
It should go without saying that a website can not be crawled with a maximum speed. Smaller website often have resources with match their usual load and sudden burst of resource usage due to a unmanned bot usually lead to a permanent ban.
Web crawlers types and example
Here is some of the most common crawler types you can encountered in your website logs :
- Search engine bots are the most famous crawlers. They crawl the web to feed search engine with data. There is a lot of search engines and every one got a dedicated crawler. For example, crawlers from google and bing search engines are GoogleBot and BingBot.
- AI bots are rather recent and the most controversial issue about IA and privacy. AI bots crawl the web to build training datasets for LLM (Large Model Language) or other models. The openAI bot GPTBot is one of them. Their is also AI bot used for search, like AppleBot which fueled Siri.
- SEO bots (Search-Engine Optimization) are more interested by the knowledge of the web graph rather by its contents, i.e., hyper links between web pages and metrics such as google’s pagerank algorithm family. One of seo crawler purpose is to study websites ranking. Barkrowler, Babbar’s seo crawler, is an example.
- Advertising & Marketing bots such as “Google AdsBot” are specialized bots dedicated to fine tuned the advertisement service by providing accurate and recent information about website. They also index the web, but in a more limited fashion. Marketing bots may be used to evaluate web marketing campaign.
Write its own web crawler
If downloading web pages is at everyone reach using the many available http libraries, we seen above that you will soon need some basic module to build a proper web crawler. Hopefully, there is a collection of open-source softwares to help you building your own crawler, there also is dome commercial ones. Among them, there is :
- nutch from the apache foundation, written in java
- Scrapy an accessible python scrapper
- Heritrix. This tool is used by “Internet Archive’s”
- BUbiNG, a java crawler from the university of Milano
Q&A about web crawling
- How do search engines use crawlers to explore web pages? Web crawlers are instrumental in the workings of search engines. They explore web pages, gather data, and play a fundamental role in determining the relevance and ranking of websites on search engine result pages. By leveraging the power of crawlers, search engines can effectively organize and present a vast array of web pages to users, facilitating their search for information on the wide web.
- What role do indexing robots play in SEO? The SEO (Search Engine Optimization) industry heavily relies on web crawlers to analyze websites and gather insights. By understanding how search engines view websites, SEO experts can make necessary adjustments to optimize a site’s visibility and potential search engine rankings. Overall, web crawling, indexing robots, and scraping play a fundamental role in the functioning of search engines and the SEO industry. They enable the collection of vast amounts of data from the web, empowering search engines to deliver better search results and helping businesses and website owners improve their online presence.
- How can a robots.txt file influence a crawler’s behavior on a website? A robots.txt file is a text file that is placed in the root directory of a website. It serves as a set of instructions for crawlers, telling them which parts of the website they are allowed to crawl and which parts to ignore. This file can greatly influence a crawler’s behavior and the data it collects from a website. When a crawler visits a website, it first checks the robots.txt file to see if there are any specific instructions for its behavior. For example, if a website has sensitive information that the owner does not want to be indexed by search engines, they can use the robots.txt file to block the crawlers from accessing certain pages or directories. By using the robots.txt file, webmasters have control over how crawlers interact with their website. They can specify which parts of the website to crawl and which to exclude, thus helping in optimizing the crawling process. It also helps in managing bandwidth and ensuring that crawlers focus on relevant content.
- What valuable insights can be derived from analyzing the crawl data of a website? Crawl data analysis can provide valuable information for search engine optimization (SEO) efforts. By examining the crawl data, website owners can understand how search engine bots interact with their site, identify crawl errors and accessibility issues, and optimize their site structure to improve search engine rankings. Additionally, by examining the content found during the crawling process, webmasters can identify opportunities for content improvement and keyword optimization to enhance the site’s visibility in search engine results. Furthermore, looking at crawl data can reveal valuable insights about a website’s internal linking structure. By analyzing the links discovered during the crawling process, webmasters can identify any broken or dead links, assess the depth and breadth of their site’s link architecture, and strategically plan internal link building campaigns to improve the overall usability and navigation of their website. Crawl data analysis can also be beneficial in monitoring competitor websites. By comparing crawl data from different websites, one can identify potential areas of improvement, benchmark their site’s performance against competitors, and glean insights into successful SEO strategies employed by others in the industry.
- How do web crawlers handle dynamic content on web pages? Dynamic content refers to web page elements that change frequently or are generated on the fly based on user interactions. It can include elements such as real-time updates, interactive features, or personalized content. The challenge for web crawlers is to capture and process this dynamic content accurately. To handle dynamic content, crawlers rely on a variety of techniques. One common approach is to execute JavaScript, the programming language commonly used to create dynamic web content. By executing JavaScript, crawlers can simulate user interactions and capture the resulting changes in the page content but this is costly and can not be achieve for each web page. Another technique used by crawlers is to monitor the network traffic between the browser and the server. They can capture requests and responses and analyze the data to identify dynamic content elements. This allows them to extract the relevant data and update their index accordingly.
- What precautions can be taken to prevent data scraping on a website? To prevent data scraping, the simplest method is to use the robots.txt file. If a crawler ignores this file, it is possible to block the crawler’s IPs, if they are identifiable. Many tools can manage this protection by handling blacklists on your behalf. Unfortunately, these systems often err on the side of caution and may block sites that are falsely flagged as spammers.
- How does web scraping differ from web crawling? Between web crawling and web scraping, it’s more about the purpose than the techniques. Web scraping involves extracting data from websites, often in real-time, for various purposes such as data analysis, research, or monitoring. Although similar to web crawling, web scraping may involve specific tools or software that extract data by simulating human browsing behavior. Web crawling is more generic.