When browsing websites, not all displayed content is relevant to a search engine or a user. Some sections can be considered the main content, while others are referred to as boilerplate. It is this task of distinguishing (or rather classifying) between content and boilerplate that we will address.
Why Extract Main Content?
Here are a few reasons:
- The main content is used to index websites based on the information that is most relevant to users.
- Audio readers for visually impaired users or the conversion of web pages into “reader mode” rely on this content extraction.
- Main content can be used to create various data corpora, making boilerplate a form of “noise” for AI training.
What’s the Difference Between Main Content and Boilerplate?
The definition can vary. Generally speaking, in the context of indexing a page for a search engine:
“We can assume that the main content corresponds either to the central text of an article on the page (if it exists) or to anything that does not belong to the recurring template of the site. One could thus define the main content as the part of the page that most visitors would expect to see if they arrived from elsewhere, for example, from a search engine.” (Janek Bevendorff et al.)
Alright, this definition isn’t universal, but we can already consider headers, footers, navigation menus, copyright notices, advertisements… as part of the boilerplate.
What about user comments? Are they considered main content?
Well, for certain structures, it depends on the type of page or what the user is looking for. According to the definition, comments aren’t recurring, so they are part of the main content. But:
- Are blog or news site comments part of the user’s primary search intent? Not really.
- Would I want to see product reviews or comments on a booking site? Possibly.
How to Extract the Main Content from a Web Page?
There are three main approaches – the first two follow a similar logic, while the last one attempts a completely different strategy.
The Heuristic Approach
The heuristic approach is based on writing rules and relies on strategies manually defined by humans through the study of web page structures. These strategies are based on text characteristics, with the most commonly used being:
- Text density – How many words per line in the analyzed block.
- Link density – How many links per line in the analyzed block.
Many other characteristics exist depending on the algorithm used, such as the average sentence length, the ratio of uppercase letters, the number of commas per block… essentially any form of text counting.
When a block sufficiently meets the manually written conditions, it is considered part of the main content.
Some well-known and effective heuristic extractors include Trafilatura and Readability.
The Machine Learning Approach
Machine learning approaches are structurally much more diverse than heuristic approaches but generally rely on the same characteristics and sometimes the same strategies.
The rules are automatically learned by the model under the supervision of an annotated dataset. Models like Dragnet and Boilerpipe are based on this difference.
Some models go further by integrating semantic analysis of blocks and their relationship with adjacent blocks. For example, Boilernet uses LSTMs (Long Short-Term Memory) in addition to learned rules.
The Visual Approach
For specific uses, extractors from previous approaches can be highly limited. Language can significantly affect performance depending on the characteristics analyzed.
But aren’t the rules language-independent?
Implicitly, they are. I’m particularly thinking of writing systems where words or syllables can be represented by a single character, scripts where uppercase and lowercase letters don’t exist, or where commas aren’t used. Counting these elements becomes impossible, inevitably biasing models that rely on such features to determine main content.
Thus, Chinese, Japanese, Korean, Russian, and many other sites are not always well-suited for even the most advanced extractors. This compatibility issue becomes even more apparent when analyzing text semantics.

The visual approach is therefore based on two principles:
- The main elements are usually located in the center of the screen, unlike menus, ads, and other “noisy” elements.
- Humans can detect the main content without needing to read it, which means that the structure of the page and its visual rendering can be more effective than text analysis.
This method is language-agnostic and relies on analyzing the page rendering and the DOM tree to extract central blocks.

Heuristic vs Machine Learning – Which is the Better Approach?
In the article: “An Empirical Comparison of Web Content Extraction Algorithms”, the authors aimed to compare 14 main content extractors (+5 HTML-to-text conversion tools), evaluated across 8 public datasets. Due to entirely different paradigms, the visual approach is difficult to compare with the first two methods and is not part of this study. So, what did they conclude?
Let’s focus on the top-performing algorithms: Trafilatura, Readability, and Boilerpipe. Trafilatura and Readability, based on the heuristic approach, are less adaptable to new structures due to their manual rules. However, they are generally faster, less resource-intensive, and even more effective than machine learning approaches like Boilerpipe.
Why Are Automatic and More Complex Strategies Less Effective Compared to Handwritten Rules?
Well, they aren’t always worse. For instance, Boilerpipe, which was trained on one of the benchmark datasets, outperforms all others specifically on that dataset. Machine learning approaches depend on their structure, the strategies they retain, and the training data. This raises questions about the quality of that data and its impact on the effectiveness of machine learning-based extractors.
English is overrepresented, dataset sizes are often considered too small, and the types of pages are almost exclusively made up of press articles and blog posts.
As of today, for Latin-based languages, heuristic extractors are the preferred choice. They perform very well for extracting content from blog or press articles but are neither language-agnostic nor genre-agnostic.
Want to Try It at Home?
Grab your favorite Python interpreter, and let’s try extracting a randomly chosen web page 👀 using a few lines of code with one of the top-performing heuristic models:

References
- An Empirical Comparison of Web Content Extraction Algorithms (Benchmark and definitions)
- An Overview of Web Page Content Extraction – Joy Bose Roy (Inspiration)
- Don’t Read, Just Look: Main Content Extraction from Web Pages Using Visual Features (GCE)
- Boilerplate Detection using Shallow Text Features (Boilerpipe)
- Mozilla Readability – GitHub (Source Code)
FAQ
What Are the Common Techniques for Extracting Data from Web Pages?
Heuristic, machine learning, and visual approaches—while the first is arguably the easiest to use—are widely utilized in the web domain to extract information from web pages.
Which Scraping Tools Do You Recommend for Extracting Information from a Web Page?
In one sentence: As mentioned earlier, Trafilatura is very easy to deploy and performs exceptionally well in terms of speed and results. Readability is also a good tool.
How to Extract Raw Data from a Web Page Using Python?
If you need all the text, BeautifulSoup and Html-Text are more than sufficient tools.
What Is the Difference Between Online and Local Web Page Scraping?
If you already have a local corpus of pages to extract, the mentioned libraries can also process local files as input. The methodology remains unchanged.
What Are the Steps to Extract Data from a Web Page Using a Scraper?
The first step is to compute the page rendering in case content is dynamically loaded by a script. Download this rendered version before extracting the desired content using the necessary scraping libraries.
How to Identify and Extract Specific Elements from a Web Page in HTML?
If you know exactly what you want to scrape, you can use the BeautifulSoup library to create custom extraction rules.
How to Differentiate Between div and class Tags to Extract Data from a Web Page in HTML?
BeautifulSoup (again, on Python) allows you to analyze extracted content based on tags, class names, or even the text itself.
What Are the Advantages and Disadvantages of Automatic Data Extraction from Websites?
The advantage of web scraping is undoubtedly its time-saving aspect and flexibility, as it is generally used on large datasets. The downside is that the target site may use scripts to add content, requiring additional tools to load the page before extraction. The output format can sometimes be chaotic and may not be robust enough for all languages or web page formats.
How to Optimize the Data Extraction Process to Obtain Accurate Information from Web Pages?
The optimization of extractors depends on the type, language, size, and complexity of the pages. The more homogeneous the data across these factors, the easier it is to learn the rules through a machine learning approach. However, it will be challenging to generalize for drastically different pages.