Classifying Web Pages by Genre: Focus on the N-Gram Method

In the previous section, we saw that classifying websites by genre is a complex challenge because categories evolve over time. Genres that were relevant 20 years ago do not necessarily reflect today’s internet. Moreover, classification can be based on various elements: visual appearance, structure, or even the function of the page. This variability complicates the creation of a universal taxonomy.

Several approaches exist for classifying web pages. As with main content extraction, the first step is to analyze the page’s information through internal search engines. Shepherd and Watters, by introducing the concept of “cybergenres,” identified three key roles for categorizing pages: content (text analysis and semantic information), form (page structure, including HTML tags), and function (user interaction with the page through links and actions).

Traditional approaches use techniques such as word counting, punctuation frequency, or Bag of Words (BOW) models to analyze websites. However, a more effective method has emerged: the n-gram method, first introduced in 2007, which has significantly improved the classification of pages by genre.

A Common Dataset: 7-Genre

Just before starting, I decided to popularise this approach to studies that have compared their results using the same dataset. This is the 7-Genre dataset.

This database helps standardize evaluations and compare the impact of different methods on the quality of page ranking. I should clarify that PHP refers to “personal home pages,” while Spages refers to “search pages.”

What is an n-gram?

An n-gram representation of a text is a sequence of characters or words of size n, maintaining their order and covering all possible combinations. For example, the word “pomme” in bigrams is represented as: {‘po’, ‘om’, ‘mm’, ‘me’}. Here, ncorresponds to the number of characters per token.

Below, in yellow, you can see character n-grams ranging from 2-gram to 5-gram to help you visualize the concept. The yellow highlights represent a sliding window that moves across the entire text line.

What Are N-Grams Used For?

This method is quite recent—only 77 years old. Most of you weren’t born when Claude Shannon introduced this idea in 1948 to predict a sequence of characters or words based on the beginning of a text. Neither was I, for that matter. Shannon was a pioneer in probabilistic Natural Language Processing (NLP).

N-grams allow us to model the probability of a sequence of symbols and analyze redundant structures. One of the key features of this approach is its ability to automatically capture word roots. Since character sequences representing roots occur more frequently than affixes (prefixes, suffixes, etc.), n-grams help identify them.

For example, let’s take the trigrams from the following words: lucide, élucider, luciole, Lucas, Lucie. I won’t list every possible combination, but by counting the occurrences across the sequence, we can isolate the root “luc”, which, by the way, comes from the Latin “lux, lucis”, meaning light.

📖 You picked an example that works well, but when I try with aéroport, port, transport, your three-character word chunks don’t work anymore, and we have to adjust the size each time.

Glue to Build More Relevant N-Grams

In their study published in 2007, Ioannis Kanaris and Efstathios Stamatatos were not only among the first to apply n-grams for classifying web pages by genre, but they also didn’t use the easiest method to explain!

Their goal was to train an SVM (Support Vector Machine) using a set of n-grams considered dominant for a given genre. For example, “price”“sale”, and “buy” are likely dominant n-grams for the e-shop genre.

📖 Why is the size variable?

That’s exactly the core of their strategy! Since words and roots vary in length, they compute all possible n-gram combinations from 2-grams to 5-grams (at least in their study).

Then, they take inspiration from the LocalMaxs method, originally proposed by Silva and Lopes, to detect dominant n-grams. The version used by Kanaris and Stamatatos calculates an indicator—essentially a measure of “cohesion strength” or glue—using Symmetrical Conditional Probability (SCP):

Where p(x, y) is the probability of observing the character sequence x followed by y, and p(x), p(y) are the probabilities of observing x and y separately.

For an n-gram of length n (e.g., “expo”), different internal cut points are considered (e.g., “e|xpo”“ex|po”, etc.), and the SCP of the left and right segments of the split is calculated. The LocalMaxs method then compares these values to determine whether the n-gram is “dominant” (strongly cohesive) or if it is “outperformed” by a longer or shorter n-gram that contains nearly the same sequence.

LocalMaxs

The algorithm checks, for each n-gram C, whether it has a higher glue score (SCP) than:

  • Its shorter n-grams (predecessors),
  • Its longer n-grams (successors).

In simple terms, if the SCP of C is strictly greater than that of its predecessors and successors, then C is declared dominant.

The rest of the method is relatively straightforward:

  1. The web page is represented using these dominant n-grams, and their frequency is counted.
  2. Support Vector Machine (SVM) is trained to classify web pages by genre based on these dominant n-grams.

This method primarily focuses on text-based classification, though the study also mentions structural analysis of HTML tags, which slightly improves the results.

What Is the Ideal N-Gram Size for Classifying Web Page Genres?

The “glue” method with variable n-grams is very costly. So, the question arises: if we don’t want to complicate things but still find the best n-gram size to use, what would it be?

In 2014, Kumari et al. decided to compare the performance of an SVM trained on different n-gram sizes ranging from 3 to 8. For each length, they then counted the occurrences of the n-gram to construct a representation vector. They found that pentagrams (5-grams) were the best compromise.

SVM Performance on the 7-Genre Dataset by Varying N-Gram Length

How Can This Be Explained?

With n-grams that are too short, words like chanternous chantons, and chaussette present a problem in root detection. A trigram would be unable to distinguish chaussette from chanson, leading to incorrect associations.

Similarly, with n-grams that are too long (e.g., 6-grams), it becomes impossible to extract a meaningful root in this example. More broadly, the number of distinct n-grams increases, introducing noise into the classification process.

However, one criticism of this study is that the optimal n-gram length was determined using the 7-Genre dataset, which contains only English-language web pages. While this study effectively illustrates that there is always an optimal n-gram length and highlights the risks of n-grams that are too short or too long, it does not confirm whether this applies to all languages, all datasets, or even all web page genres.

Can We Achieve Good Classification Without Machine Learning?

That’s a great question, and yes, Mason et al. did it in 2009!

First, they observed that the frequency of n-grams in a text follows a Zipf distribution, just like the vocabulary of a natural language.

A Zipf distribution looks like this:

Zipf Distribution of Words in the English Language

📖 What Does This Mean?

It means that only a small number of words are actually used very frequently in the vocabulary.

Representing a text as n-gram frequency vectors is problematic because it creates very large vectors, where most n-grams have low occurrences.

Here is the bigram distribution from Mason et al.’s study:

Do You See the Curve? The first time, I missed it.
What this curve tells us is that almost all vocabulary has a very low frequency in the end.

J. Mason’s first decision was to “cut off” the excess. From now on, only the 500 most frequent n-grams will be considered.

Unlike the first two approaches, SVM is not used.

  1. The normalized frequency of n-grams in the page is calculated.
  2. genre-wide average is computed.
  3. The 500 most frequent n-grams per genre are isolated, creating a unique vector for each genre—its “typical profile.”
  4. The distance between a candidate and each typical profile is calculated using the following formula proposed by Kešelj et al. (2003):

Where:

  • S₁ and S₂ are the two n-gram profiles being compared (e.g., a test page and a genre).
  • f₁(m) is the normalized frequency of the n-gram m in profile S₁.
  • f₂(m) is the normalized frequency of the same n-gram in profile S₂.
  • The sum is calculated over all n-grams present in either of the profiles (S₁ ∪ S₂).

This study also proposes an approach that subdivides the “listing” genre, which was considered too heterogeneous, into checklist, hotlist, sitemap, and table. The results show that this subdivision significantly improved accuracy.

From this study, we can conclude that it is crucial for a category to be homogeneous in its textual content to avoid biasing the created typical profile. This method is particularly interesting due to its simplicity, but it depends on the taxonomy used.

Which Sub-Approach Wins?

I recreated this table from Kumari et al.’s 2014 publication (recreated because some studies not covered here were included, and the author mistakenly rounded down the performance of their peers).

Kanaris et al.96,5%(glue)
Kumari et al.95,78%(taille de n-gram)
Mason et al.94,6%(sans SVM)

Each study was at least evaluated on the 7-Genre dataset, allowing for a comparative overview of their performance. It is immediately noticeable that the most effective method appears to be the glue approach. However, if we consider computational cost or approach complexity, the non-SVM method deserves recognition for achieving a very strong score, even in third place.

Conclusion

This approach may seem outdated in 2025, considering what LLMs can achieve, but that doesn’t mean it has nothing to offer.

First, a lighter method is sometimes beneficial—an LLM is costly in terms of time, water, and electricity, especially when compared to this type of approach.

Now, back to n-grams.

  • We saw that if n-grams are too short, they fail to distinguish meaningful segments within a context.
  • On the other hand, if they are too long, they become almost unique vocabulary elements, turning into noise.
  • We also saw that including all possible vocabulary elements is unnecessary.

Therefore, it is crucial not only to choose the right n-gram size but also to select the most relevant n-grams—either by:

  1. Using the normalized average of n-grams within the same genre,
  2. Applying the glue algorithm to extract dominant n-grams.

The n-gram method is far from being the only interesting approach in genre classification.

For instance, we could explore:

  • Using links and neighboring pages to enrich the analyzed page’s textual content with more relevant n-grams.
  • Multi-label outputs, to handle cases where a page exhibits hybrid genres.
  • The creation of “design patterns”, to classify a web page based on its structure rather than its content alone.

We’ll dive into these methods next time!