Links, internal links and the web graph hell from a crawler perspective

At Babbar, we currently crawl 3.5 billion pages per day and doing so we encounter quite a large variety of oddities that create some challenges that we constantly try to fight. Sometimes we must apply some manual actions to avoid these phenomena, but usually we try to develop methods and heuristics that will automatically correct the undesired behaviors.

What does a SEO crawler do (and do not)?

We have a few objectives: store the web graph, compute metrics about the graph, store anchor text of links, keep the metadata of web pages. We do not store the content of the web pages, instead, we compute an embedding from the web page text.
The graph is stored in several databases: 
    • URLs
    • Fetches (the biggest DB, which includes everything we keep from a crawl, including outbound links)
    • Backlinks

A very important piece of information that we have considered early in the design process is that internal links (links that point toward a webpage on the same host) outnumber external links by a factor of 10. To compute our semantic value, which is an equivalent of the topical pagerank, we need to include the embedding, plus all the metrics of the source page. Backlinks are a big part of the equation for computing metrics, but internal backlinks are “local”, i.e. they do not need to be distributed in the same way as external backlinks. 

Consequently, we chose early to only store the outbound links in the Fetches database, in order to limit their impact on the storage.

How many links are too many?

While fighting web spam, we have encountered, time and time again, hosts farms that clearly abuse the reliance of web search engines on the web graph to compute page authority. Those spam hosts tend to have lots of pages, probably generated automatically, and have sometimes up to 10000 outbound links on each page. 


In a “normal” setting, most of these links are useless, because all our metrics rely on a fundamental rule of internet browsing: the “Reasonable Surfer model”, which has been the basis of Google’s PageRank for a very long time. 


Google’s patent describes the features that their algorithms use:
“A system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data relating to navigational actions associated with the link. The system also assigns […] a rank for a particular document, generating the rank including determining particular feature data […], determining a weight indicating a probability of the link being selected, the weight is determined based on the particular feature data and selection data, the selection data identifying user behavior relating to links to other documents […] the weight indicating a higher probability of the link being selected when the particular feature data corresponds to feature data associated with the one or more links than when the particular feature data corresponds to feature data associated with the one or more other links […] words in anchor text associated with the links, and a quantity of the words in the anchor text”


Google’s patent does not describe the secret sauce, but one key ingredient is the position of the link in the page, because user behavior is determined by user’s attention, and user attention is focused on the central blocks vs. peripheral blocks and at the beginning vs. the end of a page.


At some point, Google has been rumored to advise to keep the number of internal links below 100, and while there is little to be trusted in this information, if the reasonable surfer model has anything to teach us, it’s that there is a high probability that link position is a strong indicator of its weight. Additionally, even in the previous “Random Surfer Model”, the number of outbound links determine the strength of the outbound value of a link. 


Consequently, even if there is no “strong limit” on the number of links (and more specifically maybe, internal links) that is taken into account, the farther the link in the page and the more links there are, the less important it will be.

How many internal links are used on a typical web page?

And now for some real data, we have studied a sample of representative websites to gather some knowledge about the number of internal links used. It turns out that most websites indeed follow an unwritten rule.
We have analyzed 50k websites, totaling 150M pages and 18 billion internal links.
As usual, we do not focus on averages or even median analysis, instead we take a look at “extreme” quantiles.

The graph above shows that 27% of websites have at least one internal link over the 500 links limit (red bar), i.e. 73% of websites don’t have a single page with more than 500 internal outbound links, and 14% of websites have at least one internal link over the 1000 links limit (yellow bar), i.e. 86% don’t have a single page with more than 1000 internal outbound links.

In this graph, we can see that of all the webpages analyzed, only 3% have more than 500 internal outbound links, and only 1% of all web pages crawled have more than 1000 internal outbound links

And finally, what the graph above shows is that these 3% pages (respectively 1%) amount to a very impressive 18% (respectively 12%) of the internal links that are past the position 500 (resp. 1000).

In other words, a small number of websites have a small amount of web pages that represent a very sizable number of internal links that are very likely useless.

Is it bad to have too many internal links ?

To answer a bit bluntly, it’s certainly not good to have too many internal links. Also, a lot of very suspicious websites have a lot of internal (and external) links. While there is no causality, it may perfectly be used as one of the indicators that a website should be regarded with caution. 
We have no indication that having too many internal links has a negative impact on a website ranking, but it cannot in any way be helpful, AND it consumes a lot of unnecessary resources. 
In general, limiting the links to the most useful and relevant ones is the best approach, and this is true both for external and internal links.

FAQ

Will adding internal links to a page improve its search engine ranking and overall website authority?
Yes, adding internal links to a page can positively impact its search engine ranking and improve the overall authority of a website. Internal links help distribute link equity across different pages, signaling their importance to search engines. By strategically placing relevant anchor texts and interlinking related content, websites can establish a strong internal linking structure that enhances their SEO performance.

How does the number of internal links on a web page affect its Pagerank and search engine ranking?
Understanding the impact of internal linking on a website’s authority and Pagerank is crucial for optimizing SEO.

What are the best practices for building internal links on a website?
Implementing effective internal linking strategies is essential for enhancing the user experience and improving website navigation. By strategically incorporating anchor texts within the content, users can easily discover related information and explore different pages on the site.

How can internal linking improve SEO for a website?
By creating a network of internal links, websites can establish a clear and organized structure that search engines like Google can crawl and index. This aids in improving website visibility and increasing organic traffic. Moreover, internal linking allows search engines to determine the relevance and contextual relationships between different pages, which can positively impact SEO.

Why is it important to have a solid internal linking strategy for SEO?
A well-crafted internal linking strategy not only helps search engines understand the architecture of a website but also enhances the user experience. By providing easy access to relevant information, internal linking can keep users engaged and encourage them to spend more time exploring the site. This can improve user satisfaction and ultimately contribute to higher search engine rankings.