How duplicate content is estimated in Yourtextguru

Duplicate content was one of the first problems for Web Search Engines. Back in the beginning of the 1990’s, you would frequently have a few very similar web pages in the results of a search engine. 

And that’s bad, first for the user who doesn’t want to filter out manually the content that he has already discarded, and second for the search engine which must store more data in its index. Search is in general targeted toward original content, not toward all versions of a content or variations in themes. Quickly, Google came up with a solution, and the drawback was that having duplicate content in your website became a bad signal for Google and your website ranking. Hence the existence of a multitude of SEO articles about the topic, and SEO tools that will help you analyze your website to detect if your content is duplicated. Some of these tools can crawl your website on-demand, analyze the content of your web pages and list the URLs that are highly similar to each other. Other tools, like Yourtext.guru for example use already crawled content to give you a quick overview of the duplicate content distribution. And finally, in another category of SEO tools, there are plagiarism detection that try to find copy of your web pages in other websites.

What is duplicate content and why is it a problem ?

First, what is content ? For Google, at least, it’s the core of the webpage, the content is the text minus all the boilerplate, all the menus, all the side navigation bars, etc.

Your website is considered to have duplicate content if the search engine can find several URLs with the same (or very similar) content. So if your website has a URL which allows for a lot of variations (for example because of filter, ordering or display type parameters), then unless you prevent it somehow, the search engine will crawl these variations and detect their similarity.

Duplicates in a website is bad because it creates confusion for the search engine when determining which version of the content to prioritize and display in search results. This can dilute the ranking potential of a webpage, leading to reduced visibility and effectiveness. Furthermore, duplicate content can negatively affect your site’s crawl budget, as search engines might spend unnecessary resources indexing duplicative pages instead of focusing on unique and valuable content.

Search engines aim to provide users with the most relevant and diverse set of results possible. When multiple pages serve essentially the same information, it impairs the user experience by cluttering the search results with redundant entries. This is particularly problematic for e-commerce sites, where product listings might appear under different categories or with varying URLs due to tracking parameters or session IDs.

To tackle these issues, techniques such as canonical URLs, noindex meta tags, and robots.txt rules can help the search engine decide whether to crawl a page or consider it when checking for duplicate content.

So, how is the duplicate content measured in Yourtext.guru ?

It all starts with the content of each page, which is, in itself, a problem: the easiest way to get the content is to take into account the whole HTML content, but we can also use the inner content extracted from the HTML file. To extract the content from a page, we use the same methodology as for the semantic vectors that we use to calculate semantically similar sites and domains. That is to say: we keep everything (we have tested many methods for extracting core content, but the benefit was not that easy to show in the general case).

We can then apply different normalization methods to the content (converting to lowercase, removing accents, etc.). But in practice, what we do will depend heavily on the final use case. In the case of internal duplication, we often want to detect not plagiarism, but rather identical content served in different contexts, so normalization is less important. Furthermore, normalization requires specific work depending on the language.

The concept of content signature

Since it’s out of the question to keep the full content to calculate duplication, various methods of compressed representation have been proposed to speed up the process while keeping a minimum amount of data. These representations are called fingerprints or signatures. We will cover two of the most used signatures here :

MinHash

One of these methods is implemented in the “simhash” tool available on several GNU/Linux distributions (beware of the confusion related to the tool’s name) and uses the principle proposed by Mark Manasse [1].

This principle is a variant of the MinHash technique applied to “shingles.” To understand the principle of this algorithm, let’s first imagine that we break down a document into words.

The basic principle of MinHash is as follows: to create the signature of a document, a hash function is applied to each of the words it contains. This hash function returns a number.

The signature kept is the smallest hash observed for a document. Two documents that contain the word that produces the smallest hash (and do not contain a word that produces an even smaller hash) will therefore have the same MinHash.

In practice, having a single value based on the hash of one word is too fragile, so two variants can be used:

  • Either we use several independent hash functions,
  • Or we keep the K smallest hashes observed for a document.

The technique used in the “simhash” tool that can be found on most GNU/Linux distributions corresponds to the second method, with K=128 by default.

This tool generates 128 * 32 = 4096-bit signatures by default, or 512 bytes. In reality, since the hashes are “compressed” downwards the space of numbers, they can be expressed quite easily in a compressed form, and we can compute a similarity algorithm between the signatures that is quite efficient. This calculation is simply a Jaccard similarity (number of identical hashes/number of hashes kept).

A brief note on the advantages and disadvantages of this approach: 1/ just by adding a little vocabulary (e.g., by adding some typos here and there), the distance increases quickly, because only the vocabulary matters: the fact that a word appears often or not is irrelevant. 2/ Calculating the similarity between many signatures is very time-consuming.

Shingling somewhat mitigates these drawbacks: rather than breaking down a document into words and thus having a large distortion in distribution between frequent and rare words, we break it down into sequences of N letters, using a sliding window on the content. The advantage is that by using a fixed window size, the frequency bias on short words (typically grammatical words) is much less present. On the other hand, there are many more tokens to hash to calculate the signature, as the number of shingles equals the number of letters in the document (with an epsilon difference).

SimHash

The algorithm proposed in 2003 by Moses Charikar from Google [2] allows us to retain information about the statistical distribution of words or shingles in the document. The idea is as follows:

  • Each token is hashed. The result of the hash is then viewed in terms of its binary representation and converted into a sequence of 0s and 1s. Example: An 8-letter shingle “SEO made” produces the hash 0x8ABB, which gives the sequence {1,0,0,0,1,0,1,0,1,0,1,1,1,0,1,1}.
  • The zeros are then converted to -1, giving us {1,-1,-1,-1,1,-1,1,-1,1,-1,1,1,1,-1,1,1}.
  • The sequence is summed to a global integer sequence that will represent the “pre-signature” of our document.
  • Once all tokens of the document are processed, we have a vector in which all token vectors are added together, for example: {-12,67,1,-95,12,3,8,0,-11, ….}.
  • The final signature is obtained by converting each value to 0 or 1 depending on whether it is negative or positive, so we get: {0,1,1,0,1,1,1,1,0,…….}

To calculate the similarity between two documents, we simply take their respective signatures and calculate the Hamming similarity, which is the (very fast) operation (bitcount(a XOR b)).

In the implementation done on Yourtextguru, we have improved some aspects of the algorithm to gain performance and reduce the impact of overly frequent tokens. Although the principle is slightly different, the result is very close to the results of Charikar’s method as cited in the 2007 Google paper [3].

Distribution Calculation

For the distribution calculation, that is, the representation provided by Yourtextguru, we extract the list of signatures for a site or domain, and then calculate the similarity matrix. This matrix is then presented in a synthetic way through a histogram. Each bar represents the number of pairs of pages with a given similarity. To get the list of URLs in a given bucket, you can just click on it, and it will display the list of pairs of URLs with their similarity. 

An example of a bad distribution with high duplication is the following (one can see two groups of duplications, one small around 33, one very high around 89, probably because of a template that is not used everywhere on the web site :

Note that a random sampling of vectors of the signatures’ dimension should produce a gaussian distribution of similarities centered around 50%. So, for 100 dimensions, the gaussian should be centered around 50. However, we often observe this kind of distribution:

When the distribution is gaussian, but the average similarity is shifted to the right, it means that parts of the templates, boilerplate, or redundant elements on the web pages take a significant amount of place in the content. 

The two following distributions are observed on two well-known news websites, where the content is clearly original and with a significant weight that allows to completely hide the redundant parts of the web pages :

Those variations in the “average similarity” are one of the reasons why it’s quite difficult to extend the duplicate content detection across websites. Our initial experiments in this regard have shown that the text extraction step is crucial for the reliability of this measure.

Frequently asked questions about duplicate content

How does duplicate content affect SEO rankings?

Duplicate content can confuse search engines, leading them to struggle with deciding which version of the content is the most relevant for a given query. This can dilute the visibility of the page in search engine results, as it might cause search engines to rank the less favorable duplicate or even to split ranking signals. Google Search, for example, may choose to consolidate the signals by selecting a canonical version through its own algorithms, but this doesn’t guarantee the outcome webmasters might prioritize. Moreover, if the content appears to be intentionally duplicated across different domains, it could be perceived as an attempt to manipulate rankings, potentially resulting in penalties. Implementing the “rel=canonical” tag can help indicate the preferred version of the content, guiding search engines in appropriately attributing ranking signals.

What are the potential issues of having duplicate content on a website?

Having duplicate content on a website can lead to several issues that impact its performance in search engine rankings. One major problem is the dilution of link equity; when multiple pages with similar content compete for the same set of keywords, backlinks that could strengthen a single page’s authority are instead distributed across several duplicates, weakening their individual impact. Additionally, duplicate content can generate crawling inefficiencies. Search engines have a limited crawl budget for each site, and when they come across multiple instances of similar content, they may waste resources indexing redundant pages instead of new or unique pages that could boost the site’s relevance and authority. This could delay the indexing of fresh or updated content, ultimately affecting how promptly a site can respond to trends or shifting user interests.

How can canonical tags help resolve duplicate content issues?

Canonical tags play a critical role in resolving duplicate content issues by signaling to search engines which version of a page should be considered the primary one. When properly implemented, a canonical tag informs search engines of the preferred URL, consolidating the ranking signals such as backlinks and click-through rates towards this chosen version. This not only helps in unifying disparate metrics that might otherwise be fragmented across multiple duplicates but also optimizes the crawl budget by directing search engine spiders to index the canonical page, thereby saving resources and improving the site’s overall efficiency in search engine results.

Moreover, canonical tags can prevent unintentional duplicate content situations, like session identifiers in URLs or print versions of pages, from impacting the site’s SEO health. By effectively managing which content search engines recognize as authoritative

How can robots.txt rules help resolve duplicate content issues?

Your robots.txt file can specify which URLs are explicitly disallowed to web search engines crawlers. For example if all your product pages have optional parameters that allow to display variations (orderby parameter, filter parameter, display-type parameter like list or thumbnails, etc.), you can simply add a rule for disallow a pattern such as : 

disallow: /products/*?*order=*
disallow: /products/*?*filter=*
disallow: *?*session_id=*

With these rules in place, not only will the search engine avoid discovering duplicate content, but it will also help improve your crawl budget, and reduce both server and bandwidth costs. By implementing SEO best practices such as using canonical tags, optimizing your website’s URLs, and addressing any content issues, you can ensure that search engines like Google will prioritize indexing the correct versions of your content. This will enhance your website’s visibility in search results and improve the overall user experience. So, take advantage of the Google Search Console to identify and rectify any duplicate content issues and make sure that your site is optimized for search engines.


How does Google Search Console help identify duplicate content?

Google Search Console provides invaluable insights into how Google perceives your website, offering tools to identify potential duplicate content issues. Within the platform, you can access reports that highlight indexing errors, including those related to duplicate content. The “Coverage” report is particularly useful, as it shows which URLs are indexed and which are not, along with reasons such as duplicate without user-selected canonical. This information is critical for diagnosing content issues that might otherwise go unnoticed. Google Search Console also allows you to inspect individual URLs, giving you a detailed view of how each page is interacted with by Google’s crawlers. By submitting sitemaps and utilizing the URL Inspection tool, you can better understand canonicalization issues and address them promptly. 

Obviously for checking other websites than yours, Yourtext.guru is however much better than Google search console.

How duplicate content can arise on a website?

Duplicate content can arise on a website in several ways. One common cause is having multiple URLs that display the same or similar content, which can occur due to tracking parameters or session IDs in URLs. E-commerce sites are particularly prone to this, as products might be accessible through different category pages or filtered results, leading to multiple variations of a page. Another source of duplicate content is printer-friendly versions of web pages or translated versions that don’t use the proper canonical tags. Additionally, content syndication without proper attribution or canonicalization can lead to duplicates. 

To summarize, the two most common causes of duplicate content that arise on a website are:

  1. A technical addition (new plugin of equivalent) that creates variations of web pages for the UI or UX, or for tracking. Most plugins handle the canonical correctly, but some manual tweaking can easily break it.
  2. A theme change that changes the balance between relevant content and content considered when checking for duplicate content

It’s important to resolve these issues to avoid dilution of link equity, ranking penalties, and to ensure that search engines recognize the authoritative version of each page.

How can search engines distinguish between original and duplicate content?

Search engines can distinguish between original and duplicate content by using sophisticated algorithms that evaluate various factors. One method is through the analysis of the content’s publication date; the earlier version is often considered the original, provided it appears on a reputable site. Additionally, search engines examine backlink profiles, with pages having more authoritative backlinks being prioritized as the original source. The implementation of a rel canonical tag also instructs search engines on which version of a page should be treated as the authoritative source, simplifying the process of canonicalization. Google’s algorithms are designed to recognize patterns and attributes in content that indicate originality, such as text uniqueness, and the presence of unique images or media. Furthermore, meta tags and sitemaps submitted via platforms like Google Search Console can aid in identifying

Why is it important to identify duplicate content on a website and how can I do it?

A living website is going to change all the time, new plugins are added, new sections are created. Checking for duplicate content regularly is necessary because it is almost impossible to know beforehand if an evolution will lead to duplicate content. Search engines, on the other hand, constantly crawl and recrawl your website, and will discover your duplicate content as soon as a bad update is applied. Identifying duplicate content is crucial to maintaining a site’s SEO health and ensuring that search engines can accurately index and rank the most relevant pages. One effective way to identify duplicate content and ensure a good SEO health is to use specialized tools such as Yourtext.guru for quick and efficient checking or Screaming Frog for an in-depth analysis. These tools can scan your website for similarities and flag potential issues. Regular audits should be part of your maintenance routine, focusing on title tags, headings, and body content. Using the “site:” search operator in Google can also help spot duplicated entries by listing all indexed pages, allowing you to manually verify their uniqueness. Employing a canonical tag consistently across similar pages provides clarity to search engines about the preferred version of the content. Content management systems often have plugins designed to monitor and manage duplicate content, ensuring that any

Is there a possibility of duplicate content with product descriptions?

Indeed, there is a significant possibility of encountering duplicate content with product descriptions. This problem commonly arises in e-commerce websites where similar or identical products from different vendors are frequently listed with the same descriptions. The problem is further exacerbated when manufacturer-provided descriptions are reused across multiple sites. This presents a challenge for search engines such as Google, as they struggle to identify the original version, potentially resulting in lower rankings or exclusion from search results. To mitigate this risk, it is recommended to create unique and captivating product descriptions that set your listings apart from competitors. Incorporating original insights or customer reviews can enhance the distinctiveness of your product pages. Additionally, implementing structured data markup can assist search engines in better comprehending and accurately displaying your content. Regularly reviewing and updating product descriptions ensures their ongoing relevance while reducing the likelihood of duplication.

How can URL parameters contribute to duplicate content problems?

Duplicate content can have a significant impact on a website’s SEO as it diminishes the value of original content and causes confusion for search engines in terms of determining which version to rank for a given query. This confusion may result in lower rankings for all versions of the content, ultimately decreasing the site’s search visibility and traffic. Several instances can lead to duplicate content, such as utilizing identical product descriptions across multiple e-commerce sites or having different URL parameters that lead to the same content page. To check for duplicate content, it is advisable to regularly audit the site using tools like Screaming Frog or Yourtext.guru, which can help identify duplicated pages and ensure the uniqueness of the content. One effective means of indicating to search engines the definitive version of the content is through the use of the rel=”canonical” tag. 

How can website owners prevent duplicate content issues from arising?

One of the most effective methods for preventing issues is by utilizing sitemaps and robots.txt rules. Duplicate content can have a negative impact on a site’s SEO, as it can confuse search engines when deciding which version of the content to index and rank. This can lead to reduced visibility for each duplicate piece and potentially result in ranking penalties. For example, if a site displays identical product descriptions across multiple pages without implementing the appropriate rel canonical tags, it can result in duplicate content. It is recommended to regularly conduct audits using tools such as Yourtext.guru to efficiently detect and resolve duplicate content problems.

To distinguish original content from duplicates, search engines employ advanced algorithms. The implementation of canonical tags plays a critical role in this process as they inform search engines which version of a page should be considered the authoritative source, ultimately resolving any duplicate content issues.

References

[1] Mark Manasse. Finding similar things quickly in large collections. MSR Silicon Valley, 2003.

[2] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC2002). https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

[3] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web (WWW ’07). http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf