It’s a broad topic when you’re not an expert in a market: quickly understanding which sites are competing with the client to whom we will offer our expertise. A SEO consultant is obliged to understand, at least minimally, the ecosystem in which their client operates, in order to know how to propose a SEO strategy and actions that will boost a page (or several) on Google.
Some preliminary points to consider:
From the perspective of a search engine, we already know that content is represented in the form of a vector (embedding). A website is therefore a grouping of several embeddings and can have a similar representation: mathematically, we can group vectors to form a global one for the entire site.
In two dimensions, it would look like this:

It is here that the notion of a site must also be defined: a domain would make no sense for the concept of a site: it would be topic too blurry if we had to compare wordpress.com with a competitor, there would be too many associated websites. We must therefore speak of host (or subdomain if you prefer) to have a more accurate point of view. And since we do not compare incomparable things, all site viewpoints must be at host level.
bakery.example.com
mechanics.example.com
example.com
www.example.com
Here we have an example of 4 sites on the same domain (example.com) but on 4 different subdomains (hosts): bakery, mechanics, www and the apex (no subdomain). We will necessarily compare hosts with other hosts, and not with other domains (even the one at the apex).
We simply need to compare different vectors to obtain sites that talk about the same things as you. The more specialized the sites are, the more precise this comparison will be.
A currently imprecise approach for defining competition
Today on the web, the tools that present competition will only display sites that have the same keywords as you.
Let’s take an example: The iPhone case seller has a website, and if he relies on the “keywords” approach, a marketplace like Amazon could be considered his competitor. However, Amazon does not consider it as a competitor at all. This is because Amazon has many other verticals that allow it to sell many other products. Does it make sense for our case seller to compare himself to Amazon? Not really.
This approach has a major issue: the size of the websites varies, and some sites are more generalist while others are more specialized. By only taking into account common keywords, we therefore do not have a sufficiently refined approach to distinguish between generalists and specialists.
Another issue would be that if the proposed list of competitors is truncated, the tool will generally propose generalist sites because they have many keywords, and thus fall into a too imprecise definition of competition.
The method of common keywords is nevertheless interesting, as it allows identifying the sites that squat a share of visibility in search engines, a share that one might want to recover.
In SEO, how can we define competition?
In SEO, we can talk about competition when referring to the websites that aim to capture the same prospecting area on key Google queries or more general intentions. However, not all websites have the same means, or overall address the same target.
The reality is that talking about competition is rather vague because everyone will use their own definition and adaptation to indicate a site as being their competitor.
On a case-by-case basis (query by query), a competing page ranks in the SERP (search engine results page) on the same query as you, or on a query you are targeting. It is therefore a page that will talk about roughly the same topics or will at least be considered legitimate to rank on Google (or any other engine) as your page that targets the same query.
But from a broader perspective, websites having different content policies, some will be more specific, others more generalist and overall, the proximity of the content policy can very well be a method to define who will try to directly compete with a particular site.
What alternative method to define competition?
Today, there is not that many methods that would equally allow an approximation of competition. By using the average embedding of a site, we should be able to identify a similarity score between two sites, in the same way that a search engine understands the similarity between a document and a query. To be completely precise, one should rather speak of the inverse of the distance between two average embeddings.
This method is completely independent of the common keywords approach, since it “only” requires having a sufficiently large web crawl available to define which sites are similar. (It’s a bit more complicated than that, especially when one considers the size that such an index must be, and that comparing the sites comes down to creating a gigantic matrix to compare the entire list). We will therefore turn to an operator who has already done the work for part of it: Babbar, which provides its API clients with a route allowing them, by language, to identify up to 100 hosts similar to an input host, through the embedding method.
Unfortunately, the embedding approach is not refined enough to give results that, in the eyes of the end customer, will be relevant. One can indeed end up with a list of sites that are indeed in the right thematic field but may be small players, invisible on the SERP and that the general public will not know or trust.
Using a method that takes into account embeddings is using a search engine’s perspective to define competitors. Using common keywords is using a perspective of visibility in the SERPs.
Therefore, we could combine the two approaches to find more relevant results.
How to combine common keywords and similar ones?
It’s quite simple, we will just make a drawing to show what can be obtained by using the combination of these two approaches:

Distribute the competing sites according to the similarity rate (on the x-axis) and the number of common keywords (on the y-axis) (normalized value) with the size of the competing points differing according to the total number of keywords of the competing site.

Designing a line that passes through (0,0) and (1,1) then allows us to have a unique dimension to compare all sites.
We can thus identify the top competitors based on the orthogonal projection of the point representing the competitor onto the line and by ranking the projection from the highest value to the lowest.

Based on our approach, the most important area for the initial site is the one that is highest to the right, close to the diagonal.
We can then draw up a list of competitors, sorted by their position on the diagonal.
Let’s take an example:
Here, an example when only considering the top 20 keywords, in the English-speaking US market, for the site www.planetfitness.com:
(An interactive graph (here the x-axis is not normalized) to show you the competitors: move the mouse over a point)
The diagonal is the line that goes from (0,0) and passes through (1,1). Its equation is:

For a point (x, y) the orthogonal projection onto the line is determined by:

In this case we only need the coefficient t (multiplying by (1,1) is not useful for comparing values since it is the same value for each projected point)
We therefore obtain, for www.planetfiness.com (on keywords and similars in English from the US), the following ordered table, where t_projection is decreasing because we have reduced the competitors to a single dimension:
similar | similarity_score | common_keywords_norm | t_projection |
www.gymbird.com | 0.7 | 1.0 | 0.85 |
gym.com | 0.7 | 0.41607000795544946 | 0.5580350039777247 |
www.orangetheory.com | 0.68 | 0.3739061256961018 | 0.526953062848051 |
www.planetfitness.ca | 0.87 | 0.012728719172633254 | 0.44136435958631665 |
chuzefitness.com | 0.67 | 0.11429329090426943 | 0.39214664545213473 |
youfit.com | 0.69 | 0.05595332802970034 | 0.37297666401485013 |
www.o2fitnessclubs.com | 0.7 | 0.030495889684433838 | 0.3652479448422169 |
www.thegymgroup.com | 0.7 | 0.01988862370723946 | 0.3599443118536197 |
www.thefitnessdistrictgym.com | 0.69 | 0.002916998143728454 | 0.3464584990718642 |
gymdues.com | 0.69 | 0.002916998143728454 | 0.3464584990718642 |
planetfitnessteenfitpass.com.au | 0.68 | 0.012198355873773535 | 0.34609917793688677 |
wellbridge.com | 0.68 | 0.010607265977194379 | 0.34530363298859723 |
titanfitness24.com | 0.68 | 0.0015910898965791568 | 0.3407955449482896 |
bodyfuelfitness.com | 0.68 | 0.0013259082471492973 | 0.34066295412357467 |
www.leadfitness.com | 0.68 | 0.0010607265977194379 | 0.3405303632988597 |
transform180training.com | 0.68 | 0.0010607265977194379 | 0.3405303632988597 |
theironplate.com | 0.68 | 0.0007955449482895784 | 0.3403977724741448 |
steelfitnesspremier.com | 0.68 | 0.0005303632988597189 | 0.34026518164942987 |
info.o2fitnessclubs.com | 0.67 | 0.006894722885176346 | 0.3384473614425882 |
www.racmn.com | 0.67 | 0.002916998143728454 | 0.33645849907186426 |
fdgyms.com | 0.67 | 0.0010607265977194379 | 0.3355303632988597 |
ellisathleticcenter.com | 0.67 | 0.0007955449482895784 | 0.3353977724741448 |
www.blastfitness.com | 0.67 | 0.0007955449482895784 | 0.3353977724741448 |
www.raincityfit.com | 0.67 | 0.00026518164942985947 | 0.33513259082471497 |
www.fairviewlfc.com | 0.67 | 0.00026518164942985947 | 0.33513259082471497 |
www.goldsgymdcmetro.com | 0.65 | 0.01909307875894988 | 0.3345465393794749 |
www.clubfit30.com | 0.66 | 0.0021214531954388757 | 0.33106072659771946 |
www.gymcompany.co.za | 0.66 | 0.0010607265977194379 | 0.3305303632988597 |
www.survive41.com | 0.66 | 0.0007955449482895784 | 0.3303977724741448 |
There is undoubtedly a limit to be found in order to keep only the best results.
We can decide to perform the same exercise by taking the top 100 keywords, the ranking is quite different (but we still find the same top competitors):
similar | similarity_score | common_keywords_norm | t_projection |
www.orangetheory.com | 0.68 | 1.0 | 0.8400000000000001 |
www.gymbird.com | 0.7 | 0.9092111959287532 | 0.8046055979643766 |
gym.com | 0.7 | 0.48346055979643765 | 0.5917302798982188 |
chuzefitness.com | 0.67 | 0.4355216284987277 | 0.5527608142493639 |
www.thegymgroup.com | 0.7 | 0.22615776081424938 | 0.46307888040712464 |
www.planetfitness.ca | 0.87 | 0.031246819338422393 | 0.4506234096692112 |
youfit.com | 0.69 | 0.1146055979643766 | 0.4023027989821883 |
www.o2fitnessclubs.com | 0.7 | 0.0789821882951654 | 0.38949109414758265 |
gymbills.com | 0.7 | 0.011297709923664122 | 0.355648854961832 |
wellbridge.com | 0.68 | 0.030127226463104326 | 0.35506361323155217 |
gymdues.com | 0.69 | 0.013842239185750636 | 0.3519211195928753 |
gymbigot.com | 0.7 | 0.001424936386768448 | 0.3507124681933842 |
www.beactivefitness.co.nz | 0.7 | 0.00010178117048346055 | 0.3500508905852417 |
www.thefitnessdistrictgym.com | 0.69 | 0.0015267175572519084 | 0.3457633587786259 |
www.gymcompany.co.za | 0.66 | 0.03094147582697201 | 0.34547073791348604 |
planetfitnessteenfitpass.com.au | 0.68 | 0.0071246819338422395 | 0.3435623409669211 |
titanfitness24.com | 0.68 | 0.0031552162849872775 | 0.34157760814249366 |
steelfitnesspremier.com | 0.68 | 0.0030534351145038168 | 0.34152671755725195 |
transform180training.com | 0.68 | 0.002544529262086514 | 0.34127226463104327 |
www.leadfitness.com | 0.68 | 0.001119592875318066 | 0.34055979643765905 |
bodyfuelfitness.com | 0.68 | 0.0008142493638676844 | 0.34040712468193385 |
theironplate.com | 0.68 | 0.0006106870229007634 | 0.3403053435114504 |
www.crossfitchicagoheights.com | 0.68 | 0.00010178117048346055 | 0.34005089058524174 |
www.shalomwellnesscenter.org | 0.68 | 0.00010178117048346055 | 0.34005089058524174 |
www.racmn.com | 0.67 | 0.006921119592875318 | 0.3384605597964377 |
info.o2fitnessclubs.com | 0.67 | 0.006513994910941475 | 0.3382569974554708 |
www.plusfitness.com.au | 0.66 | 0.01648854961832061 | 0.3382442748091603 |
ellisathleticcenter.com | 0.67 | 0.002849872773536896 | 0.33642493638676846 |
fdgyms.com | 0.67 | 0.0026463104325699744 | 0.33632315521628503 |
(And the graph – without the line – looks like this)
(An interactive graph (here the x-axis is normalized) to show you the competitors: move the mouse over a point)
It is interesting to see that in both cases, the top 3 competitors of the initial site are the same, (gym.com, www.gymbird.com and www.orangetheory.com). This comes down to estimating that the most optimal way to find the best competitors must lie between the top 20 and the top 100, since we get more information with the top 100, but a little more noise.
Will the result always be sufficient to find the client’s major business competitors?
Not necessarily. Remember, we retrieve the top 100 sites according to the similarity score based on the average embedding of each site on the web visited by a crawler. While it is unlikely that a crawler visiting 3 – 6 billion pages per day would miss a site, it is very likely, on the other hand, that many small sites appear in the similar tops without having many positions. The approach is refined and these sites are filtered by adding the second dimension of the number of common keywords. However, we have a limit of 100 similar sites from which we will only retrieve those that have common keywords.
Therefore, everything depends on the keyword database and the analyzed site database:
For the similar top, everything depends on the quantity of pages analyzed on the web and the selected embedding method.
For common keywords, everything depends on the size of the tool’s database, and the update frequency.
The competitors that emerge from this analysis remain entirely exploitable as competitors for a search engine competition analysis.
Would performing the work in the other direction (first the common keywords then the similar ones) yield better results?
It is possible, but with a much higher cost. It would always be necessary to sort out the “whales”: those sites that touch on a very large number of keywords, some of which might be unknown to the SEO expert, and then launch the similarity calculation afterwards, which implies crawling the concerned sites to calculate their average embedding. It is another approach that requires a very different computing and processing power on demand, without really guaranteeing better results.
What are the drawbacks of this approach?
The major drawback concerns sites that are at the extremes: the whales and other generalist sites will have a general embedding that is not very precise and will be close to sites that have little to do with the themes. Small sites that have few positions will struggle to provide common keywords. For the vast majority of web players who are neither whales nor unknown sites, however, this approach is entirely usable.
Understanding competition, through keyword analysis tools and site analysis tools, as well as an in-depth study of advertisers and competing sites, is essential to develop a high-performing SEO strategy. The goal is to position the client’s pages at the top of Google’s search results.
In summary, the proposed hybrid approach – combining embeddings and keyword analysis – offers a mathematical definition of online competition. The example of Planet Fitness shows that, despite the variations between the top 20 and the top 100, main competitors can be found.
Without giving you the exhaustive list of competitors that your prospect or client has in mind, you will have a list of competitors that they know, or partners that they may have. It is a good way to save time to have results that will speak to your interlocutor, validating an SEO’s ability to adapt to the client’s theme and quickly identify at least a few serious competitors for their client.