Automating Internal Linking for SEO with the Babbar API

In SEO, one of the commonly suggested and relatively “easy” recommendations is to improve a website’s internal linking.
I recently wrote an article about internal linking, how search engines like Google use it, and how to do it effectively. Today, I’m sharing a script to make the process easy and fast.

To do this, we’ll use the Babbar API, which enables us to retrieve semantic proximity between internal pages quickly.

How does the script work?

Crawling and storing information

Automating internal linking starts with an obvious prerequisite: having a minimal site map, including internal pages and the links between them. The script begins with a recursive crawl from a root URL, respecting the robots.txt directives.

The crawl runs up to a defined depth (default: 4 levels), with a limit on the number of links explored per page. For each page encountered, the script collects:

  • its canonical URL, if defined,
  • the list of internal links present in the HTML (up to the imposed limit of 100 links per page),
  • for each link: its position in the page, its type (follow or nofollow), and its validity (same host, allowed by robots).

This data is stored in a DataFrame, with the following for each link:

  • source, target (the URLs),
  • link_position (index in the page),
  • total_links (total number of links on the page).

These last two are used to compute a position weight: a link higher up on the page, among a reasonable number of links, carries more weight than one at the bottom among 200 others.

In parallel, the raw text content of each page is extracted (via BeautifulSoup or Trafilatura), then normalized and stored locally. This avoids re-fetching the same content if the URL is requested multiple times later.

Finally, for each collected page, the script calls the Babbar API, sending the page text and the host. Babbar returns a list of semantically related internal URLs with a similarity score (the identical URL is removed from the list). These scores help build an internal semantic graph, which is the foundation for optimization.

This first processing block forms the observable basis of the site: a real topology (links), available content, and a quantitative estimation of proximity between pages. The remaining processing relies on this structure to suggest improvements.

Retrieving priorities

The script doesn’t aim to “improve” the internal linking globally but to restructure attention around a selected set of high-priority pages. These are listed manually by the user in a file named prios.txt.

The approach is based on a simple principle:

On a website, not all pages deserve the same level of internal exposure.

For various reasons (business, policy, strategic…), some pages need to be made more accessible, better linked, or better integrated into the graph. These are defined as “priority” pages.

At this stage, the script:

  • reads the list of pages from prios.txt or prompts for it if the file is missing or outdated,
  • checks if they are already part of the collected graph,
  • and downloads their content if they were not included in the initial crawl.

This last step ensures their integration into the analysis, even if they aren’t naturally linked or reachable from the starting URL. It guarantees that the pages we want to boost are factored into the internal linking calculations, even if they are orphaned or deeply nested.

In short, this phase is a critical step for the upcoming optimization: the rest of the process will aim to improve these pages’ positions in the site graph by adding or modifying internal links pointing to them.

Adapted Internal PageRank Calculation

To measure each page’s structural visibility within the site, the script uses an adapted version of PageRank, incorporating two types of weighting:

  • the position of the link in the source page,
  • the semantic proximity between source and target.

Each link is thus weighted by a composite score, defined as:

total_weight = ((1 – α) + α × semantic_score) × ((1 – β) + β × position_weight)

The coefficients α (semantic) and β (position) are defined at the top of the script and can be adjusted. By default, position is slightly favored (β = 0.65) while still accounting for thematic proximity (α = 0.35), but these can be changed if semantics are more important in your context.

This model helps rebalance the analysis in favor of:

  • visually emphasized links (higher up in the HTML, fewer in number),
  • and contextually justified links — i.e., between semantically close pages.

Once the weights are applied to all links, the script computes a standard PageRank (damping factor d = 0.85), iterating until convergence. The result is a weighted centrality score for each page, representative of its structural importance in the actual site graph.

This calculation serves two purposes:

  • it provides a baseline for before/after comparison,
  • and it identifies underexposed pages — priority pages poorly positioned in the current graph.

Based on this analysis, the script will begin suggesting modifications, additions, or removals of internal links.

Suggestion Phase

Once the graph is built and priority pages identified, the script enters an active phase: suggesting internal linking modifications to help these pages rise in the graph.

This phase is organized into three types of actions, applied iteratively:

1. Modifying existing links

The script starts by adjusting the position of existing links to priority pages.

The goal is simple: if a link to a priority page already exists in a source page, it is moved higher in the HTML — giving it a stronger position weight.

This increases its contribution to the PageRank calculation without adding new links.

2. Creating new links

If priority pages are still insufficiently visible despite the above adjustments, the script suggests creating new internal links.

To do this, it queries the Babbar API to retrieve up to 50 semantically close pages for each priority page.

Among these suggestions, it identifies pages that don’t yet link to the priority page and adds a virtual link, at the top of the page, to test its impact.

The additions are iterative: one link is created at a time, PageRank is recalculated, and the process stops once the priority page reaches the target position (by default: ranking in the top N pages, with N equal to the number of defined priorities).

3. Removing links

Lastly, if some non-priority pages rank higher in PageRank than the lowest-ranking priority page — and if they receive internal links — the script suggests removing those links. Removal is never automatic and only applies when a link structurally competes with the priorities.

This helps reallocate internal visibility in favor of the designated targets.

All suggested changes (links to create, modify, or remove) are exported in separate files for human review.

This phase is deterministic, controlled, and progressive. It aims to reallocate the site’s internal visibility capital toward strategically chosen pages without disrupting the architecture or cluttering content.

Finally, all results are exported into separate files for review: recommended linking, links to create or modify, and PageRank comparison. The final graph version can be compared to the initial state to evaluate the real impact of the suggested changes.

What are the script’s limitations?

Limits from the indexability of pages

The Babbar API relies on prior indexing — only known pages can be used to perform semantic analysis and generate link recommendations. In other words, if a page hasn’t been indexed by Babbar or isn’t part of its semantic graph, it can’t be used to generate internal link suggestions.

This presents a real limitation when certain pages aren’t indexed.

If a page is unknown to the tool (e.g., non-existent, recently added, isolated, or too deep and not yet crawled by Babbar), it’s impossible to get semantic recommendations for it. This means the tool can’t suggest relevant links to or from it, and its role in internal linking optimization is limited.

This is especially problematic for new or low-visibility pages that haven’t yet received significant internal or external links.

That said, these limits do not compromise the overall effectiveness of the script, as the approach still focuses on analyzing the site as a whole. Missing recommendations for certain pages can be manually addressed if necessary, by adjusting priorities or reevaluating existing links.

To address this, you can contact support to request Babbar to crawl your entire site (either by providing the domain or a complete list of URLs — the latter is better). The tool will then crawl your pages, and you’ll be able to use the script within about ten days (enough time for metrics to update).

It’s also possible that Babbar cannot crawl your site. This is a separate issue and may require action on your part to allow Barkrowler to access your URLs. Contact support if you’re unsure what’s blocking the crawl.

Limitations of internal link modifications

The script aims to enhance the structural visibility of a set of priority pages. It does so by using available levers: adjusting link positions, adding relevant links, and reducing dispersion to non-priority pages.

But it can only work within the boundaries of the existing graph.

Sometimes, even after multiple iterations, certain priority pages fail to reach top internal PageRank positions. This isn’t a script error but a structural limitation: the semantic distance between those pages and the rest of the site.

If a priority page is too far from the site’s overall content — in terms of topic — internal pages won’t have enough thematic proximity to justify links to it.

In such cases, the script won’t force links. It stops iterations once it determines that PageRank propagation is structurally unfavorable.

In other words, a page cannot be artificially brought closer to the site if no contextual link justifies it.

In these situations, editorial action may be needed:

It may be worthwhile to create intermediate pages, better linked to existing content, to act as thematic bridges toward the priority page.

These “pivot” pages will serve as semantic relays, allowing the graph to open up more fluidly.

Use Case : www.visitscotland.com

Internal PageRank before

Before running the script, the crawl helps identify how PageRank is distributed across the internal linking structure.
This allows you to quickly assess whether the existing linking meets your expectations or if certain pages need to be given more prominence.

For the purposes of this example, we’ll assume that some URLs (randomly chosen) represent our priority pages.
We’ll use 4 articles as examples (this approach works with more URLs, but it’s better to keep the list limited):

We’ll add them to the prios.txt file (or add them when the command line asks for it).

Internal PageRank afterwards

After letting the tool work things out, the following files have been generated:
– PR_comparisons.csv
– current_linking.csv
– recommended_linking.csv
– links_to_create.csv
– existing_links_to_edit.csv
– existing_links_to_remove.csv

The new sorting is the following:

Luckily here, there is enough content on this website to create the linking the way we want and push the prioritized pages at the top of the list.

Our priorities are here:

Our pages are in the top 4 of the newly adapted internal PageRank sorted list. But what does that mean? How many links should I create? How many should I delete? How many should I edit?

Here, the script gives us the list of links to create: there are up to 150 links to create towards the priorities. 

But there are also many links to delete (or obfuscate): up to 32765 links to delete. The idea here is to withdraw these links from a crawler’s graph.

But the worst is mainly about the links to edit: the script recommends up to 372569 links to edit.

For the edition and the deletion parts, it may be easy to do if the recommendations are sitewide links, because it would mean it could be an automatic edition to the website.

Otherwise, it’s quite a huge amount of work to do? There could be some ways to improve the result (mainly through positions updates when modifying or creating new links) but the main principles will make it work.

We could also soften things up (on links deletion for instance) to globally improve the internal linking on prioritized pages without forcing them at the top of the list, therefore reducing the expected workload.

For those who are interested in this script, you’ll be able to find it over here.