Regression testing for SEO practice

SEOs always must deal with the same problems: among them, untimely
modifications to elements that count for ranking can be stressful.
Especially when the person making the change doesn’t warn (yes, I know,
it’s in the definition of untimely), messes everything up, is not or poorly
trained in SEO, or overall, above all, is unaware of the impact of his
actions.

It’s a problem frequently encountered in the case of in-house SEO, who
have to fight a constant battle as many users have editing rights on a
website without knowing the implications of their modifications. It’s a
problem that can also be found in a consulting business, where the client,
however big or small, doesn’t take the time to indicate the modifications
he’s going to make to his consultant.
In SEO, the stability of elements that matter is a real issue. And to
prevent Google from discovering an unstable element, or to react as
quickly as possible to avoid downgrading, there’s a solution: non-
regression testing.


What is a non-regression test?

In the context of development, a non-regression test (NRT) is a
methodology for verifying the expected state of an important element or
behavior, in order to ensure the stability of that element or behavior.


What is a non-regression test in SEO?


For SEO, there are plenty of elements to test. Stability being a key
element for a url, a non-regression test helps to avoid errors, and to be
warned in the event of problems, or at best to identify the date on which
a problem occurred. In the context of SEO support, it’s important not to
forget this kind of asset, which helps prevent unintentional behavior from
Google’s point of view. These tests can be coupled with e-mail alerts to
warn of any problems.


What elements can be tested to check for non-
regression of important SEO signals?


There are many elements that can be tested, but the main ones are the basics: the title tag, the description attribute of the meta tag, the H1 to H6…* But also the links that come from outside, because their presence is important for the authority of your page, specifically if they are obtained as a result of an acquisition strategy or a partnership.


So how do you use a non-regression script?


On the on-site side, the frequency of the non-regression script should
depend on how often the site is generally updated. If the site has many regular contributors, who publish, modify or delete pages several times a week, a high update frequency implies a high frequency of verification tests. We can therefore check for non-regression several times a day on the strategic elements of a url (or several) to know roughly when a problem arises.
As part of a migration, non-regression tests are particularly useful for monitoring the impact of a website version change, even if it’s only the header* or footer* that’s changed.
On the off-site part, specifically to check the presence of a link, the script can be run several times a month, once a week, even once a day if you wish. It all depends on your ability to store this information.


What does a non-regression script look like?


Overall, there are 2 parts to non-regression, and you have to think like a
web crawler or an application that simulates tests in the manner of a
search engine bot: the data harvesting part, and the interpretation part.
Harvesting data is pretty straightforward. All you need to do is retrieve
the information at regular intervals and store it in a database that you
can use in parallel.
The use and interpretation of this data is a little more complex, because
depending on the elements you’re checking or monitoring, a simple
update may involve corrections and modifications that will alert the
system to the presence of a modification. For example, if you want to
check the location of an incoming link, and the source of the link makes a
graphic modification to the site, the Xpath* may change, and so you’re
going to have a potential false positive on the exploitation of the data. On
the off-site, this is particularly impactful, as you have no control and are
not warned of any modifications made by your partners. (Hence the
importance of a non-regression test).


What to test as part of an SEO non-regression test?


Several elements can be tested, but in this example I’ve chosen to do
rather basic tests: for the onsite part, instead of checking the text content
(p tags and others)*, I’ve chosen to stick to the title, meta description and h1, h2 etc…*

With the script, I make sure to test the presence and stability of these elements, and I record any changes that may occur in a separate table. This ensures that the stability signal is respected.
For the off-site part, I’ve chosen to keep things fairly simple: we’re
looking to retrieve information about the presence of a link and its
location on the page. You can add the rel* information of the link tag, the
meta robots of the source page*, etc… Please note: the idea here is to
keep things simple, so that you can test them quickly, and so that you can
use them effectively.

To find out more:

You can crawl the site at regular intervals, in order to add the PageRank
distribution internally, or do a little more by testing semantic optimization
via the Yourtext Guru API (I think I’ll do a specific code on this if enough
people ask for it), which will also allow you to test the modification of the
vocabulary expected by the Google results page.
Overall, a script or web application performing non-regression tests is an
essential element of the SEO profession, as it enables the stability of a
website’s strategic pages to be assessed.
I’m sharing a python script with you, so it’s pretty basic in terms of the
tests carried out, but once set up on a machine running continuously, you
can monitor the evolution of these basic elements for your strategic url
(or your important links).

The code in question is available here.


How do I use this code?


Open your terminal and start by moving to the location of the downloaded
and unzipped folder. Example:

cd D:\user\folder\dezipped_folder\


Then install the dependencies (the libraries used) by typing:

pip install -r requirements.txt


Then make sure you’ve created a file with the existing links you want to
check (a csv with a “source” column and a “target” column, without quotation marks). And / or the pages whose title, meta description, h1, h2, h3, h4, h5, h6 elements you wish to test for non-regression.
For instance, I’ll call my links file “links.csv” and my pages file
“pages.txt”.

All that’s left to do is run the script (which will run as long as the
computer is switched on and connected to the Internet, and until you
switch it off (ctrl +c interrupts the script)).

To execute it, here’s the command (enter on the first line and answer with
y (yes: yes) or n (no: no) to include your files in the verification request).
Here, I’ve programmed the script to run every 3 hours: (3 x 60min = 180
min).

The seo_data.db file created is a SQLITE database. Not everyone knows
how to use a SQLITE database, so I’ve planned ahead:


python basics_seo_non_regression_tests.py
INFO:root:Initialized DB : [here the folder path]\seo_data.db
Do you want to provide a CSV file with source and target URLs? (y/n) : y
Enter the path to the CSV file with source and target URLs: links.csv
Do you want to provide a text file with URLs of pages to check? (y/n) : y
Enter the path to the text file with URLs of pages to check: pages.txt
Enter the frequency of the task in minutes (default is 1 minute):180

While your script is running, or after a script interruption, if you wish to
export the contents of the database in csv format, you can do so using the
2e python code I’ve made available to you in the folder:
“non_regression_to_csv.py”.

This script will allow you to export the contents of the 2 tables present in
the database: seo_data and differences. seo_data is the table which lists
all the passages and the state of the elements when the script passes over
them. This is the essential element for all comparisons: without data,
there can be no comparisons. differences is the table that lists all
modifications that can be identified between each script execution.

And to lighten your computer’s memory load, this csv export script also
lets you delete the contents of both tables after export (for safety
purposes, seo_data will always keep the last entry, to enable comparison
at the next run, which can happen quickly if your initial script is still
running). I wouldn’t necessarily recommend doing this, especially if
you’re worried about losing information, but sometimes the amount of
data stored is so great that you have no choice.


How do I use it?


In the same way as you need to go to the folder, make sure it’s the folder
for the python file & the seo_data.db file.
then run the :


cd D:\user\folder\dezipped_folder\
python non_regression_to_csv.py
Do you want to export the differences? (y/n) : y
Do you want to export the SEO data? (y/n) : y
Differences have been exported to differences.csv.
Do you want to clear the ‘differences’ table after export? (y/n) : y
The ‘differences’ table has been cleared.
SEO data has been exported to seo_data.csv.
Do you want to clean the ‘seo_data’ table after export (keep only the last
entry for each element)? (y/n) : y
The ‘seo_data’ table has been cleaned to keep only the last entry for each
element.

here’s an example of data export under differences.csv & seo_data.csv
with database information removed.
Please note: I have not foreseen the case where you export one file after
another. If you export 2 times without changing the names of the
differences.csv & seo_data.csv files, you risk, at worst, rewriting the files
and losing the historical information, at best, not being able to export the
data because the file would be open on your computer and you’d get an
error mentioning a lack of rights.


Output files:
difference.csv, with the following columns:

id: line number

timestamp: precise date and time (in UTC format) of entry into the
database (and therefore of comparison: this is the date on which the
modification was observed).

type: link or page, depending on whether a link or a url is being tested
element: the source and target of the link tested, or the url tested

data: a “dictionary” storing all the information tested in the last state
known to the script at the time of comparison

difference: an automatic comment on the observed change.


To explain what you are about to see :

For links : when you see the xpath change, it means that the link (or at
least the first occurrence of the link) has been moved within the HTML
structure of the page. This may simply be due to a graphic modification to
the site, but it may also be a modification to the content or the page in
particular.

When you see “xpath changed from […] to None, rel_attribute changed
from […] to None, links_list changed from […] to None”, the link has been
deleted. As the 3 elements are tested, it makes sense to see the 3
corrections indicated instead of just “link deleted”, which in my opinion is
a little too simple: here we indicate where the link was before it was
deleted, which rel attribute it had and how many times it appeared in the
page.

For pages : The change in title, description, htags (my name for h1 to h6) and the
number of htags are all stored equally to give a precise idea of the
change that has taken place on the page.