Copywriting methodology for search engines

In the world of search engines, understanding text is very different from understanding it as a human being. In this article, I’d like to take you through the evolution of how search engines perceive our pages and the best practices to adopt. We’ll see why it’s crucial to exploit the SERP live, how contextualization is essential for an engine to even consider your page for its primary index, and what reflexes to adopt to meet readers’ real expectations.

Fundamental understanding

Tf idf

      In the 60s, Karen Spärck-Jones worked on information retrieval in a library setting. 

      The beginnings of search engines initially stemmed from the need to search for a corpus of texts or a library for a document that would correspond to a query. This was the work of Karen Spärck-Jones in the 60s, and led to the conception of TF-IDF, a word-based content summary, scored according to its usage in the document and compared with the rest of the corpus.

      This approach enabled mathematicians to conceptualize the notion of text “comprehension” by a machine: this approach, improved by Stephen Robertson, then integrated into search engines (at Google, Amit Singhal was responsible for this) exploits this summary in a vector model to identify a distance between two documents (a query and a text, for example) via a cosine: this is the approach initiated by Gerard Salton.

      As you’ll see in the rest of this article, the TF IDF approach is no longer sufficient when trying to understand a search engine. An approach using TF IDF is based on a method that is too outdated to provide truly conclusive results over the long term.

      Salton Cosine

      Gerard Salton invented the concept of the vector model to compare two documents to determine their distance. For this approach, he exploited cosine to define the distance between two vectors, providing a value that can be compared to assess the relevance of a result to a query, but also to identify documents that talk about similar subjects to a specific document (such as a query, for example).

      When the web appeared and search engines were democratized, this approach to semantics was essential for all engines, including Google, specifically in the design of the index of answers to web users’ queries.

      (What will make the difference for Google is its use of authority metrics like PageRank, but I talk more about that in my article on backlink acquisition methodology.)

      Today, text comprehension has come a long way from the search engine’s point of view. The notion of vector models has been significantly enhanced, thanks to the improved efficiency (cost-performance ratio) of embeddings.

      Vector model (embedding)

        More on this later, but embeddings are more complex than Salton’s vector model: they take contextualization into account and can predict (statistically) a word from a context (Continuous Bag of Words model) or a context from a word (skip-gram).

        (very good illustration of CBOW and skip gram from the article https://spotintelligence.com/2023/12/05/fasttext/ )

        What are embedding models? We can’t talk about embedding without mentioning Tomas Mikolov and his work on Word2Vec.

        His work brings a significant advantage over the vector model: it’s more compact (a real saving in terms of storage resources), it’s more efficient (context is truly considered, to take a recently used example: Paris-France+Poland = Warsaw), and corpus embeddings can be reused in a variety of tasks.

        We can also name FastText, which goes beyond n grams and takes sub-words into account (sub-words: the decomposition of a word into several parts: in French we could use it to decompose “Moreover” into numerous sub-parts, some of which would have more impact: “more” and “over”. 

        Aparté on linking:

        I talked about it in this article: embeddings aren’t just used to rank SERPs, they’re also used to identify the distance between the source and target of a link, which, as you’ll have gathered, is far more advanced than human categorization of a site.

        An embedding therefore serves more than just a semantically relevant classification function. It can also be used for a language model.

        So why is it important to understand this approach to web copywriting?

        There’s a lot to be said for web copywriting, but it’s important to remember that your primary reader in SEO is Google. By understanding how it works, you’ll understand that the text you write matters, and not just from the point of view of the human who’s going to read it.

        So how has Google adapted the use of an embedding vector model?

        Language model (BERT)

          On Google’s side, embeddings from large corpora (the whole web provides a nice-sized corpus, even if there’s some cleaning up to be done) enable contextualized models like BERT to be trained, an advance on Word2Vec, which is based solely on words, and FastText, which is based on word decomposition. It’s an approach that will refine contextual understanding and enable more precise results (the final “vector” for a word will not be the same depending on context, and can be better understood).

          Language templates can also be used in other situations: how can we use them in our content creation process?

          The use of language models and LLM in writing

          Recognizing a language model

              The principle of a language model is essentially to use statistical formulas to predict the terms or context that will follow or precede a word or context.

              We’re all familiar with one impressive example: the chatGPT conversational tool has revolutionized the way the web is used. ChatGPT is based on an LLM (Large Language Model), and the way such a tool works is as follows:

              Depending on the context of the prompt, the use of a language model should make it possible to predict the sequence of words that will fill the text that follows the prompt.

              A term that has a probability of following another depending on the context can be represented in the form of a table:

              “I park”

              suiteProbability (fictitious)
              “my car0.5
              “my bike0.2
              “my bike0.15
              “then I’m off again”0.07
              “the truck0.05
              “and I’m going shopping”0.03

              These probabilities will vary according to the seed (a random or pseudo-random value that varies the elements selected in the probability top) and, above all, the prompt (the context).

              It’s very important to understand that this is an approach based on the probability of one term following another. 

              The importance of contextualization and QBST

                I’ve mentioned this contextualization several times now. But what is it exactly?

                Context (literally “around the text”) refers to the set of terms used around a term. In BERT, this context is used to identify an expected context for a query. What is sometimes called “intent” can be covered by the context of the query or expected responses… But in a much more precise way than marketing intent.

                What is QBST and what does it have to do with it?

                QBST stands for Query Based Salient Terms. Guillaume will talk about it better than I can here: https:

                Thanks to QBST, we can identify words in a document or corpus of documents that are deemed particularly relevant to a query. This then enables us to identify which terms to use for the contextualized query on which we want to position ourselves: this is where we’ll exploit semantics to become relevant in the eyes of a robot, and it’s precisely for this point that we need a tool to identify the terms to exploit, and specifically, a tool that uses the SERP. 

                SERP usage more important than ever

                  We can separate the tools that offer semantic optimization into several categories, but there’s one essential category: tools that fetch information live to give you an updated context. Because Google’s SERP changes regularly, QBST is not fixed, it varies according to user interaction. And so, as the results page improves, the recommendations become more interesting for content writing.

                  What we know from the Google Leaks of 2024 is that QBST pre-selects a list of pages from the index that match the query, before reworking the ranking with Navboost information. This is a vital element when considering the selection of your page for the query: the more your page meets the need, the greater its chances of being indexed and moving up but working on semantics alone is not enough.

                  Of course, this has one major drawback: you must fetch the results from the results page, live, to retrieve the texts that make up the corpus of documents. This method is vulnerable to the whims of Google, which sometimes makes updates to prevent the use of its SERPs by automated extraction methods.

                  Another disadvantage is that you need to be able to extract the content of the web page in question, and there can be quite a few reasons why this extraction would be hampered by the site.

                  However, the alternative of using a tool that doesn’t retrieve the SERP would be to use a copywriting tool that doesn’t aim to position content on the search engine.

                  Using the SERP, a tool can suggest a list of contextualization words. This is quite useful for the human writer, who will be able to choose an angle of attack to best match the expected context. Providing context is more important for a machine trying to imitate a human writer. This brings us back to the use of LLM, this time on the copywriting side.

                  Uncanny Valley, the repulsor

                    Writing, or having someone write, with a generative tool is undoubtedly an excellent time-saver for the writer. However, even with the best context and the most reliable sources of information, an LLM can be off the mark in terms of writing style, the information it will provide or the explanation of what the main subject of the article you’re writing is.

                    The impacts are diverse: 

                    • If the article doesn’t make sense, there may be legal implications. 
                    • If the writing style seems too smooth, too perfect, even after re-reading, a reader may not appreciate the content provided, despite the quality of the information delivered. This is known as the “Uncanny Valley” phenomenon

                    The “valley of the uncanny” and its impact on web readers, is a concept that appeals to the emotional affinity of what a human can feel when put in relation with what seems human or what is human. It’s an instinctive mechanism that will, as part of its conceptualization, differentiate between the robot that looks very human and the real human. This phenomenon also applies to text production:

                    We have a feeling of affinity with a text that is actually produced by a human, a feeling of affinity that we might not have with a text produced 100% automatically, if it’s not well framed.

                    For user signals (Navboost again), texts 100% produced by an LLM are a potential disaster, because despite the quality of the content provided, the bounce rate of users repelled by their instinct has an impact on the article’s visibility.

                    EEAT: capturing and keeping attention (Google’s communication)

                      In view of the use of user signals in understanding the perceived quality of texts, the criteria mentioned in Google’s official communication make sense:

                      Experience, Expertise, Authoritativeness, Trustworthiness are the elements that make up the general recommendations for enhancing the value of editors and the content they produce. 

                      The aim here is to capture and hold the attention of users, highlighting its ability to be a source of quality, reliable information, citing sources to back up the points made, and benefiting from the reader’s image of a quality source.

                      What it means exactly: 

                      On the web, there are as many levels of writing quality as there are pages, subjects and writers. Google wants to reward the best and, above all, encourage the best to speak out. A writer who is an expert on his or her subject, a recognized expert, a journalist who documents his or her sources, a researcher who produces a quality, well-sourced paper – this is what Google seeks to value, whatever the topic, but particularly those that have an impact on the lives of search engine users (Your Money Your Life).

                      In the field of web copywriting, a copywriter will use a text structure that corresponds to this methodology: AIDA (for Attention Interest Desire and Action), which should enable retention once attention has been captured. The aim here is to transform a potential market into a customer at the end of the customer session, by taking the prospect through the various stages in the hierarchy of the sequence of effects I’ll talk about below.

                      Official recommendations are incentives to speak out. To make an analogy, we could say that content will be judged: the jury will be the public who will react behaviorally to the content, and the executor of the sentence will be Google and its rankings.

                      Steps to optimized copywriting for the web

                      Find a subject

                          First and foremost, you need to know what you’re talking about. And for that, there are a multitude of tools to help you make a choice.

                          Various methods to help you:

                          • Exploratory methods (you enter a theme and retrieve keywords associated with this theme from a database): the principle is that an exploratory method should enable you to find a list of keywords that will make up topics associated with the main themes you want to cover. This method will be quite effective in terms of quantity of keywords, provided the subject:
                            • Not news
                            • Is represented in the database you’re querying (the most important data provider in this field is the Google Ads keyword database). No keyword database can claim to contain every topic. 
                          • Competitive analysis (identifying competitor topics and using them as inspiration to write similar content): The principle is quite like the exploratory method, by filtering on the subjects covered by competitors. This filter allows you to discover broader themes than those initially thought, but will have the following disadvantages:
                            • There will always be a need to filter competitors’ brand queries
                            • Source base limit remains active
                            • The competitors listed must be specific to your theme.
                            • You’re dependent on the content your competitors have sent out (and you’re a follower).
                          • Identifying specific needs (you know what your article needs to cover, and you have a general idea of the topic’s strategic relevance to your overall content). This method is the most promising in terms of producing original topics or independent content. It’s part of the logic of offering readers added value.

                          The first two methods are easy to activate:

                          A tool can help you search a list of keywords based on the exploratory method and the competitive analysis. You can then generally sort according to the metrics you wish to use.

                          A few remarks about the metrics that are generally provided for sorting purposes:

                          • Search volume (usually monthly): This metric has several drawbacks:
                            • It’s mainly a metric provided by Google, and many professionals have criticized its reliability.
                            • It gives the impression that working on a piece of content will bring in huge amounts of traffic, when the click-through rate is very low.
                          • CPC, min & max bid: once again, provided by Google, these metrics are designed to help you budget or identify the level of competition in the subject concerned.
                          • Position, url, date: When your aim is to optimize a piece of content, or to work on content that your competitors already have, seeing the positions (as up to date as possible) of these pieces of content can give you a clue as to how the content you want to overtake is performing.
                          • The category: For this, we’re generally using human categories, usually automated based on the identification of a sequence of characters, and not on a precise context. I don’t recommend their use, but if you accept the fact that your final list will be less exhaustive through the use of this data, please do.
                          • Difficulty and competition: two metrics with a fairly similar objective: to help establish what is realistic to achieve when working to target a particular query. The highest levels of difficulty and competition are generally branded queries, and the lowest levels of difficulty are generally queries of limited interest.
                          • Trends: this information enables you to identify seasonal topics and news needs. In the first case, it’s essentially to identify how the publication calendar can best adapt to traffic. In the second case, it’s more about prioritizing hot topics, if the data is up to date.

                          Above all, these metrics should help you choose the keywords you feel most urgently need to be addressed and integrated into your content roadmap.

                          Finally, you’ve got your subject, you’ve got your target query (and if you don’t have it, you need to put yourself in the shoes of the surfer looking for the subject,in order too identify what’s expected). All you need to do is find out what the user is waiting for.

                          Analyze the expected context

                            To make sure you’re relevant, you need to put yourself in the surfer’s shoes to find out what they’re looking for, and this can be done by asking several questions:

                            • Does your targeted query match the content you were planning to produce?
                              • Make the query yourself: what do you find on the search results page? Is it texts that are highlighted? Do you need a simulator? A definition? A video? Infographics or PDFs?
                            • Who are you addressing?
                              • You may want to identify your target audience: in this case, you may have some typical personas, and an approach to adopt to reach them.
                              • You may want to know what level of the conversion tunnel your target is at, and orient your content to match their information needs. This level corresponds to the hierarchy of the sequence of effects:
                                • The cognitive stage (Awareness, Knowledge)
                                • Affective stage (Attraction, Preference, Belief)
                                • The conative stage (Decision, Action)
                            • What is the semantic context of the query?
                              • This is a little more technical than the first question. Here, a tool won’t go amiss: you need to identify the semantic field that is important to Google around the query (this semantic field is modified according to user interactions, so it may change frequently, but the main terms generally remain the most important). You need a tool that compares the corpus of texts that come up on the SERP live with an up-to-date language model. Through this comparison, the list of important terms is generally displayed in order of importance, and you need to concentrate on the first ones.

                            But when a tool gives you expected terms, do you have to use them all? 

                            No. Look at it from a statistical point of view: the most important terms are most likely to have an impact on the expected semantic signal. Using important terms that others don’t use can give you the upper hand over certain SERP competitors who won’t dare to use certain terms, or who have missed the opportunity to use them. However, using all the terms is not necessarily the solution: for an engine that knows this principle of statistically important terms, detecting artificial text is easy when you see that certain texts contain ALL the terms from QBST. 

                            From a statistical point of view, and sticking to a safe technique, take care to use the most important terms and to use only a few of them below a minimum value of probability of importance.

                            How many times should these terms be used?

                            It’s an answer that’s relative to the classic usage of each context term on the documents in the query corpus. There are zones of values where it is statistically acceptable to go, compared with classic usage, and depending on the overall size of the content.

                            What size content should I use?

                            Here it’s trickier: content size doesn’t necessarily have much to do with the state of optimization. With content size also comes the risk of semantic dilution. Your compatibility with expected semantics may decrease with larger content. Another risk is that, with content larger than that of competitors, you also run the risk of activating a filter linked to keyword stuffing. There’s a real balance to be struck between the right level of optimization, content size and information relevance.

                            Yes, but if you’re careful to have well-themed and well-contextualized paragraphs, big content is better than small content, isn’t it?

                            Again, not necessarily. In a context where all the texts in the query corpus are long texts (as is often the case in Germany), then it’s legitimate to have very large content. In a context where the texts in the corpus are of a more moderate size, there’s another real risk in playing with an optimization level: Transition Rank (valid on pages that are being re-optimized)

                            An example of a curve where we focused on the most important terms but didn’t use all of them, specifically on the right of the list (guide: “strawberry muffin recipe” date: 8/01/25, expected SOSEO score 72-80, DSEO <19).

                            Transition Rank in semantics

                              As with backlink acquisition, Transition Rank also monitors semantic optimizations. It’s a definition of spam that Google engineers write in their patent

                              Overall, Ross Koningstein, the inventor of this Transition Rank, does not wish to see positions modified by spamming practices. His proposal will therefore help Google limit the impact of practices aimed at modifying a url’s signals to place it in a better position.

                              Its algorithm consists of 2 parts: 

                              – Detection

                              – The quarantine phase

                              Detection: when a URL is suspected of being spam through actions such as semantic optimization or link acquisition, it enters a Transition Rank phase: a phase lasting from a few hours to 3 months, during which an arbitrary position lower than the initial one is assigned to the page. If, during this period, the page is re-optimized, it again enters a Transition Rank phase and loses positions once again.

                              The only way to avoid Transition Rank is not to touch the content or linking of a page for 3 months.

                              With semantic work, we still could avoid the first phase if we stay within acceptable usage of the expected terms.

                              Doing your research

                                If you want to meet the needs of the content philosophy pushed by the EEAT recommendations, you need to do your research. You should consider offering sources, because they explain and justify your points of view. These sources can be online or offline, (don’t overlook the advantage of a trip to the library if your subject requires it). Always note the reference of your source and try to produce a link that mentions the source and where to find it.

                                Don’t be afraid to question what you were planning to say. You never stop learning, and even more so when you’re writing a paper.

                                Nowadays, with advanced search engines and generative tools, you can undoubtedly try to find sources to support your content but re-read them before naming them (you never know) in order to assess their relevance and quality. Don’t leave this proofreading to a generative tool – even with the “my life depends on it” or “you’ll get a paycheck” prompt, you’re not safe from a response that contradicts your content.

                                Finally, if you’ve already published and want to redirect your readers to other articles in your production, you can do so to an author page that brings together links to your previous content, for example.

                                SEO criteria to-do-list

                                The basics

                                    Historically, there have been many criteria that have counted for, or not at all, and I’m going to try to list the basics, while recalling their use:

                                    Content displayed in search results:

                                    • The Title tag: the only real tag in the list that has a direct impact on ranking. There is a limited number of pixels (between 500 and 550 at the time of writing) for this tag, which must be respected if you don’t want your page title to be cut off in the results, and risk being clicked on less, which is never good for Navboost and click skip. Sometimes, for reassurance, the title includes the brand name, if known.
                                    • The meta tag and its description attribute: the text field that must convince and reassure the user to click on the result and visit the page. One of the most important and least-used tags: in addition to a question of size in pixels (variable according to the medium: for mobile and between 900 and 1000 on widescreen, at the time of writing), its tone and content must be adapted to the target of the query.

                                    Contents on the page:

                                    • The H1: supposed to be unique, this is the tag that represents the title of the content, in the somewhat simplistic manner of a book title. It also has a limited size, this time in number of characters (70).
                                    • Subheadings (H2, H3, H4, H5, H6): these are tags that are supposed to structure content. In the real world of the web, they are misused for cosmetic purposes (how many times have we seen a line in a style specific to an h3?) which renders these tags useless for semantic signaling. If you can, avoid this kind of usage, because it’s part of the extra steps you need to take to overtake your competitors in the SERPs, and every element that goes in the right direction can only be positive.

                                    Avoid using the same text for the title and the h1: the two have different objectives: the title should entice by announcing an answer to the desired content, and the H1 should confirm that the page corresponds to the expected content by announcing the complete content of the page.

                                    Another element to exploit, straddling semantics and linking, is to mesh content within your site, ensuring that the links you make to and from this new URL are related to URLs that are not too distant from the point of view of semantic embeddings.

                                    Relay content on social networks

                                      Thanks to the Leaks of Google’s internal API documentation and the anti-trust lawsuit, 2024 has enabled us to discover what some SEOs already suspected: traffic is an important element of a URL’s authority. What this means is that to give your content the best chance of performing, sharing it on social networks is essential.

                                      If your objective is to ensure that your content has every chance of performing, don’t hesitate to share it on your personal and professional social networks.

                                      To drive traffic, from a logical point of view, you even need to publish several times a day so that your content is seen by most of your social network.

                                      In fact, not everyone logs on at the same time, and social algorithms will naturally only show a portion of the recent posts you may wish to see. For more coverage, you need to publish to the same page several times a day, several times a week.

                                      Track the performance of the content you produce

                                        Logic would dictate that you make sure you create content that serves and is read. That would be the whole point of spending time producing quality content. Engage with the community you’re sharing your content with, and make sure they read and appreciate it. Feedback can always provide a constructive touch to improve your style.

                                        To get a broader picture of the content you’re producing, also rely on clicks, visits and bounce rates. Set up position tracking, to find out whether, your content at least appears on the targeted query (enter in the primary index rewards content that responds to an expected context). 

                                        If you’ve read this article carefully, you’ll have understood that copywriting alone is not enough to achieve good positioning. However, it is an essential basis for the robot to identify legitimate pages, and the writing style will help to transform the test with readers. You’ll also need to supplement the overall picture and go in search of more interesting positions via additional practices (linking, link acquisition, clicks from social networks or geographic zones, etc… 

                                        But all these ancillary practices aren’t going to do much if your content doesn’t meet the user’s needs or preferences.

                                        SOURCES

                                        TF IDF:

                                        Title: A statistical interpretation of term specificity and its application in retrieval

                                        Author: Karen Sparck Jones

                                        https://www.emerald.com/insight/content/doi/10.1108/eb026526/full/html

                                        Salton Cosine:

                                        Title: Introduction to modern information retrieval

                                        Authors: Gerard Salton, M.J. McGill

                                        https://www.google.fr/books/edition/Introduction_to_Modern_Information_Retri/7f5TAAAAMAAJ

                                        Transition Rank:

                                        Title: Changing a rank of a document by applying a rank transition function 
                                        Author: Ross Koningstein
                                        https://patents.google.com/patent/US8924380B1/en

                                        Word2Vec:

                                        Title: Efficient Estimation of Word Representations in Vector Space.

                                        Authors: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

                                        https://arxiv.org/abs/1301.3781

                                        Title: Distributed Representations of Words and Phrases and their Compositionality

                                        Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean

                                        https://arxiv.org/abs/1310.4546

                                        FatText:

                                        Title: Enriching Word Vectors with Subword Information

                                        Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

                                        https://aclanthology.org/Q17-1010

                                        BERT:

                                        Title: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

                                        Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

                                        https://arxiv.org/abs/1810.04805

                                        The Uncanny Valley:

                                        Title: The Uncanny Valley [From the Field

                                        Authors: Masahiro Mori, Karl F. MacDorman, Norri Kageki

                                        https://ieeexplore.ieee.org/document/6213238

                                        Other sources:

                                        Search Quality Rater Guidelines: EEAT

                                        Author: Google

                                        https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf

                                        QBST: 

                                        Title: QBST: In 2025, writing for Google is more than ever about using the right words

                                        Author: Guillaume Peyronnet

                                        FastText :

                                        Title: What Is FastText? Compared To Word2Vec & GloVe [How to Tutorial In Python]?

                                        Author: Neri Van Otten

                                        https://spotintelligence.com/2023/12/05/fasttext