When dealing with web data it is very important not to get overwhelmed. When you have a ton of texts to process it might be better to only process a ton of smaller texts. Depending of what you intend to do it could be sufficient to deal with only summaries of those texts. When presenting search results to a user it is much better to present excerpts of documents. If you don’t have relevant summaries of your documents, you are bound to show some random summary of the first few sentences of any documents which might not be relevant enough for longer documents.
Obviously summarizing texts by hand, as qualitative as it may be, will lack the efficiency required to process millions of texts per day. Then one should look into automatic summarization techniques. There are several ways to obtain an automatic summary of a source document. If you are looking to summarize millions of documents per day you want fast and reliable techniques. Obviously, you can take X sentences at random from the source documents or the Y first sentences. While being very fast, the methods cannot guarantee that you will end up with a good summary of the original content.
Since we are talking about the summarization of web documents, we will make the assumptions that the important text from the web page has already been extracted using any tool like trafilatura, readability, boilerpipe, etc. And before diving further into extractive summaries let’s give a definition and look at the differences with abstractive summaries.
Extractive vs Abstractive summaries
The objective of both techniques is to provide a shorter text that is the most truthful to the original. We will talk later about how to evaluate the quality of a summary but for now we will focus on how to get summaries. No technique is better than the other, as is often the case, it depends on what you want to do with the automatically computed summary.
Extractive summaries are the easiest form of summary done by a machine since it consists in extracting some parts of the original text to build the summary. Typically, the extracted parts are sentences but on can imagine an automatic extractive summarization method that works on the word or n-gram level. In order to produce easy to read summaries and not just perform keyword extraction from a source document we will focus on methods that extract whole sentences. Usually, the extracted sentences are chosen regarding their content and their similarity with the whole texts as well as their novelty regarding the sentences already extracted for the summary. We want a shorter version of the document but would prefer to avoid blatant repeats. The task of computing an extractive summary for a document can be seen as a binary classification task on its sentences. Algorithms can either decide directly in which category each sentence will end up or compute a probability of being part of the summary for each sentence and pick sentences according to these probabilities.
Abstractive summarization is a much easier task for humans than machine. Since to write your summary you need to keep the meaning of the original text without being limited by the original content vocabulary and sentences. You can juste write a short text that describes the content of the original text. With new advances in machine learning, machines have become quite potent at writing abstractive summaries. And with the progress made by the large language models it is now a well-known task for these models and they got quite good at it.
There is no best way to get a summary. Extractive summarization, using only sentences from the original content is sure not to include false information. The summary obtained might not be exhaustive depending on its length and only covers certain aspects of the original document while abstractive summaries may not the appropriate semantic and deviate from the meaning of the source document.
Deciding between extractive and abstractive summarization depends on the context and what we wish to do with the summary. Basically if the summary is to be consumed by a machine it might be better to use an extractive approach since you then make sure that the computer will work on parts of the original text while if the summary is to be shown at a human it might be better for readability sake to use an abstractive approach since selecting sentences that where not necessarily next to each other in the source document might make the summary unpleasant to read for a human.
Right below are examples of extractive and abstractive summaries of the same source document.

On the left is the original content, on the top right with a red border is an abstractive summary of the text and on the bottom right with a green border there is an extractive summary. The extracted sentences are highlighted in green in the original text.
In this article we will focus on extractive summaries and present several methods to get extractive summaries from a collection of documents. Obviously, you can extract summaries yourself but manual summarization wont scale up so we will focus on automatic extraction of summaries from source content.
We will differentiate two kinds of methods, those who use deep learning algorithms and those who don’t. We will first briefly introduce the latter as they were the first to emerge and can be easily explained before talking a bit more about more recent techniques using deep learning.
Standard methods
When people start trying to automatically summarize content, they first tried to put some score on the sentences of the source document and extracted the sentences with the highest score. Many methods were invented to compute scores for the sentences. At first people computed scores based on the relevance of the sentences regarding the whole document or a specific query using some variant of TF.IDF. To avoid redundancy in the produced summaries, researchers added some notion of novelty for the sentences before adding them to already existing summaries. This way it is possible to ensure that the summary will cover more aspects of the source material.
Researchers also build many features like the number of named entities in a sentence or the number of pronouns. These features are then used to build classifiers to extract summaries.
Other researchers tried to use graph ranking algorithms to extract sentences from a document. Algorithms like HITS or the PageRank can be used to score sentences of a text. You first have to transform your text into a graph. Sentences are transformed into nodes in the graph and edges between nodes can be set for examples if the sentences are judged similar and the weight of the edge is the similarity score. An advantage of graph techniques is that they do not require human labelling of the sentences which could be a strong requirement in certain domains where they cannot be obtained.
TextRank defines sentences similarity as a function of their content overlap. This produces highly connected graphs where it is then possible to use ranking algorithms. TextRank then select the sentences with the highest score as the summary.
Deep learning comes to the rescue
Deep learning has proven very useful in many NLP tasks and automatic summarization is one of them. One doesn’t specifically need to have very deep model in order to obtain very efficient results. Many architectures can be used to obtain extractive summaries from a source document. We are not going to present all the deep learning based methods that tackles automatic summarization. This article will focus on some methods that represents milestones in automatically extracting summaries with deep learning.
The first architecture that was used efficiently is Recurrent Neural Networks. Recurrent Neural Networks or RNNs for short are well adapted for NLP tasks since they embark some memory of previous states and can make decisions using this context. SummaRuNNer is a method based on 2 GRU (Gated Recurrent Unit) layers (word and sentences level) that can extract summaries automatically. SummaRuNNer optimizes its summaries along several axes like the information contained in each sentence, their salience and novelty regarding the other sentences in the summary. The interest of this method is that SummaRuNNer can be trained on abstractive summaries if one doesn’t have extracted summaries at hand and is based on a very small model.
Transformer-based architecture stormed the field of NLP and really improved the obtained results. With the attention mechanism it is possible to have a better context with a wider window to establish connections between tokens, sentences, etc.
Examples like BertSUM that uses already trained models for an automatic extractive summarization task work pretty well. Using the Bert architecture and inserting markers at each sentence instead of at each document and alternating embeddings between sentences to distinguish them and fine training the model for the extractive summarization tasks yields excellent results.
Both SummaRuNNer and BertSUM need to be trained on some documents to give good results. It is still unknown if there exists a training dataset that can encompass the whole Web.
Transformers architecture gave birth to Large Language Models (LLMs). LLMs are very popular at the moment because they obtain spectacular results in NLP tasks. LLMs are great at generating abstractive summaries. The appeal behind using LLMs for NLP task is that you don’t have to train your model from scratch which is very costly and time consuming. Having LLMs perform extractive summaries is a bit trickier but could be obtained with fine tuning of the models. LLMs are very good at computing abstractive summaries but the problem here is with the hallucinations that LLMs suffer from. The abstractive summaries could be false and to be sure the LLM will produce a correct summary it is better to make it produce extractive summaries. Still with models getting bigger fine-tuning model to perform extractive summaries can become untrackable if you must keep the whole parameters. Using LoRA (Low-Rank Adaptation of LLMs) and Parameter Efficient Fine-Tuning, researchers developed models based on Llama2 and ChatGLM2 to perform extractive summaries of documents.
New models will keep getting better at this task, the problem with using huge models for extractive summarization is that it is not costly or timely efficient. Using smaller models might prove better since those models could be easier to fine tune and to specialize for the task of automatic extractive summarization.
Evaluation methods for automatic summaries
There are several ways to evaluate the quality of the summary. You can ask a human to read both the source content and the extracted summary and tell you what she thinks of the summary. Of course, this way of evaluating automatic summaries does not scale up if you have thousands or millions of summaries to evaluate.
It is also possible to compare the automatically extracted summaries with a “gold standard” summary produced by a human. You can compare the summaries on a small number of documents and train a method on this data and if everything goes well the method should generalize well on other cases.
More specific ways to evaluate automatic summaries arose during the last decades, the most known is the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) set of metrics. The ROUGE family of metrics (ROUGE-1, ROUGE-2, ROUGE-L, …) can be used when you possess a gold standard summary of your source text. The idea is to compare the n-grams of the automatic summary with the n-grams of the source document. Many other metrics are also used like BLEU or METEOR, these metrics are originally designed to evaluate the quality of a text translation but can be used to evaluate de quality of a summarization method.
Obviously depending on what you intend to do with the summaries there are other ways to evaluate automatic summarization. It is also important to note that human do not always agree on what is a good summary and therefore having a more objective way of evaluating summaries could prove useful.
For example, if you want to query the summaries to store less content in your index it is interesting to have a summarization method that generates extracted summaries that are as close as possible semantically speaking to the original content to make sure they will be relevant to the same queries.
You should choose the metric that best fit your use case while remaining efficient in computation time and memory. There are many good answers.
This document focuses on single-document single language summarization using extractive methods. The field of automatic summarization is much bigger and includes many different problems like summarizing several documents at once, producing summaries in a different language than the source document, producing query-dependent summaries, etc. I encourage the reader that finds these problems interesting to check the resources below since they provide some starting points on these problems as well as detailing the methods introduced in this article.
Resources
TextRank – https://aclanthology.org/W04-3252.pdf
SummaRuNNer – https://ojs.aaai.org/index.php/AAAI/article/view/10958
BertSUM – https://arxiv.org/abs/1908.08345
EYEGLAXS – https://arxiv.org/pdf/2408.15801