Extractive Text Summarization Using spaCy in Python

Extractive Text Summarization Using spaCy in Python
Extractive Text Summarization Using spaCy in Python.We started off with a simple explanation of TF-IDF and the difference in our approach. Then, we moved on to install the necessary modules and language model.

Traditionally, TF-IDF (Term Frequency-Inverse Data Frequency) is often used in information retrieval and text mining to calculate the importance of a sentence for text summarization.

The TF-IDF weight is composed of two terms:

  • TF: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document, as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
  • IDF: Inverse Document Frequency — Measures how important a term is. While computing the term frequency, all terms are considered equally important. However, it is known that certain terms may appear a lot of times but have little importance in the document. We usually term these words stopwords. For example: is, are, they, and so on.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

TF-IDF is not a good choice if you are dealing with multiple domains. An unbalanced dataset tends to be biased and it will greatly affect the result.

A common term in a domain might be an important term in another domain. As the saying goes: “One man’s meat is another man’s poison.”

Let’s try to do it differently by using just the important terms. This piece focuses on identifying the top sentences in an article as follows:

  1. Tokenize the article using spaCy’s language model.
  2. Extract important keywords and calculate normalized weight.
  3. Calculate the importance of each sentence in the article based on keyword appearance.
  4. Sort the sentences based on the calculated importance.

Let’s move on to the next section to start installing the necessary modules.

1. Setup

We will be using pip install to download the spaCy module. It is highly recommended to create a virtual environment before you proceed.

Run the terminal in administrator mode as we will need admin privilege to create a symlink when we download the language model later on.

pip install -U spacy

Once you are done with the installation, the next step is to download the language model. I will be using the large English model for this tutorial. Feel free to check the official website for the complete list of available models.

python -m spacy download en_core_web_lg

It should take some time to download the model as it is about 800MB in size. If you experienced issues while downloading the model, you can try to use the smaller language models.

When you’re done, run the following command to check whether spaCy is working properly. It also indicates the models that have been installed.

python -m spacy validate

Let’s move on to the next section. We will be writing some code in Python.

2. Implementation

Import

Add the following import declaration at the top of your Python file.

import spacy
from collections import Counter
from string import punctuation

Counter will be used to count the frequency while punctuation contains the most commonly-used punctuation.

Load spaCy Model

The spaCy model can be loaded in two ways. The first one is via the built-in load function.

nlp = spacy.load("en_core_web_lg")

If you experience issues with not being able to load the model, even though it’s installed, you can load the model via the second method. You need to import the module directly and you can use it to load the model.

import en_core_web_lg
nlp = en_core_web_lg.load()

Top Sentence Function

We will be writing our code inside a function. It is always a good idea to modularize your code whenever possible.

def top_sentence(text, limit):

The function accepts two input parameters:

  • text — The input text. Can be a short paragraph or a big chuck of text.
  • limit — The number of sentences to be returned.

The first part is to tokenize the input text and find out the important keywords in it.

keyword = []
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
doc = nlp(text.lower()) #1
for token in doc: #2
    if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
        continue #3
    if(token.pos_ in pos_tag):
        keyword.append(token.text) #4

keyword.py

  • #1 — Convert the input text to lower case and tokenize it with spaCy’s language model.
  • #2 — Loop over each of the tokens.
  • #3 — Ignore the token if it is a stopword or punctuation.
  • #4 — Append the token to a list if it is the part-of-speech tag that we have defined.

I have covered a tutorial on extracting keywords and hashtags from text previously. Feel free to check it out.

The next step it to normalize the weightage of the keywords.

freq_word = Counter(keyword) #5
max_freq = Counter(keyword).most_common(1)[0][1] #6
for w in freq_word:
    freq_word[w] = (freq_word[w]/max_freq) #7
  • #5Counter will convert the list into a dictionary with their respective frequency values.
  • #6 — Get the frequency of the top most-common keyword.
  • #7 — Loop over each item in the dictionary and normalize the frequency. The top most-common keyword will have frequency value of 1.

If you were to print out the max_freq variable, you should see a Counter object that contains a dictionary with a normalized value range from 0 to 1.

Counter({'ghosn': 1.0, 'people': 0.625, 'equipment': 0.625, 'japan': 0.625, 'yamaha': 0.5, 'musical': 0.5, 'escape': 0.5, 'cases': 0.375, 'japanese': 0.375})

Let’s proceed to calculate the importance of the sentences by identifying the occurrence of important keywords and sum up the value.

sent_strength={}
for sent in doc.sents: #8
    for word in sent: #9
        if word.text in freq_word.keys(): #10
            if sent in sent_strength.keys():
                sent_strength[sent]+=freq_word[word.text]#11
            else:
                sent_strength[sent]=freq_word[word.text]#12

sent_strength.py

  • #8 — Loop over each sentence in the text. The sentences are split by the spaCy model based on full-stop punctuation.
  • #9 — Loop over each word in a sentence based on spaCy’s tokenization.
  • #10 — Determine if the word is a keyword based on the keywords that we extracted earlier.
  • #11 — Add the normalized keyword value to the key-value pair of the sentence.
  • #12 — Create a new key-value in the sent_strength dictionary using the sentence as key and the normalized keyword value as value.

You should be able to get a cumulative normalized value for each sentence. We will use this value to determine the top sentences.

summary = []
sorted_x = sorted(sent_strength.items(), key=lambda kv: kv[1], reverse=True) #13
    
counter = 0
for i in range(len(sorted_x)): #14
    summary.append(str(sorted_x[i][0]).capitalize()) #15
    counter += 1
    if(counter >= limit):
        break #16
            
return ' '.join(summary) #17

summary.py

  • #13 — Sort the dictionary based on the normalized value. Set the reverse parameter to True for descending order.
  • #14 — Loop over each of the sorted items.
  • #15 — Append the results to a list. The first letter of the sentence is capitalized since we have converted the text to lower case during the tokenization process. Please note that all other words in the sentence are still in lowercase. Kindly implement your own mapping system if you intend to keep the upper case in sentences.
  • #16 — Break out of the loop if the counter exceeds the limit that we have set. This determines how many sentences are to be returned from the function.
  • #17 — Return the list as a string by joining each element with a space.

Your function should be as follows.

def top_sentence(text, limit):
    keyword = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            keyword.append(token.text)
    
    freq_word = Counter(keyword)
    max_freq = Counter(keyword).most_common(1)[0][1]
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)
        
    sent_strength={}
    for sent in doc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent]+=freq_word[word.text]
                else:
                    sent_strength[sent]=freq_word[word.text]
    
    summary = []
    
    sorted_x = sorted(sent_strength.items(), key=lambda kv: kv[1], reverse=True)
    
    counter = 0
    for i in range(len(sorted_x)):
        summary.append(str(sorted_x[i][0]).capitalize())

        counter += 1
        if(counter >= limit):
            break
            
    return ' '.join(summary)

text-summarization.py

Result

Let’s test the function that we have just created. Feel free to use any kind of text for your test. I will be using the following text as example for this tutorial.

example_text = '''Yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan Carlos Ghosn reportedly was smuggled out of Japan in one. In a tweet over the weekend, the Japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases. Yamaha (YAMCY) warned people not to get into, or let others get into, its cases to avoid "unfortunate accidents." Multiple media outlets have reported that Ghosn managed to sneak through a Japanese airport to a private jet that whisked him out of the country by hiding in a large, black music equipment case with breathing holes drilled in the bottom. CNN Business has not independently confirmed those details of his escape. The former Nissan (NSANF) CEO had been out on bail awaiting trial in Japan on charges of financial wrongdoing before making his stunning escape to Lebanon at the end of December. Ghosn has referred to his departure as an effort to "escape injustice." In an interview with CNN\'s Richard Quest last week, Ghosn did not comment on the nature of his escape, saying he didn\'t want to endanger any of the people who aided in the operation. Ghosn did, however, respond to a question about what it felt like to ride through the airport in a packing case by first declining to comment but then adding: "Freedom, no matter the way it happens, is always sweet." In a press conference in Lebanon ahead of the CNN interview last Wednesday, Ghosn\'s first public appearance since fleeing Japan, Ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim Japanese authorities have disputed. Brands sometimes capitalize on their tangential relationship to big news in order to attract attention on social media. Yamaha is one of Japan\'s best known brands and Ghosn was one of Japan\'s top executives before being ousted from Nissan — a match made in social media heaven. Not surprisingly, Yamaha\'s post went viral on Twitter over the weekend.'''

Remember to pass in the number of sentences to be returned for the top_sentence function.

print(top_sentence(example_text, 3))

I obtained the following result:

"Yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one. In a press conference in lebanon ahead of the cnn interview last wednesday, ghosn's first public appearance since fleeing japan, ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim japanese authorities have disputed. In a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases."

Future Improvement

Although we have implemented a simple extractive text summarization function, there are still a lot of improvements that can be made based on your use cases. Examples include:

  • Retaining the case-sensitivity of the result.
  • Connection between sentences as the order of sentences will be out of order for the final output. The result will not be good for story-based text.
  • Normalization between long sentences and short sentences.

3. Conclusion

Let’s recap what we have learned today.

We started off with a simple explanation of TF-IDF and the difference in our approach. Then, we moved on to install the necessary modules and language model.

Next, we implemented a custom function to get the top sentences from a chunk of text.

During each function call, we will extract the keywords, normalize the frequency value, calculate the importance of the sentence, sort the sentence based on its importance value, and return the result based on the limit that we have passed to the function.

Finally, we explored how we can improve it further based on our own use cases.

Thanks for reading and hope to see you again in the next article!

Suggest:

Python Tutorials for Beginners - Learn Python Online

Learn Python in 12 Hours | Python Tutorial For Beginners

Complete Python Tutorial for Beginners (2019)

Python Programming Tutorial | Full Python Course for Beginners 2019

Python Tutorial for Beginners [Full Course] 2019

Introduction to Functional Programming in Python