How to build a legal document summarizer?

Have you ever thought how legal experts manage series of court statements effectively!

Shifu Jain

Associate - DSG

Have you ever thought how legal experts manage series of court statements effectively! Reading ~500 paged document and drawing out the general context from it with salient details – understanding, interpreting, explaining and researching a wide variety of legal documents. Isn’t it tedious !!

To solve this problem there came the idea of automated text summarization.

Text summarization

Text Summarization is the process of creating short and meaningful summaries from a larger text. In this blog, we will briefly discuss some of the interesting works done towards building an automated text summarizer and discuss the approach to build a legal doc summarizer.

There are two main approaches to summarizing text documents:

1. Extractive Methods: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method. Most summarization approaches today are extractive in nature. It is based on the PageRank algorithm used in the Google Search Engine.

A. TextRank

B. LexRank


2. Abstractive Methods: These methods use advanced NLP techniques to generate an entirely new summary consisting of its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is a more appealing, but much more difficult than extractive


A. Encoder-Decoder Model (TextSum)

B. Sequence-to-Sequence RNNs

Combination Approach:

This approach is a combination of both Extractive and Abstractive Methods. 1. Extract then Abstract model


Data Used for Training

Australian legal cases from the Federal Court of Australia (FCA) have used as dataset for training this model. Original Data contained documents with average character length of 15000. Using TextRank extractive approach the text length was reduced to an average of 3000 characters (Composed of top-ranking sentences in the Text, based on the Page Rank Algorithm).

Text Preprocessing

It is always a good practice to make textual data noise-free as much as possible. So, let’s do some basic text cleaning.

  1. Remove punctuations and special characters from the text
  2. Convert all the alphabets into lower case
  3. Remove STOPWORDS
  4. Do not remove numbers and dates, they can be very important for a legal case

Entity Recognition

Recognising entities in a legal document is very important. Knowing the name of the parties involved, Dates, location of the incident etc. solves 30% of your task. Spacy is one such library which helps to extract underlined entities and labels from the text.


Extractive Method

Gensim’s summarization module provides functions for summarizing texts. Summarizing is based on ranks of text sentences using a variation of the TextRank algorithm. It provides the flexibility to choose the word count or word ratio of the summary to be generated from original text. Applying the algorithm to extract 100 words summary from the text.

Abstractive Method

Applying Recursive RNN Algorithm to generate summary from the text. This algorithm takes the article content and the current built-up summarized text to predict the next character of the summarized text.

The details of the model implementation can be found in the GITHUB1, GITHUB2.



Rouge-N is a word N-gram measure between the model and the gold summary. Specifically, it is the ratio of the count of N-gram phrases which occur in both the model and gold summary, to the count of all N-gram phrases that are present in the gold summary. It is same as “recall” because it evaluates the covering rate of gold summary and does not consider the non-included n-grams in it. More specifically, we use ROUGE-L, ROUGE-1 and ROUGE-2 to evaluate and compare the quality of the summaries generated by our system. While ROUGE-N focuses on n-gram overlaps, ROUGE-L uses the longest common subsequence to measure the quality of the summary.

Out[]: [{‘rouge-1’: {‘f’: 0.8297872291240379, ‘p’: 0.9512195121951219, ‘r’: 0.7358490566037735}, ‘rouge-2’: {‘f’: 0.6788990776634964, ‘p’: 0.7872340425531915, ‘r’: 0.5967741935483871}, ‘rouge-l’: {‘f’: 0.7833864406466188, ‘p’: 0.926829268292683, ‘r’: 0.7169811320754716}}]


BLEU score is a modified form of “precision”, extensively used in machine translation evaluation. It is the ratio of the number of words that co-occur in both gold and model summary to the number of words in the model summary. Unlike ROUGE, BLEU directly accounts for variable length phrases – unigrams, bigrams, trigrams etc., by taking a weighted average.

End Notes

I hope this post helped you in some way building your custom document summariser. Please feel free to comment.

About Author

Affine is leading AWS select consulting partner renowned for providing cutting-edge cloud services on AWS platform

Shifu Jain

Recommended Blogs & Articles

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.