Semantic Literature Search Powered by Sentence-BERT

The most appropriate strategy would be flicking through the books to figure out the theme based on its words/sentences/paragraphs.

Dr. Monika Singh

Consultant - DSG

Suppose you were given a challenge to pick out a horror novel from a small collection of books, without any prior information regarding these books. How would you go about this task in the most efficient way possible? The most appropriate strategy would be flicking through the books to figure out the theme based on its words/sentences/paragraphs. Another approach could be reading the reviews on the cover page for context.

This article discusses how we can use a pre-trained BERT model to accomplish a similar task. Kaggle recently released an Open Research Dataset Challenge called CORD-19 , with over 52,000 research articles on COVID-19, SARS-CoV-2, and other related topics. The problem statement was to provide insights on the “tasks” using this vast research corpus. The required solution must fetch all the relevant research articles based on questions user submits.

Our approach to finding the most relevant research articles (related to the task), was to compare the question with abstracts of all the research articles present. By finding the similarity score, we could easily rank the top-N articles. Comparing the question with the entire corpus is as easy as flicking through the small collection of novels to pick out the horror genre. This automation takes place by using Sentence-Transformers library, pre-trained BERT model and Spacy library

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 ACL paper by Nils Reimers and Iryna Gurevych mention the architecture, and advantages of Siamese and Triplet Network Structures, some of which are:

  • Finding similar pairs of sentences from a corpus of 10,000 sentences requires around 50 million inference computations and ~65 hours with BERT/RoBERTa. Sentence BERT can quite significantly reduce the embeddings construction time for the same 10,000 sentences to ~5 seconds!
  • Fine-tuning a pre-trained BERT network and using siamese/triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine similarity.

Methodology:

1. Install sentence-transformers and load a pre-trained BERT model. We have used ‘bert-base-nli-mean-tokens’ due to its high performance on STS benchmark dataset.

from sentence_transformers import SentenceTransformer model = SentenceTransformer(‘bert-base-nli-mean-tokens’)

2. Vectorize/encode the abstracts of each article using pre-trained Bert embedding. Suppose the abstracts are contained in a list named ‘abstract’.

abstract_embeddings = model.encode(abstract)

The abstracts are encoded using BERT pre-trained embeddings and the shape would be (size of abstract list) rows × 768 columns.

3. Dump the abstract_embeddings as pickle.

with open(“my.pkl”, “wb”) as f: pickle.dump(abstract_embeddings, f)

4. Inference from the trained embeddings in real time.

i) Load the embedding file “my.pkl”.

with open(“my.pkl”, “rb”) as f: df = pickle.load(f)

ii) Encode the question into a 768-D embedding using step 1-2. For e.g., the question “What do we know about COVID-19 risk factors?” would be represented by a 768 dimensional embedding.

iii) Compare the question with all the abstracts’ embeddings and find the cosine similarity to rank and give top-N results. As the question and the abstracts are in numerical form, this can be done easily using scipy function.

distances = scipy.spatial.distance.cdist([query_embedding], df, “cosine”)[0]

This article discusses how to use pre-trained Sentence-BERT library for a downstream NLP task, which is to find relevant research articles based on user’s questions. We can compare the question with the title or research body. Comparing the question with only the title could be less informative as the text-length of the title will be too small. Contrary to this, comparing the question with the research body requires heavy computations. Additionally, shrinking the huge research body to 768 dimensions will lead to a huge loss of information. However, feel free to experiment with embeddings of the title and research body. Your findings may be fascinating.

Reference(s):

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks
https://github.com/UKPLab/sentence-transformers
https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md
https://www.aclweb.org/anthology/D19-1410.pdf
https://www.aclweb.org/anthology/D19-1410/

About Author

Dr. Monika Singh

Recommended Blogs & Articles

Copyright © 2024 Affine All Rights Reserved

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.