Semantic Literature Search Powered by Sentence-BERT

Suppose you were given a challenge to pick out a horror novel from a small collection of books, without any prior information regarding these books. How would you go about this task in the most efficient way possible? The most appropriate strategy would be flicking through the books to figure out the theme based on its words/sentences/paragraphs. Another approach could be reading the reviews on the cover page for context.

This article discusses how we can use a pre-trained BERT model to accomplish a similar task. Kaggle recently released an Open Research Dataset Challenge called CORD-19 , with over 52,000 research articles on COVID-19, SARS-CoV-2, and other related topics. The problem statement was to provide insights on the “tasks” using this vast research corpus. The required solution must fetch all the relevant research articles based on questions user submits.

Our approach to finding the most relevant research articles (related to the task), was to compare the question with abstracts of all the research articles present. By finding the similarity score, we could easily rank the top-N articles. Comparing the question with the entire corpus is as easy as flicking through the small collection of novels to pick out the horror genre. This automation takes place by using Sentence-Transformers library, pre-trained BERT model and Spacy library

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 ACL paper by Nils Reimers and Iryna Gurevych mention the architecture, and advantages of Siamese and Triplet Network Structures, some of which are:

  • Finding similar pairs of sentences from a corpus of 10,000 sentences requires around 50 million inference computations and ~65 hours with BERT/RoBERTa. Sentence BERT can quite significantly reduce the embeddings construction time for the same 10,000 sentences to ~5 seconds!
  • Fine-tuning a pre-trained BERT network and using siamese/triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine similarity.

Methodology:

1. Install sentence-transformers and load a pre-trained BERT model. We have used ‘bert-base-nli-mean-tokens’ due to its high performance on STS benchmark dataset.

from sentence_transformers import SentenceTransformer model = SentenceTransformer(‘bert-base-nli-mean-tokens’)

2. Vectorize/encode the abstracts of each article using pre-trained Bert embedding. Suppose the abstracts are contained in a list named ‘abstract’.

abstract_embeddings = model.encode(abstract)

The abstracts are encoded using BERT pre-trained embeddings and the shape would be (size of abstract list) rows × 768 columns.

3. Dump the abstract_embeddings as pickle.

with open(“my.pkl”, “wb”) as f: pickle.dump(abstract_embeddings, f)

4. Inference from the trained embeddings in real time.

i) Load the embedding file “my.pkl”.

with open(“my.pkl”, “rb”) as f: df = pickle.load(f)

ii) Encode the question into a 768-D embedding using step 1-2. For e.g., the question “What do we know about COVID-19 risk factors?” would be represented by a 768 dimensional embedding.

iii) Compare the question with all the abstracts’ embeddings and find the cosine similarity to rank and give top-N results. As the question and the abstracts are in numerical form, this can be done easily using scipy function.

distances = scipy.spatial.distance.cdist([query_embedding], df, “cosine”)[0]

This article discusses how to use pre-trained Sentence-BERT library for a downstream NLP task, which is to find relevant research articles based on user’s questions. We can compare the question with the title or research body. Comparing the question with only the title could be less informative as the text-length of the title will be too small. Contrary to this, comparing the question with the research body requires heavy computations. Additionally, shrinking the huge research body to 768 dimensions will lead to a huge loss of information. However, feel free to experiment with embeddings of the title and research body. Your findings may be fascinating.

Reference(s):

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks
https://github.com/UKPLab/sentence-transformers
https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md
https://www.aclweb.org/anthology/D19-1410.pdf
https://www.aclweb.org/anthology/D19-1410/

Bidirectional Encoder Representations for Transformers (BERT) Simplified

In the past, Natural Language Processing (NLP) models struggled to differentiate words based on context due to the use of shallow embedding methods for text analysis.

Bidirectional Encoder Representations for Transformers (BERT) has revolutionized the NLP research space. It excels at handling language problems considered to be “context-heavy” by attempting to map vectors onto words post reading the entire sentence in contrast to traditional methods in NLP models.

This blog sheds light on the term BERT by explaining its components.

BERT (Bidirectional Encoder Representation from Transformers)

Bidirectional – Reads text from both the directions. As opposed to the directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore, it is considered bidirectional, though it would be more accurate to call it non-directional.

Encoder – Encodes the text in a format that the model can understand. It maps an input sequence of symbol representations to a sequence of continuous representations. It is composed of a stack with 6 identical layers. Each layer has two sub-layers. The first layer is a multi-head self-attention mechanism. And the second layer is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by Layer Normalization. The key feature of layer normalization is that it normalizes the inputs across the features.

Representation – To handle a variety of down-stream tasks, our input representation can unambiguously represent both a single sentence and a pair of sentences, e.g. Question & Answering, in one token sequence in the form of transformer representations.

Transformers – Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.

Transformers are a combination of 3 things:

In this blog, we will only talk about the Attention Mechanism.

Limitations of RNNs over transformers:

  • RNNs and its derivatives are sequential, which contrasts with one of the main benefits of a GPU i.e. parallel processing
  • LSTM, GRU and derivatives can learn a lot of long-term information, but they can only remember sequences of 100s, not 1000s or 10,000s and above

Attention Concept

As you can see in the image above, attention must be paid at the stop sign. And for the text, eating (verb) has higher attention in relation to oats.

Transformers use attention mechanisms to gather information about the relevant context of a given word, then encode that context in the vector that represents the word. Thus, attention and transformers together form smarter representations.

Types of Attention:

  1. Self-Attention
  2. Scaled Dot-Product Attention
  3. Multi-Head Attention

Self-Attention

Self-attention, also called intra-attention is an attention mechanism that links different positions of a single sequence to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, etc.

Scaled Dot-Product Attention

Scaled Dot-Product Attention consists of queries Q and keys K of dimension dk, and values V of dimension dv. We compute the dot products of the query with all keys, divide each of them by √dk, and apply a SoftMax function to obtain the weights on the values.

The two most commonly used attention functions are:

  • Dot-product (multiplicative) attention: This is identical to the algorithm, except for the scaling factor of √1dk.
  • Additive attention: Computes the compatibility function using a feed-forward network with a single hidden layer.

While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice as it uses a highly optimized matrix multiplication code.

Multi-Head Attention

Instead of performing a single attention function with dmodel dimensional keys, values and queries, it is beneficial to linearly project the queries and values h times with different, trained linear projections to dk, dk and dv dimensions, respectively. We can then perform the attention function in parallel to each of these projected versions, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Applications of BERT

  1. Context-based Question Answering: It is the task of finding an answer to a question over a given context (e.g., a paragraph from Wikipedia), where the answer to each question is a segment of the context.
  2. Named Entity Recognition (NER): It is the task of tagging entities in text with their corresponding type.
  3. Natural Language Inference: Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
  4. Text Classification

Conclusion:

Recent experimental improvements due to transfer learning with language models have demonstrated that rich and unsupervised pre-training is an integral part of most language understanding systems. It is in our interest to

further generalize these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broader set of NLP tasks.

References:

  1. https://arxiv.org/abs/1810.04805
  2. https://arxiv.org/abs/1706.03762

TensorFlow Lite: The Future of AI in Mobile Devices

Do you use Google Services on your mobile phones? If so, you would have probably noticed that the predictive capability of this service has recently improved with respect to speed and accuracy. The enhanced predictive capability of Google is now faster, thanks to TensorFlow Lite working with your phone’s GPU.

What is TensorFlow Lite?

TFLite is TensorFlow’s light weight solution for mobile, embedded and other IoT devices. It can be described as a toolkit that helps developers run the TensorFlow model on such devices.

What is the need for TensorFlow Lite? Running Machine Learning models on mobile devices are not easy due to the limitation of resources like memory, power, storage, etc. Ensuring that the deployed AI models are optimized for performance under such constraints becomes a necessary step in such scenarios. This is where the TFLite comes into the picture. TFLite models are hyper-optimized with model pruning and quantization to ensure accuracy for a small binary size with low latency, allowing them to overcome limitations and operate efficiently on such devices.

TensorFlow Lite consists of two main components:

  • The TensorFlow Lite converter: that converts TensorFlow models into an efficient form and creates optimizations to improve binary size and performance.
  • The TensorFlow Lite interpreter: runs the optimized models on different types of hardware, including mobile phones, embedded Linux devices, and microcontrollers.

TensorFlow Lite Under the Hood

Before deploying the model on any platform, the trained model needs to go through a conversion process. The diagram below depicts the standard flow for deploying a model using TensorFlow Lite.

Fig: TensorFlow Lite conversion and inference flow diagram.

Step 1: Train the model in TensorFlow with any API, for e.g. Keras. Save the model (h5, hdf5, etc.)

Step 2: Once the trained model has been saved, convert it into a TFLite flat buffer using the TFLite converter. A Flat buffer, a.k.a. TFLite model is a special serialized format optimized for performance. The TFLite model is saved as a file with the extension .tflite

Step 3: Post converting the TFLite flat buffer from the trained model, it can be deployed to mobile or other embedded devices. Once the TFLite model gets loaded by the interpreter on a mobile platform, we can go ahead and perform inferences using the model.

Converting your trained model (‘my_model.h5’) into a TFLite model (‘my_model.tflite’) can be done with just a few lines of code as shown below:

How does TFLite overcome these challenges?

TensorFlow Lite uses a popular technique called Quantization. Quantization is a type of optimization technique that constrains an input from a large set of values (such as the real numbers) to a discrete set (such as the integers). Quantization essentially reduces the precision representation of a model. For instance, in a typical deep neural network, all the weights and activation outputs are represented by a 32-bit floating-point numbers. Quantization converts the representation to the nearest 8-bit integers. And by doing so, the overall memory requirement for the model reduces drastically which makes it ideal for deployment in mobile devices. While these 8-bit representations can be less precise, certain techniques can be applied to ensure that the inference accuracy of the new quantized model is not affected significantly. This means that quantization can be used to

make models smaller and faster without sacrificing accuracy. Stay tuned for the follow up blog that will be a walkthrough of how to run a Deep learning model on a Raspberry Pi 4. In the meantime, you can keep track of all the latest additions to TensorFlow Lite at https://www.tensorflow.org/lite/

AI in Robotic Process Automation – The Missing Link

Robotic Process Automation as we know it today is a framework through which large scale processes can be automated. The biggest advantage of current RPA systems is that they can be easily integrated with current IT and Software systems.

RPA can be used to automate repetitive process-driven work with well-defined outcomes. However, there is a catch. If this is Process Automation, then why don’t we just call it Process Automation? Apart from the marketing angle of using the word Robotic in Process Automation, whoever coined the term had something beyond just process automation in mind.

The intent and endeavor to automate workflows are not new. In traditional systems, automation was achieved by software developers building a comprehensive list of APIs or Scripts to cover all pre-conceived and possible tasks. However, there was a serious drawback in this approach, it was not scalable.

In modern RPA systems, instead of writing a finite number of scripts, the software systems trained to understand any number of steps as executed by recording the actual process and then replicating the same as it was recorded with the RPA platform. The 90’s generation that used excel extensively to write macros to record a set of standard activities and then store it so that every time those set of processes are run, it executes a complex but well-defined set of processes in a certain sequence. This was also called an excel macro. The current RPA platforms execute on similar principles, although at a much larger scale of complexity and size with advanced technology.

While this overcomes the problem of scale, there was still one last challenge. The significance of the word Robotic comes from the fact that there is an element of intelligence expected from the process of automation undertaken by an RPA platform. This intelligence allows the platform to take autonomous decisions based on

trigger. While the trigger can be programmed, the same is not the case with the decision tree that activates the trigger.

Most RPA software’s like UIPath, Blue Prism and Automation Anywhere have so far come out with platforms that are very good at process automation by being programmed to follow a certain set of standard processes. However, they fall short miserably when it comes to making the whole process intelligent. They fail to transcend from their platforms being Automatic to Autonomous.

Let’s illustrate this with an example. Let’s assume there is a complex senior management report that gets generated by collating some specific lines of data from various enterprise databases like Oracle, SAP, and others. After the report generation, an automated mail is sent out to 500 users with specific content to each user.

The process is repetitive because multiple reports need to be generated from multiple sources of data. This is a complex process with ginormous scales of data having a multiplier effect on each of the customized reports generated for more than 500 stakeholders which are then emailed to intended recipients.

Sounds like quite a complex process but today’s RPA platforms are equipped to handle this easily and repeat process automation with minimum errors. This does not require a lot of development on existing RPA platforms. As mentioned, it can also be easily integrated into multiple platforms mentioned earlier.

However, current implementations still fall short of one critical feature required to certify it as a true autonomous implementation.

Taking one particular use case as an example, if the RPA platform had to decide on whom to send the reports based on some critical random content of the reports which could be either a picture, text or numeric and it could appear in a random pattern, the platform would fail to do so. It will fail for the simple reason that it does not know how to detect and tackle unknown situations and the decisions thereof because it is not a part of the standard process.

Similarly, many other use cases are missing from current RPA platforms. While they claim that some of them are AI-enabled, most of them are not there yet.

The main reason for such a shortcoming is that AI is not the core of process automation developers. Naturally, it is an area they are skeptical in investing and rightfully so. It is going to be difficult for them to develop such a specialized competency. The RPA platforms should, therefore, drop the word Robotic unless their platforms are truly autonomous.

The RPA enterprise customers who see a real and large-scale implementation of their process automation platforms should work with AI service providers like Affine, to be able to add the element of intelligence and true autonomous Capabilities to the process automation already implemented.

One should not have the apprehensions of integration here because just like RPA platforms can be easily integrated into existing systems, AI modules can also be integrated into either RPA modules or to the end systems directly.

That is when RPA implementations will truly become Robotic in nature.

Capsule Network: A step towards AI mimicking human learning systems

1. A quick introduction to Convolution Neural Networks

The field of computer vision has witnessed a paradigm shift after the introduction of Convolutional Neural Network (CNN) architectures which has pushed AI performance at par with humans. There has been significant progress in CNN driven architectures right from the first AlexNet architecture published in 2012 to newer architectures like ResNet, DenseNet, NASNet and more recently EfficientNet; each focusing on improving accuracy while rationalizing the computing cost of adoption (through a lesser number of parameters).

An Illustration of CNN architecture

Evolution of various CNN architectures Ref: https://arxiv.org/pdf/1905.11946.pdf

CNNs learn faster with higher accuracy than any traditional non-convolutional image models owing to features like :

  • Local connectivity: While this limits the learnings to nearby pixel but it is sufficient enough to learn correlations required to evaluate an image
  • Parameter sharing across spatial locations: It makes learning easier and faster by reducing redundancy. e.g. if the algorithm has learned to detect horizontal edge at a point A, it need not learn horizontal edge detection again at point B

2. Drawbacks of CNN

While CNN has worked remarkably well, they fall short on 2 key aspects: Lack of Invariance: Human beings perceive in a translation invariance way, which means we are capable of identifying an object even if the location and orientation of the object changes in our field of view. An example below:

Humans can identify cat in each of the above scenarios. However, CNN needs to be trained on multiple orientation scenarios for accurate inferencing. While image augmentation techniques have helped to overcome the orientation challenge, but it leads to higher processing, data management costs and also might not work in all scenarios.

Loss of information on related features: Image Source

 Image Source

In the scenario above, CNN will recognize both Figures A and B as a face.

This happens due to the Pooling layers in the CNN architecture. In simple terms, we know that initial layers in CNN learn about the low-level features e.g. edges, points, curves, arcs, etc., while the later layers learn about high-level features e.g. eyes, nose, lips which are further enhanced to identify the actual object i.e. the human face in subsequent layers. The intermediate Pooling layers help in regulating these high order features while reducing the spatial size of the data flowing through the network. The dynamics of Pooling layers don’t take into account the spatial relationships between simple and complex objects.

Thus, in both Figure A and B, CNN recognizes a face by evaluating the presence of high-level features like eyes, nose, lips, without applying any cognizance to their spatial relationships.

3. How can Capsule Network address the problem?

According to Hinton, for accurate image classification CNN should be able to learn and preserve the spatial relationships and hierarchies between the features. Capsule networks, introduced by Hinton and his team, is a step towards learning a better and complete knowledge representation.

What is a capsule? A capsule is a group of neurons that captures both the likelihood and parameters of the specific feature Capsule networks use a vector representation as compared to scalar representations used in existing neurons in CNN architectures. This capsule vector representation includes (1) whether an object exists (2) what are the key features (3) where is it located. Thus, for an object capsule, the activations will consist of: 1. Presence probability ak 2. Feature vector ck 3. Pose matrix OVk which represents the geometrical relationship between the object and its parts thus able to infer various feature orientations and hierarchies

Capsule networks will also be able to identify abstract version of the objects even if it has not been trained on the particular image, an example below:

Capsule networks will be able to identify Image2 with moderate accuracy even if it hasn’t been part of its training image. This is not possible with the regular CNN architectures.

4. What is the current status in Capsule Networks research?

Few popular techniques for implementing a capsule architecture has been discussed below:

A. Dynamic Routing Algorithm (source)

The original Capsules network developed by Hinton and his team uses a Dynamic routing algorithm to group child capsules to form a parent capsule. The vectors of an input capsule are transformed to match the output, and if a close match is available it forms a vote, capsules with a similar vote are grouped.

If the activity vector has a close similarity with the prediction vector, then both capsules are inferred to be related to each other. The main drawback of this approach is that it takes a long time both during training and inferencing of the model. Since the voting is done iteratively, each part could start by initially disagreeing and voting on different objects, before converging to the relevant object. Hence this iterative manner of voting is highly time consuming and inefficient

B. Stacked Capsule Autoencoders (SCAE) (source)

Dynamic routing can be thought of a bottom-up approach where the parts are used to learn parts à object relationship. Since this relationship is learnt in an iterative manner (iterative routing) it leads to many inefficiencies. SCAE takes a top-down approach where parts of a given object are predicted, thus removing the dependency of the iterative routing approach. A further advantage of this version of capsules is that it can perform unsupervised learning.

SCAE consists of two networks

  1. 1. Part Capsule Autoencoder (PCAE) : Detects parts and recombines them into an image in the following manner
  • Part Capsule Encoder: Segments an image into constituent parts, infers their poses
  • Part Capsule Decoder: Learns an image template for each part and reconstructs each image pixel

2. Object Capsule Autoencoder (OCAE), organizes parts into objects

  • Object Capsule Encoder: Tries to organize discovered parts and their poses into a smaller set of objects
  • Object Capsule Decoder: Make predictions for each object part into one of the object types

Capsule Networks has achieved better performance as compared to CNN networks on MNIST (98.5% to 99%) and SVHN datasets (55% to 67%). The major drawback of SCAE is that the part decoder uses fixed templates, which are insufficient to model complicated real-world images. These networks can be perceived as a network with reasoning ability and thus with more expressive templates in the future, it can help infer complex real-world images efficiently with lesser data.

Are Streaming-services like Stadia the future of Gaming?

1. Introduction

Uber has revolutionized the way of commute since its launch. Traveling short distances has never been hassle free. Earlier people used to use their personal vehicles to cover small distances. Other alternative was to use public transport which is time-consuming and inconvenient. Uber, on the other hand, provides flexibility to non-frequent traveler and ones who love commuting over shorter distances, as they do not have to spend on purchasing a vehicle and at the same time can move around very conveniently. The same might hold true for the future of gaming! What would you feel if technology giants like Google and Amazon-owned the expensive hardware to process games with the best possible CPU and GPUs allowing you to simply stream the games? This could potentially eliminate the need of purchasing an expensive console and pay in a proportion of usage! This could be a game changer especially for someone who has not been able to commit to a INR 30,000/- console to play a single game. Can the entry of Google and Amazon in the gaming industry make this possible? At the Game Developers Conference (GDC) 2019, Google unveiled its cloud-streaming service called STADIA. Just like how humans have built stadiums for sports over hundreds of years, Google believes it’s building a virtual stadium: Stadia, to foster 1000s of player to play or spectate games simultaneously interacting with each other. Free to play games like Fortnite will standout on Stadia if Google can increase the number of players participating in an instance from 100 to say 1000s. Would Stadia really live up to its hype is a tricky question that only time may answer.

2. How does it work?

Google will make use of its massive data centers across the globe that will act as computational power for this service. Massive servers will make use of its advanced CPUs, GPUs, RAM and ROM to render games and stream to the users the enhanced audio/visual outputs. The players’ input shall be uploaded via keyboard or custom Stadia controller directly to the server. Let’s look at how Stadia stands against conventional console-based gaming.

3. Comes with advantages over console-based gaming

3.1. No Hardware (other than a remote): The bare minimum piece of hardware required is a device that can run chrome like a laptop, PC, mobile, tablet or even smart TV.

3.2. No Upgrade costs as they are taken care of, by the shared infrastructure hosted by Google. In the recent past, we had games that were below 10 GB in size while the recent RDR2 was above 100 GB with its patches. One can imagine how the need to upgrade hardware is the biggest driver for upgrading to next-gen consoles.

3.3. No RAM/ROM or Loading time limitations: Apart from these, YouTube integration will enable users to live broadcast their gameplay and will allow others to join as well in case of multiplayer games. In addition, the google assistant present on stadia controller will provide immediate help in case one is stuck at some point

of time to clear the stage. The benefits of this concept are really promising. But will the drawbacks offset these promises? Let’s go through each of them.

4. Need to overcome challenges to expand at scale

The drawbacks can potentially be addressed over time, but for now, scaling this remain the biggest hindrance. There are various challenges that Google (and users) will face such as Latency, Pricing, Markets and Game Library. There are other pointers as well, but these are going to be the biggest ones.

4.1. Latency effect

The video footage must get to you and the controller inputs must get from you to the server. Hence it is obvious that there is going to be an extra latency. Latency will depend upon three elements:

– Amount of time to encode and decode the video feed: Google has tons of experience in the field of video feed under the likes of YouTube

– The quality of internet infrastructure at the end user: This worrisome problem will hinder the smooth conduct of this process. The internet speed will be good in tier 1 cities, and not necessarily in the rural areas. You will also need a data connection without any cap. As per google, a minimum speed of 25Mbps will be required to bring Stadia into function. This means 11.25 GB of data will be transferred per hour. That’s about 90 hours of game streaming before the bandwidth is exhausted, considering that the user has a data cap of 1 TB. In other words, 3 hours of gaming per day in a month of 30 days. This is under the assumption that there is only one user and is utilized only for gaming purpose.

4.2. Dilemma for developers

Above was the issue that the end user will face. Let’s look at the situation from the game developer’s perspective. With the advent of a new platform, the developers will have yet another platform to port and test games. The developers will have to do more research which will increase the cost of production. At the same time, more time will be required to release game. This will be a big challenge for franchises that launch games every year. Google has partnered with Ubisoft and has promised to feature Ubisoft games at launch. The time will tell how many more developers will be willing to go a step ahead to support this concept. If not, then this could potentially mean that a lot of games will not be available ever. Now from a consumer’s perspective, it will be hard to justify their purchase as they won’t be able to play all the games available in the market.

4.3. Optimal pricing

Another challenge will be pricing. There is no information regarding the pricing of the overall model. Is this going to be a subscription service? Do we have to buy games? How the revenue is going to be shared with developers? Will the pricing be the same for hardcore gamers and casual gamers? Consider Activision (developer of games like Call of Duty) for example. Historical analysis tells us that slightly more than one-fourth of the purchasers do not even play the game for few hours. On the other hand, there are purchasers who play it day in and day out. The cost that each user has to pay for the game is $60. This amount goes to Activision and the platform on which it is sold. In case, Activision decides to release the game on Stadia, all the casual purchasers who would have bought the game to test out the hype, would now just stream it on Stadia at a much lower cost. Will Activision take that chance and release the game on Stadia? In case, the pricing is different for the types of users, how will the revenue be shared with the developers? Let’s assume that this will be a subscription model and users will be charged $30 per month, which comes out to be $360 per year. Now for a casual gamer, this will be very high as he can buy a console for $300 and play for years. All these questions will have to be answered before the launch. Running a cloud gaming service is

expensive. If the whole selling point is making gaming accessible to more and more people, then a high price point is not going to help the cause.

4.4. Available markets

At the GDC event, the team said that the service will be available in the US, Canada, UK, and Europe at launch. These regions have a high penetration of console-based gamers and Google will have to make a lot of efforts to make these people switch. The penetration of PlayStation and Microsoft Xbox is in single digits in India or China. With Stadia not available in Asia, Google is missing a lot of developing countries like India and China where people are not inclined towards consoles and hence hampering its user coverage. Given the high cost of consoles in developing countries like India, Stadia can become the go-to gaming platform.

4.5. Array of games available

Games library will be another hurdle in the race. We have no information regarding the list of games available during launch. Third party support isn’t enough for a gaming platform to survive. You need a list of exclusive games to bring people aboard. Google even unveiled its own Stadia Games and Entertainment studio to create Stadia-exclusive titles, but it didn’t mention any details on what games it will be building. In addition, it is highly unlikely that Console exclusives (1P titles) like Spider-Man or Halo will be available for Stadia. 1P games play a significant role in the console sales and Sony and Microsoft will never let this happen until they stick to console-based gaming. So, Google will have come up with its own exclusive titles so be dominant in the market. Making exclusive games takes a lot of research and time. It took Sony a good 5-6 years to develop one of its best-selling game “God of War”. If Google has not already started on its exclusive games, then it would be a mountain to climb for them.

4.6. What about other browsers?

Stadia will only be available through Chrome, Chromecast, and on Android devices initially. There was no mention of iOS support through a dedicated app or Apple’s Safari mobile browser. Will Apple be comfortable to let its user base shift completely to Chrome from Safari? Will Apple charge Google additional money for the subscription that Google gets on Apple’s devices? All these questions will be answered over time.

4.7. What if…?

Last but not the least, in case Google decides to drop the idea of Stadia in the later years of its launch like it has done in the past with Google lens or google plus, then gamers will lose all their progress and games despite their subscription fees. Apart from the above drawbacks, Google is not the only company to step in this field. It already has some serious competition from existing players in the game streaming sector.

5. Any competition that Google might face?

Sony already streams games to its consoles and PCs via its PlayStation Now service. Microsoft is also planning its own cloud game streaming service and can leverage its Azure data centers. Also, both Sony and Microsoft don’t require developers to port their games for their cloud streaming service. Apart from these two players, Nvidia has been quite successful in this domain allowing users to stream games from its library. This means Google has some strong competition and looks like the cloud gaming war is just getting started.

6. Conclusion

What is the incremental change you get from one version of a device to another? It is the absolute bare minimum they can give to make people switch. Let’s take an example of PS4 slim and PS4 pro. The only difference is that Pro supports 4K while Slim doesn’t and we have seen 30% people switching from Slim to Pro. The entrance of Google into the gaming industry will make PlayStation better, it will make Xbox better, it will make internet infrastructure better. The success or failure of Google stadia will cost nothing to consumer and at the same time, it will be net positive to gaming industry as well.

Thanks for reading this blog, For anyfeedback/suggestions/comments,
please drop a mail to marketing@affine.ai

Contributors:
Shailesh Singh – Delivery Manager
Akash Mishra – Senior Business Analyst

NB-NET – A Convolutional Neural Network based Classification Approach for Detecting Foot Overstep No Balls in Cricket

During the seventh match of the ongoing 12th season of the Indian Premier League between Royal Challengers Bangalore and Mumbai Indians at Bengaluru, Mumbai Indians defeated Royal Challengers Bangalore by 6 runs to register first win in IPL 2019.

In what was a nerve-wrenching match, it went down to the last ball. It was legendary batsman AB de Villiers who brought down the equation from 40 off 18 to 17 off the last over. Needing seven runs on the last ball, veteran bowler Lasith Malinga managed to win the match. However, once the match got over, replays confirmed that Malinga had bowled a front foot no-ball. With the umpire missing the same, there were some unhappy faces in the Bangalore camp.

“If this (the Malinga no-ball) had happened in the first, second or third overs of the game, I can tell you 100% surely that the umpire would have called it. The soil on the pitch becomes loose after a few overs. The line tends to go off even if it’s re-drawn. It’s too tough for the umpires to call no balls towards the end of the games. On a loose turf, when the bowler keeps landing on the pitch, the line doesn’t stay. From side view, we’ll see the line because it’s a two-dimensional image, and sometimes the lines are clearer outside the turf so the viewer thinks it’s there. There will be a parallax error.” – Says Hariharan, Former ICC Veteran Umpire

Cricket is a worldwide popular game where a single delivery can change the fate of the game. Every delivery is counted as a crucial moment for both teams. Umpires make the decisions regarding a no ball. Different technologies are being used to help the umpires to take their decisions. But often due to human perception, deciding whether a bowled delivery is a no ball or legal ball makes controversies. So, it is very important to make an accurate decision regarding a no ball Simultaneously, with advent and advances in Artificial Intelligence and Computer Vision, application of these two in different domains has become an emerging trend. Applying several computer vision techniques in analysing different Cricket events and automatically coming into decisions has become popular in recent days.

we have deployed a CNN based classification method with Resnet50 in order to automatically detect and differentiate foot overstepping no balls with fair balls. Our goal is to measure the probability of an image either it is a no-ball or not, to make the automated umpiring system and to eliminate the shortcoming of human perception.

Transfer learning uses the knowledge gained from solving one problem and apply it to another related problem. Facing the problem of collecting enough training data to rebuild models, transfer learning aims to transfer knowledge from a large dataset known as source domain to a smaller dataset named target domain. Either the feature spaces between domain data are different or the source tasks and the target tasks focus on different topics, boosting the performance of the target task.

Using the state-of-the-art deep learning techniques, we have devised a methodology to achieve our goal –

In our model to classify no-ball, we use images as input. Our input dataset contains images collected from google image search. The images are manually annotated and contain two classes: no ball and legal ball. We have used Keras and Tensorflow to build our model and generate results. Our model produces a score for both possible outcomes then each of them is converted to a probability by SoftMax.

All the collected images were resized, and the resolution of the images were enhanced using the techniques given below and were processed into the customized data augmentation pipeline. Some of the techniques used to enhance resolution of the image:
A] Contrast Limited Histogram Equalization (CLAHE)
B] Enhancing resolution of image using super resolution (OpenCV)

Original Photo
Enhanced photo

We have used Resnet50 as the classifying CNN as our model. In this step, we should keep the parameters of the previous layer, then remove the final layer and input our dataset to retrain the new last layer. The last layer of the model is trained by back propagation algorithm, and the cross-entropy cost function is used to synthesize the weight parameter by calculating the error between the output of the SoftMax layer and the label vector of the given test category.

To improve our results further we have used differential learning rates for the model layers. In the pretrained model, the layers closer to the input are more likely to have learned more general features. Thus, we don’t want to change them much. For these early layers, we set the LR to be very low. We increase the LR per layer gradually as we move deeper into the model.

No Ball
Ball

To make the classification more robust and accurate we have used test time augmentation (TTA). Test Time Augmentation is to perform random modifications to the test images. Thus, instead of showing the regular, “clean” images, only once to the trained model, we will show it the augmented images several times. We will then average the predictions of each corresponding image and take that as our final prediction. This boost the results of the model at the testing time.

Using this model, we eliminated the shortcoming of Umpire’s perception to decide a overstep no-ball. Corresponding to many no ball detection approaches and applications, our approach is more effective and efficient.

References:

Isolating Toxic Comments to prevent Cyber Bullying

Online communities are susceptible to Personal Aggression, Harassment, and Cyberbullying. This is expressed in the usage of Toxic Language, Profanity or Abusive Statements. Toxicity is the use of threats, obscenity, insults, and identity based hateful language.

Our objective was to create an algorithm which could identify Toxicity and other Categories with an Accuracy of > 90%. We will now discuss how we created a model with an accuracy of 98%.

Approach Part 1: Engineer the right features for Emotion Capture

  • No doubt Toxicity can be detected and measured by the presence of cuss words, obscenities etc. But is that all? Is the bag of words approach which stresses on content alone sufficient enough to define Toxicity? Are we missing information which can be assessed by the sequential nature of language and expression which comes not from the chosen word but also the way it is written or used?
  • This additional thought process of using Emotion as well as content was validated on the Kaggle Toxic Challenge problem data.
  • Emotion and Intensity of people’s written comment were captured via Feature Engineering and Deep Learning Models. The content was captured using Machine Learning Models that follow Bag of Words
  • The inclusion of Emotions coupled with our Ensembling Approach raised the accuracy levels from 60%(Bag of words) to 98%.

Approach Part 2: Create Disparate Models for the best Ensemble

With the Feature Engineering been taken care of, we have the all the information that needed to be extracted from our Corpus. Now the next step which is getting the best possible Ensemble model.

We did a Latent Dirichlet Allocation on the Corpus using the Bag of words approach and we found how some of the Categories were quite similar to another when we saw the Probability Distribution. Categories such as “Toxic”, “Severe Toxic”, “Obscene” which have a small margin of Decision Boundary. This at least confirms that creating one model that predicts all categories may be Time Consuming.

Our Strategy – Come up with the best model for a category. Concentrate on Parameter Tuning of Individual Models. Parameter Tune it till we get the best possible model. Finally, ensemble these models.

This paved the way for arriving at the best ensemble model as now each of the individual models have less correlations with one another.
Also, conceptually we had two classes of Models: The Deep Learning Sequential Models and the Machine Learning Bag of words models. This again contributed to our Ensembling Idea. Meaning there will be certain categories which the Sequential Model will be particularly good at and likewise the Bag of Words Models.

LSTM is a special case of Recurrent Neural Networks. Our Sequential Models consisted of LSTM models each of which were tuned in a manner as to have an edge in predicting one or at most two categories.

Our Bag of words model consisted of Light GBM models. Each of these models with their own different parameters and strengths in a category.

Illustration: As a part of our Ensembling Strategy, Sequential and Bag of Words Models were built for each Toxic Category.

Conclusion: Text Categorization can be dealt efficiently if we combine Bag of Words as well as the Sequential Approach. Also, emotions play a key role in Text Categorization for Classification Problems such as Toxicity.

Contributors

Anshuman Neog: Anshuman is a Consultant at Affine. He is a  Deep Learning & Natural Language Processing enthusiast.

Ansuman Chand: Ansuman is a Business Analyst at Affine. His Interests involve real-world implementation of Computer Vision, Machine Intelligence and Cognitive Science. Wishes to actively leverage ML, DL techniques to build solutions.

Learn How to Classify Documents Using Computer Vision and NLP

Many companies, especially those in BFSI and Legal sectors, deal with a large volume of handwritten and scanned documents. It is difficult to easily use the granular information in these documents to perform an analysis or even browse through the documents in a convenient manner. A simple classification of the documents into meaningful bins or folders would make it a lot easier to leverage the information within the documents.

The current blog focusses on a Document/Text Classification solution that we developed for an Insurance industry client which focussed on grouping medical/health insurance claims into pre-defined categories. The current process of categorization was done manually by a panel of experts. These experts had their own biasness and heuristics, which lead to inconsistencies.

We developed a Deep Learning based framework which ensembled learnings from document’s layout and structure, the content/text within a given document and amalgamation of consistent & coherent expert opinions. The framework helped in automation of the existing process leading to better efficiency and efficacy.

We had a set of 40k scanned images of medical insurance documents and tried building an algorithm to classify those documents into given 5 categories. These scanned documents exhibited characteristics for each of the classes based on the document structure and token sequences present in the document.

Document Sample:

Sample Image Features: Document Structure e.g. QR Code on Top Left Corner, Gridlines etc.

Sample Text Features: Presence of text sequences such as Health Claim Insurance Form, Field information e.g. Insured’s ID Number, Patient’s Name, Patient’s Address, Date of Service etc.

ANALYTICAL APPROACH

Over the past few years, Deep Learning (DL) architectures and algorithms have made impressive advances in fields such as image recognition and speech processing.

Their application to Natural Language Processing (NLP) has now proven to make significant contributions, yielding state-of-the-art results for some common NLP tasks. Named entity recognition (NER), topic modelling and sentiment analysis are some of the problems where neural network models have outperformed traditional approaches.
Convolutional neural networks (CNN) have also been widely used in automatic image classification systems, Object Detection and Recognition, Neural Style Transfer and many more applications. Image classification is the task of taking an input image and outputting a class or a probability of classes that best describes the image.

For the given problem, we decided to use the features of images as well as text in the document. Since these documents were scanned images, the first challenge was to extract text out of them, and second to draw meaningful insights from this text and image. Following is the overall solution architecture:

To extract the text sequences out of these images we performed OCR (Optical Character Recognition) using Tesseract. For python users, there is an OCR Library called “pytesseract” with the functionality of “image_to_string” conversion.

CHOICE OF MODEL

Through OCR steps we were able to extract the text sequences out of these scanned images. This leads us to achieve both text and image-based features which were leveraged for developing the classification algorithm.

Bi-directional LSTM:

For text features such as ‘Patient’s Name’, ‘Patient’s Address’, ‘Insured’s ID Number’, ‘Type of Bill’, ‘Patient’s Control Number’ etc. we decided to implement a bidirectional LSTM. Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems. It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.

The text sequences were converted into word vectors using pre-trained Glove embedding vector matrix and then passed through bidirectional LSTM (Long Short-Term Memory) Layer and 2 Dense Layers with ‘ReLU’ and ‘Softmax’ activations respectively.

Convolutional Neural Networks:

To deal with the image features we decided to implement a Convolutional Neural Network.
The image features were extracted and enhanced through 2Convolution layers and 3 dense layers with ReLU activations and softmax for the output layer activations respectively. Data Augmentation was done by flipping these scanned images for adding more features for robustness. Various transformations like resizing, rescaling etc. were also experimented with.

Combination of LSTM & CNN:

A combination of both LSTM and Convolution layers has also experimented on the text features, which resulted in a good classification accuracy.

Ensembling: Ensembling was done to capture features from different models and improve the accuracy of classification. This helped achieve 90% plus overall accuracy in correctly classifying the documents into respective 5 classes.

Please feel free to comment in case of any queries.

Contributors

Shifu Jain: Shifu is a Senior Business Analyst and a part of Affine Artificial Intelligence CoE “AICoE”. Her interests involve exploring and learning new researches in NLP, Topic Modelling, Recurrent Neural Networks etc.

Karthik Devaraj: Karthik is a Consultant at Affine with 6+ years’ experience in the field of Computer Vision, Machine Translation, Generative Algorithms.

Transmorphosizing Banking Through Artificial Intelligence

Banking is re-inventing itself as it always has

“Help us build the kind of bank you want to use” – That’s what Monzo says to its customers. Monzo is a UK based banking app with no usage fees, no branches, no mortgages and zero charges for spending abroad. While few of the banks in the UK might have a similar offer, but where Monzo has gained grounds is how it emotionally connects to its users. It auto-budgets your expenses, highlights where you are spending more, helps in finding the best deal e.g. it puts the money in the best savings accounts. They have formed a partnership with Google and Amazon to ensure real-time banking. For generations of users who spend a considerable amount of their time on their smartphones, it’s more than a conventional bank. Not surprising, they are termed as the new “wonder kid” in the banking industry and are aiming to have a billion-user base in next 5 years (ref: www.thegaurdian.com).

Seems like a fiction isn’t it. Not quite, because innovation has been the constant change in the Banking sector. History presents us numerous pieces of evidence in the form of

  • Giovanni di Medici setting up first credit banking systems in Renaissance Europe where charging interests was considered as a sin
  • The Dutch East India company trading in Tulips as stocks
  • The bond markets fueling the Rothschild to become one of the richest and influential family
  • The evolution of e-Wallets, UPIs, e-Money transfers giving a new definition to funds re-allocation

While the initial transitions required a couple centuries, the later took few decades. The rate at which better technologies and techniques are being made available, innovations are hitting us at a faster pace than previously imagined. The change is already here to be seen

  • From physical branch-based model to mobile platforms
  • Human interactions with virtual interface like chat-bots/virtual assistants
  • Single player bank to multiple players who are traditionally not banks e.g. blockchain
  • Inclusion of social risks along with traditional-risk analysis
  • Mass market products to hyper-personalized products in real time

With Fintech startups gaining greater prominence and technology behemoths such as Facebook, Amazon possessing in-direct challenges to the traditional banking systems, competition in this sector is going to get more serious. In the Indian context, with non-traditional players like PayTM and Reliance making in-roads, the baking landscape might undergo a complete transformation.

Why is it happening now and what changed?

Two key factors – Creation and Consumption. While “Creation” is being driven by mining better insights through “data democratization” enabled on Big Data platforms, the “Consumption” has increased through relevant and accurate outcomes on these data catalyzed through “Machine Learning and more recently Artificial Intelligence”.

Artificial Intelligence(AI) has accelerated the rate of innovations in the last couple of years like “fire” in human evolution. Its ability to mimic human brain has led to identifying patterns and insights which were historically impossible or took a long time to manifest. e.g. while matching an image it tries to think like a human, using processes like Convolution Neural Networks, unlike some pre-defined algorithms which were tediously trained to function within some mathematical boundaries. AI is extending traditional Analytics to look beyond just numbers and statistics. It is revolutionizing the concept of “sensing” than “predicting”. These technologies are not new. While AI existed since 1950, the adoption has been high in recent times due to the availability of huge computing power, big data and reduced cost of implementation.

AI has already been adopted across several use cases like:

  • Identifying Frauds with better accuracies
  • Cross Selling and Up Selling using Chatbots/Virtual Assistance
  • AI-assisted portfolio management
  • IoT to better manage marketing across multi-customer touch points
  • Process automation like Contract Optimization

But are these actions sufficient for survival?

Unfortunately, history has been etched with battles between unconventional domains. Instances like the downfall of Kodak when they delayed the adoption of digital cameras or Airlines which has been impacted hugely by “video conferencing” technologies, or Amazon kindle impacting publication houses. Innovation is essential and at a faster pace. The marketplace has seen the entry of unconventional players like:

  • Reliance Jio: They already have the huge customer base over 120 M and this customer reach may increase exponential possessing threat to conventional bankers
  • PayTM: It has expanded from a wallet to marketplace with a customer base of 200 M. With humongous gamut of offers in their kitty and a favorable customer perception, they have already started putting a dent in the online transactions & payment market
  • In global markets: Jawbone and Amex have developed fitness + payment tracker “UP4”, thus helping payment channels as a natural extension of human everyday life or Westpac coming up with Geo-location specific marketing
  • P2P lenders and their alarming growth rate of 300% in the last couple of years

Banking needs to leapfrog from the traditional way of doing things to new ways of thinking.

Can Banking be done differently to “thrive than just survive”?

With AI getting more “Context-Aware” and frameworks like Ubiquitous Computing gaining prominence, “Cyber-Physical” systems will become a reality or has become one as discussed in the “UP4” examples. Banking cannot just survive on innovation/automation alone, it needs to “Transform”. Bankers need fiction writers, the futurist who can dream of future. All we can say is there needs to be a “Trans-Morphosis”, from natural genetic change to a non-predefined transition.

If we re-visualize the opportunities through the prism of “Trans-Morphosis” few of those things can be done differently:

  • Stopping a fraud before it happens- Like Manufacturing can we develop on the lines of “Preventive Maintenance” to sense trends before the “Boiler” fails
  • Incentivizing customers for having a high credit score – Like Insurance sector, where better drivers are awarded “No Claim” bonuses
  • Transforming Virtual Assistants to be Virtual Mentors – Help customers, not just with the best offers but suggestions to fulfill their dreams e.g. help to maximize savings for the down payment of a car by liquidating portion of existing saving portfolios like SIP, Mutual funds etc. seamlessly

These are just a few possibilities. Just think of additional opportunities like

  • Shared Risk Platform across multi-partners or other banks – like blockchain, where risks are shared among multiple stakeholders. A recent defaulter with good credit history applies for a loan. His current requirement can be funded by multiple banking partners each sharing a portion of the risk. The risk value, loan amount and lending time/rates can be evaluated through AI to mitigate risk and help customers with better experience while ensuring he is still the part of the banking system
  • Understand customer needs in real time based on what preferences influenced by one’s social setting e.g. what others who can influence him/her are spending on. Our decisions may be correlated by some facts but we humans may not behave rationally. So rather than predicting, since the possibilities through AI

What this decade is seeing is data-based decision making moving from being a decision aid to being the decision engine to almost being the decision maker.

To conclude “Data is the new nuclear fuel and AI is the reactor”. If controlled and curated we can fulfill numerous next-gen possibilities.

Copyright © 2024 Affine All Rights Reserved

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.