The transformer revolution in video recognition. Are you ready for it?

Imagine how many lives could be saved if caretakers or medical professionals could be alerted when an unmonitored patient showed the first signs of sickness. Imagine how much more secure our public spaces could be if police or security personnel could be alerted upon suspicious behavior. Imagine how many tournaments could be won if activity recognition could inform teams and coaches of flaws in athletes’ form and functioning.

With Human Activity Recognition (HAR), all these scenarios can be effectively tackled. HAR has been one of the most complex challenges in the domain of computer vision. It has a wide variety of potential applications such as – sports, post-injury rehabilitation, analytics, security surveillance traffic monitoring, etc. But the complexity of HAR arises from the fact that an action spans both spatial and temporal dimensions. In normal computer vision tasks, where a model is trained to classify/detect objects located in a picture, only spatial dimension is involved. However, in HAR, learning multiple frames together over the time helps in classifying the action better. Hence, the model must be able to track both the spatial and temporal components.

The architecture used for video activity recognition includes 2D convolutions, 3D CNN volume filters to capture spatio-temporal information (Tran et al., 2015), 3D convolution factorized into separate spatial and temporal convolutions (Tran et al., 2018), LSTM for spatio-temporal information (Karpathy et al., 2014), as well as the combination/enhancements of these techniques.

TimeSformer – A revolution by Facebook!

Transformers have been making waves in Natural Language Processing for the last few years. They employ self-attention with encoder-decoder architecture to make accurate predictions by extracting information about context. In the domain of computer vision, the first implementation of transformers came through ViT (visual transformers) developed by Google. In ViT, a picture is divided into patches of size 16×16 (see Figure 1) and then flattened to 1D vectors. Then they are embedded and passed through an encoder. The self-attention is calculated with respect to all the other patches.

\vision-tranformer-gif

Figure 1. The Vision Transformer treats an input image as a sequence of patches, akin to a series of word embeddings generated by an NLP Transformer. (Source: Google AI Blog: Transformers for Image Recognition at Scale (googleblog.com))

Recently Facebook developed TimeSformer, the first instance in which transformers are used for HAR. In the case of TimeSformer, as in the case of other HAR methods, the input is a block of continuous frames from the video clip, for example, 16 continuous frames of size 3x224x224. To calculate the self-attention for a patch in a frame, two sets of other patches are used:

 a) other patches of the same frame (spatial attention).

 b) patches of the adjacent frames (temporal attention).

There are several and different ways to use these patches. We have utilized only the “divided space-time attention” (Figure 2) for this purpose. It uses all the patches of the current frame and patches at the same position of the adjacent frames. In “divided attention”, temporal attention and spatial attention are separately applied within each block, and it leads to the best video classification accuracy (Bertasius et al., 2021).

Figure 2. Divided space-time attention in a block of frames (Link: TimeSformer: A new architecture for video understanding (facebook.com))

It must be noted that TimeSformer does not use any convolutions which brings down the computational cost significantly. Convolution is a linear operator; the neighboring pixels are used by the kernel in computations. Vision transformers (Dosovitskiy et al., 2020), on the other hand, are permutation invariant and require sequences of data. So, for the transformer’s input, spatial non-sequential data is converted into a sequence. Learnable positional embeddings are added per patch (analogously taken in an NLP task) to allow the model to learn the structure of the image.

TimeSformer is roughly three times faster to train than 3DCNNs, requires less than one-tenth the amount of compute for inference, and has 121,266,442 trainable parameters compared to only 40,416,074 trainable parameters in the 2DCNN model and 78,042,250 parameters in 3DCNN models.

In addition to this, TimeSformer has the advantages of customizability and scalability over convolutional models – the user can choose the size and depth of the frames which is used as input to the model. The original study has utilized images as big as 556×556 and as deep as 96 frames and yet could not scale exponentially. 

But challenges abound…

Following are some challenges while tracking motion:

  1. The challenge of angles: A video can be shot from multiple angles; the pattern of motion could appear different in different angles.
  2. The challenge of camera movement: Depending on whether the camera moves with the motion of the object or not, the object could appear to be static or moving. Shaky cameras add further complexity.
  3. The challenge of occlusion: During motion, the object could be hidden by another object temporarily in the foreground.
  4. The challenge of delineation: Sometimes it is not easy to differentiate where one action ends and the other begins.
  5. The challenge of multiple actions: Different objects in a video could be performing different actions, adding complexity to recognition.
  6. The challenge of change in relative size: Depending on whether an object is moving towards or away from the camera, its relative size could change continuously adding further complexity to recognition.

Dataset

In order to determine video recognition capability, we have used a modified UCF11 data set.

We have removed the Swing class as it has many mislabeled data. The goal is to recognize 10 activities – basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. Each has 120-200 videos of different lengths ranging from 1-21 seconds. The link to the dataset is (CRCV | Center for Research in Computer Vision at the University of Central Florida (ucf.edu)).

How we trained the model

We carried out different experiments to get the best-performing model. Our input block contained 8 frames of size 3x224x224. The base learning rate was 0.005 which was reduced by a factor of 0.1 in the 11th and 14th steps. For augmentation, color jitter (within 40), random horizontal flip and random crop (from 256×320 to 224×224) were allowed. We trained the model on Amazon AWS Tesla M60 GPU with a batch size of 2 (due to memory limitations) for 15 epochs.

Metrics are everything

In the original code, Timesformer samples one input block per clip for training and validation. In the case of test videos, it takes 3 different crops and averages over the predictions (we term this samplewise accuracy). As a result, several of the models we trained could achieve over 95% validation accuracy. However, in our humble opinion, this is not satisfactory because it does not examine all the different spatio-temporal possibilities in the video. To address that, we take two other metrics into consideration.

  • Blockwise accuracy – a video clip is considered an object obtained by combining continuous building blocks (no overlap). The model makes predictions for all the input blocks, and this accuracy of prediction is measured. This is more suitable for real-time scenarios.
  • Clipwise accuracy – prediction of all the blocks of a video is considered and the mode is assigned to be the prediction for the clip, and that accuracy is measured. This also helps to understand real-time accuracy in a larger timeframe.

The final outcome

Our best model had the following performance metric values:

  • Samplewise accuracy – 97.3%
  • Blockwise accuracy – 86.8%
  • Clipwise accuracy – 92.2%

The confusion matrix for the clipwise accuracy is given in Figure 3(a). For comparison, the confusion matrix for 2dCNN and 3DCNN models are shown in Figure 3(b) and 3(c) respectively.

These metrics are quite impressive and far better than the results we obtained using 2D convolution (VGG16) as well as 3D convolution (C3D); 81.3% and 74.6% respectively. This points to the potential of TimeSformers.

Figure 3(a). Clipwise accuracy confusion matrix for TimeSformer

Concluding Remarks

In this work, we have explored the effectiveness of the TimeSformer for Human Activity Recognition task on the modified UCF11 dataset. This non-convolution model has outperformed the 2DCNN and 3DCNN models and performed extremely well on some hard to classify classes such as ‘basketball’ and ‘walking’. Future work includes trying more augmentation techniques to fine-tune this model and using Vision-based Transformers in other video-related tasks such as video captioning and action localization.

When done right, TimeSformer can truly change the game for Human Activity Recognition. Its use cases across healthcare and sports, safety and security can truly come to life with TimeSformer.

References

[1] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).

[2] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459).

[3] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).

[4] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding?. arXiv preprint arXiv:2102.05095.

[5] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

Optimizing Inventory with the Power of AI

Inventory management is a critical aspect for businesses – those that are required to store products for the ultimate purpose of sales. Stocking the right number of goods at the right place and time, taking into consideration the scenarios of demand and supply is vital in order to fulfil consumer expectations in a timely manner, reduce wastage, stay efficient, and in turn, earn substantial profits. Needless to say, the advent of breakthrough technologies such as Artificial Intelligence (AI) has simplified and revolutionized inventory management, enabling business owners to act diligently and optimize their stocks, right from the manufacturing stage till product distribution.

Uncertain Times Call for Stronger Measures

We are in the midst of uncertain, tough times led by Covid-19, which took businesses by surprise and jolted the operations of many. Ever since, demand and supply have been prone to constant fluctuations and consumer behaviour is seen to always be always new turns. Understanding and optimizing inventory has emerged as a grave challenge, and businesses are actively on the look-out for solutions to effectively track and manage their stocks. Artificial Intelligence, with its potential to churn and distil large volumes of data, provides cutting-edge inventory insights and visibility to businesses, enabling them to enhance their revenues, customer experience and brand image. 

AI’s capability to leverage large swathes of real-time inventory control dynamics that affect inventory stock levels differentiates it from traditional tools. AI can predict scenarios, recommend actions and even act — independently or with human approval. Take the example of a digital twin with a global view of all suppliers, manufacturers, transportation, warehouses, and retailers. Data from IoT devices such as GPS and RFID tags, business applications like ERP and WMS, and third party sources can be interconnected to model, monitor, and manage real-world supply chain environments, giving a real time view into product requirement.

The Positive Impact of AI-Led Inventory Optimization

Let us understand how the use of AI can provide positive outcomes across various scenarios in inventory optimization. 

a) Analysis of consumer shopping behavior – In today’s unpredictable times, the buying behavior of the consumer is no longer constant. For businesses, it means staying agile and alert at all times to determine what will approve to the consumer at any given point of time. AI, with its outstanding abilities, intelligently slices the wealth of consumer data, including their purchase history, browsing patterns, social media acts, etc to accurately catch the pulse of the consumer behavior. These powerful behavioral analytics empowers businesses to determine their stocks and quantities effectively.  

b) Accurate prediction of demand – The multiple datasets generated by businesses are intelligently harnessed by the AI technology to identify forthcoming market demand patterns with greater ease and accuracy. With Machine Learning (ML), real-time data can be easily leveraged to accurately predict demand and thereby determine how much inventory is actually needed to fulfill this demand. This helps businesses to take care of out-of-stock and over-stock issues intelligently. As per McKinsey, AI-powered demand forecasting has the potential to reduce supply chain errors by 30-50%

c) Improved Warehouse Management – With the correct amount of goods being held at the warehouses, the space logistics and productivity get automatically optimized. Over-stocking entails huge costs and also eats chunks of storage space, which gets duly resolved with the help of AI-powered insights.  As a result, the warehouse teams are able to operate efficiently, which brightens up the prospects of growth. McKinsey highlights that the use of AI reduces warehousing costs by approximately 10-40%

d) Time Management – As the machines take over to process large amounts of otherwise inaccessible data, businesses are able to lay hands on important information such as expected time of arrival of any particular good/ goods, which might be out-of-stock. The same can then be communicated to the customers, which helps strengthen the brand-customer relationship.

e) Scalability – AI-based inventory management, with its accurate forecasting capabilities, allows businesses to respond instantly to any unexpected fluctuations in demand or supply, thereby allowing them to scale their stock up or down. This ability to act in real-time helps businesses to provide quick, quality services to their consumers and also help them clear their stock profitably. 

The right solution tackles the problem in a right manner at the right time.

Enter, Affine

Affine’s new-age, AI-powered solution for Inventory Replenishment System helps businesses improve their inventory level efficiency by optimizing a wide range of variables.

  • Demand Forecast – Accurate demand predicting algorithms used by Affine allow businesses to know what is needed at what time.  
  • Lead Time – Intelligence on lead time helps business combat issues such as delayed supply or low inventory. 
  • Inventory Carrying Cost – The cost of holding unwanted inventory automatically gets eradicated, since the AI-powered smart insights allow just the right inventory to be held in warehouses. 
  • Ordering Cost – Ordering just the right inventory keeps the ordering costs, including the overheads in control. 

In fact, Affine has extensive experience in implementing AI and ML solutions across verticals in supply chain management. We have empowered a carrier fleet with route optimization, optimized inventory for a large shoe manufacturer store, demand forecasting for a coffee giant, and improving on-time delivery for an e-commerce giant. These are just the tip of the iceberg that is our experience and expertise in AI-led optimization for organizations across sectors and verticals. 

Before we go…

Inventory optimization leveraging AI is all set to take the business world by storm. With the large swathes of data sets now available to organizations, their investment in cutting 4IR technologies, and unpredictable consumer behaviour are all leading this change from the front.

If you too are on the look-out for an advanced inventory optimization tool, experience the world of AI with Affine and elevate your inventory experiences.

About The Author(s):

The blog is a result of research efforts conducted by Affine’s Manufacturing CoE team, our Centre of Excellence which exists for the sole purpose of hyper-innovation in the manufacturing space. The Manufacturing CoE is a dedicated in-house team responsible to continuously innovate manufacturing solutions and services powered by AI, AE & Cloud capabilities. Our enterprise-grade solutions are new, hot, happening, futuristic and the next big thing in Industry 4.0. 

Human Activity Recognition: Fusing Modalities for Better Classification

Human Activity Recognition using high-dimensional visual streams has been gaining popularity in recent times. Using a video input to categorize human activity is primarily applied in surveillance of different kinds. At hospitals and nursing homes, this can be used to immediately alert caretakers when any of the residents are displaying any sign of sickness – clutching their chest, falling, vomiting, etc. At public places like airports, railway stations, bus stations, malls or even in your neighborhoods, the activity recognition becomes the means to alert authorities on recognizing suspicious behavior. This AI solution is even useful for identifying flaws in and improving the form of athletes and professionals ensuring improved performance and better training.

Why Use Multimodal Learning for Activity Recognition?

Our experience of the world is multimodal in nature, as quoted by Baltrušaitis et al. There are multiple modalities a human is blessed with. We can touch, see, smell, hear, and taste, and understand the world around us in a better way. Most parents would remember their kid’s understanding of “what a dog is” keeps improving after seeing actual dogs, videos of dogs, photographs of dogs and cartoon dogs and being told that they are all dogs. Just seeing a single video of an actual dog does not help the kid identify the character “Goofy” as a dog. It is the same for machines. Multimodal machine learning models can process and relate the information from multiple modalities, learning in a more holistic way.

This blog serves as the captain’s log on how we combined the effectiveness of two modalities – Static Images and Videos – to improve the classification of human activities from videos. Algorithms for video activity recognition are based on dealing with only spatial information (images), or both spatial and temporal information (videos). Algorithms were used for both static images and videos for the activity recognition modeling. Fusing both models together made the resultant multimodal model far better than each of the individual unimodal models.

Multimodal Learning for Human Activity Recognition – Our Recipe

Our goal was to recognize 10 activities – basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. We created the multimodal models for activity recognition by fusing the two unimodal models – image-based and video-based – using the ensemble method, thus enhancing the effect of the classifier.

The Dataset Used: We have used modified UCF11 dataset (removed Swing class as it has many mis-labelled data). For the 10 activities we need to classify, the dataset has 120-200 videos of different lengths ranging from 1-21 seconds. The link to the dataset is (CRCV | Center for Research in Computer Vision at the University of Central Florida (ucf.edu)).

One Modality at a Time

There are different methodologies for Multimodal learning as is described by Song et al., 2016, Tzirakis et al., 2017, and Yoon et al., 2018. One of the techniques is ensemble learning, in which 2DCNN model and 3DCNN models are trained separately and the final softmax probabilities are combined to get predictions. Other techniques include joint representation, coordinated representation etc. A detailed overview is available in Baltrušaitis et al., 2018.

The First Modality – Images: We trained a 2DCNN model with VGG-16 architecture and equipped our model with batch normalization and other regularization methods, since the enormous number of frames

caused overfitting. We observed that deeper architectures were less accurate. We achieved 81% clip-wise accuracy. However, the accuracies are very poor on some classes such as basketball, soccer juggling, tennis, and walking as can be seen below.

Second Modality – Videos: The second model was trained with 3DCNN architecture, which took 16 frames as one chunk called as a block. We incorporated various augmentations, batch normalization, and regularization methods. During the exercise, we observed that 3D architecture is sensitive to the learning rate. In the end, we achieved a clip-wise accuracy of 73%. Looking at the accuracies for individual classes, this model is not the best, but it is much better than the 2DCNN for Soccer Juggling and Golf classes. Class-wise accuracies can be seen below.

Our next objective was to ensure that the learnings from both modalities are combined to create a more accurate and robust model.

Our Secret Sauce for Multimodal learning – The Ensemble Method

For fusing the two modalities, we resorted to ensemble methods (Zhao, 2019). We experimented with two ensemble methods for the Multimodal learning:

1. Maximum Vote – Mode from the predictions of both models is taken as the predicted label.

2. Averaging and Maximum Pooling – Weighted sum of probabilities of both models is calculated to decide the predicted label.

Maximum Vote Ensemble Approach: Let us consider a video consisting of 64 frames. We fed these frames to both 2D and 3D models. The 2D model produced an output for each frame resulting in 64 tensors with a predicted label. Similarly, the 3D model produced a label for 16 frames, thus finally generating 4 tensors for 64 frames video. To balance the weightage of the 3D model, we augment the output tensors of the 3D model by repeating the tensor multiple times. Finally, we concatenated the output tensors from both models, and selected the modal class (the class label with the highest occurrence) as the prediction label.

We iteratively experimented with different weightages for both models and checked the corresponding overall accuracy as shown below. Maximum overall accuracy of 84% was achieved at 60% weightage to the 2D model and 40% weightage to the 3D model.

At 60:40 weightage for 2D:3D, the Maximum Vote ensemble method has performed better than either of the unimodal models except for Biking and Soccer Juggling.

Averaging and Maximum Pooling Ensemble Method: Consider that we are feeding a video with 64 frames to both models, the 2D model which generates 64 tensors with probabilities for every class and the 3D model will generate 4 tensors with probabilities for each of 10 classes. We calculate the average of all 64 tensors from the 2D model to get the average probability for every class. Similarly, we calculate the average of the 4 tensors from the 3D model. Finally, we take the weighted sum of resulting tensors from both models. We select the class with maximum probability as the predicted class.

The equation for the weight in Averaging and Maximum Pooling method is:

Multimodal = α*(2D Model) + (1-α)*(3D Model)

We experimented with different values of α and compared the resulting overall accuracies as shown below. With equal weightage to both models, we achieved 77% accuracy. With the increase in α, i.e., weightage to 2D model, multimodal accuracy increased. We achieved the best accuracy of 87% at 90% weightage to the 2D model and 10% weightage of the 3D model.

The class-wise accuracy for the Averaging and Maximum Pooling method is low for Basketball and Walking classes. However, the class-wise accuracies are better than or at least the same as the unimodal models. The accuracies for Soccer Juggling and Tennis Swing improved the most from 71% and 67% for the 2D model to 90% and 85% respectively.

Conclusion

Beyond doubt, our Multimodal models performed better than the Unimodal ones. Comparing the multimodal engines, Averaging and Maximum Pooling performed better than the Maximum Vote method, as is evident from the overall accuracies of 87% and 84% respectively. The reason is that the Averaging and Maximum Pooling method considers the confidence of the predicted label whereas, the Maximum Vote method considers only the label with maximum probability.

In Human Activity Recognition, we believe the multimodal learning approach can be improved further by incorporating other modalities as well. Such as Facebook’s Detectron model or pose estimation method.

Our next plan of action is to explore more forms of multimodal learning for activity recognition. Using features addition/ layers are fusing features can be effective in learning features better. Another way of proceeding would be to add different modalities like pose detection feed, motion detection feed and object detection feed to provide better results. No matter the approach, fusing modalities has a corresponding cost factor associated with it. While we have 40.4 million trainable parameters in the 2DCNN model and 78 million parameters in the 3DCNN models, the multimodal model has 118.5 million parameters to train on. But this is a small amount to pay considering the limitless applications that can be made viable because of the performance improvement provided by the multimodal models.

Natural Language Inferencing (NLI) Task: Demonstration Using Kaggle Dataset

Natural Language Inferencing (NLI) task is one of the most important subsets of Natural Language Processing (NLP) which has seen a series of development in recent years. There are standard benchmark publicly available datasets like Stanford Natural Language Inference (SNLI) Corpus, Multi-Genre NLI (MultiNLI) Corpus, etc. which are dedicated to NLI tasks. Few state-of-the-art models trained on these datasets possess decent accuracy. In this blog I will start with briefing the reader about NLI terminologies, applications of NLI, NLI state-of-the-art model architectures and eventually demonstrate the NLI task using Kaggle Contradictory My Dear Watson Challenge Dataset by the end.

Prerequisites:

  • Basics of NLP
  • Moderate Python coding

What is NLI?

Natural Language Inference which is also known as Recognizing Textual Entailment (RTE) is a task of determining whether the given “hypothesis” and “premise” logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other. For example, let us consider hypothesis as “The game is played by only males” and premise as “Female players are playing the game”. The task of NLI model is to predict whether the two sentences are either entailment, contradiction, or neutral. In this case, it is a contradiction.

How NLI is different from NLP?

The main difference between NLP and NLI is that NLP is a broader set that contains two subsets Natural Language Understanding (NLU) and Natural Language Generation (NLG). We are more concerned about NLU as NLI comes under this. NLU is basically making the computer capable of comprehending what the given text block represents. NLI, which comes under NLU is the task of understanding the given two statements and categorizing them either as entailment, contradiction, or neutral sentences. When dealing with data most of the NLP tasks include pre-processing steps like removing stop words, special characters, etc. But in case of NLI, one has to just provide the model with two sentences. The model then processes the data itself and outputs the relationship between the two sentences.

Figure 1: NLP vs NLI. (Source)

Applications of NLI

NLI is been used in many domains like banking, retail, finance, etc. It is widely used in cases where there is a requirement to check if generated or obtained result from the end-user follows the hypothesis. One of the use cases includes automatic auditing tasks. NLI can replace human auditing to some extent by comparing if sentences in generated document entail with the reference documents.

Models used to Demonstrate NLI Task

In this blog, I have demonstrated the NLI task using two models: RoBERTa and XLM-RoBERTa. Let us understand these models in this section.

In order to understand RoBERTa model, one should have a brief knowledge about BERT model.

BERT

Bidirectional Encoder Representation Transformers (BERT) was published by Google AI researchers in 2018. It has shown state-of-the-art results in many NLP tasks like question and answering, NLI task etc. It is basically an encoder stack of transformer architecture. It has two versions BERT base and BERT large. BERT base has 12 layers in its encoder stack and 110M total parameters whereas BERT large has 24 layers and 340M total parameters.

BERT pre-training consists of two tasks:

  1. Masked Language Model (MLM)
  2. Next Sentence Prediction (NSP)

Masked LM

In the input sequence sent to the model as input, randomly 15% of the words are masked and the model is tasked to predict these masks by understanding the context from unmasked words at the end of training. This helps model in understanding the context of the sentence.

Next Sentence Prediction

Model is fed with two-sentence pairs as input. In this task, a model must predict at the end of training whether the sentences follow or unfollow each other. This helps in understanding the relationship between two sentences which is the major objective for tasks like question and answering, NLI, etc.

Both the tasks are executed simultaneously while training.

Model Input

Input to the BERT model is a sequence of tokens that are converted to embeddings. Each token embedding is a combination of 3 embeddings.

Figure 2: BERT input representation [1]
  1. Token Embeddings – These are word embeddings from WordPiece token vocabulary.
  2. Segment Embeddings – As BERT model takes pair of sentences as input, in order to help model distinguish the embeddings from different sentences these embeddings are used. In the above picture, EA represents embeddings of sentence A while EB represents embeddings from sentence B.
  3. Position Embeddings – In order to capture “sequence” or “order” information these embeddings are used to express the position of words in a sentence.

Model Output

The output of BERT model has the same no. of tokens as input with additional classification token which gives the classification results ie. whether sentence B follows sentence A or not.

Figure 3: Pre-training procedure of BERT [1]

Fine-Tuning BERT

A pre-trained BERT model can be fine-tuned to achieve a specific task on specific data. Fine-tuning uses same architecture as the pre-trained model only an additional output layer is added depending on the task. In case of NLI task classification token is fed into the output classification layer which determines the probabilities of entailment, contradiction, and neutral classes.

Figure 4: Illustration for BERT fine-tuning on sentence pair specific tasks like MNLI, QQP, QNLI, RTE, SWAG etc. [1]

BERT GLUE Task Results

Figure 5: GLUE test results [1]

As you can see in figure 4, BERT outperforms all the previous models on GLUE tests.

RoBERTa

Robustly Optimised BERT Pre-training Approach (RoBERTa) was proposed by Facebook researchers. They found with a much more robustly pre-training BERT model it can still perform better on GLUE tasks. RoBERTa model is a BERT model with modified pre-training approach.

Below are the few changes incorporated in RoBERTa model when compared to BERT model.

  1. Data – RoBERTa model is trained using much more data when compared to BERT. It is trained on 160GB uncompressed data.
  2. Static vs Dynamic Masking – In BERT model, data was masked only once during pre-processing which results in single static masks. These masks are used for all the iterations while training. In contrast, data used for RoBERTa training was duplicated 10 times with 10 different mask patterns and was trained over 40 epochs. This means a single mask pattern is used only in 4 epochs. This is static masking. While in dynamic masking different mask pattern is generated for every epoch during training.
Figure 6: Static vs Dynamic masking results [2]

3. Removal of Next Sentence Prediction (NSP) objective – Researches have found that removing NSP loss significantly improved the model performance on GLUE tasks.

Figure 7: Model results comparison when different input formats are used [2]

4. Trained on Large Batch Sizes – Training model on large batch sizes improved the model accuracy.

Figure 8: Model results when trained with different batch sizes [2]

5. Tokenization – RoBERTa uses a byte-level Byte-Pair Encoding (BPE) encoding scheme with a containing 50K vocabulary in contrast to BERT’s character-level BPE with a 30K vocabulary.

Figure 9: RoBERTa results on GLUE tasks. [2]

RoBERTa Results on GLUE Tasks

RoBERTa clearly outperforms when compared to previous models.

XLM-RoBERTa

XLM-R is a large multilingual model trained on 100 different languages. It is basically an update to Facebook XLM-100 model which is also trained in 100 different languages. It uses the same training procedure as RoBERTa model which used only Masked Language Model (MLM) technique without using Next Sentence Prediction (NSP) technique.

Noticeable changes in XLM-R model are:

  1. Data – XLM-R model is trained on large cleaned CommonCrawl data scaled up to 2.5TB which is a way larger than Wiki-100 corpus which was used in training other multilingual models.
  2. Vocabulary – XLM-R vocabulary contains 250k tokens in contrast to RoBERTa which has 50k tokens in its vocabulary. It uses one large shared Sentence Piece Model (SPM) to tokenize words of all languages instead of XLM-100 model which uses different tokenizers for different languages. XLM-R authors assume that similar words across all the languages have similar representation in space.
  3. XLM-R is self-supervised, whereas XLM-100 is supervised model. XLM-R samples stream of text from each language and trains the model to predict masked tokens. XLM-100 model required parallel sentences (sentences that have same meaning) in two different languages as input which is a supervised method.

XLM-R Results on Cross-Lingual-Classification on XNLI dataset

Figure 10: XLM-R results on XNLI dataset. [3]

XLM-R is now the state-of-the-art multilingual model which outperforms all the previous multi-language models.

Demonstration of NLI Task Using Kaggle Dataset

In this section, we will implement the NLI task using Kaggle dataset.

Kaggle has launched Contradictory My Dear Watson challenge to detect contradiction and entailment in multilingual text. It has shared a training and validation dataset that contains 12120 and 5195 text pairs respectively. This dataset contains textual pairs from 15 different languages – Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese. Sentence pairs are classified into three classes entailment (0), neutral (1), and contradiction (2).

We will be using the training dataset of this challenge to demonstrate the NLI task. One can run the following code blocks using Google Colab and can download a dataset from this link.

Code Flow

1. Install transformers library

!pip install transformers

2. Load XLM-RoBERTa model –

Since our dataset contains multilingual text, we will be using XLM-R model for checking the accuracy on training dataset.

from transformers import AutoModelForSequenceClassification, AutoTokenizer xlmr= AutoModelForSequenceClassification.from_pretrained(‘joeddav/xlm-roberta-large-xnli’) tokenizer = AutoTokenizer.from_pretrained(‘joeddav/xlm-roberta-large-xnli’)

3. Load training dataset –

import pandas as pd train_data = pd.read_csv(<dataset path>) train_data.head(3)

4. XLM-R model classes –

Before going further do a sanity check to confirm if the model classes notation and the dataset classes notation is same

xlmr.config.label2id

{‘contradiction’: 0, ‘entailment’: 2, ‘neutral’: 1}

We can see that the model classes notation and Kaggle dataset classes notation (entailment (0), neutral (1), and contradiction (2)) is different. Therefore, change the training dataset classes notation to match with model.

5. Change training dataset classes notation –

train_data[‘label’] = train_data[‘label’].replace([0, 2], [2, 0]) train_data.head(3)

6. EDA on dataset –

Check the distribution of training data based on language

train_data_lang = train_data.groupby(‘language’).count().reset_index()[[‘language’,’id’]] # plot pie chart import matplotlib.pyplot as plt import numpy as np plt.figure(figsize=(10,10)) plt.pie(train_data_lang[‘id’], labels = train_data_lang[‘language’], autopct=’%1.1f%%‘) plt.title(‘Distribution of Train data based on Language’) plt.show()

We can see that English constitutes to more than 50% of the training data.

7. Sample data creation –

Since training data has 12120 textual pairs, evaluating all the pairs would be time-consuming. Therefore, we will create a sample data out of training data which will be a representative sampling ie. sample data created will have the same distribution of text pairs based on language as of the training data.

# create a column which tells how many random rows should be extracted for each language train_data_lang[‘sample_count’] = train_data_lang[‘id’]/10 # sample data sample_train_data = pd.DataFrame(columns = train_data.columns) for i in range(len(train_data_lang)): df = train_data[train_data[‘language’] == train_data_lang[‘language’][i]] n = int(train_data_lang[‘sample_count’][i]) df = df.sample(n).reset_index(drop=True) sample_train_data = sample_train_data.append(df) sample_train_data = sample_train_data.reset_index(drop=True) # plot distribution of sample data based on language sample_train_data_lang = sample_train_data.groupby(‘language’).count().reset_index()[[‘language’,’id’]] plt.figure(figsize=(10,10)) plt.pie(sample_train_data_lang[‘id’], labels = sample_train_data_lang[‘language’], autopct=’%1.1f%%‘) plt.title(‘Distribution of Sample Train data based on Language’) plt.show()

We can see that sample data created and the training data have nearly same distribution of text pairs based on language.

8. Functions to get predictions from XLM-R model –

def get_tokens_xlmr_model(data): ”’ Function which creats tokens for the passed data using xlmr model input – Dataframe Output – list of tokens ”’ batch_tokens = [] for i in range(len(data)): tokens = tokenizer.encode(data[‘premise’][i], data[‘hypothesis’][i], return_tensors=’pt’, truncation_strategy=’only_first’) batch_tokens.append(tokens) return batch_tokens def get_predicts_xlmr_model(tokens): ”’ Function which creats predictions for the passed tokens using xlmr model input – list of tokens Output – list of predictions ”’ batch_predicts = [] for i in tokens: predict = xlmr(i)[0][0] predict = int(predict.argmax()) batch_predicts.append(predict) return batch_predicts

9. Predictions on sample data –

sample_train_data_tokens = get_tokens_xlmr_model(sample_train_data) sample_train_data_predictions = get_predicts_xlmr_model(sample_train_data_tokens)

10. Find model accuracy on the predictions –

# plot the confusion matrix and classification report for original labels to the predicted labels import numpy as np import seaborn as sns from sklearn.metrics import classification_report sample_train_data[‘label’] = sample_train_data[‘label’].astype(str).astype(int) x = np.array(sample_train_data[‘label’]) y = np.array(sample_train_data_predictions) cm = np.zeros((3, 3), dtype=int) np.add.at(cm, [x, y], 1) sns.heatmap(cm,cmap=”YlGnBu”, annot=True, annot_kws={‘size’:16}, fmt=’g’, xticklabels=[‘contradiction’,’neutral’,’entailment’],

yticklabels=[‘contradiction’,’neutral’,’entailment’]) matrix = classification_report(x,y,labels=[0,1,2], target_names=[‘contradiction’,’neutral’,’entailment’]) print(‘Classification report : \n‘,matrix)

The model is able to give 93% accuracy on the sample data without any finetuning.

11. Find model accuracy at language level –

sample_train_data[‘prediction’] = sample_train_data_predictions sample_train_data[‘true_prediction’] = np.where(sample_train_data[‘label’]==sample_train_data[‘prediction’], 1, 0) sample_train_data_predicted_lang = sample_train_data.groupby(‘language’).agg({‘id’:’count’, ‘true_prediction’:’sum’}).reset_index()[[‘language’,’id’,’true_prediction’]] sample_train_data_predicted_lang[‘accuracy’] = round(sample_train_data_predicted_lang[‘true_prediction’]/sample_train_data_predicted_lang[‘id’], 2) sample_train_data_predicted_lang = sample_train_data_predicted_lang.sort_values(by=[‘accuracy’],ascending=False) sample_train_data_predicted_lang

Except for English, rest of the languages are having accuracy greater than 94%. Therefore, use a different model for English pairs prediction to further improve accuracy.

12. RoBERTa model for English text pairs prediction –

!pip install regex requests !pip install omegaconf !pip install hydra-core import torch roberta = torch.hub.load(‘pytorch/fairseq’, ‘roberta.large.mnli’) roberta.eval()

RoBERTa model classes notation is same as XLM-R model notations. Therefore, we can directly use the sample data without any class notation changes.

13. Extract only English pairs from sample data –

sample_train_data_en = sample_train_data[sample_train_data[‘language’]==’English’].reset_index(drop=True) sample_train_data_en.shape

(687, 8)

14. Functions to get predictions from RoBERTa model –

def get_tokens_roberta(data): ”’ Function which generates tokens for the passed data using roberta model input – Dataframe Output – list of tokens ”’ batch_tokens = [] for i in range(len(data)): tokens = roberta.encode(data[‘premise’][i],data[‘hypothesis’][i]) batch_tokens.append(tokens) return batch_tokens def get_predictions_roberta(tokens): ”’ Function which generates predictions for the passed tokens using roberta model input – list of tokens Output – list of predictions ”’ batch_predictions = [] for i in range(len(tokens)): prediction = roberta.predict(‘mnli’, tokens[i]).argmax().item() batch_predictions.append(prediction) return batch_predictions

15. Predictions with RoBERTa model –

sample_train_data_tokens = get_tokens_xlmr_model(sample_train_data) sample_train_data_predictions = get_predicts_xlmr_model(sample_train_data_tokens)

16. Accuracy of RoBERTa model –

sample_train_data_en[‘prediction’] = sample_train_data_en_predictions # roberta model accuracy sample_train_data_en[‘true_prediction’] = np.where(sample_train_data_en[‘label’]==sample_train_data_en[‘prediction’], 1, 0) roberta_accuracy = round(sum(sample_train_data_en[‘true_prediction’])/len(sample_train_data_en), 2) print(“Accuracy of RoBERTa model {}“.format(roberta_accuracy))

Accuracy of RoBERTa model 0.92

Accuracy of English text pairs increased from 89% to 92%. Therefore, for predictions on test dataset of Kaggle challenge use RoBERTa for English pairs prediction and XLM-R for predictions of other language pairs.

By this approach, I was able to score 94.167% accuracy on the test dataset.

Conclusion

In this blog ,we have learned what NLI task is, how to achieve this using two state-of-the-art models. There are many more pre-trained models for achieving NLI tasks other than the models discussed in this blog. Few of them are language-specific like German BERT, French BERT, Finnish BERT, etc. multilingual models like Multilingual BERT, XLM-100, etc.

As future steps, one can further achieve task-specific accuracy by finetuning these models with specific data.

References

[1] BERT official paper

[2] RoBERTa official paper

[3] XLM-RoBERTa official paper

Accelerate Your eCommerce Sales with Big Data and AI for 2021

Holiday season is the most exciting time of the year for businesses. It has always driven some of the highest sales of the year. In 2019, online holiday sales in the US alone touched $135.35 billion and the average order value hit $152.95. After an unprecedented 2020, retailers are performing many bold maneuvers for turning the tide around in the new year.

A successful holiday strategy in 2021 requires much more than just an online presence. To compete during one of the strangest seasons post a strangest year yet, brands are trying to create more meaningful connections with consumers, offering hyper-personalized online experiences, and ensuring that holiday shoppers experience nothing short of pure convenience and peace of mind.

In 2020, retailers faced some novel challenges and many unknowns. To begin with, here are a few key things that could not be ignored:

  • Customer behaviors significantly changed during the pandemic and expectations now have only burgeoned
  • Gen Z and Millennial shoppers who have the maximum purchasing power got focused on sustainability and peace of mind
  • The ecommerce industry saw five years of digital transformation in two months courtesy the pandemic. Immersive cutting-edge technology like voice-aided shopping, AI-assisted browsing and machine learning were no longer seen as optional, they became must-haves for facilitating a superior customer experience

Here are ten ways how big data and AI tech are helping businesses accelerate ecommerce sales

1. Hyper-Personalized product recommendations through Machine Learning

Providing people with exactly what they want is the best way to attract new customers and to retain existing ones. So, having intelligent systems to surface products or services that people would be inclined to buy only seems natural. To enable this, data and machine learning are playing big roles. They are helping businesses put the right offers in front of the right customers at the right time. Research has proven that serving relevant product recommendations can have a sizable impact on sales. As per a study, 45% of customers reveal they are likely to shop on a site that preempts their choices, while 56% are more likely to return to such a site. Smart AI systems are allowing deep dive into buyer preferences and sentiments and helping retailers and e-commerce companies provide their customers with exactly what they might be looking for.

2. Enabling intelligent search leveraging NLP

The whole point of effective search is to understand user intent correctly and deliver exactly what the customer wants. More and more companies are using modern customer-centric search powered by AI which enables it to think like humans. It deploys advanced image and video recognition and natural language processing tools to constantly improve and contextualize the results for customers which eventually helps companies in closing the leads more rigorously.

3. One-to-one marketing using advanced analytics

With one-to-one marketing, retailers are taking a more targeted approach in delivering a personalized experience than they would with giving personalized product recommendations or intelligent search engines. Data like page views and clickstream behavior forms the foundation of one-to-one marketing. As this data is harvested and processed, commonalities emerge that correspond with broad customer segments. As this data is further refined, a clearer picture emerges of an individual’s preferences and 360° profile, which is informing real-time action on the end of the retailer.

4. Optimized pricing using big data

There are numerous variables that impact a consumer’s decision to purchase something: product seasonality, availability, size, color, etc. But many studies zero down on price being the number one factor in determining whether the customer will buy the product.

Pricing is a domain that has traditionally been handled by an analyst after diving deep into reams of data. But big data and machine learning-based methods today are helping retailers accelerate the analysis and create an optimized price, often several times in a single day. This helps keep the price just right so as not to turn off potential buyers or even cannibalize other products, but also high enough to ensure a sweet profit.

5. Product demand forecasting and inventory planning

In the initial months of the pandemic, many retailers had their inventory of crucial items like face coverings and hand sanitizers exhausted prematurely. In certain product categories, the supply chains could not recover soon enough, and some have not even recovered yet. Nobody could foretell the onslaught of the coronavirus and its impending shadow on retailers, but the disastrous episode that followed sheds urgent light on the need for better inventory optimization and planning in the consumer goods supply chain.

Retailers and distributors who early-on leveraged machine learning-based approaches for supply chain planning fared better than their contemporaries who continued to depend solely on analysts. With a working model in place, the data led to smarter decisions. Incorporating external data modules like social media data (Twitter, Facebook), macroeconomic indicators, market performance data (stocks, earnings, etc.) to the forecasting model, in addition to the past samples of the inventory data seasonality changes, are helping correctly determine the product demand pattern.

6. Blending digital and in-store experiences through omnichannel ecommerce offerings

The pandemic has pushed many people who would normally shop in person to shop online instead. Retailers are considering multiple options for getting goods in the hands of their customers, including contactless transactions and curbside pickups. Not that these omnichannel fulfillment patterns were not already in place before the coronavirus struck, but they have greatly accelerated under COVID-19. AI is helping retailers expedite such innovations as e-commerce offerings, blending of digital and in-store experiences, curbside pickup and quicker delivery options, and contactless delivery and payments.

7. Strengthening cybersecurity and fighting fraud using AI

Fraud is always a threat around the holidays. And given the COVID-19 pandemic and the subsequent shift to everything online, fraud levels have jumped by 60% this season. An increase in card-not-present transactions incites fraudsters to abuse cards that have been compromised. Card skimming, lost and stolen cards, phishing scams, account takeovers, and application fraud present other loopholes for nefarious exploits to take place. In a nutshell, fraudsters are projected to extort innocent customers by about 5.5% more this year. In this case, card issuers and merchants alike armed with machine learning and AI are analyzing huge volumes of transaction, identifying the instances of attempted fraud, and automating the response to it.

8. AI-powered chatbots for customer service

Chatbots that can automatically respond to repetitive and predictable customer requests are one of the speediest growing sectors of big data and AI. Thanks to advances in NLP and natural language generation, chatbots can now correctly understand complex written and spoken queries of the most nuanced order. These smart assistants are already saving companies millions of dollars per year by supplementing human customer service reps in resolving issues with purchases, facilitating returns, helping find stores, answering repetitive queries concerning hours of operation, etc.

9. AI guides for enabling painless gift shopping

As this is the busiest time of the year when customers throng websites and stories for gift-shopping, gaps in customer service can seriously confuse and dissuade the already indecisive shopper. In such a scenario, tools like interactive AI-powered gift finders are engaging shoppers in a conversation by asking a few questions about the gift recipient’s personality, and immediately providing them with gifting ideas, helping even the most unsettled gift shopper to find the perfect gift with little wavering. This is helping customers overcome choice paralysis and inconclusiveness and helping companies boost conversions, benefiting both sides of the transaction table.

10. AR systems for augmented shopping experience

AR is taking the eCommerce shopping and customer experience to the next level. From visual merchandising to hyper-personalization, augmented reality offers several advantages. Gartner had indicated in a 2019’s predictions report that by 2020 up to 100 million consumers are expected to use augmented reality in their shopping experiences and the prophecy came true. The lockdown and isolation necessitated by Covid-19 rapidly increased the demand for AR systems.

Based on the “try-before-you-buy” approach, augmented shopping appeals to customers by allowing them to interact with their choice of products online before they proceed to buy any. For instance, AR is helping buyers visualize what their new furniture will look and feel like by moving their smartphone cameras around the room in real-time and getting a feel of the size of the item and texture of the material for an intimate understanding before purchase. In another instance, AR is helping women shop for makeup by providing them with a glimpse of the various looks on their own face at the click of a button.

To survive the competitive landscape of eCommerce and meet the holiday revenue goals this year, merchants and retailers are really challenge the status quo and adopting AI-powered technology for meeting customer expectations. AI is truly the future of retail, and not leveraging the power of artificial intelligence, machine learning and related tech means you are losing out.

Recommendation Systems for Marketing Analytics

How I perceive recommendation systems is something which the traditional shopkeepers used to use.

Remember the time when we used to go shopping with our mother in the childhood to the specific shop. The shopkeeper used to give the best recommendations for the products, and we used to buy it from the same shop because we knew that this shopkeeper knows us best.

What the shopkeeper did was he understood our taste, priorities, the price range that we are comfortable with and then present the products which best matched our requirement. This is what the businesses are doing in the true sense now.

They want to know their customer personally by their browsing behaviour and then make them recommendation of the products that they might like, the only thing is that they want to do it on a large scale.

For example, Amazon and Netflix understand your behaviour through what you browse, add to basket and order, movies you watch and like and then recommend the best of the products which you make like with high probability.

In a nutshell, they combine what you call as the business understanding with some mathematics so that we can essentially know and learn about the products that the customer likes.

So basically, as recommendation system for marketing analytics is a subclass of information filtering system that seeks the similarities between users and items with different combinations.

Below are some of the most widely used types of recommendation systems:

  1. Collaborative Recommendation system
  2. Content-based Recommendation system
  3. Demographic based Recommendation system
  4. Utility based Recommendation system
  5. Knowledge based Recommendation system
  6. Hybrid Recommendation system

Let us go into the most useful ones which the industry is using:

  • Content Based Recommendation System

The point of content-based is that we should know the content of both user and item. Usually we construct user-profile and item-profile using the content of shared attribute space. The product attributes like image (Size, dimension, colour etc…) and text description about the product is more inclined towards “Content Based Recommendation”.

This essentially means that based upon the content that I watch on Netflix, I can run an algorithm to see what the most similar movies are and then recommend the same to the other users.

For example, when you open Amazon and search for a product, you get the similar products pop up below which is the item-item similarity that they have computed for the overall products that are there in Amazon. This gives us a very simple yet effective idea of how the products behave with each other.

Bread and butter could be similar products in the true sense as they go together but their attributes can be varied. In case of the movie industry, features like genres, reviews could tell us the

similar movies and that is the type of similarity we get for the movies.

  • Collaborative Recommendation System:

Collaborative algorithm uses “User Behaviour” for recommending items. They exploit behaviour of other users and items in terms of transaction history, ratings, selection, purchase information etc. In this case, features of the items are not known.

When you do not want to see what the features of the products are for calculating the similarity score and check the interactions of the products with the users, you call it as a collaborative approach.

We figure out from the interactions of the products with the users what are the similar products and then take a recommendation strategy to target the audience.

Two users who watched the same movie on Netflix can be called similar and when the first user watches another movie, the second users gets that same recommendation based on the likes that these people have.

  • Hybrid Recommendation System:

Combining any of the two systems in a manner that suits the industry is known as Hybrid Recommendation system. It combines the strengths of more than two Recommendation system and eliminates any weakness which exist when only one recommendation system is used.

When we only use Collaborative Filtering, we have a problem called as “cold start” problem. As we take into account the interaction of users with the products, if a user comes to the website for the first time, I do not have any recommendations to make to that customer as I do not have interactions available.

To eliminate such a problem, we used hybrid recommendation systems which combines the content-based systems and

collaborative based systems to get rid of the cold start problem. Think of it as this way, item-item and user-user, user-item interaction all combined to give the best recommendations to the users and to give more value to the business.

From here, we will focus on the Hybrid Recommendation Systems and introduce you to a very strong Python library called lightfm which makes this implementation very easy.

LightFM:

The official documentation can be found in the below link:

lyst/lightfm

Build status Linux OSX (OpenMP disabled) Windows (OpenMP disabled) LightFM is a Python implementation of a number of…

github.com

LightFM is a Python implementation of the number of popular recommendation algorithms for both implicit and explicit feedback.

User and item latent representations are expressed in terms of their feature’s representations.

It also makes it possible to incorporate both item and user metadata into the traditional matrix factorization algorithms. When multiplied together, these representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user.

Interactions : The matrix containing user-item interactions.

User_features : Each row contains that user’s weights over features.

Item_features : Each row contains that item’s weights over features.

Note : The matrix should be Sparsed (Sparse matrix is a matrix which contains very few non-zero elements.)

Predictions

fit_partial : Fit the model. Unlike fit, repeated calls to this method will cause training to resume from the current model state.

Works mainly for the new users to append to the train matrix.

Predict : Compute the recommendation score for user-item pairs.

The scores are sorted in the descending form and top n-items are recommended.

Model evaluation

AUC Score : In the binary case (clicked/not clicked), the AUC score has a nice interpretation: it expresses the probability that a randomly chosen positive item (an item the user clicked) will be ranked higher than a

randomly chosen negative item (an item the user did not click). Thus, an AUC of 1.0 means that the resulting ranking is perfect: no negative item is ranked higher than any positive item.

Precision@K : Precision@K measures the proportion of positive items among the K highest-ranked items. As such, this is focused on the ranking quality at the top of the list: it does not matter how good or bad the rest of your ranking is as long as the first K items are mostly positive.

Ex: Only one item of your top 5 item are correct, then your precision@5 is 0.2

Note : If the first K recommended items are not available anymore (say, they are out of stock), and you need to move further down the ranking. A high AUC score will then give you confidence that your ranking is of high quality throughout.

Enough of the theory now, we will move to the code and see how the implementation for lightfm works:

I have taken the dataset from Kaggle, you can download it below:

E-Commerce Data

Actual transactions from UK retailer www.kaggle.com

Hope you liked the coding part of it, and you are ready to implement that in any version. The enhancement that can be done in this is if you have the product and the user features.

These can also be taken as inputs into the lightfm model and the embedding that the model creates would be based upon all those attributes. The more data that is pushed into the lightfm will give the model a better accuracy and more training data.

That’s all from my end for now. Keep Learning!! Keep Rocking!!

CatBoost – A new game of Machine Learning

Gradient Boosted Decision Trees and Random Forest are one of the best ML models for tabular heterogeneous datasets.

CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.

Catboost, the new kid on the block, has been around for a little more than a year now, and it is already threatening XGBoost, LightGBM and H2O.

Why Catboost?

Better Results

Catboost achieves the best results on the benchmark, and that’s great. Though, when you look at datasets where categorical features play a large role, this improvement becomes significant and undeniable.

GBDT Algorithms Benchmark

Faster Predictions

While training time can take up longer than other GBDT implementations, prediction time is 13–16 times faster than the other libraries according to the Yandex benchmark.

Left: CPU, Right: GPU

Batteries Included

Catboost’s default parameters are a better starting point than in other GBDT algorithms and it it is good news for beginners who want a plug and play model to start experience tree ensembles or Kaggle competitions.

GBDT Algorithms with default parameters Benchmark

Some more noteworthy advancements by Catboost are the features interactions, object importance and the snapshot support.In addition to classification and regression, Catboost supports ranking out of the box.

Battle Tested

Yandex is relying heavily on Catboost for ranking, forecasting and recommendations. This model is serving more than 70 million users each month.

The Algorithm

Classic Gradient Boosting

Gradient Boosting on Wikipedia

Catboost Secret Sauce

Catboost introduces two critical algorithmic advances – the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features.

Both techniques are using random permutations of the training examples to fight the prediction shift caused by a special kind of target leakage present in all existing implementations of gradient boosting algorithms.

Categorical Feature Handling

Ordered Target Statistic

Most of the GBDT algorithms and Kaggle competitors are already familiar with the use of Target Statistic (or target mean encoding). It’s a simple yet effective approach in which we encode each categorical feature with the estimate of the expected target y conditioned by the category. Well, it turns out that applying this encoding carelessly (average value of y over the training examples with the same category) results in a target leakage.

To fight this prediction shift CatBoost uses a more effective strategy. It relies on the ordering principle and is inspired by online learning algorithms which get training examples sequentially in time. In this setting, the values of TS for each example rely only on the observed history.

To adapt this idea to a standard offline setting, Catboost introduces an artificial “time”— a random permutation σ1 of the training examples. Then, for each example, it uses all the available “history” to compute its Target Statistic. Note that, using only one random permutation, results in preceding examples with higher variance in Target Statistic than subsequent ones. To this end, CatBoost uses different permutations for different steps of gradient boosting.

One Hot Encoding

Catboost uses a one-hot encoding for all the features with at most one_hot_max_size unique values. The default value is 2.

Catboost’s Secret Sauce

Orederd Boosting

CatBoost has two modes for choosing the tree structure, Ordered and Plain. Plain mode corresponds to a combination of the standard GBDT algorithm with an ordered Target Statistic. In Ordered mode boosting we perform a random permutation of the training examples – σ2, and maintain n different supporting models – M1, . . . , Mn such that the model Mi is trained using only the first i samples in the permutation. At each step, in order to obtain the residual for j-th sample, we use the model Mj−1. Unfortunately, this algorithm is not feasible in most practical tasks due to the need of maintaining n different models, which increase the complexity and memory requirements by n times. Catboost implements a modification of this algorithm, on the basis of the gradient boosting algorithm, using one tree structure shared by all the models to be built.

Catboost Ordered Boosting and Tree Building

In order to avoid prediction shift, Catboost uses permutations such that σ1 = σ2. This guarantees that the target-yi is not used for training Mi neither for the Target Statistic calculation nor for the gradient estimation.

Tuning Catboost

Important Parameters

cat_features — This parameter is a must in order to leverage Catboost preprocessing of categorical features, if you encode the categorical features yourself and don’t pass the columns indices as cat_features you are missing the essence of Catboost.

one_hot_max_size — As mentioned before, Catboost uses a one-hot encoding for all features with at most one_hot_max_size unique values. In our case, the categorical features have a lot of unique values, so we won’t use one hot encoding, but depending on the dataset it may be a good idea to adjust this parameter.

learning_rate & n_estimators — The smaller the learning_rate, the more n_estimators needed to utilize the model. Usually, the approach is to start with a relative high learning_rate, tune other parameters and then decrease the learning_rate while increasing n_estimators.

max_depth — Depth of the base trees, this parameter has an high impact on training time.

subsample — Sample rate of rows, can’t be used in a Bayesian boosting type setting.

colsample_bylevel, colsample_bytree, colsample_bynode— Sample rate of columns.

l2_leaf_reg — L2 regularization coefficient

random_strength — Every split gets a score and random_strength is adding some randomness to the score, it helps to reduce overfitting.

Check out the recommended spaces for tuning here

Model Exploration with Catboost

In addition to feature importance, which is quite popular for GBDT models to share, Catboost provides feature interactions and object (row) importance

Catboost’s Feature Importance

Catboost’s Feature Interactions

Catboost’s Object Importance

SHAP values can be used for other ensembles as well

Not only does it build one of the most accurate model on whatever dataset you feed it with — requiring minimal data prep — CatBoost also gives by far the best open source interpretation tools available today AND a way to productionize your model fast.

That’s why CatBoost is revolutionising the game of Machine Learning, forever. And that’s why learning to use it is a fantastic opportunity to up-skill and remain relevant as a data scientist. But more interestingly, CatBoost poses a threat to the status quo of the data scientist (like myself) who enjoys a position where it’s supposedly tedious to build a highly accurate model given a dataset. CatBoost is changing that. It’s making highly accurate modeling accessible to everyone.

Image taken from CatBoost official documentation: https://catboost.ai/

Building highly accurate models at blazing speeds

Installation

Installing CatBoost on the other end is a piece of cake. Just run

pip install catboost

Data prep needed

Unlike most Machine Learning models available today, CatBoost requires minimal data preparation. It handles:

  • Missing values for Numeric variables
  • Non encoded Categorical variables. Note missing values have to be filled beforehand for Categorical variables. Common approaches replace NAs with a new category ‘missing’ or with the most frequent category.
  • For GPU users only, it does handle Text variables as well. Unfortunately I couldn’t test this feature as I am working on a laptop with no GPU available. [EDIT: a new upcoming version will handle Text variables on CPU. See comments for more info from the head of CatBoost team.]

Building models

As with XGBoost, you have the familiar sklearn syntax with some additional features specific to CatBoost.

from catboost import CatBoostClassifier # Or CatBoostRegressor model_cb = CatBoostClassifier() model_cb.fit(X_train, y_train)

Or if you want a cool sleek visual about how the model learns and whether it starts overfitting, use plot=True and insert your test set in the eval_set parameter:

from catboost import CatBoostClassifier # Or CatBoostRegressor model_cb = CatBoostClassifier() model_cb.fit(X_train, y_train, plot=True, eval_set=(X_test, y_test))

Note that you can display multiple metrics at the same time, even more human-friendly metrics like Accuracy or Precision. Supported metrics are listed here. See example below:

Monitoring both Logloss and AUC at training time on both training and test sets

You can even use cross-validation and observe the average & standard deviation of accuracies of your model on the different splits:

Finetuning

CatBoost is quite similar to XGBoost . To fine-tune the model appropriately, first set the early_stopping_rounds to a finite number (like 10 or 50) and start tweaking the model’s parameters.

Python Newbie – Doctest

Any Software Development process consists of five stages:

  1. Requirement Analysis
  2. Design
  3. Development
  4. Testing
  5. Maintenance

Though each and every process mentioned above is important in SDLC lifecycle, this post will mainly focus on the importance of testing and enlighten on how we can use doctest a module in python to perform testing.

Importance of testing

We all make mistakes and if left unchecked, some of these mistakes can lead to failures or bugs that can be very expensive to recover from. Testing our code helps to catch these mistakes or avoid getting them into production in the first place.

Testing therefore is very important in software development.

Used effectively, tests help to identify bugs, ensure the quality of the product, and to verify that the software does what it is meant to do.

Python module- doctest

Doctest helps you test your code by running examples embedded in the documentation and verifying that they produce the expected results. It works by parsing the help text to find examples, running them, then comparing the output text against the expected value.

To make things easier, let us start by understanding the above implementation using a simple example

Python inline function

So, in the above snippet, I have written a basic inline function that adds up a number to itself.

For this function, I run manually a couple of test cases to do some basic verification (to do sanity check of my function).

Now, consider a scenario in which python can read the above output and perform the sanity check for us at the run time. This is exactly the idea behind a doctest.

Now, let’s see how we can implement one.

Let’s take a very simple example to calculate what day of the week it will be ahead of the current weekday. We write a docstring for our function which helps us to understand what our function does and what inputs it takes and so on. In this docstring, I have added couple of test cases which will be read by the doctest module at the run time while testing is carried out.

Implementation of doctest

When we run the above script from the command, we will get the foollowing output:

Doctest Output

We can see from the above snippet that all test cases mentioned in the docstring were successful as the resulted outcome matched with the expected outcome.

But what happens if any test fails, or the script does not behave as expected?

To test this, we add a false test case as we know our script can only take integers as input.

What will happen if we give a string as an input? Let us check out.

Test case with strings as input

I have used the same script but made a small change in the test case where I have passed strings as an input.

So, what is the outcome for the above test case?

Failed test case output

Voila! We can see that the doctest has clearly identified the failed test case and listed the reason for the failure of the above-mentioned test case.

Conclusion

That is all you require to get started with doctest. In this post, we have come across the power of doctest and how it makes a lot easier to carry out automated testing for most of the script. Though, many developers find doctest easier as compared to unittest because in its simplest form, there is no API to learn before using it. However, as the examples become more complex, the lack of fixture management can make writing doctest tests more cumbersome than when using unittest. But still due to the ease of its module, doctest is worth adding to our codes.

In upcoming blogs, we are going to discuss more about handy python module that can ease our task and we will also dive into some other concepts like Machine Learning and Deep Learning. Until then, keep learning!

Evolution Of Human Resource In The New World Of Technology How has the Human Resources changed with time?

Of all the departments and functions in a corporate organization, Human Resource is the one function related to employees’ personal aspects. The entire employee job cycle is taken care of by the Human Resources (HR), ranging from hiring, compensation, leave management, employee satisfaction, development, and growth till exit. This function requires personal involvement and conscience that may vary from person to person.

Another rather unique aspect of an organization – technology is the practical and scientific application of various aspects such as skills, processes, methods, and organization techniques. It does not concern any personalized features and derives the same result irrespective of who, where, or when someone brings it into the organizational process.

But what if these two aspects are related?

The challenges related to HR, like Employee Engagement, Employee Retention, development of the leaders, competitive compensation, global outreach of businesses, and various other factors have stimulated extensive innovation in the HR field. For instance, more than 92% of the recruiters have turned their trust into Social Media hiring in the recent decade rather than organic hiring methods. More than 3% recruiters use “Snapchat” as a recruiting channel, moving beyond LinkedIn, Facebook, and Twitter. Below are some of such instances that bring HR and Technology together.

HR and Virtual Reality

COVID-19 has undoubtedly helped change the mindset of protagonists across the Corporate industry, especially in India. Holding appraisal meetings, taking interviews, onboarding, and even celebrations as part of work today takes place over video calls. Technology and Virtual Reality help HR with talent management, training, onboarding, and inductions, hiring, etc. as the new normal.

HR and Machine Learning Machine Learning (ML) uses algorithms for automated data analysis to create automated analytical models. HR deals with massive data sets from Recruitment and the Employee Database. ML technology helps HR improve the efficiency of initial research with dedicated hours to acquiring next-level results. So far, in Human Resources, machine learning

applications are confined mainly to the Recruitment process. However, it will be exciting to oversee the advancements in this field.

HR and Could Computing Cloud Computing ensures using a network of remote servers hosted on the internet instead of a personal computer or a local server. It helps data processing by storing and managing valuable information over the cloud, enabling the HR department to push its expertise into the middle and higher-level leadership, resulting in efficient business performance and execution. When the data on performance, attendance, track of time, etc. gets automated, the focus can be shifted to increasing productivity, transforming the HR department from being a cost center to generating revenues.

The mentioned instances are only some broad spectrum of Technology inter-dependencies in the HR department. The ever-changing and fast-paced technological advances are only making HR strive towards innovation that ultimately makes it even more indulged with technology.

Significantly, the global pandemic has altered stigmas helping people become adaptive. Several theories believe that remote working can continue as the “new normal” once we overcome this pandemic.

It can make the who’s who of the Corporate world, refine the entire workplace experience with HR as the bridge that connects extremes in an organization, exposing and expanding them with the latest technological developments.

It will be interesting to see how the Corporates fit into this new reality.

Super-Resolution with Deep Learning for Image Enhancement

Have you ever looked at your old photographs and hoped it had better quality? Or wished to convert all your photos to a better resolution to get more likes? Well, Deep learning can do it!

Image Super Resolution can be defined as increasing the size of small images while keeping the drop-in quality to a minimum or restoring High Resolution (HR) images from rich details obtained from Low Resolution (LR) images.

The process of enhancing an image is quite complicated due to the multiple issues within a given low-resolution image. An image may have a “lower resolution” as it is smaller in size or as a result of degradation. Super Resolution has numerous applications like:

  • Satellite Image Analysis
  • Aerial Image Analysis
  • Medical Image Processing
  • Compressed Image
  • Video Enhancement, etc.

We can relate the HR and LR images through the following equation:

LR = degradation(HR)

The goal of super resolution is to recover a high-resolution image from a low-resolution input.

Deep learning can estimate the High Resolution of an image given a Low Resolution copy. Using the HR image as a target (or ground-truth) and the LR image as an input, we can treat this like a supervised learning problem.

One of the most used techniques for upscaling an image is interpolation. Although simple to implement, this method faces several issues in terms of visual quality, as the details (e.g., sharp edges) are often not preserved.

Most common interpolation methods produce blurry images. Several types of Interpolation techniques used are :

  • Nearest Neighbor Interpolation
  • Bilinear Interpolation
  • Bicubic Interpolation

Image processing for resampling often uses Bicubic Interpolation over Bilinear or Nearest Neighbor Interpolation when speed is not an issue. In contrast to Bilinear Interpolation, which only takes 4 pixels (2×2) into account, bicubic interpolation considers 16 pixels (4×4).

SRCNN was the first Deep Learning method to outperform traditional ones. It is a convolutional neural network consisting of only three convolutional layers:

  • Pre-Processing and Feature Extraction
  • Non-Linear Mapping
  • Reconstruction

Before being fed into the network, an image needs up-sampling via Bicubic Interpolation. It is then converted to YCbCr (Y Component and Blue-difference to Red-difference Chroma Component) color space, while the network uses only the luminance channel (Y). The network’s output is then merged with interpolated CbCr channels to produce a final color image. This procedure intends is to change the brightness (the Y channel) of the image while there is no change in the color (CbCr channels) of the image.

The SRCNN consists of the following operations:

  • Pre-Processing: Up-scales LR image to desired HR size.
  • Feature Extraction: Extracts a set of feature maps from the up-scaled LR image.
  • Non-Linear Mapping: Maps the feature maps representing LR to HR patches.
  • Reconstruction: Produces the HR image from HR patches.

LR images are preserved in SR image to contain most of the information as a super-resolution requirement. Super-resolution models, therefore, mainly learn the residuals between LR and HR images. Residual network designs are, therefore, essential.

Up-Sampling Layer¶

The up-sampling layer used is a sub-pixel convolution layer. Given an input of size H×W×CH×W×C and an up-sampling factor ss, the sub-pixel convolution layer first creates a representation of size H×W×s2CH×W×s2C via a convolution operation and then reshapes it to sH×sW×CsH×sW×C, completing the up-sampling operation. The result is an output spatially scaled by factor ss

Enhanced Deep Residual Networks (EDSR)

Furthermore, when super resolution techniques started gaining momentum, EDSR was developed which is an enhancement over the shortcomings of SRCNN and produces much more refined results.

EDSR network consists of :

  • 32 residual blocks with 256 channels
  • pixel-wise L1 loss instead of L2
  • no batch normalization layers to maintain range flexibility
  • scaling factor of 0.1 for residual addition to stabilize training

On comparing several techniques, we can clearly see a trade-off between performance and speed. ESPCN looks the most efficient while EDSR is the most accurate and expensive. You may choose the most suitable method depending on the application.

Further, I have implemented Super Resolution using EDSR on few images as shown below:

  • Importing Modules

This block of code imports all the necessary modules needed for training. DIV2K is the dataset that consists of Diverse 2K resolution-high-quality-images used in various image processing codes. EDSR algorithm is also imported here for further training.

  • Defining the Parameters

This block of code assigns the number of residual blocks as 16. Furthermore, the super resolution factor and the downgrade operator is defined. Higher the number of residual blocks, better the model will be in capturing minute features, even though its more complicated to train the same. And hence, we stick with residual blocks as 16.

A directory is created where the model weights will be stored while training

Images are downloaded from DIV2K with two folders of train and a valid consisting of both low and high resolution.

Once, the dataset has been loaded, both the train and valid images need to be converted into TensorFlow dataset objects.

  • Training the Model

Now the model is trained with the number of steps as 30k. The image is evaluated using the generator every 1000th step. The training takes around 12 hours in Google Colab GPU.

  • Obtaining the Result

Furthermore, the models are saved and can be used in the future for further modifications.

For output, the model is loaded using weight files and further both the images of low resolution and high resolution are plotted simultaneously for comparison.

High-Resolution Images showcased on the right are obtained by applying the Super-Resolution algorithm on the Low-Resolution Images showcased on the left.

Conclusion

You can clearly observe a significant improvement in the resolution of images post-application of the Super-Resolution algorithm, making it exceptionally useful for Spacecraft Images, Aerial Images, and Medical Procedures that require highly accurate results.

References

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.