Human Activity Recognition: Fusing Modalities for Better Classification

Human Activity Recognition using high-dimensional visual streams has been gaining popularity in recent times. Using a video input to categorize human activity is primarily applied in surveillance of different kinds. At hospitals and nursing homes, this can be used to immediately alert caretakers when any of the residents are displaying any sign of sickness – clutching their chest, falling, vomiting, etc. At public places like airports, railway stations, bus stations, malls or even in your neighborhoods, the activity recognition becomes the means to alert authorities on recognizing suspicious behavior. This AI solution is even useful for identifying flaws in and improving the form of athletes and professionals ensuring improved performance and better training.

Why Use Multimodal Learning for Activity Recognition?

Our experience of the world is multimodal in nature, as quoted by Baltrušaitis et al. There are multiple modalities a human is blessed with. We can touch, see, smell, hear, and taste, and understand the world around us in a better way. Most parents would remember their kid’s understanding of “what a dog is” keeps improving after seeing actual dogs, videos of dogs, photographs of dogs and cartoon dogs and being told that they are all dogs. Just seeing a single video of an actual dog does not help the kid identify the character “Goofy” as a dog. It is the same for machines. Multimodal machine learning models can process and relate the information from multiple modalities, learning in a more holistic way.

This blog serves as the captain’s log on how we combined the effectiveness of two modalities – Static Images and Videos – to improve the classification of human activities from videos. Algorithms for video activity recognition are based on dealing with only spatial information (images), or both spatial and temporal information (videos). Algorithms were used for both static images and videos for the activity recognition modeling. Fusing both models together made the resultant multimodal model far better than each of the individual unimodal models.

Multimodal Learning for Human Activity Recognition – Our Recipe

Our goal was to recognize 10 activities – basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. We created the multimodal models for activity recognition by fusing the two unimodal models – image-based and video-based – using the ensemble method, thus enhancing the effect of the classifier.

The Dataset Used: We have used modified UCF11 dataset (removed Swing class as it has many mis-labelled data). For the 10 activities we need to classify, the dataset has 120-200 videos of different lengths ranging from 1-21 seconds. The link to the dataset is (CRCV | Center for Research in Computer Vision at the University of Central Florida (

One Modality at a Time

There are different methodologies for Multimodal learning as is described by Song et al., 2016, Tzirakis et al., 2017, and Yoon et al., 2018. One of the techniques is ensemble learning, in which 2DCNN model and 3DCNN models are trained separately and the final softmax probabilities are combined to get predictions. Other techniques include joint representation, coordinated representation etc. A detailed overview is available in Baltrušaitis et al., 2018.

The First Modality – Images: We trained a 2DCNN model with VGG-16 architecture and equipped our model with batch normalization and other regularization methods, since the enormous number of frames

caused overfitting. We observed that deeper architectures were less accurate. We achieved 81% clip-wise accuracy. However, the accuracies are very poor on some classes such as basketball, soccer juggling, tennis, and walking as can be seen below.

Second Modality – Videos: The second model was trained with 3DCNN architecture, which took 16 frames as one chunk called as a block. We incorporated various augmentations, batch normalization, and regularization methods. During the exercise, we observed that 3D architecture is sensitive to the learning rate. In the end, we achieved a clip-wise accuracy of 73%. Looking at the accuracies for individual classes, this model is not the best, but it is much better than the 2DCNN for Soccer Juggling and Golf classes. Class-wise accuracies can be seen below.

Our next objective was to ensure that the learnings from both modalities are combined to create a more accurate and robust model.

Our Secret Sauce for Multimodal learning – The Ensemble Method

For fusing the two modalities, we resorted to ensemble methods (Zhao, 2019). We experimented with two ensemble methods for the Multimodal learning:

1. Maximum Vote – Mode from the predictions of both models is taken as the predicted label.

2. Averaging and Maximum Pooling – Weighted sum of probabilities of both models is calculated to decide the predicted label.

Maximum Vote Ensemble Approach: Let us consider a video consisting of 64 frames. We fed these frames to both 2D and 3D models. The 2D model produced an output for each frame resulting in 64 tensors with a predicted label. Similarly, the 3D model produced a label for 16 frames, thus finally generating 4 tensors for 64 frames video. To balance the weightage of the 3D model, we augment the output tensors of the 3D model by repeating the tensor multiple times. Finally, we concatenated the output tensors from both models, and selected the modal class (the class label with the highest occurrence) as the prediction label.

We iteratively experimented with different weightages for both models and checked the corresponding overall accuracy as shown below. Maximum overall accuracy of 84% was achieved at 60% weightage to the 2D model and 40% weightage to the 3D model.

At 60:40 weightage for 2D:3D, the Maximum Vote ensemble method has performed better than either of the unimodal models except for Biking and Soccer Juggling.

Averaging and Maximum Pooling Ensemble Method: Consider that we are feeding a video with 64 frames to both models, the 2D model which generates 64 tensors with probabilities for every class and the 3D model will generate 4 tensors with probabilities for each of 10 classes. We calculate the average of all 64 tensors from the 2D model to get the average probability for every class. Similarly, we calculate the average of the 4 tensors from the 3D model. Finally, we take the weighted sum of resulting tensors from both models. We select the class with maximum probability as the predicted class.

The equation for the weight in Averaging and Maximum Pooling method is:

Multimodal = α*(2D Model) + (1-α)*(3D Model)

We experimented with different values of α and compared the resulting overall accuracies as shown below. With equal weightage to both models, we achieved 77% accuracy. With the increase in α, i.e., weightage to 2D model, multimodal accuracy increased. We achieved the best accuracy of 87% at 90% weightage to the 2D model and 10% weightage of the 3D model.

The class-wise accuracy for the Averaging and Maximum Pooling method is low for Basketball and Walking classes. However, the class-wise accuracies are better than or at least the same as the unimodal models. The accuracies for Soccer Juggling and Tennis Swing improved the most from 71% and 67% for the 2D model to 90% and 85% respectively.


Beyond doubt, our Multimodal models performed better than the Unimodal ones. Comparing the multimodal engines, Averaging and Maximum Pooling performed better than the Maximum Vote method, as is evident from the overall accuracies of 87% and 84% respectively. The reason is that the Averaging and Maximum Pooling method considers the confidence of the predicted label whereas, the Maximum Vote method considers only the label with maximum probability.

In Human Activity Recognition, we believe the multimodal learning approach can be improved further by incorporating other modalities as well. Such as Facebook’s Detectron model or pose estimation method.

Our next plan of action is to explore more forms of multimodal learning for activity recognition. Using features addition/ layers are fusing features can be effective in learning features better. Another way of proceeding would be to add different modalities like pose detection feed, motion detection feed and object detection feed to provide better results. No matter the approach, fusing modalities has a corresponding cost factor associated with it. While we have 40.4 million trainable parameters in the 2DCNN model and 78 million parameters in the 3DCNN models, the multimodal model has 118.5 million parameters to train on. But this is a small amount to pay considering the limitless applications that can be made viable because of the performance improvement provided by the multimodal models.

Measuring Impact: Top 4 Attribution and Incrementality Strategies

I believe you have gone through part 1 and understood what Attribution and Incrementality mean and why it is important to measure these metrics. Below, we will discuss some methods that are commonly used across the industry to achieve our goals.

Before we dive into the methods, let us understand the term Randomised Controlled Trials (RCT). And by the way, in common jargon, they are popularly known as A/B tests.

What are Randomized Controlled Trials (RCT)?

Simply put, it is an experiment that measures our hypothesis. Suppose we believe (hypothesis) that the new email creative (experiment) will perform (measure) better than the old email creative. Now, we will randomly split our audience into 2 groups. One of them, the control group, keeps receiving the old emails, and the other, the test group, keeps receiving the new email creative.

Now how do you quantify your measure? How do you understand your experiment is performing better? Choose any metric that you think should determine the success of the new email creatives. Say, Click-through Rate (CTR). Thus, if the test group has a better CTR than the control group, you can say that the new email creative is performing better than the old email creative.

Some popular methods to run experiments:

Method 1:

User-Level Analysis

Illustration for Incrementality

One of the simple ways to quantify incrementality would be to run an experiment as done in the diagram. Divide your sample into two random groups. Expose the groups to a different treatment; for example, one group receives a particular email/ad, and the other does not.

The difference in the groups reflects the true measurement of the treatment. This helps us quantify the impact of sending an email or showing a particular ad.

Method 2:

Pre/Post Analysis

This is an experiment that can be used to measure the effect of a certain event or action by taking a measurement before (pre) and after (post) the start of the experiment.

You can introduce a new email campaign during the test period and measure the impact over time of any metric of your interest by comparing it against the time when the new email campaign was not introduced.

Thus, by analyzing the difference in the impact of the metric you can estimate the effect of your experiment.

Things to keep in mind while performing pre/post analysis:

  • Keep the control period long enough to get significant data points
  • Keep in mind that there might be spillover in results during the test phase, so we should ensure that the impact of this spillover is not missed
  • Ensure that you keep enough time for the disruption period. It refers to the transient time just after you have launched the experiment
  • It is ideal to avoid peak seasons or other high volatility periods in the business for these experiments to yield conclusive results

Method 3:

Natural Experiment

It is similar to the A/B test, where you can observe the effect of a treatment (event, feature) on different samples but not having the ability to define/control the sample. So, it is similar to Randomised Controlled Trial, but you cannot control the environment of the experiment.

Suppose you want to understand the impact of a certain advertisement. If you do what we have explained above in Method 1, and create 2 groups, a control group that is not shown the particular advertisement and a test group that has been shown the ad and try to measure the impact of the advertisement, you might make a basic mistake. The groups may not be homogenous to start with. The behavior of the groups can be different from the start itself, so you are expected to see very different results and thus cannot be sure of the effectiveness of the ad.

We need to decrease the bias by attempting resampling and reweighting techniques. To tackle this, we can create Synthetic Control Groups(SCGs).

Find below an example with an illustration for the scenario:

We will create SCGs within the unexposed groups. We will try to understand which households (HHs) missed an ad, but based on their viewing habits are similar to those households (HHs) which have seen one.


James Bond: This Photo by Unknown Author is licensed under CC BY-NC;

Jurrasic park: This Photo by Unknown Author is licensed under CC BY-NC-ND;

Fast and Furious: This Photo by Unknown Author is licensed under CC BY-NC-ND;

Mr Bean: This Photo by Unknown Author is licensed under CC BY-NC

Another sub-method that is out of the scope for this blog is to attach a weight to every household based on their demographics attributes(gender, age, income, etc) using iterative proportional fitting and the comparison happens on the weighted results.

Method 4:

Geo Measurement

Geo measurement is a method that utilizes the ability to spend and/or market in one geographic area (hence “geo”) vs. another. A typical experiment consists of advertising in one geo (the “on” geo) and holding out another geo (the “off” geo), and then measuring the difference between them i.e., the incrementality caused by the treatment. One also needs to account for pre-test differences between on and off geographies either by normalizing these before evaluation or adjusting for this post-hoc analysis.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Hope this helps as a good starting point in understanding what attribution and incrementality are and how it is utilized in the industry.

Detectron2 FPN + PointRend Model for Amazing Satellite Image Segmentation

Satellite image segmentation has been in practice for the past few years, and it has a wide range of real-world applications like monitoring deforestation, urbanization, traffic, identification of natural resources, urban planning, etc. We all know that image segmentation is the process of color coding each pixel of the image into either one of the training classes. Satellite image segmentation is the same as image segmentation. In this process, we use landscape images taken from satellites and perform segmentation on them. Typical training classes include vegetation, land, buildings, roads, cars, water bodies, etc. Many Convolution Neural Network (CNN) models have shown decent accuracy in satellite image segmentation. One of these models, which is a highlight, is the U-Net model. Though the U-Net model gives decent accuracy, it still has some drawbacks like predicting classes with very near-distinguishable features, being unable to predict precise boundaries, and so on. In order to address these drawbacks, we have performed satellite image segmentation using the Basic FPN + PointRend model from the Detectron2 library, which has significantly rectified the drawbacks mentioned above and showed a 15% increase in accuracy when compared to the U-Net model on the validation dataset used.

In this blog, we will start by describing the objective of our experiment, the dataset we used, a clear explanation of the FPN + PointRend model architecture, and then demonstrate predictions from both U-Net and Detectron 2 models for comparison.


The main objective of this task is to perform semantic segmentation on satellite images to segment each image’s pixels into either of the five classes considered: greenery, soil, water, building, or utility. Constraints are anything related to plants and trees like forests, fields, bushes, etc., considered single-class greenery. Same for soil, water, building, and utility (roads, vehicles, parking lots, etc.) classes.

Data Preparation for Modeling

We have created 500 random RGB satellite images using the Google Maps API for modeling. For segmentation tasks, we need to prepare annotated images in the format of RGB masked images with height and width the same as the input image and each pixel value corresponds to the respective class color code (i.e., greenery – [0,255,0], soil – [255,255,0], water – [0,0,255], buildings – [255,0,0], utility – [255,0,255]). For the annotation process, we chose the LabelMe tool. Additionally, we have performed image augmentations like horizontal flip, random crop, and brightness alterations on images to let the model robustly learn the features. After annotations are done, we made a train and validation split for the dataset in the ratio of 90:10. Below is a sample image from the training dataset and the corresponding RGB masked image.

Fig 1: A sample image with a corresponding annotated RGB mask from the training dataset

Model Understanding

For modeling, we have used the Basic FPN segmentation model + PointRend model from Facebook’s Detectron2 library. Now let us understand the architecture of both models.

Basic FPN Model

FPN (Feature Pyramid Network) mainly consists of two parts: encoder and decoder. An image is processed into a final output by passing through the encoder first, then through the decoder, and finally through a segmentation head for generating pixel-wise class probabilities. In the bottom-up encoder, the approach is performed using the ResNet encoder, and in the decoder, the top-down approach is performed using an adequately structured CNN network.

Fig 2: Feature Pyramid Network (FPN) mode process flow (Image Source [1])

In the bottom-up approach, the RGB image is passed as input, and then it is processed through all the convolution layers. After passing through each layer, the output of that particular layer is sent as input to the next layer as well as input to the corresponding convolution layer in the top-down path, as seen in fig 2. In the bottom-up approach, image resolution is reduced by 1/4th the input resolution of that layer as we go up, therefore, performing down sampling on the input image.

In the top-down path for each layer, input comes from the above layer and from the corresponding bottom-up path layer, as seen in fig 2. These two inputs are merged and then sent to the layer for processing. Before merging to equate the channels of both input vectors, the input from the bottom-up path is passed through a 1*1 convolution layer, which results in an output of 256 channels, and the input from the above top-down layer is upsampled 2 times using the nearest neighbor’s interpolation method. Then both the vectors are added and sent as input to the top-down layer. The output of the top-down layer is passed 2 times successively through the 3*3 convolution layer, which results in a feature pyramid with 128 channels. This process is continued till the last layer in the top-down path. Therefore, the output of each layer in the top-down path is a feature pyramid.

Each feature map is upsampled such that its resulting resolution is the same as 1/4th of the input RGB image. Once upsampling is done, they are added and sent as input to the segmentation head, where 3*3 convolution, batch normalization, and ReLU activation are performed. To reduce the number of channels in the output to the same as the number of classes, we apply 1*1 convolution. Then spatial dropout and bi-linear interpolation upsampling are performed to get the prediction vector with the same resolution same as the input image.

Technically, in the FPN network, the segmentation predictions are performed on a feature map that has a resolution of 1/4th of the input image. Due to this, we must compromise on the accuracy of boundary predictions. To address this issue, PointRend model is used.

PointRend Model

The basic idea of the PointRend model is to see segmentation tasks as computer graphics rendering. Same as in rendering where pixels with high variance are refined by subdivision and adaptive sampling techniques, the PointRend model also considers the most uncertain pixels in semantic segmentation output, upsamples7t them, and makes point-wise predictions which result in more refined predictions. The PointRend model performs two main tasks to generate final predictions. These tasks are,

  • Points Selection – how uncertain points are selected during inference and training
  • Point-Wise Predictions – how predictions are made for these selected uncertain points

Points Selection Strategy

During inference, random points are selected where the probabilities in the coarse prediction output (prediction vector which has a resolution equal to 1/4th of the input image) from the FPN model have class probabilities near to 1/no. of the classes, i.e., 0.2 in our case as we have 5 classes. But during training, instead of selecting points only based on probabilities first, it selects kN random points from a uniform distribution. Then it selects βN most uncertain points (points with low probabilities) from among these kN points. Finally, the remaining (1 – β)N are sampled from a uniform distribution. For the segmentation task during training, k=3 and β=0.75 have shown good results. See fig 3 for more details on the point selection strategy during training.

Fig 3: Point selection strategy demonstration (Image source [2])

Point-Wise Predictions

Point-wise predictions are made by combining two feature vectors:

  • Fine-Grained Features – At each selected point, a feature vector is extracted from the CNN feature maps. The feature vector can be extracted from a single feature map of res2 or multiple feature maps, e.g., res2 to res5 or feature pyramids.
  • Coarse Prediction Features – Feature maps are usually generated at low resolutions. Due to this fine feature, information is lost. For this, instead of relying completely on feature maps, course prediction output from FPN is also used in the extracted feature vector at selected points.

The combined feature vector is passed through MLP (multi-layer perceptron) head to get predictions. The MLP has a weight vector for each point, and these get updated during training by calculating loss similar to the FPN model, which is categorical cross-entropy.

Combined Model (FPN + PointRend) Flow

Now that we understand the main tasks of the PointRend model, let’s understand the flow of the complete task.

First, the input image is sent to the CNN network (in our case, the FPN model) to get a coarse prediction output (in our case, a vector of ¼th the resolution of the input image with 5 channels). This coarse prediction vector is sent as input to the PointRend model, where it is upsampled 2* times using bilinear interpolation, and N uncertain points are generated using the PointRend point selection strategy. Then, at these points, new predictions are made using a point-wise prediction strategy with a MLP head (multi-layer perceptron), and this process is continued till we reach the desired output resolution. Suppose the course prediction output resolution is 7*7 and the desired output resolution is 224*224; the PointRend upsampling is done 5 times. In our case, the input resolution is ¼th of the desired output resolution. Therefore, PointRend upsampling is performed twice. Refer to fig. 4 and 5 for a better understanding of the flow.

Fig 4: PointRend model process flow (Image source [2])
Fig 5: PointRend model up sampling and point-wise prediction demo for 4*4 course prediction vector (Image source [2])


For model training, we have used Facebook’s Detectron2 library. The training was done using an Nvidia Titan XP GPU with 12GB of VRAM and performed for 1 lakh steps with an initial learning rate of 0.00025. The best validation IoU was obtained at the 30000th step. The accuracy of Detectron2 FPN + PointRend outperformed the UNet model for all classes. Below are some of the predictions from both models. As you can see Detectron2 model was able to distinguish features of greenery and water class when U-Net failed in almost all cases. Even the boundary predictions of the Detectron2 model are far better than U-Net’s.

Fig 6: Sample predictions from UNet and Detectron2 model. Per image left is the prediction from UNet model, the middle is original RGB image and right is the prediction from Detectron2 model


In this blog, we have understood how the Detectron 2 FPN + PointRend model performs segmentation on the input image. The PointRend model can be applied as an extension to any image segmentation tasks to get better predictions at class boundaries. As further steps to improve accuracy, we can increase the training dataset, do augmentations on images, and play with hyperparameters like learning rate, decay, thresholds, etc.


[1] Feature Pyramid Network for Multi-Class Land Segmentation

[2] PointRend: Image Segmentation as Rendering

HYPER DASH: How To Manage The Progress Of Your Algorithm In Real-time?

Most of our readers who work with Machine Learning or Deep Learning models daily understand the struggle of peeking at the terminal to check for the completion status of the model training. Models training for hundreds of epochs can take several hours to complete. When you train a deep learning model on Google Colab, you’ll want to know your training progress to proceed further.

The saying, “A watched pot never boils faster,” seems to be relevant in the case of Machine learning or Deep Learning Model Training. Hyper Dash could be a helpful tool for all those situations.

Hyper Dash can monitor model training remotely on i0S, Android, or web URL. You can check the progress, stay informed about significant training changes, and get notified once your training is complete.

You can simultaneously monitor all your models running on different GPUs or TPUs and arrive at the best result. It maintains a log for results and hyper-parameters used, notifying you of the best one completed.

The Hyper Dash Convenience:

  • It is fast and user friendly.
  • Tracks hyper-parameters of the experiments and different functions.
  • Stores Graph performance metrics in real time · It can be viewed remotely on the Web, iOS, and Android without self-hosting (e.g., Tensorboard).
  • It saves the print output of the user’s experiment (standard out / error) as a local log file.
  • It notifies the user when a long-running experiment is completed.

Implementing Hyper Dash:

  • Installations on terminal/jupyter notebook.
  • It’s an easy to install PyPI package in the python module.
  • Next, you have to sign up for the hyper dash account
  • You can sign-up via mail or GitHub account. And when you log in, you will see the message above on your terminal/jupyter notebook.
  • Installation on phone.
  • Download the mobile application on your phone using Google Play/App store.
  • Once you have signed in, you can access experiments and their results on your phone.
  • Implementing real-time tracking for modeling experiments.
  • First we import the required libraries. Then we import the dataset to be read, and split them for training and testing. After this, we add the target columns as y_train and y_test and drop the target variables from them.

Now, we import the hyper dash library and its required modules. And after that, we define each experiment with an experiment name shown on your phone to identify your experiment uniquely.

We fit the model and test it on the test data set as hygiene. We can also view the confusion matrix of the entire model on our phone once the training is completed. We must add a method called exp.metric() to define our parameters like False Positives, False Negatives, True Positives, and True Negatives of our experiment.

Steps to use Hyperdash application are as follows-

  • Name: It declares the experiment object in each run with the Experiment(name).
  • Parameters: It records the value of any hyper-parameter with exp.param(‘name’, value) of the model.
  • Metrics: It records the value of any metric we wish to see with exp.metric(‘name’, score).
  • End: It marks the end of the experiment with exp.end().

An added decorator experiment:

The above experiment will never close without an exp.end() command as it marks the end of the experiment. To avoid this confusion, we can always wrap our entire experiment command in a decorator as follows-

So once you start the modeling, you can track the experiment in real time on your phone.

Take for instance the following below:

These are experiments from different GPUs. You can select any model to observe the total time taken to run the model. Additionally, you get notified as soon as you complete the model training.

We can see different parameters and logs for every experiment. There is also a chart of confusion matrices with several parameters.


Hyper Dash is a user friendly application to track your model trainings. You can use the app with Tensorflow, Pytorch, etc. with compatibility across platforms and notification features for your phone.

DECIPHERING: How do Consumers Make Purchase Decisions?


Suppose you are looking for a product on a particular website. As soon as you commence on the journey of making; the first search for a product, fidgeting on the idea to either buy it or not, and finally purchasing it, you are targeted or tempted by various marketing strategies through; various channels to buy the product.

You may start seeing the ads for the particular product on social media websites, on the side of various web pages, receive promotional emails, etc. This entire experience through these different channels that you interact with; will be referred to as touchpoints.

Customer Touchpoint Journey | Source

So, to sum up, whenever you provide an interest/signal to a platform that you are going to purchase a certain product, you may interact with these touchpoints mentioned above.

The job of a marketing team of a particular company is to utilize the marketing budget in a way that they get the maximum return on the marketing spend, i.e. to ensure that you buy their product.

So to achieve this, the marketing team uses a technique called Attribution.

What is Attribution?

Attribution is also known as Multi-Touch Attribution. Moreover, it’s an identification that walks you through of a set of user actions/events/touchpoints that drive a certain outcome or result and the assignment of value to each of those events/touchpoints.

Why is Attribution Important?

The main aim of marketing attribution is to quantify the influence of various touchpoints on the desired outcome and optimize their contribution to the user journey to maximize the return on marketing spend.

How does Attribution Work?

Assume; you had shown an interest in buying sneakers on Amazon. You receive the email tempting you to make the purchase, and finally, after some deliberation, you click on it and make the purchase. In a simple scenario, the marketing team will attribute your purchase to this email, i.e. they will feel that the email channel is what caused the purchase. They will think that there is a causal relationship between the targeted email and the purchase decision.

Suppose this occurrence is replicated across tens of thousands of users. The marketing team feels that email has the best conversion when compared to other channels. They start allocating more budget to it. They spend money on aesthetic email creatives, hire better designers, send more emails as they feel email is the primary driver.

But, after a month, you notice that the conversion is reducing. People are not interacting with the email. Alas! The marketing team has wasted the budget on a channel that they thought was causing the purchases.

Where did the Marketing Team go Wrong?

Attribution models are not causal, signifying that they give the credit of a transaction to a channel that may not necessarily cause that transaction. So, it was not only the emails that were causing the transactions; but there might have been another important touchpoint/touchpoints that were actually driving the purchase.

Understanding Causal Inference

The main goal of the marketing team is to use the attribution model to infer causality, and as we have discussed, they are not necessarily doing so. We need Causal Inference to truly understand the process of cause and effect of our marketing channels. Causal Inference deals with the counterfactual; it is imaginative and retrospective. Causal inference will instead help us understand what would have happened in the absence of a marketing channel.

Ta-Da!! Enters Incrementality. (Incrementality waiting the entire time to make its entrance in the blog)

What is Incrementality?

Incrementality is the process of identifying an interaction that caused a customer to do a certain transaction.

In fact, it finds the interaction that, in its absence, a transaction would not have occurred. Therefore, incrementality is the art of finding causal relationships in the data.

It is tricky to quantify the inherent relationships among touchpoints, so I have dedicated part 2 to discuss various strategies that are used to measure incrementality and how a marketing team can better distribute its budget across marketing channels.

Natural Language Inferencing (NLI) Task: Demonstration Using Kaggle Dataset

Natural Language Inferencing (NLI) task is one of the most important subsets of Natural Language Processing (NLP) which has seen a series of development in recent years. There are standard benchmark publicly available datasets like Stanford Natural Language Inference (SNLI) Corpus, Multi-Genre NLI (MultiNLI) Corpus, etc. which are dedicated to NLI tasks. Few state-of-the-art models trained on these datasets possess decent accuracy. In this blog I will start with briefing the reader about NLI terminologies, applications of NLI, NLI state-of-the-art model architectures and eventually demonstrate the NLI task using Kaggle Contradictory My Dear Watson Challenge Dataset by the end.


  • Basics of NLP
  • Moderate Python coding

What is NLI?

Natural Language Inference which is also known as Recognizing Textual Entailment (RTE) is a task of determining whether the given “hypothesis” and “premise” logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other. For example, let us consider hypothesis as “The game is played by only males” and premise as “Female players are playing the game”. The task of NLI model is to predict whether the two sentences are either entailment, contradiction, or neutral. In this case, it is a contradiction.

How NLI is different from NLP?

The main difference between NLP and NLI is that NLP is a broader set that contains two subsets Natural Language Understanding (NLU) and Natural Language Generation (NLG). We are more concerned about NLU as NLI comes under this. NLU is basically making the computer capable of comprehending what the given text block represents. NLI, which comes under NLU is the task of understanding the given two statements and categorizing them either as entailment, contradiction, or neutral sentences. When dealing with data most of the NLP tasks include pre-processing steps like removing stop words, special characters, etc. But in case of NLI, one has to just provide the model with two sentences. The model then processes the data itself and outputs the relationship between the two sentences.

Figure 1: NLP vs NLI. (Source)

Applications of NLI

NLI is been used in many domains like banking, retail, finance, etc. It is widely used in cases where there is a requirement to check if generated or obtained result from the end-user follows the hypothesis. One of the use cases includes automatic auditing tasks. NLI can replace human auditing to some extent by comparing if sentences in generated document entail with the reference documents.

Models used to Demonstrate NLI Task

In this blog, I have demonstrated the NLI task using two models: RoBERTa and XLM-RoBERTa. Let us understand these models in this section.

In order to understand RoBERTa model, one should have a brief knowledge about BERT model.


Bidirectional Encoder Representation Transformers (BERT) was published by Google AI researchers in 2018. It has shown state-of-the-art results in many NLP tasks like question and answering, NLI task etc. It is basically an encoder stack of transformer architecture. It has two versions BERT base and BERT large. BERT base has 12 layers in its encoder stack and 110M total parameters whereas BERT large has 24 layers and 340M total parameters.

BERT pre-training consists of two tasks:

  1. Masked Language Model (MLM)
  2. Next Sentence Prediction (NSP)

Masked LM

In the input sequence sent to the model as input, randomly 15% of the words are masked and the model is tasked to predict these masks by understanding the context from unmasked words at the end of training. This helps model in understanding the context of the sentence.

Next Sentence Prediction

Model is fed with two-sentence pairs as input. In this task, a model must predict at the end of training whether the sentences follow or unfollow each other. This helps in understanding the relationship between two sentences which is the major objective for tasks like question and answering, NLI, etc.

Both the tasks are executed simultaneously while training.

Model Input

Input to the BERT model is a sequence of tokens that are converted to embeddings. Each token embedding is a combination of 3 embeddings.

Figure 2: BERT input representation [1]
  1. Token Embeddings – These are word embeddings from WordPiece token vocabulary.
  2. Segment Embeddings – As BERT model takes pair of sentences as input, in order to help model distinguish the embeddings from different sentences these embeddings are used. In the above picture, EA represents embeddings of sentence A while EB represents embeddings from sentence B.
  3. Position Embeddings – In order to capture “sequence” or “order” information these embeddings are used to express the position of words in a sentence.

Model Output

The output of BERT model has the same no. of tokens as input with additional classification token which gives the classification results ie. whether sentence B follows sentence A or not.

Figure 3: Pre-training procedure of BERT [1]

Fine-Tuning BERT

A pre-trained BERT model can be fine-tuned to achieve a specific task on specific data. Fine-tuning uses same architecture as the pre-trained model only an additional output layer is added depending on the task. In case of NLI task classification token is fed into the output classification layer which determines the probabilities of entailment, contradiction, and neutral classes.

Figure 4: Illustration for BERT fine-tuning on sentence pair specific tasks like MNLI, QQP, QNLI, RTE, SWAG etc. [1]

BERT GLUE Task Results

Figure 5: GLUE test results [1]

As you can see in figure 4, BERT outperforms all the previous models on GLUE tests.


Robustly Optimised BERT Pre-training Approach (RoBERTa) was proposed by Facebook researchers. They found with a much more robustly pre-training BERT model it can still perform better on GLUE tasks. RoBERTa model is a BERT model with modified pre-training approach.

Below are the few changes incorporated in RoBERTa model when compared to BERT model.

  1. Data – RoBERTa model is trained using much more data when compared to BERT. It is trained on 160GB uncompressed data.
  2. Static vs Dynamic Masking – In BERT model, data was masked only once during pre-processing which results in single static masks. These masks are used for all the iterations while training. In contrast, data used for RoBERTa training was duplicated 10 times with 10 different mask patterns and was trained over 40 epochs. This means a single mask pattern is used only in 4 epochs. This is static masking. While in dynamic masking different mask pattern is generated for every epoch during training.
Figure 6: Static vs Dynamic masking results [2]

3. Removal of Next Sentence Prediction (NSP) objective – Researches have found that removing NSP loss significantly improved the model performance on GLUE tasks.

Figure 7: Model results comparison when different input formats are used [2]

4. Trained on Large Batch Sizes – Training model on large batch sizes improved the model accuracy.

Figure 8: Model results when trained with different batch sizes [2]

5. Tokenization – RoBERTa uses a byte-level Byte-Pair Encoding (BPE) encoding scheme with a containing 50K vocabulary in contrast to BERT’s character-level BPE with a 30K vocabulary.

Figure 9: RoBERTa results on GLUE tasks. [2]

RoBERTa Results on GLUE Tasks

RoBERTa clearly outperforms when compared to previous models.


XLM-R is a large multilingual model trained on 100 different languages. It is basically an update to Facebook XLM-100 model which is also trained in 100 different languages. It uses the same training procedure as RoBERTa model which used only Masked Language Model (MLM) technique without using Next Sentence Prediction (NSP) technique.

Noticeable changes in XLM-R model are:

  1. Data – XLM-R model is trained on large cleaned CommonCrawl data scaled up to 2.5TB which is a way larger than Wiki-100 corpus which was used in training other multilingual models.
  2. Vocabulary – XLM-R vocabulary contains 250k tokens in contrast to RoBERTa which has 50k tokens in its vocabulary. It uses one large shared Sentence Piece Model (SPM) to tokenize words of all languages instead of XLM-100 model which uses different tokenizers for different languages. XLM-R authors assume that similar words across all the languages have similar representation in space.
  3. XLM-R is self-supervised, whereas XLM-100 is supervised model. XLM-R samples stream of text from each language and trains the model to predict masked tokens. XLM-100 model required parallel sentences (sentences that have same meaning) in two different languages as input which is a supervised method.

XLM-R Results on Cross-Lingual-Classification on XNLI dataset

Figure 10: XLM-R results on XNLI dataset. [3]

XLM-R is now the state-of-the-art multilingual model which outperforms all the previous multi-language models.

Demonstration of NLI Task Using Kaggle Dataset

In this section, we will implement the NLI task using Kaggle dataset.

Kaggle has launched Contradictory My Dear Watson challenge to detect contradiction and entailment in multilingual text. It has shared a training and validation dataset that contains 12120 and 5195 text pairs respectively. This dataset contains textual pairs from 15 different languages – Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese. Sentence pairs are classified into three classes entailment (0), neutral (1), and contradiction (2).

We will be using the training dataset of this challenge to demonstrate the NLI task. One can run the following code blocks using Google Colab and can download a dataset from this link.

Code Flow

1. Install transformers library

!pip install transformers

2. Load XLM-RoBERTa model –

Since our dataset contains multilingual text, we will be using XLM-R model for checking the accuracy on training dataset.

from transformers import AutoModelForSequenceClassification, AutoTokenizer xlmr= AutoModelForSequenceClassification.from_pretrained(‘joeddav/xlm-roberta-large-xnli’) tokenizer = AutoTokenizer.from_pretrained(‘joeddav/xlm-roberta-large-xnli’)

3. Load training dataset –

import pandas as pd train_data = pd.read_csv(<dataset path>) train_data.head(3)

4. XLM-R model classes –

Before going further do a sanity check to confirm if the model classes notation and the dataset classes notation is same


{‘contradiction’: 0, ‘entailment’: 2, ‘neutral’: 1}

We can see that the model classes notation and Kaggle dataset classes notation (entailment (0), neutral (1), and contradiction (2)) is different. Therefore, change the training dataset classes notation to match with model.

5. Change training dataset classes notation –

train_data[‘label’] = train_data[‘label’].replace([0, 2], [2, 0]) train_data.head(3)

6. EDA on dataset –

Check the distribution of training data based on language

train_data_lang = train_data.groupby(‘language’).count().reset_index()[[‘language’,’id’]] # plot pie chart import matplotlib.pyplot as plt import numpy as np plt.figure(figsize=(10,10)) plt.pie(train_data_lang[‘id’], labels = train_data_lang[‘language’], autopct=’%1.1f%%‘) plt.title(‘Distribution of Train data based on Language’)

We can see that English constitutes to more than 50% of the training data.

7. Sample data creation –

Since training data has 12120 textual pairs, evaluating all the pairs would be time-consuming. Therefore, we will create a sample data out of training data which will be a representative sampling ie. sample data created will have the same distribution of text pairs based on language as of the training data.

# create a column which tells how many random rows should be extracted for each language train_data_lang[‘sample_count’] = train_data_lang[‘id’]/10 # sample data sample_train_data = pd.DataFrame(columns = train_data.columns) for i in range(len(train_data_lang)): df = train_data[train_data[‘language’] == train_data_lang[‘language’][i]] n = int(train_data_lang[‘sample_count’][i]) df = df.sample(n).reset_index(drop=True) sample_train_data = sample_train_data.append(df) sample_train_data = sample_train_data.reset_index(drop=True) # plot distribution of sample data based on language sample_train_data_lang = sample_train_data.groupby(‘language’).count().reset_index()[[‘language’,’id’]] plt.figure(figsize=(10,10)) plt.pie(sample_train_data_lang[‘id’], labels = sample_train_data_lang[‘language’], autopct=’%1.1f%%‘) plt.title(‘Distribution of Sample Train data based on Language’)

We can see that sample data created and the training data have nearly same distribution of text pairs based on language.

8. Functions to get predictions from XLM-R model –

def get_tokens_xlmr_model(data): ”’ Function which creats tokens for the passed data using xlmr model input – Dataframe Output – list of tokens ”’ batch_tokens = [] for i in range(len(data)): tokens = tokenizer.encode(data[‘premise’][i], data[‘hypothesis’][i], return_tensors=’pt’, truncation_strategy=’only_first’) batch_tokens.append(tokens) return batch_tokens def get_predicts_xlmr_model(tokens): ”’ Function which creats predictions for the passed tokens using xlmr model input – list of tokens Output – list of predictions ”’ batch_predicts = [] for i in tokens: predict = xlmr(i)[0][0] predict = int(predict.argmax()) batch_predicts.append(predict) return batch_predicts

9. Predictions on sample data –

sample_train_data_tokens = get_tokens_xlmr_model(sample_train_data) sample_train_data_predictions = get_predicts_xlmr_model(sample_train_data_tokens)

10. Find model accuracy on the predictions –

# plot the confusion matrix and classification report for original labels to the predicted labels import numpy as np import seaborn as sns from sklearn.metrics import classification_report sample_train_data[‘label’] = sample_train_data[‘label’].astype(str).astype(int) x = np.array(sample_train_data[‘label’]) y = np.array(sample_train_data_predictions) cm = np.zeros((3, 3), dtype=int), [x, y], 1) sns.heatmap(cm,cmap=”YlGnBu”, annot=True, annot_kws={‘size’:16}, fmt=’g’, xticklabels=[‘contradiction’,’neutral’,’entailment’],

yticklabels=[‘contradiction’,’neutral’,’entailment’]) matrix = classification_report(x,y,labels=[0,1,2], target_names=[‘contradiction’,’neutral’,’entailment’]) print(‘Classification report : \n‘,matrix)

The model is able to give 93% accuracy on the sample data without any finetuning.

11. Find model accuracy at language level –

sample_train_data[‘prediction’] = sample_train_data_predictions sample_train_data[‘true_prediction’] = np.where(sample_train_data[‘label’]==sample_train_data[‘prediction’], 1, 0) sample_train_data_predicted_lang = sample_train_data.groupby(‘language’).agg({‘id’:’count’, ‘true_prediction’:’sum’}).reset_index()[[‘language’,’id’,’true_prediction’]] sample_train_data_predicted_lang[‘accuracy’] = round(sample_train_data_predicted_lang[‘true_prediction’]/sample_train_data_predicted_lang[‘id’], 2) sample_train_data_predicted_lang = sample_train_data_predicted_lang.sort_values(by=[‘accuracy’],ascending=False) sample_train_data_predicted_lang

Except for English, rest of the languages are having accuracy greater than 94%. Therefore, use a different model for English pairs prediction to further improve accuracy.

12. RoBERTa model for English text pairs prediction –

!pip install regex requests !pip install omegaconf !pip install hydra-core import torch roberta = torch.hub.load(‘pytorch/fairseq’, ‘roberta.large.mnli’) roberta.eval()

RoBERTa model classes notation is same as XLM-R model notations. Therefore, we can directly use the sample data without any class notation changes.

13. Extract only English pairs from sample data –

sample_train_data_en = sample_train_data[sample_train_data[‘language’]==’English’].reset_index(drop=True) sample_train_data_en.shape

(687, 8)

14. Functions to get predictions from RoBERTa model –

def get_tokens_roberta(data): ”’ Function which generates tokens for the passed data using roberta model input – Dataframe Output – list of tokens ”’ batch_tokens = [] for i in range(len(data)): tokens = roberta.encode(data[‘premise’][i],data[‘hypothesis’][i]) batch_tokens.append(tokens) return batch_tokens def get_predictions_roberta(tokens): ”’ Function which generates predictions for the passed tokens using roberta model input – list of tokens Output – list of predictions ”’ batch_predictions = [] for i in range(len(tokens)): prediction = roberta.predict(‘mnli’, tokens[i]).argmax().item() batch_predictions.append(prediction) return batch_predictions

15. Predictions with RoBERTa model –

sample_train_data_tokens = get_tokens_xlmr_model(sample_train_data) sample_train_data_predictions = get_predicts_xlmr_model(sample_train_data_tokens)

16. Accuracy of RoBERTa model –

sample_train_data_en[‘prediction’] = sample_train_data_en_predictions # roberta model accuracy sample_train_data_en[‘true_prediction’] = np.where(sample_train_data_en[‘label’]==sample_train_data_en[‘prediction’], 1, 0) roberta_accuracy = round(sum(sample_train_data_en[‘true_prediction’])/len(sample_train_data_en), 2) print(“Accuracy of RoBERTa model {}“.format(roberta_accuracy))

Accuracy of RoBERTa model 0.92

Accuracy of English text pairs increased from 89% to 92%. Therefore, for predictions on test dataset of Kaggle challenge use RoBERTa for English pairs prediction and XLM-R for predictions of other language pairs.

By this approach, I was able to score 94.167% accuracy on the test dataset.


In this blog ,we have learned what NLI task is, how to achieve this using two state-of-the-art models. There are many more pre-trained models for achieving NLI tasks other than the models discussed in this blog. Few of them are language-specific like German BERT, French BERT, Finnish BERT, etc. multilingual models like Multilingual BERT, XLM-100, etc.

As future steps, one can further achieve task-specific accuracy by finetuning these models with specific data.


[1] BERT official paper

[2] RoBERTa official paper

[3] XLM-RoBERTa official paper

In-Store Traffic Analytics: Retail Sensing with Intelligent Object Detection

1. What is Store Traffic Analytics?

In-store traffic analytics allows data-driven retailers to collect meaningful insights about customer’s behavioral data.

The retail industry receives millions of visitors every year. Along with fulfilling the primary objective of a store, it is can also extract valuable insights from this constant stream of traffic.

The footfall data, or the count of people in a store, creates an alternate source of value for retailers. One can collect traffic data and analyze key metrics to understand what drives the sales of their product, customer behavior, preferences, and related information.

2. How does it help store potential?

Customer Purchase Experience

Store Traffic Analytics helps provide insights and in-depth knowledge of customer shopping and purchasing habits, their in-store journey, etc., by capturing key data points such as the footfall at different periods, the preferred product categories identifying traffic intensity across departments, among others. Retailers can leverage such analytics to strategize and target their customers such that it enhances customer experience and drive sales.

Customer Dwell Time Analysis

Dwell time is the length of time a person spends looking at the display or remains in a specific area. It grants an understanding of what in a store holds customer attention and helps in optimizing store layouts and product placements for higher sales.

Demographics Analysis

Demographic analysis separates store visitors into categories based on their age and gender, aiding in optimizing product listing. For instance, a footwear store footfall analysis shows that the prevalent customers are young men between the age group of 18-25. The information helps the store manager list products that appeal to this demographic group, ensuring better conversion rates.

Human Resource Scheduling

With the help of store traffic data, workforce productivity can also be enhanced by effective management of staff schedules according to peak shopping times to meet demands and provide a better customer experience, directly impacting operational costs.

3. Customer Footfall Data

The first step for Store Traffic Analytics is to have a mechanism to capture customer footfall data. Methods to count people entering the store (People Counting) have been evolving rapidly. Some of them are as follows –

  • Manual tracking
  • Mechanical counters
  • Pressure mats
  • Infrared beams
  • Thermal counters
  • Wi-Fi counting
  • Video counters.

This article will take a closer look into the components of an AI-based object detection and tracking framework for Video counters using Python, Deep Learning and OpenCV, by leveraging CCTV footage of a store.

4. People Counting (Video Counters)

Following are the key components involved in building a framework for people counting in CCTV footage:

  1. Object Detection – Detecting objects (persons) in each frame or after a set of a fixed number of frames in the video.
  2. Object Tracking – Assigning unique IDs to the persons detected and tracking their movement in the video stream.
  3. Identifying the entry/exit area – Based on the angle of the CCTV footage, identifying the entry/exit area tracks the people entering and exiting a store.

Since object detection algorithms are computationally expensive, we can use a hybrid approach where objects are detected once every N frames (and not in each frame). And when not in the detecting phase, the objects are tracked as they move around the video frames. Tracking continues until the Nth frame, and then the object detector is re-run. We then repeat the entire process. The benefit of such an approach helps to apply highly accurate object detection algorithms without much computational burden.

4.1 Object Detection

Object detection is a computer vision technique that allows us to determine where the object is in an image/frame. Some object detection algorithms include Faster R-CNN, Single Shot Detectors (SSD), You Only Look Once (YOLO), etc.

Illustration of YOLO Object Detector Pipeline (Source)

YOLO being significantly faster and accurate, can be used for video/real-time object detection. YOLOv3 model is pre-trained on the COCO dataset to classify 80 different classes, including people, cars, etc. Using the machine learning concept of transfer learning (where knowledge gained from solving a problem helps solve similar problems), for people counting, the pre-trained model weights, developed by the darknet team, can be leveraged to detect persons in the frames of the video stream.

Following are the steps involved for object detection in each frame using the YOLO model and OpenCV –

1. Load the pre-trained YOLOv3 model using OpenCV’s DNN function-

2. Determine the output layer names/classes from YOLO model and construct blob from the frame –

3. For each object detected in ‘layerOutputs,’ filter objects labeled ‘Person’ to identify all the persons present in the video frame.

4.2 Object Tracking

The persons detected using object detection algorithms are tracked with the help of an object tracking algorithm that accepts the input coordinates (x,y) of where the person is in the frame and then assigns a unique ID to that particular person. The tracked person moves around a video stream (in different frames) by predicting the new object location in the next frame based on various factors of the frame such as gradient, speed, etc. Few object tracking algorithms are Centroid Tracking, Kalman Filter tracking, etc.

Since the position of a person in the next frame is determined to a great extent by their velocity and position in the current frame, the Kalman Filter tracking algorithm tracks old or new persons detected. Kalman Filter allows model tracking based on velocity and position, predicting likely possible positions. It does so by using Gaussians. When it receives a new reading, it uses probability to assign measurements to its prediction and update itself. Accordingly, the object assigns to existing or unique IDs. This blog explains the maths behind Kalman Filter.

4.3 Identifying the entry/exit area

To keep track of people entering/exiting a particular area of the store, based on the CCTV angle, the entry/exit area in the video stream is specified to accurately collect data of the customer journey and traffic in the store.

In the image below (the checkout counter), the yellow boundary specifies the entry/exit area, and the status of store traffic is updated accordingly.

A reference frame from the video (Source of the original video)

5. Key Challenges –

Some key challenges observed during object detection and tracking framework for footfall data capturing are listed below –

  • Speed for real-time detection – The object detection and prediction time needs to be incredibly fast to accurately capture the traffic in the store with frequently visiting customers.
  • The angle of CCTV Cameras – The camera angle should accurately capture the footfall traffic.
  • Video Clarity – For object detection and tracking algorithms to accurately capture the people in a video, the quality of the video plays an important role. It should not be too blurry, have proper lighting, etc.

6. Conclusion

The need for Store Traffic Analytics has become apparent with the growing complexity of the industry. Retail businesses face fierce competition that pressures them to guarantee that the right products and services are available to their customers.

Collecting and analyzing data that accurately reveals customer behavior has therefore become a crucial part of the process.

7. References

The Evolution of Movies : How has it changed with time?

“Cinema is a mirror by which we often see ourselves.” – Alejandro Gonzalez Inarritu

If 500 people saw a movie, there exist 500 different versions of the same idea conveyed by the movie. Often movies reflect culture, either what the society is or what it aspires to be.

From “Gone with the Wind” to “Titanic” to “Avengers: Endgame,” Movies have come a long way. The Technology, Scale, and Medium might have changed, but the essence has remained invariable to storytelling. Any idea or story that the mind can conceive can be given life in the form of Movies.

The next few minutes would be a humble attempt to share the What, Why, and How of the Movie Industry.


  1. The process behind movies
  2. How did the movie industry operate?
  3. How is it functioning now?
  4. What will be the future for movies?

1. The process behind movies

We see the end output in theatres or in the comfort of our homes (on Over-The-Top (OTT) platforms like Netflix, Amazon Prime, HBO Max, etc.). But in reality, there is a long and ardent process behind movie-making.

As with all things, it starts with an idea, an idea that transcribes into a story, which then takes the form of a script detailing the filmmaker’s vision scene by scene.

They then pitch this script to multiple studios (Disney, Universal, Warner Bros., etc.). When a studio likes this script, they decide to make the film using its muscle for pre-production, production, and post-production.

The filmmaker and studio start with the pre-production activities such as hiring cast and crew, choosing filming locations to set constructions. Post which the movie goes into the production phase, where it gets filmed. In post-production, the movie gets sculpted with music, visual effects, video & sound editing, etc.

And while moviemakers often understand the art of movie-making, it becomes a product to be sold on completing production. Now they have to sell this story to the audience.

Marketing & distribution step in to ensure that this product is now available to its worldwide audience by promoting it in all possible mediums like billboards and social media. After this, it’s all about delivering an immersive, entertaining experience where your surroundings go dark and the mind lights up.

An Overview Of The Movie-Making Process

In recent times some creative flexibility has been observed in the above process. For example, studios who own certain characters or intellectual property like Marvel & DC characters hire the right talent to get their movies made. In such cases, big studios control significant aspects of the film, from content creation to distribution.

2. How did the movie industry operate?

For a considerable period, movies used to stay in theatres for a long time post their initial release before reaching other available forms of Home Entertainment (based on the technological choices/era). For example, Let’s take a movie that acted as the bridge between two distinct generations. Titanic transformed the industry from the old school blockbusters to the new school global hits (with technology, CGI, worldwide markets, etc.) And before the 2010s, blockbuster movies like Titanic used to run in theatres for several months. Titanic was the undisputed leader of Box Office for nearly four months, both in terms of the number of tickets sold and worldwide revenue generated.

Post its theatrical run of approximately four months, blockbuster titles used to be available in-Home Entertainment (HE) formats (such as DVD, VCD, etc.) These formats were available in various options based on the decade or era. Options such as rental or purchasable DVDs ruled the HE domain for a considerable amount of time. Until the emergence of the internet.

The Dawn of the internet brought in other sources of entertainment in competition to the traditional Movies, Sports, etc. These options gave the consumer alternate forms of entertainment (which resulted in shortened theatrical runs, approximately three months or less). They gave the studios also another platform to sell their content. Hence the Home Entertainment release windows were fast-tracked as a natural consequence to capitalize the most from the movie’s potential.

The following is an example of the pre-2020/pandemic norms in Hollywood.

  1. December 25: Movie releases in Theatres (TH). Ex: Fast and Furious March 10 19: EST (Electronic Sell Through) release of the movie (Ex: Amazon, iTunes)
  2. April 2: iVOD/cVOD (Internet/Cable Video on Demand) release of the movie (Ex: YouTube, Comcast, Amazon)
  3. April 30: PST (Physical Sell Through) release of the movie (Ex: DVDs, Blu-ray discs)
  4. After this, the movie becomes available on Linear TV networks (Ex: HBO)

An Overview Of Movie Releases Before And After Pandemic

3. How is it functioning now?

Amid all the uncertainty surrounding the COVID pandemic, the movie industry did come to a halt, as did many other integral aspects of people’s lives. Around March 2020, most theatres worldwide shut down to prevent the widespread pandemic. The forceful shutting of the movie industry immobilized crucial aspects of the filmmaking process, such as the filming & theatrical release of movies. Since it was not safe for people to gather in numbers, theatres closed, as did other forms of entertainment such as Concerts, Live Sports, etc. This change of unprecedented magnitude was the first since world wars, where major entertainment activities worldwide were shut down.

With every problem, there lies an opportunity, as with this change, innovation was the name of the game. Those businesses that innovated survived, and the rest were likely to perish. The founding stone for this innovation was laid a long time back. With the influx of the internet, OTT (Over-The-Top) & VOD (Video on Demand) platforms were rapidly growing. OTT Platforms like Netflix & Amazon Prime were significant players in the US and worldwide before the beginning of the pandemic itself.

Shutting down of theatres meant some movies slated for 2020 waited for release dates. In the movie industry, movies are often planned well in advance. Major studios are likely to have tentative release dates for the upcoming 2 to 3 years. Delaying movies of the current year not only does it cumulatively delay the subsequent year’s release dates, but it also decays the potential of the film (due to factors like heavy competition later, loss of audience interest, etc.)

Major studios & industry leaders lead the way with innovation. A new format (Premium Video on Demand) and a new release strategy were the most viable options to ensure the movie’s release, guaranteeing both financial and viewership success.

The New Format – PVOD (Premium Video on Demand) was essentially releasing the iVOD/cVOD rental formats at an earlier period by shortening the pre-pandemic normal of 12 weeks post-theatrical release window to an earlier release window.

There were two ways of doing this; the first one is a Day and Date release in PVOD, which meant the audience can watch a new movie (Ex: Trolls World Tour, Scoob!) on its first release date at the comfort of their homes via the PVOD/rental channels (Ex: Amazon/iTunes)

The second way for the PVOD format is by releasing the movie in PVOD 2 to 8 weeks post its release in theatres. This happened once people got used to the new normal during the pandemic. Theatres across the world opened partially with limited seating capacity (50%). This meant that a movie would release in theatres exclusively first (as it was previously). However, the traditional Home Entertainment window of 12 weeks bypassed to release PVOD at an early window of 2 to 8 weeks post Theatrical release. This was the key in catering to a cautious audience during the pandemic between 2020 to 2021. This enabled them to watch a newly released movie at the comfort of their homes within a couple of weeks of its initial release itself.

A similar strategy was also tried with EST, where an early EST release (Premium EST or PEST) is offered to people at an early release window. The key difference is that PEST and PVOD were sold at higher price points (25% higher than EST/iVOD/cVOD) due to their exclusivity and early access.

The other strategy was a path-breaking option that opened the movie industry to numerous viable release possibilities – a direct OTT release. A movie waiting for its release & does not want to use the PVOD route due to profitability issues, or other reasons can now release the film directly on OTT platforms like Netflix & Amazon Prime. These platforms, which were previously producing small to medium-scale series & movies, now have the chance to release potential blockbuster movies on their platform. Studios also get to reach Millions of customers across the globe at the same time by jumping certain cumbersome aspects posed by the conventional theatrical distribution networks (which includes profit-sharing mechanisms). In this route of OTT platform release, there are many advantages to all parties involved (Studios, OTT Platforms & Audiences) and the number of potential customers.

The studios either get a total remuneration paid for the movie upfront (e.g., Netflix made a $200 Million offer for Godzilla vs. Kong to its producers, Legendary Entertainment & Warner Bros.). Or get paid later based on the number of views gathered in a given period or a combination of both (depending upon the scale & genre of the movie). The OTT platforms will now have a wide array of the latest movies across all genres to attract & retain customers. The people will now get to watch new movies on their preferred OTT platforms at their convenience and get a great value for money spent (OTT 1-month subscription ~$10 for new movie + existing array of movies & series vs. ~ $10 Theatre Ticket Price for one movie)

Overview of Major OTT Platform in US

Given there are two new gateways (OTT & PVOD) to release films in addition to the existing conventional mediums such as Theatres, EST, iVOD, cVOD, PST. There are numerous beneficial ways a movie can be released to reach the maximum people & make the most profit for the filmmakers & studios.

Release Strategy Examples

In the above example, releasing a movie directly in OTT in parallel to theatrical release attracts more subscribers to the OTT platform and covers the traditional theatrical audiences.

In the second example, let’s take a direct to Home Entertainment approach, targeting audiences directly in PVOD & early OTT releases. Similar to the movies that were released during the pandemic, like Trolls World Tour & Scoob!

The third example shows a possibility where a movie can leverage all existing major platforms for a timely release.

Since there are hundreds of possibilities for any studio or filmmaker to release their movies, how would one know the best release strategy for a movie? Does one size fit all methods work? Or do we scale and change release strategies according to the Budget/Content of the movie? Are certain genre films more suited for large-scale theater experience than being better suited for Home Entertainment? Who decides this? That should be a straightforward answer. In most cases, the one who finances the film decides the release strategy. But how would they know what combination ensures the maximum success to recoup the amount invested and guarantee a profit for all involved?

In such an uncertain industry, where more movies fail than succeed (considering the bare minimum of breaking even), the pandemic-induced multiple release strategies compound the existing layers of complexity.

In an ocean of uncertainties, the ship with a compass is likely to reach the shore safely. The compass, in this case, is Analytics. Analytics, Insights & Strategy provide the direction to take the movie across to the shores safely and profitably.

Analytics, Insights & Strategy (AIS) helps deal with the complex nature of movies and provides a headstrong direction for decision making, be it from optimal marketing spend recommendations to profitable release strategies. There are thousands of films with numerous data points. When complex machine learning models leverage all this data, it yields eye-opening insights for the industry leaders to make smart decisions. Capitalizing on such forces eases the difficulties in creating an enjoyable & profitable movie.

4. What will be the future for movies?

The Entertainment industry evolves as society progresses forward. Movies & theatres have stood the test of time for decades. There will always be a need for a convincing story, and there will always be people to appreciate good stories. Although with what seems to be a pandemic-induced shift into the world of online entertainment & OTT’s. This change was inevitable and fast-tracked due to unexpected external factors.

What the future holds for this industry is exciting for both the filmmakers and the audiences. The audiences have the liberty to watch movies across their preferred mediums early on, rather than the conventional long drawn theatrical only way. The studios now have more ways to engage audiences with their content. In addition to the theatrical experience, they can reach more people faster while ensuring they run a profitable business.

We will soon start seeing more movies & studios using the OTT platforms for early releases and the conventional theatre first releases with downstream combinations of other Home Entertainment forms to bring the movie early to the audience on various platforms.

On an alternate note, in the future, we might be in a stage where Artificial Intelligence (AI) could be generating scripts or stories based on user inputs for specific genres. An AI tool could produce numerous scripts for filmmakers to choose from. It is exciting to think of its potential, for example, say in the hands of an ace director like Christopher Nolan with inputs given to the AI tool based on movies like Tenet or Inception.

Post-Pandemic, when life returns to normal, we are likely to see star-studded, big-budget movies directly being released on Netflix or HBO Max, skipping the conventional theatrical release. Many filmmakers have expressed concerns that the rise of OTT may even lead to the death of theatres.

That said, I do not think that the theatres would perish. Theatres were and will always be a social experience to celebrate larger-than-life movies. The number of instances where people go to theatres might reduce since new movies will be offered in the comfort of their homes.

With all this discussion surrounding making profitable movies, with the help of Analytics, Insights & Strategy, why don’t filmmakers and studios stop after making a couple of profitable movies?

The answer is clear, as stated by Walt Disney, one of the brightest minds of the 20th century, “We don’t make movies to make money, we make money to make more movies.”


  1. The Shawshank Redemption Image:
  2. Godzilla vs. Kong $200 Million Bid: 
  3. US OTT Platforms Statistics:

Hotel Recommendation Systems: What is it and how to effectively build one?

What is a Hotel Recommendation System?

A hotel recommendation system aims at suggesting properties/hotels to a user such that they would prefer the recommended property over others.

Why is a Hotel Recommendation System required?

In today’s data-driven world, it would be nearly impossible to follow the traditional heuristic approach to recommend millions of users an item that they would actually like and prefer.

Hence, a Recommendation System solves our problem where it incorporates user’s input, historical interaction, and sometimes even user’s demographics to build an intelligent model to provide recommendations.


In this blog, we will cover all the steps that are required to build a Hotel Recommendation System for the problem statement mentioned below. We will do an end-to-end implementation from data understanding, data pre-processing, and the algorithms used along with their PySpark codes.

Problem Statement: Build a recommendation system providing hotel recommendations to users for a particular location they have searched for on

What type of data are we looking for?

Building a recommendation system requires two sources of data, explicit and implicit signals.

Explicit data is the user’s direct input, like filters (4 star rated hotel or preference of pool in a hotel) that a user applies while searching for a hotel. Information such as age, gender, and demographics also comes under explicit signals.

Implicit data can be obtained by users’ past interactions, for example, the average star rating preferred by the user, the number of times a particular hotel type (romantic property) is booked by the user, etc.

What data are we going to work with?

We are going to work with the following:

  1. Explicit signals where a user provides preferences for what type of amenities they are looking for in a property
  2. Historical property bookings of the user
  3. Users’ current search results from where we may or may not get information regarding the hotel that a user is presently interested in

Additionally, we have the property information table (hotel_info table), which looks like the following:

hotel_info table

Note: We can create multiple property types (other than the above 4, Wi-Fi, couple, etc.) ingeniously covering the maximum number of properties in at least one of the property types. However, for simplicity, we will continue with these 4 property types.

Data Understanding and Preparation:

Consider that the searches data is in the following format:

user_search table

Understanding user_search table:

Information about a user (user ID), the location they are searching in (Location ID), their check-in and check-out dates, the preferences applied while making the search (Amenity Filters), the property specifically looked into while searching (Property ID), and whether they are about to book that property (Abandoned cart = ‘yes’ means that they are yet to make the booking and only the payment is left) can be extracted from the table

Clearly, we do not have all the information for the searches made by the user hence, we are going to split the users into 3 categories; namely, explicit users (users whose amenity filter column is not null), abandoned users (users whose abandoned cart column is ‘yes’), and finally, historical users (users for whom we have historical booking information)

Preparing the data:

For splitting the users into the 3 categories (explicit, abandoned, historical), we give preference in the following order, Abandoned users>Explicit users>historical users. This preferential order is because of the following reasons:

The abandoned cart gives us information regarding the product the user was just about to purchase. We can exploit this information to give recommendations similar to the product in the cart; since the abandoned product represents what a user prefers. Hence, giving abandoned users the highest priority.

An explicit signal is an input directly given by the user. The user directly tells his preference through the Amenities column. Hence, explicit users come next in the order.

Splitting the users can be done following the steps below:

Firstly, create a new column as user_type, under which each user will be designated with one of the types, namely, abandoned, explicit, or historical

Creating a user_type column can be done using the following logic:

df_user_searches =‘xyz…….’)

df_abandon = df_user_searches.withColumn(‘abandon_flag’,F.when(col(‘Abandon_cart’).like(‘yes’) & ‘Property_ID is not Null’,lit(1)).otherwise(lit(None))).filter(‘abandon_flag = 1’).withColumn(‘user_type’,lit(‘abandoned_users’)).drop(‘abandon_flag’)

df_explicit = df_user_searches.join(‘user_ID’),’user_ID’,’left_anti’).withColumn(‘expli_flag’,F.when(col(‘Amenity_Filters’).like(‘%Wifi Availibility%’)|col(‘Amenity_Filters’).like(‘%Nature Friendly%’)|col(‘Amenity_Filters’).like(‘%Budget Friendly%’)|col(‘Amenity_Filters’).like(‘%Couple Friendly%’),lit(1)).otherwise(lit(None))).filter(‘expli_flag = 1’).withColumn(‘user_type’,lit(‘explicit_users’)).drop(‘expli_flag’)

df_historical = df_user_searches.join(df_abandon.unionAll(df_explicit).select(‘user_ID’).distinct(),’user_ID’,’left_anti’).withColumn(‘user_type’,lit(‘historical_user’))

df_final = df_explicit.unionAll(df_abandon).unionAll(df_historical)

Now, the user_search table has the user_type as well. Additionally,

For explicit users, user_feature columns will look like this:

explicit_users_info table

For abandoned users, after joining the property id provided by the user with that in the hotel_info table, the output will resemble as follows:

abandoned_users_info table

For historical users, sum over the user and calculate the total number of times the user has booked a particular property type; the data will look like the following:

historical_users_info table

For U4 in the historical_users_info table, we have information that tells us that the user prefers an average star rating of 4, has booked WiFi property 5 times, and so on. Eventually, telling us the attribute preferences of the user….

Building the Recommendation System:

Data at hand:

We have users split and user’s preferences as user_features

We have the hotel attributes from the hotel_type table, assume that it contains the following values:

hotel_type table

We will use content-based-filtering in building our recommendation model. For each of the splits, we will use an algorithm that will give us the best result. To gain a better understanding of recommendation systems and content-based filtering, one can refer here.

Note: We have to give recommendations based on the location searched by the user. Hence, we will perform a left join on the key Location ID to get all the properties that are there in the location.

Building the system:

For Explicit users, we will proceed in the following way:

We have user attributes like wifi_flag, budget_flag, etc. Join this with the hotel_type table on the location ID key to get all the properties and their attributes

Performing Pearson correlation will give us a score([-1,1]) between the user and hotel features, eventually helping us to provide recommendation in that location

Code for explicit users:

explicit_users_info = explicit_users_info.drop(‘Property_ID’)

expli_dataset = explicit_users_info.join(hotel_type,[‘location_ID’],’left’).drop(‘star_rating’)

header_user_expli = [‘wifi_flag’,’couple_flag’,’budget_flag’,’nature_flag’]

header_hotel_features = [‘Wifi_Availibility’,’Couple_Friendly’,’Budget_Friendly’,’Nature_Friendly’]

assembler_features = VectorAssembler(inputCols= header_user_expli, outputCol=”user_features”)

assembler_features_2 = VectorAssembler(inputCols= header_hotel_features, outputCol=”hotel_features”)

tmp = [ assembler_features,assembler_features_2]

pipeline = Pipeline(stages=tmp)

baseData =

df_final = baseData

def pearson(a,b):

if (np.linalg.norm(a) * np.linalg.norm(b)) !=0:

a_avg, b_avg = np.average(a), np.average(b)

a_stdev, b_stdev = np.std(a), np.std(b)

n = len(a)

denominator = a_stdev * b_stdev * n

numerator = np.sum(np.multiply(a-a_avg, b-b_avg))

p_coef = numerator/denominator

return p_coef.tolist()

pearson_sim_udf = udf(pearson, FloatType())

pearson_final = df_final.withColumn(‘pear_correlation_res’, pearson_sim_udf(‘user_features’,’hotel_features’))


Our output will look like the following:

explicit users

For abandoned and historical users, we will proceed as follows:

Using the data created above, i.e., abandoned_users_info and historical_users_info tables, we obtain user preferences in the form of WiFi_Availibility or wifi_flag, star_rating or avg_star_rating, and so on

Join it with the hotel_type table on the location ID key to get all the hotels and their attributes

Perform Cosine Similarity to find the best hotel to recommend to the user in that particular location

Code for abandoned users:

abandoned_users_info = abandoned_users_info.drop(‘Property_ID’)\






abandoned_dataset = abandoned_users_info.join(hotel_type,[‘location_ID’],’left’)

header_user_aban = [‘a_Wifi_Availibility’,’a_Couple_Friendly’,’a_Budget_Friendly’,’a_Nature_Friendly’,’a_Star_Rating’]

header_hotel_features = [‘Wifi_Availibility’,’Couple_Friendly’,’Budget_Friendly’,’Nature_Friendly’,’Star_Rating’]

assembler_features = VectorAssembler(inputCols= header_user_aban, outputCol=”user_features”)

assembler_features_2 = VectorAssembler(inputCols= header_hotel_features, outputCol=”hotel_features”)

tmp = [ assembler_features,assembler_features_2]

pipeline = Pipeline(stages=tmp)

baseData =

df_final = baseData

def cos_sim(value,vec):

if (np.linalg.norm(value) * np.linalg.norm(vec)) !=0:

dot_value =, vec) / (np.linalg.norm(value)*np.linalg.norm(vec))

return dot_value.tolist()

cos_sim_udf = udf(cos_sim, FloatType())

abandon_final = df_final.withColumn(‘cosine_dis’, cos_sim_udf(‘user_features’,’hotel_features’))


Code for historical users:

historical_dataset = historical_users_info.join(hotel_type,[‘location_ID’],’left’)

header_user_hist = [‘wifi_flag’,’couple_flag’,’budget_flag’,’nature_flag’,’avg_star_rating’]

header_hotel_features = [‘Wifi_Availibility’,’Couple_Friendly’,’Budget_Friendly’,’Nature_Friendly’,’Star_Rating’]

assembler_features = VectorAssembler(inputCols= header_user_hist, outputCol=”user_features”)

assembler_features_2 = VectorAssembler(inputCols= header_hotel_features, outputCol=”hotel_features”)

tmp = [ assembler_features,assembler_features_2]

pipeline = Pipeline(stages=tmp)

baseData =

df_final = baseData

def cos_sim(value,vec):

if (np.linalg.norm(value) * np.linalg.norm(vec)) !=0:

dot_value =, vec) / (np.linalg.norm(value)*np.linalg.norm(vec))

return dot_value.tolist()

cos_sim_udf = udf(cos_sim, FloatType())

historical_final = df_final.withColumn(‘cosine_dis’, cos_sim_udf(‘user_features’,’hotel_features’))


Our output will look like the following:

historical users

abandoned users

Giving Recommendations:

Giving 3 recommendations per user, our final output will look like the following:


One can notice that we are not using hotel recommendation X for the abandoned user U1 as a first recommendation we are avoiding so as hotel features were created from the same property ID, hence, it will always be at rank 1

Unlike cosine similarity where 0’s are considered a negative preference, Pearson correlation does not penalize the user if no input is given; hence we use the latter for explicit users


In the end, the objective is to fully understand the problem statement, work around the data available, and provide recommendations with a nascent system.

Marketing Mix Modelling: What drives your ROI?

There was a time when we considered traditional marketing practices, and the successes or failures they yield, as an art form. With mysterious, untraceable results, marketing efforts lacked transparency and were widely regarded as being born out of the creative talents of star marketing professionals, but the dynamics switched, and regime of analytics came into power. It has evolved over the time and numerous methodologies have been discovered in this regard. Market mix model is one among those popular methods.

The key purpose of a Marketing Mix Model is to understand how various marketing activities are contributing together in driving the sales of any given product. Through MMM the effectiveness of each marketing input/channel can be assessed in terms of Return on Investment (ROI). In other words, a marketing input/channel with higher ROI is a more effective than others with a lower ROI. Such understanding facilitates effective marketing decisions with regards to spends allocation across channels.

Marketing Mix Modelling is a statistical technique of determining the effectiveness of marketing campaigns by breaking down aggregate data and differentiating between contributions from marketing tactics and promotional activities, and other uncontrollable drivers of success. It is used as a decision-making tool by brands to estimate the effectiveness of various marketing initiatives in increasing Return on Investment (ROI).

Whenever we change our methodologies, it is our human nature we would have various questions. Let’s deep dive into the MMM Modelling technique and address these questions in detail.

Question 1: How is the data collected? How much minimum data is required?

MMM Model requires a brand`s product data to collectively capture the impact of key drivers such as marketing spends, price factor, discounts, social media presence/sentiment of the product, event information etc. In any analytical method, more the data, better is the implementation of the modelling technique and more robust the results will be. Hence, these methods are highly driven by the quantum of data available to develop the model over.

Question 2: What level of data granularity is required/best for MMM?

A best practice for any analytical methodology and to generate valuable insights is to have as granular data as possible. For example, a Point-of-Sale data at Customer-Transaction-Item level will yield recommendations with highly focused marketing strategy at similar granularity. However, if needed, the data can always be rolled up at any aggregated level suitable for the business requirement.

Question 3: Which sales drivers are included in the marketing mix model?

In order to develop a robust and stable Market Mix Model, various sales drivers such as Price, Distribution, Seasonality, Macroeconomic variables, Brand Affinity etc. play a pivotal role in understanding the consumer behaviour towards product. Even more important are the features that capture the impact of marketing efforts for the product. Such features provide an insight into how consumers react to the respective marketing efforts or the impact of these efforts on the product.

Question 4: How do you ensure the accuracy of the data inputs?

Ensuring accuracy of data inputs is very subjective with respect to business. On many occasions direct imputation is not very helpful and would skew the results. Further sanity check and statistical testing like distribution of each feature set can be measured.

MMM Components –

In Market Mix Modelling sales are divided into 2 components:

Base Sales:

Base Sales is what marketers get if they do not do any advertisement. It is sales due to brand equity built over the years. Base Sales are usually fixed unless there is some change in economic or environmental factors.

Base Drivers:
  1. Price: The price of a product is a significant base driver of sales as price determines both the consumer segment that a product is targeted toward and the promotions which are implemented to market the product to the chosen audience.
  2. Distribution: The number of store locations, their respective inventories, and the shelf life of that stock are all considered as base drivers of the sales. Store locations and the inventory are static and can be unwittingly understood by customers without any marketing intervention.
  3. Seasonality: Seasonality refers to variations that occur in a periodic manner. Seasonal opportunities are enormous, and often they are the most commercially critical times of the year. For example, major share of electronics sales is around the holiday season.
  4. Macro-Economic Variables: Macro-economic factors greatly influence businesses and hence, their marketing strategies. Understanding of macro factors like GDP, unemployment rate, purchase power, growth rate, inflation and consumer sentiment is very critical as these factors are not under the control of businesses yet substantially impact them.

Incremental Sales:

This is sales generated by marketing activities like TV advertisement, print advertisement, and digital spends, promotions etc. Total incremental sales are split into sales from each input to calculate contribution to total sales.

Incremental Drivers:
  1. Media Ads: Promotional media ads form the core of MMM which penetrates the market and competitor deeply and create awareness about product key feature & other aspects. Numerous media channel available such as TV, print ads, digital ads, social media, direct mail marketing campaigns, in-store marketing etc.
  2. Product Launches: Marketers invest carefully to position the new product into the market and plan marketing strategies to support the new launch.
  3. Events & Conferences: Brands need to look for opportunities to build relationships with prospective customers and promote their product through periodic events and conferences.
  4. Behavioural Metrics: Variables like touch points, online behaviour metrics and repurchase rate provide deeper insights into customers for businesses.
  5. Social Metrics: Brand reach or recognition on social platforms like Twitter, Facebook, YouTube, blogs, and forums can be measured through indicative metrics like followers, page views, comments, views, subscriptions, and other social media data. Other social media data like the types of conversations and trends happening in your industry can be gathered through social listening.

Ad-stock Theory –

Ad-stock, or goodwill, is the cumulative value of a brand’s advertising at a given point in time. For example, if any company is advertising its product over 10 weeks, then for any given week t spending value would be X + Past Week Fractional Amount.

Ad-stock theory states that advertising is not immediate and has diminishing returns, meaning that its influential power decreases over time, even if more money is allocated to it. Therefore, time regression analysis will help marketers to understand the potential timeline for advertising effectiveness and how to optimize the marketing mix to compensate for these factors

  1. Diminishing Returns: The underlying principle for TV advertisement is that the exposure to TV ads create awareness to a certain extent in the customers’ minds. Beyond that, the impact of exposure to ads starts diminishing over time. Each incremental amount of GRP (stand for “Gross Rating Point” which measures impact of Advertisement) would have a lower effect on Sales or awareness. So, the incremental sales generated from incremental GRP start to diminish and saturate eventually. This effect can be seen in the above graph, where the relationship between TV GRP and sales in non-linear. This type of relationship is captured by taking exponential or log of GRP.
  2. Carry over effect or Decay Effect: The impact of past advertisement on present sales is known as Carry over effect. A small component termed as lambda is multiplied with the past month GRP value. This component is also known as Decay effect as the impact of previous months’ advertisement decays over time.

Implementation details:

The most common marketing mix modelling regression techniques used are:

  1. Linear regression
  2. Multiplicative regression

1. Linear Regression Model:

Linear regression can be applied when the DV is continuous and the relationship between the DV and IDVs is assumed to be linear. The relationship can be defined using the equation:

Here ‘Sales’ is the dependent variable to be estimated, X are the independent variables and ε is the error term. βi’s are the regression coefficients. The difference between the observed outcome Sales and the predicted outcome sales is known as a prediction error. Regression analysis is mainly used for Causal analysis, Forecasting the impact of a change, Forecasting trends etc. However, this method does not perform well on large amounts of data as it is sensitive to outliers, multicollinearity, and cross-correlation.

2. Multiplicative Regression Models-

Additive models imply a constant absolute effect of each additional unit of explanatory variables. They are suitable only if businesses occur in more stable environments and are not affected by interaction among explanatory variables. But in scenarios such as when pricing is zero, the sales (DV) will become infinite.

To overcome the limitations inherent in linear models, multiplicative models are often preferred. These models offer a more realistic representation of reality than additive linear models do. In these models, IDVs are multiplied together instead of added.

There are two kinds of multiplicative models:

Semi-Logarithmic Models-

In Log-Linear models, the exponents of independent variables are multiplied.

Logarithmic transformation of the target variable linearizes the model form, which in turn can be estimated as an additive model. The dependent variable is logarithmic transformed, the only difference between additive model and semi-logarithmic model

Some of the benefits of Log-Linear models are:

  1. The coefficients β can be interpreted as % change in business outcome (sales) to unit change in the independent variables.
  2. Each independent variable in the model works on top of what has been already achieved by other drivers.
Logarithmic Models-

In Log-Log models, independent variables are also subjected to logarithmic transformation in addition to the target variable.

In the case of non-linear regression models, the above defined elasticity formula needs to be tweaked according to the equation. Refer the table below.

Statistical significance –

Once the model has been generated, it should be checked for validity and prediction quality. Based on the nature of the problem, various model stats are used for evaluation purposes. The following are the most common statistical measures in marketing mix modelling

  1. R-squared – R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination. R-squared is always between 0 and 100%; 0% indicates that the model explains none of the variability of the response data around its mean.100% indicates that the model explains all the variability of the response data around its mean. General formula for R-squared is: R2 = 1 – Where SSE = Sum of squared errors and SST = Total sum of square
  2. Adjusted R Squared: The adjusted R-squared is a refined version of R-squared that has been penalized for the number of predictors in the model. It increases only if the new predictor improves the model. The adjusted R-squared can be used to compare the explanatory power of regression models that contain different numbers of predictors.
  3. Coefficient: Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. In linear regression, coefficients are the values that multiply the predictor values. The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive sign indicates that as the predictor variable increases, the response variable also increases. A negative sign indicates that as the predictor variable increases, the response variable decreases
  4. Variable Inflation Factor: A variance inflation factor (VIF) detects multicollinearity in regression analysis. Multicollinearity is when there is a correlation between predictors (i.e. independent variables) in a model. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model. Every variable in the model would be regressed against all the other available variables to calculate the VIF. VIF is usually calculated as Where Ri2 is R-squared value obtained by regressing “i”, the predictor variable against all other variables.
  5. Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions. It is the average over the absolute differences between prediction and actual observation where all individual differences have equal weight Where Yt is the actual value at time ‘t’ and ŷt is the predicted value at time ‘t’
  6. Mean Absolute Percentage Error (MAPE): MAPE is the average absolute percent error for each observation or predicted values minus actuals divided by actuals Where yt is the actual value at time ‘t’ and ŷt is the predicted value at time ‘t’

MMM Output –

Marketing Mix Model outputs provide contribution of each marketing vehicle/channel, which along with marketing spends, provide marketing ROIs. It also captures time decay and diminishing returns on different media vehicles, as well as the effects of other non-marketing factors discussed above and other interactions like the halo effect and cannibalization. The model output provides all the necessary components and parameters required to arrive at the best media mix under any condition

Expected Benefit & Limitation –

Benefits of Marketing Mix Modelling –

  • Enables marketers to prove the ROI of their efforts across marketing channels
  • Returns insights that allow for effective budget allocation
  • Facilitates superior sales trend forecasting

Limitations of Marketing Mix Modelling –

  • Lacks the convenience of real-time modern data analytics
  • Critics argue that modern attribution methods are more effective as they consider 1 to 1, individual data
  • Marketing Mix Modelling does not analyze customer experience (CX)

Application/Scope for Optimization, Extension of MMM Model

1. Scope for Optimization

Marketing optimization is the process of improving marketing efforts to maximize desired business outcomes. Since the nature of MMM are mostly non-linear, non-linear constrained algorithms are used for optimization. Some of the use cases for marketing mix optimization are:

To improve current sales level by x%, what is the level of spends required in different marketing channels? E.g., To increase sales by 10%, how much to invest in TV ads or discounts or sales promotions?

What happens to the outcome metric (sales, revenue, etc.), if the current level of spends is increased by x%? E.g., On spending additional $20M on TV, how much more sales can be obtained? Where are these additional spends to be distributed?

2. Halo and Cannibalization Impact

Halo effect is a term for a consumer’s favouritism towards a product from a brand because of positive experiences they have had with other products from the same brand. Halo effect can be the measure of a brand’s strength and brand loyalty. For example, consumers favour Apple iPad tablets based on the positive experience they had with Apple iPhones.

Cannibalization effect refers to the negative impact on a product from a brand because of the performance of other products from the same brand. This mostly occurs in cases when brands have multiple products in similar categories. For example, a consumer’s favouritism towards iPads can cannibalize MacBook sales.

In Marketing Mix Models, base variables, or incremental variables of other products of the same brand are tested to understand the halo or cannibalizing impact on the business outcome of the product under consideration.


Marketing mix modelling techniques can minimize much of the risk associated with new product launches or expansions. Developing a comprehensive marketing mix model can be the key to sustainable long-term growth for a company. It will become a key driver for business strategy and can improve the profitability of a

company’s marketing initiatives. While some companies develop models through their in-house marketing and analytics departments, many choose to collaborate with an external company to develop the most efficient model for their business.

Developers of marketing mix models need to have a complete understanding of the marketing environment they operate within and of the latest advanced market research techniques. Only through this will they be able to fully comprehend the complexities of the numerous marketing variables that need to be accounted for and calculated in a marketing mix model. While numerical and statistical expertise is undoubtedly crucial, an insightful understanding of market research and market environments is just as important to develop a holistic and accurate marketing mix model. With these techniques, you can get started on developing a watertight marketing mix model that can maximise performance and sales of a new product.


Copyright © 2024 Affine All Rights Reserved

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.