Accelerate Your eCommerce Sales with Big Data and AI for 2021

Holiday season is the most exciting time of the year for businesses. It has always driven some of the highest sales of the year. In 2019, online holiday sales in the US alone touched $135.35 billion and the average order value hit $152.95. After an unprecedented 2020, retailers are performing many bold maneuvers for turning the tide around in the new year.

A successful holiday strategy in 2021 requires much more than just an online presence. To compete during one of the strangest seasons post a strangest year yet, brands are trying to create more meaningful connections with consumers, offering hyper-personalized online experiences, and ensuring that holiday shoppers experience nothing short of pure convenience and peace of mind.

In 2020, retailers faced some novel challenges and many unknowns. To begin with, here are a few key things that could not be ignored:

  • Customer behaviors significantly changed during the pandemic and expectations now have only burgeoned
  • Gen Z and Millennial shoppers who have the maximum purchasing power got focused on sustainability and peace of mind
  • The ecommerce industry saw five years of digital transformation in two months courtesy the pandemic. Immersive cutting-edge technology like voice-aided shopping, AI-assisted browsing and machine learning were no longer seen as optional, they became must-haves for facilitating a superior customer experience

Here are ten ways how big data and AI tech are helping businesses accelerate ecommerce sales

1. Hyper-Personalized product recommendations through Machine Learning

Providing people with exactly what they want is the best way to attract new customers and to retain existing ones. So, having intelligent systems to surface products or services that people would be inclined to buy only seems natural. To enable this, data and machine learning are playing big roles. They are helping businesses put the right offers in front of the right customers at the right time. Research has proven that serving relevant product recommendations can have a sizable impact on sales. As per a study, 45% of customers reveal they are likely to shop on a site that preempts their choices, while 56% are more likely to return to such a site. Smart AI systems are allowing deep dive into buyer preferences and sentiments and helping retailers and e-commerce companies provide their customers with exactly what they might be looking for.

2. Enabling intelligent search leveraging NLP

The whole point of effective search is to understand user intent correctly and deliver exactly what the customer wants. More and more companies are using modern customer-centric search powered by AI which enables it to think like humans. It deploys advanced image and video recognition and natural language processing tools to constantly improve and contextualize the results for customers which eventually helps companies in closing the leads more rigorously.

3. One-to-one marketing using advanced analytics

With one-to-one marketing, retailers are taking a more targeted approach in delivering a personalized experience than they would with giving personalized product recommendations or intelligent search engines. Data like page views and clickstream behavior forms the foundation of one-to-one marketing. As this data is harvested and processed, commonalities emerge that correspond with broad customer segments. As this data is further refined, a clearer picture emerges of an individual’s preferences and 360° profile, which is informing real-time action on the end of the retailer.

4. Optimized pricing using big data

There are numerous variables that impact a consumer’s decision to purchase something: product seasonality, availability, size, color, etc. But many studies zero down on price being the number one factor in determining whether the customer will buy the product.

Pricing is a domain that has traditionally been handled by an analyst after diving deep into reams of data. But big data and machine learning-based methods today are helping retailers accelerate the analysis and create an optimized price, often several times in a single day. This helps keep the price just right so as not to turn off potential buyers or even cannibalize other products, but also high enough to ensure a sweet profit.

5. Product demand forecasting and inventory planning

In the initial months of the pandemic, many retailers had their inventory of crucial items like face coverings and hand sanitizers exhausted prematurely. In certain product categories, the supply chains could not recover soon enough, and some have not even recovered yet. Nobody could foretell the onslaught of the coronavirus and its impending shadow on retailers, but the disastrous episode that followed sheds urgent light on the need for better inventory optimization and planning in the consumer goods supply chain.

Retailers and distributors who early-on leveraged machine learning-based approaches for supply chain planning fared better than their contemporaries who continued to depend solely on analysts. With a working model in place, the data led to smarter decisions. Incorporating external data modules like social media data (Twitter, Facebook), macroeconomic indicators, market performance data (stocks, earnings, etc.) to the forecasting model, in addition to the past samples of the inventory data seasonality changes, are helping correctly determine the product demand pattern.

6. Blending digital and in-store experiences through omnichannel ecommerce offerings

The pandemic has pushed many people who would normally shop in person to shop online instead. Retailers are considering multiple options for getting goods in the hands of their customers, including contactless transactions and curbside pickups. Not that these omnichannel fulfillment patterns were not already in place before the coronavirus struck, but they have greatly accelerated under COVID-19. AI is helping retailers expedite such innovations as e-commerce offerings, blending of digital and in-store experiences, curbside pickup and quicker delivery options, and contactless delivery and payments.

7. Strengthening cybersecurity and fighting fraud using AI

Fraud is always a threat around the holidays. And given the COVID-19 pandemic and the subsequent shift to everything online, fraud levels have jumped by 60% this season. An increase in card-not-present transactions incites fraudsters to abuse cards that have been compromised. Card skimming, lost and stolen cards, phishing scams, account takeovers, and application fraud present other loopholes for nefarious exploits to take place. In a nutshell, fraudsters are projected to extort innocent customers by about 5.5% more this year. In this case, card issuers and merchants alike armed with machine learning and AI are analyzing huge volumes of transaction, identifying the instances of attempted fraud, and automating the response to it.

8. AI-powered chatbots for customer service

Chatbots that can automatically respond to repetitive and predictable customer requests are one of the speediest growing sectors of big data and AI. Thanks to advances in NLP and natural language generation, chatbots can now correctly understand complex written and spoken queries of the most nuanced order. These smart assistants are already saving companies millions of dollars per year by supplementing human customer service reps in resolving issues with purchases, facilitating returns, helping find stores, answering repetitive queries concerning hours of operation, etc.

9. AI guides for enabling painless gift shopping

As this is the busiest time of the year when customers throng websites and stories for gift-shopping, gaps in customer service can seriously confuse and dissuade the already indecisive shopper. In such a scenario, tools like interactive AI-powered gift finders are engaging shoppers in a conversation by asking a few questions about the gift recipient’s personality, and immediately providing them with gifting ideas, helping even the most unsettled gift shopper to find the perfect gift with little wavering. This is helping customers overcome choice paralysis and inconclusiveness and helping companies boost conversions, benefiting both sides of the transaction table.

10. AR systems for augmented shopping experience

AR is taking the eCommerce shopping and customer experience to the next level. From visual merchandising to hyper-personalization, augmented reality offers several advantages. Gartner had indicated in a 2019’s predictions report that by 2020 up to 100 million consumers are expected to use augmented reality in their shopping experiences and the prophecy came true. The lockdown and isolation necessitated by Covid-19 rapidly increased the demand for AR systems.

Based on the “try-before-you-buy” approach, augmented shopping appeals to customers by allowing them to interact with their choice of products online before they proceed to buy any. For instance, AR is helping buyers visualize what their new furniture will look and feel like by moving their smartphone cameras around the room in real-time and getting a feel of the size of the item and texture of the material for an intimate understanding before purchase. In another instance, AR is helping women shop for makeup by providing them with a glimpse of the various looks on their own face at the click of a button.

To survive the competitive landscape of eCommerce and meet the holiday revenue goals this year, merchants and retailers are really challenge the status quo and adopting AI-powered technology for meeting customer expectations. AI is truly the future of retail, and not leveraging the power of artificial intelligence, machine learning and related tech means you are losing out.

Recommendation Systems for Marketing Analytics

How I perceive recommendation systems is something which the traditional shopkeepers used to use.

Remember the time when we used to go shopping with our mother in the childhood to the specific shop. The shopkeeper used to give the best recommendations for the products, and we used to buy it from the same shop because we knew that this shopkeeper knows us best.

What the shopkeeper did was he understood our taste, priorities, the price range that we are comfortable with and then present the products which best matched our requirement. This is what the businesses are doing in the true sense now.

They want to know their customer personally by their browsing behaviour and then make them recommendation of the products that they might like, the only thing is that they want to do it on a large scale.

For example, Amazon and Netflix understand your behaviour through what you browse, add to basket and order, movies you watch and like and then recommend the best of the products which you make like with high probability.

In a nutshell, they combine what you call as the business understanding with some mathematics so that we can essentially know and learn about the products that the customer likes.

So basically, as recommendation system for marketing analytics is a subclass of information filtering system that seeks the similarities between users and items with different combinations.

Below are some of the most widely used types of recommendation systems:

  1. Collaborative Recommendation system
  2. Content-based Recommendation system
  3. Demographic based Recommendation system
  4. Utility based Recommendation system
  5. Knowledge based Recommendation system
  6. Hybrid Recommendation system

Let us go into the most useful ones which the industry is using:

  • Content Based Recommendation System

The point of content-based is that we should know the content of both user and item. Usually we construct user-profile and item-profile using the content of shared attribute space. The product attributes like image (Size, dimension, colour etc…) and text description about the product is more inclined towards “Content Based Recommendation”.

This essentially means that based upon the content that I watch on Netflix, I can run an algorithm to see what the most similar movies are and then recommend the same to the other users.

For example, when you open Amazon and search for a product, you get the similar products pop up below which is the item-item similarity that they have computed for the overall products that are there in Amazon. This gives us a very simple yet effective idea of how the products behave with each other.

Bread and butter could be similar products in the true sense as they go together but their attributes can be varied. In case of the movie industry, features like genres, reviews could tell us the

similar movies and that is the type of similarity we get for the movies.

  • Collaborative Recommendation System:

Collaborative algorithm uses “User Behaviour” for recommending items. They exploit behaviour of other users and items in terms of transaction history, ratings, selection, purchase information etc. In this case, features of the items are not known.

When you do not want to see what the features of the products are for calculating the similarity score and check the interactions of the products with the users, you call it as a collaborative approach.

We figure out from the interactions of the products with the users what are the similar products and then take a recommendation strategy to target the audience.

Two users who watched the same movie on Netflix can be called similar and when the first user watches another movie, the second users gets that same recommendation based on the likes that these people have.

  • Hybrid Recommendation System:

Combining any of the two systems in a manner that suits the industry is known as Hybrid Recommendation system. It combines the strengths of more than two Recommendation system and eliminates any weakness which exist when only one recommendation system is used.

When we only use Collaborative Filtering, we have a problem called as “cold start” problem. As we take into account the interaction of users with the products, if a user comes to the website for the first time, I do not have any recommendations to make to that customer as I do not have interactions available.

To eliminate such a problem, we used hybrid recommendation systems which combines the content-based systems and

collaborative based systems to get rid of the cold start problem. Think of it as this way, item-item and user-user, user-item interaction all combined to give the best recommendations to the users and to give more value to the business.

From here, we will focus on the Hybrid Recommendation Systems and introduce you to a very strong Python library called lightfm which makes this implementation very easy.

LightFM:

The official documentation can be found in the below link:

lyst/lightfm

Build status Linux OSX (OpenMP disabled) Windows (OpenMP disabled) LightFM is a Python implementation of a number of…

github.com

LightFM is a Python implementation of the number of popular recommendation algorithms for both implicit and explicit feedback.

User and item latent representations are expressed in terms of their feature’s representations.

It also makes it possible to incorporate both item and user metadata into the traditional matrix factorization algorithms. When multiplied together, these representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user.

Interactions : The matrix containing user-item interactions.

User_features : Each row contains that user’s weights over features.

Item_features : Each row contains that item’s weights over features.

Note : The matrix should be Sparsed (Sparse matrix is a matrix which contains very few non-zero elements.)

Predictions

fit_partial : Fit the model. Unlike fit, repeated calls to this method will cause training to resume from the current model state.

Works mainly for the new users to append to the train matrix.

Predict : Compute the recommendation score for user-item pairs.

The scores are sorted in the descending form and top n-items are recommended.

Model evaluation

AUC Score : In the binary case (clicked/not clicked), the AUC score has a nice interpretation: it expresses the probability that a randomly chosen positive item (an item the user clicked) will be ranked higher than a

randomly chosen negative item (an item the user did not click). Thus, an AUC of 1.0 means that the resulting ranking is perfect: no negative item is ranked higher than any positive item.

Precision@K : Precision@K measures the proportion of positive items among the K highest-ranked items. As such, this is focused on the ranking quality at the top of the list: it does not matter how good or bad the rest of your ranking is as long as the first K items are mostly positive.

Ex: Only one item of your top 5 item are correct, then your precision@5 is 0.2

Note : If the first K recommended items are not available anymore (say, they are out of stock), and you need to move further down the ranking. A high AUC score will then give you confidence that your ranking is of high quality throughout.

Enough of the theory now, we will move to the code and see how the implementation for lightfm works:

I have taken the dataset from Kaggle, you can download it below:

E-Commerce Data

Actual transactions from UK retailer www.kaggle.com

Hope you liked the coding part of it, and you are ready to implement that in any version. The enhancement that can be done in this is if you have the product and the user features.

These can also be taken as inputs into the lightfm model and the embedding that the model creates would be based upon all those attributes. The more data that is pushed into the lightfm will give the model a better accuracy and more training data.

That’s all from my end for now. Keep Learning!! Keep Rocking!!

Python Newbie – Doctest

Any Software Development process consists of five stages:

  1. Requirement Analysis
  2. Design
  3. Development
  4. Testing
  5. Maintenance

Though each and every process mentioned above is important in SDLC lifecycle, this post will mainly focus on the importance of testing and enlighten on how we can use doctest a module in python to perform testing.

Importance of testing

We all make mistakes and if left unchecked, some of these mistakes can lead to failures or bugs that can be very expensive to recover from. Testing our code helps to catch these mistakes or avoid getting them into production in the first place.

Testing therefore is very important in software development.

Used effectively, tests help to identify bugs, ensure the quality of the product, and to verify that the software does what it is meant to do.

Python module- doctest

Doctest helps you test your code by running examples embedded in the documentation and verifying that they produce the expected results. It works by parsing the help text to find examples, running them, then comparing the output text against the expected value.

To make things easier, let us start by understanding the above implementation using a simple example

Python inline function

So, in the above snippet, I have written a basic inline function that adds up a number to itself.

For this function, I run manually a couple of test cases to do some basic verification (to do sanity check of my function).

Now, consider a scenario in which python can read the above output and perform the sanity check for us at the run time. This is exactly the idea behind a doctest.

Now, let’s see how we can implement one.

Let’s take a very simple example to calculate what day of the week it will be ahead of the current weekday. We write a docstring for our function which helps us to understand what our function does and what inputs it takes and so on. In this docstring, I have added couple of test cases which will be read by the doctest module at the run time while testing is carried out.

Implementation of doctest

When we run the above script from the command, we will get the foollowing output:

Doctest Output

We can see from the above snippet that all test cases mentioned in the docstring were successful as the resulted outcome matched with the expected outcome.

But what happens if any test fails, or the script does not behave as expected?

To test this, we add a false test case as we know our script can only take integers as input.

What will happen if we give a string as an input? Let us check out.

Test case with strings as input

I have used the same script but made a small change in the test case where I have passed strings as an input.

So, what is the outcome for the above test case?

Failed test case output

Voila! We can see that the doctest has clearly identified the failed test case and listed the reason for the failure of the above-mentioned test case.

Conclusion

That is all you require to get started with doctest. In this post, we have come across the power of doctest and how it makes a lot easier to carry out automated testing for most of the script. Though, many developers find doctest easier as compared to unittest because in its simplest form, there is no API to learn before using it. However, as the examples become more complex, the lack of fixture management can make writing doctest tests more cumbersome than when using unittest. But still due to the ease of its module, doctest is worth adding to our codes.

In upcoming blogs, we are going to discuss more about handy python module that can ease our task and we will also dive into some other concepts like Machine Learning and Deep Learning. Until then, keep learning!

Recommendation Systems for Beginners

Why do we need a recommendation system ?

Let us take the simplest and the most relatable example of E-commerce giant, Amazon. When we shop at Amazon, it gives us the options of bundles and products that are usually bought along with the product you are currently going to buy. For example, if you are to buy a smartphone, it recommends you to buy a back cover for the product as well.

For a second let us think and try to figure out what Amazon is trying to do in the figure below:

What does a recommendation system do ?

A recommendation system recommends you products or items that can be of your interest or liking. Let’s take another example:

It’s quite easy to notice that they are trying to sell the equipment that is generally required for a camera (memory card and the case). Now, the main question is, how do they do it for millions of items listed on their website. This is where a recommendation system comes handy.

When we first set up our Netflix account, they ask us what our preferences are, which movie or TV show is most likely to be watched by us or what genre of movie is our favorite. So as the first layer of recommendation, Netflix gives us recommendations based on our input, it shows us movies or shows similar to the input that we had provided to it. Once we are a frequent user, it has gathered enough data and gives recommendations more accurately based on our preferences of genres, cast, star rating, and so on…

The ultimate aim here is to recommend a user with an item such that he will watch it or buy it(in the case of Amazon), this in turn makes sure that the user is engaged with the platform and the customer lifetime value(CLTV) is maintained.

Objective of this blog

By the end of this blog, one will have a basic understanding of how to approach towards building a recommendation system. To make things more lucid let us take an example and try building a Hotel recommendation system. In this process, we will cover data understanding and the algorithms that can be used to realize how a nascent recommendation engine is built. We will use analogies between diurnal used products like Amazon and Netflix to have a clearer understanding.

Understanding the data required for building a recommendation system

To build a recommendation system, we must be clear with the problem statement and the end objective to provide accurate recommendations. For example, consider the following scenarios:

  1. Providing a user with a hotel recommendation based on his/her current search and historical behavior (giving a recommendation knowing that a user is looking for a hotel in Las Vegas and prefers hotels with casinos).
  2. Providing a hotel recommendation based on the user’s historical behavior, targeting those users who are not actively engaged (searching) but can be incentivized towards making a booking by targeting through a relevant recommendation (a general recommendation can be based on metrics such as a user’s historical star rating preference or historical budget preference).

These are two different objectives, and hence, the approach towards achieving both of them is different.

One must be aware of what type of data is available and also needs to know how to leverage that data to proceed towards building a recommendation engine.

There are two types of data which are of importance in our use case:

Explicit Data:

Explicit signals or input is where a user directly gives feedback to a particular item/product. This can be star values, say in the range of 1 to 5 or just a binary 1(like) and 0(dislike). For example, when we rate an item on Amazon or when we rate a movie on IMDb, these are explicit signals where we are directly giving our feedback towards an item. One thing to keep in mind is that we should be aware that each and every individual is not the same, i.e. for an Item X, User A, and User B can have different ratings. User A can be generous with his ratings and can give a rating of 5 stars whereas, User B is a critic and gives Item X 3.5 stars and gives 5 stars only for exceptional Items.

Replicating the example for our Hotel recommendation use case can be summarized like, the filters that a user applies while searching for a Hotel, say, filters like swimming pool or WiFi are explicit signals, here the user is explicitly saying that he is interested in properties which have WiFi and a swimming pool.

Additionally, the explicit data is sparse in most of the cases, as it is not ideally possible for a user to give ratings to each and every item. Logically thinking, I would not have seen each and every movie on Netflix and hence can only provide feedback for the set of movies that I have seen. These ratings reflect how much a user likes or approves of an item.

Implicit Data:

Implicit signals are obtained by capturing a user’s interaction with the item. This can be a continuous value, like the number of times a user has clicked on an item or the number of times a user has watched an Action movie or Binary, similar to just clicked or not clicked. For example, while scrolling through amazon.com the amount of time spent viewing an item or the number of times you have clicked the item can act as implicit feedback.

Drawing parallels for hotel recommendations with implicit signals can be understood as follows. Consider that we have the historical hotel bookings of a user A, and we see that in the 4 out of 5 bookings made by the user, it was a property that was near the beach. This can act as an implicit signal where we can say that user A prefers hotels near the beach.

Types of Recommendation Systems

Let us take a specific example given below to further explain the recommendation models:

While making a hotel recommendation system, we have the user’s explicit and implicit signals. However, we do not have all the signals for all the users, for a set of users E, we have explicit signals and for a set of users I, we have implicit signals.

Further, let us assume that a hotel property has the following attributes:

WiFi Availability, Couple Friendly, Budget Friendly and Nature Friendly (closer to nature)

For simplicity, let us assume that these are flags, such that if a property A has WiFi in it, the WiFi availability column will be 1. Hence our hotels data will look something like the following:

Let us name this table as Hotel_Type for further use

Content Based Filtering:

This technique is used when explicit signals are provided by the user or when we have the user and item attributes and the interaction of the user with that item. The objective here is to show items/products which are similar to the item/product that a person has already purchased or shows a liking for, or in another case, show a product to a user where he explicitly says that he is looking for something in particular. Taking our example, consider that you are booking a hotel from xyz.com, you apply filters for couple-friendly properties, here you are explicitly saying that you are looking for a couple-friendly property and hence, xyz.com will show you properties that are couple friendly. Similarly, while providing recommendations, if we have explicit signals from a user we try to get the best match for that signal with the list of items that we have and provide recommendations accordingly.

Model Algorithms:

Cosine Similarity: It is a measure of similarity

between two non-zero vectors. The values range from 0 to 1.

Cosine Similarity is used as a measure of similarity between two non-zero vectors. Here the vectors can be both user or item based.

Let us take an example, Assume that a user A has specifically shown interest towards property X from the hotel_type table (the user has previously booked the property or has searched for the property X multiple times recently), we now have to recommend him properties that are similar to property X. To do so, we are going to find the similarity between property X and the rest of the properties in the table.

We can clearly see that property Q has the highest similarity with property X and followed by property P. So if we are to recommend a property to user A, we will recommend him property Q knowing that he has a preference for property X.

Pearson Correlation: It is a measure of linear correlation between two variables. The values range from -1 to 1.

Let us take an example where we are getting explicit input from the user where the user is shown the 4 categories (WiFi, Budget, Couple, Nature). The user has the option to provide his input by selecting as many as he wants, he can even select none. Considering the case when a user B has selected at least one of the 4 options. Now, assume user B’s input looks like the following:

While one can say that we can use cosine similarity in this case by just filling in the null values as 0. However, it is not advised to do so since, cosine similarity assumes the 0’s as a negative preference and in this explicit signal we cannot for sure say that user B is not looking for a couple friendly or a budget friendly property just because the user has not given an input in that field.

Hence, to avoid this, we use Pearson correlation and the output of the similarity measuring technique would look like the following:

We can see that property Z is highly correlated to user B’s explicit signal and hence, we will provide Z as a recommendation for user B.

So, for the set of users E (explicitly proving us their preference) we will use Pearson Correlation and for the set of users I (implicitly telling us that he/she is looking for a property with a certain set of attributes), we will use Cosine Similarity.

Note: A user’s explicit signal is always preferred over an implicit signal. For example, in the past, I have only booked hotels in the urban areas, however, now I want to book a hotel near the beach (nature friendly). In my explicit search, I specify this, but if you are making an implicit signal from my past bookings you will see that I do not prefer hotels near the beach and would recommend me hotels in the city. In conclusion, Pearson correlation and Cosine similarity are the most widely used similarity techniques, however, we need to always use the correct similarity measuring technique as per our use case. More information on different types of similarity techniques can be found here.

Collaborative Filtering:

This modeling technique is used by leveraging user-item interaction. Here, we try to match or group similar users and recommend based on the preferences of similar users. Let us consider a user-item interaction matrix (rating matrix) where we have the hotel rating a user has given a particular hotel:

Rating Matrix

Now let us compare user A and user E, we can see that they both have similar tastes and have rated Hotel Y as 4, seeing this let us assume that user A will rate Hotel X as 5 and hotel R as 3. Hence, we can give a recommendation of hotel X to user A by noticing the similarity between user A and user E (considering that he will like Hotel X and rate it 5).

So, if we are provided with the interaction of a user with an item where the user has given feedback towards the item, we can use collaborative filtering (for example, the rating matrix). Explicit ratings such as star rating

given by the user or Implicit signals such as a flag if the user has booked a property or contrary of user-item interaction.

Model Algorithms:

Memory and Model-Based Approach are the two types of techniques to implement collaborative filtering. The key difference between the two is that in the memory-based approach we do not use parametric machine learning models.

Memory-Based Approach: It can be divided into two subdivisions, user-item filtering, and item-item filtering. In the user-item approach, we identify clusters of similar users and utilize the interaction of a particular user in that cluster to predict the interaction of the whole cluster. For example, to predict the rating user C gives to a hotel X, we will take a weighted sum of hotel X’s rating by the other users, here weight is the similarity number between user X and the other users. Adjusted cosine similarity can also be used to remove the difference in the nature of individuals, which brings critics and the general public on the same scale.

Item-item filtering is similar to user-item filtering, but here we take an item and see the users that liked that item and find other sets of items that those set of users or similar users also liked. It takes items, finds similar items, and outputs those items as recommendations.

Model-Based Approach: In this technique, we use machine learning models to predict the rating for an item that could have been given by a user and hence, provide recommendations.

Several ML models that are used, to name a few, Matrix factorization, SVD (singular value decomposition), ALS, and SVD++. Some also use neural networks, decision trees, and latent factor models to enhance the results. We will delve into Matrix Factorization below.

Matrix Factorization:

In matrix factorization, the goal is to complete the matrix and fill in the null values in the rating matrix.

The preferences of the users are identified by a small number of hidden features of the user and items. Here there are two hidden feature vectors for users(user matrix 4×2) and items(item matrix 2×4). Once we multiply the user and item matrix back together, we get back our ratings matrix with null values replaced by predicted values. Once we get the predicted values, we can recommend an Item with the highest rating for a user (not considering the items already interacted with).

Note: Here we are not providing any feature vector for the users or the items, the computation decomposes and creates vectors on its own and, finally, predicts by filling in the null values.

If we have user demographics information and user’s features and preference information and item features, we can use SVD++ where we can pass users and item feature vectors as well to get the best recommendation results.

Hybrid Models:

The hybrid model combines multiple models/algorithms to build a recommendation system. To improve the effectiveness of the recommendation, we can combine collaborative filtering and content-based filtering giving appropriate weights to the individual models and finally using this hybrid system to give out a recommendation.

For example, we can combine the results of the following by giving weights:

  1. Using Matrix factorization (collaborative filtering) on ratings matrix to match similar users and predict a rating for the user-item pair.
  2. Using Pearson correlation (content-based filtering) to find similarity between users who provide explicit filters and the hotels with feature vectors.

The combined results can be used to provide recommendations to users.

Conclusion:

Building a recommendation system highly depends on the end objective and the data we have at hand. Understanding the algorithms and knowing which one to use to get recommendations plays a vital role in building a suitable recommendation system. Additionally, a sound recommendation system also uses multiple algorithms and combines the results to provide the final recommendations.

References:

Isolating Toxic Comments to prevent Cyber Bullying

Online communities are susceptible to Personal Aggression, Harassment, and Cyberbullying. This is expressed in the usage of Toxic Language, Profanity or Abusive Statements. Toxicity is the use of threats, obscenity, insults, and identity based hateful language.

Our objective was to create an algorithm which could identify Toxicity and other Categories with an Accuracy of > 90%. We will now discuss how we created a model with an accuracy of 98%.

Approach Part 1: Engineer the right features for Emotion Capture

  • No doubt Toxicity can be detected and measured by the presence of cuss words, obscenities etc. But is that all? Is the bag of words approach which stresses on content alone sufficient enough to define Toxicity? Are we missing information which can be assessed by the sequential nature of language and expression which comes not from the chosen word but also the way it is written or used?
  • This additional thought process of using Emotion as well as content was validated on the Kaggle Toxic Challenge problem data.
  • Emotion and Intensity of people’s written comment were captured via Feature Engineering and Deep Learning Models. The content was captured using Machine Learning Models that follow Bag of Words
  • The inclusion of Emotions coupled with our Ensembling Approach raised the accuracy levels from 60%(Bag of words) to 98%.

Approach Part 2: Create Disparate Models for the best Ensemble

With the Feature Engineering been taken care of, we have the all the information that needed to be extracted from our Corpus. Now the next step which is getting the best possible Ensemble model.

We did a Latent Dirichlet Allocation on the Corpus using the Bag of words approach and we found how some of the Categories were quite similar to another when we saw the Probability Distribution. Categories such as “Toxic”, “Severe Toxic”, “Obscene” which have a small margin of Decision Boundary. This at least confirms that creating one model that predicts all categories may be Time Consuming.

Our Strategy – Come up with the best model for a category. Concentrate on Parameter Tuning of Individual Models. Parameter Tune it till we get the best possible model. Finally, ensemble these models.

This paved the way for arriving at the best ensemble model as now each of the individual models have less correlations with one another.
Also, conceptually we had two classes of Models: The Deep Learning Sequential Models and the Machine Learning Bag of words models. This again contributed to our Ensembling Idea. Meaning there will be certain categories which the Sequential Model will be particularly good at and likewise the Bag of Words Models.

LSTM is a special case of Recurrent Neural Networks. Our Sequential Models consisted of LSTM models each of which were tuned in a manner as to have an edge in predicting one or at most two categories.

Our Bag of words model consisted of Light GBM models. Each of these models with their own different parameters and strengths in a category.

Illustration: As a part of our Ensembling Strategy, Sequential and Bag of Words Models were built for each Toxic Category.

Conclusion: Text Categorization can be dealt efficiently if we combine Bag of Words as well as the Sequential Approach. Also, emotions play a key role in Text Categorization for Classification Problems such as Toxicity.

Contributors

Anshuman Neog: Anshuman is a Consultant at Affine. He is a  Deep Learning & Natural Language Processing enthusiast.

Ansuman Chand: Ansuman is a Business Analyst at Affine. His Interests involve real-world implementation of Computer Vision, Machine Intelligence and Cognitive Science. Wishes to actively leverage ML, DL techniques to build solutions.

HowStat – Application Of Data Science In Cricket

Data science helps us to extract knowledge or insights from data- either structured or unstructured- by using scientific methods like mathematical or statistical models. In the last two decades, it has been one of the most popular fields with the rise of all big data technologies. A lot of companies have been using recommendation engines to promote their products/suggestions in accordance with users’ interests such as Amazon, Netflix, Google Play. A lot of other applications like image recognition, gaming, or Airline route planning also involves the usage of big data and data science.

Sports is another field which is using data science extensively to improve strategies and predicting match outcomes. Cricket is a sport where machine learning has scope to dive into quite a large outfield. It can go a long way towards suggesting optimal strategies for a team to win a match or a franchise to bid a valuable player.

Under the International Cricket Council (ICC), there are 10 full-time member countries, 57 affiliate member countries, and 38 associate member countries, which adds up to 105 member countries. We cannot imagine the amount of data that will be generated every day for 365 days with the ball-by-ball information of 5,31,253 cricket players in close to 5,40,290 cricket matches at 11,960 cricket grounds across the world. Database maintenance has already been present in cricket for a long time back and simple analysis has also been used in the past. We have the scores of each match with all the details which have been used to generate stats like, highest run scorer, highest wicket-taker, best batting/bowling average, the highest number of centuries in away matches, best strike rate, the highest run scorer in successful chases and much more. In recent years, the depth of analysis has reached a whole new level.

The most popular use of mathematics in cricket is the Duckworth-Lewis system (D/L). The brainchild of Frank Duckworth and Tony Lewis, this method helps in resetting targets in rain-affected limited overs cricket matches. The D/L method is widely used in all limited overs international matches to predict the target score. It is a statistical formula to set a fair target for the team batting second, based on the score achieved by the first team. It takes into consideration the chasing side’s wickets lost and overs remaining. The predicted par score is calculated at each ball and is proportional to a percentage of the combination of wickets in hand and overs remaining. It is simple mathematics and has a lot of flaws. This method seems to be more beneficial for the team batting second. It does not account for changes in the proportion of the innings for which field restrictions are in place compared to a completed inning. V Jayadevan, an engineer from Kerala, also created a mathematical model alternative to the D/L method but it did not become popular because of certain limitations.

Machine Learning algorithms can be used to identify complex yet meaningful patterns in the data, which then allows us to predict or classify future instances or events. We can use data from the first innings, such as the number of deliveries bowled, wickets left, runs scored per deliveries faced and partnership for the last wicket, and compare that against total runs scored. Machine learning techniques like SVM, Neural Network, Random Forest can be used to create a model from the historical first innings data, considering the teams playing the match. The same model can be used to predict the second innings which is interrupted by rain. This will give a more accurate prediction than the D/L method, as we are using a lot of historical data and all relevant variables.

Another application is the WASP (Winning and Scoring Prediction), which has used machine learning techniques that predict the final score in the first innings and estimates the chasing team’s probability of winning in the second innings. However, this technology has been used in very few tournaments as of now. WASP was created by Scott Brooker as part of his Ph.D. research, along with his supervisor Seamus Hogan, at the University of Canterbury. New Zealand’s Sky TV first introduced the WASP during the coverage of their domestic limited overs cricket. The models are based on a database of all non-shortened ODI and 20-20 games played between top-eight countries since late 2006 (slightly further back for 20-20 games). The first-innings model estimates the additional runs likely to be scored as a function of the number of balls and wickets remaining. The second innings model estimates the probability of winning as a function of balls and wickets remaining, runs scored to date, and the target score. Let V(b,w) be the expected additional runs for the rest of the innings when b (legitimate) balls have been bowled and w wickets have been lost, and let r(b,w) and p(b,w) be, respectively, the estimated expected runs and the probability of a wicket on the next ball in that situation. The equation is –

V(b,w) =r(b,w) +p(b,w) V(b+1,w+1) +(1-p(b,w)))V(b+1,w)

Factors like the history of games at that venue and conditions on the day (pitch, weather etc.) are considered and scoring rates and probabilities of dismissals are used to make the predictions.

Other successful applications of data science in cricket are –

  • “ScoreWithData”, an analytics innovation from IBM, had predicted that the South African cricketer Imran Tahir would be ranked as the power bowler, 7 hours before the first quarter final of the 2015 world cup.

South Africa went on to win the match on the back of an outstanding performance by Tahir.

  • “Insights”, an interactive cricket analysis tool developed by ESPNCricInfo, is an amalgamation of cricket and big data analytics.
  • In the last T20 World cup in 2016, ESPNCricInfo did some advanced statistical analysis before the start of each match, viz. when Ravichandran Ashwin takes 3 wickets, India’s chance of winning the match increases by 40%.

But, the application of data science has been used more extensively in other sports like football. The German

Football Association (DFB) and SAP had developed a “Match Insights” software system which helped the German national football team to win the 2014 World Cup. Billy Beane of “Money Ball” fame was successful by taking the drastic step of disregarding traditional scouting methods in favor of detailed analysis of statistics. This enabled him to identify the most productive players irrespective of the all-around athleticism and merchandise-shifting good looks that clubs had previously coveted.

The future of big data and machine learning is indeed very bright in the world of cricket. While the bowlers shout “Howzat” to try and clinch wickets, we as data scientists, with the help of machine learning and big data, can pose the question: HowStat?

References

The Evolution Of Data Analytics – Then, Now And Later

The loose definition of data science is to analyze data of a business, to be able to produce actionable insights and recommendations for the business. The simplicity or the complexity of the analysis, aka the level of “Data Science Sophistication” also impacts the quality and accuracy of results. The sophistication is essentially a function of 3 main data science components – technological skills, math/stats skills and the necessary business acumen to define and deliver a relevant business solution. These 3 pillars have very much been the mainstay of data science ever since it started getting embraced by the businesses over the past two decades and should continue to be even in the future. What, however, has changed or will change in the future is the underlying R&D in the areas of technology and statistical techniques. I have not witnessed many other industries where these skills are becoming obsolete at such fast rate. Data Science is unique in its requirement of the data scientist and the consulting firms to constantly update their skills and be very futuristic in adopting new and upcoming skills. This article is an attempt to look at how the tool/tech aspects of data science have evolved over the past few decades, and more importantly what the future holds for this fascinating tech and innovation driven field.


THEN > NOW > LATER

When businesses first started embracing data science, the objective was to find more accurate and reliable solutions than those obtained using business heuristics. At, the same time trying to keep the solutions simple enough so as to not overwhelm the business users. The choice of technology was kept simple for easier implementation/consumption and the same went for math/stat too for easier development and explanation. The earlier use cases were more exploratory than predictive in nature and hence that also impacted the choice of tools/techs. Another important factor was availability in the market in terms of the products and more importantly the analysts with those skills.

  • Data Processing

SAS, that used to be one of the workhorses of the industry during the 2000s when it came to data processing/EDA jobs, building backend data for reporting and modeling. A few companies used SAS for EDW too which otherwise was dominated by IBM Netezza, Teradata, and Oracle. SPSS found a good use too owing to its easy to use GUI interface as well as the solution suite it offered that included easy to develop (but quite handy) solutions like CHAID/PCA etc.

  • Predictive Modeling

The so-called “Shallow Learning” techniques were the most common choices (due to the availability of products and resources) when it came to building statistical models. These mostly included linear regression, Naïve Bayes, logistic regression, CHAID, univariate and exogenous time series methods like smoothing, ARIMA, ARIMAX etc. for supervised and K-Means clustering, PCA etc. for the unsupervised use cases. Toolkits like IBM CPLEX or excel solvers were mostly used to address optimization problems due to their ease of implementation.

  • Visualization

Reports were mostly developed and delivered on excel and VBA for complex functionalities. Cognos, Micro strategy were some of the other enterprise tools, typically used by large organizations.

  • Sought Skillsets

Due to the nature of work described above, the skillset required were quite narrow and limited to what was available off the shelf. The data science firms typically hired people with statistics degree and trained them on the job for the required programming skills which were mainly SQL, SAS and sometimes VBA programming.

THEN > NOW> LATER

  • Data Processing

Python & R are the main technologies for the daily data processing chores for today’s data scientist. They are open source tools, have vast and ever-evolving libraries, and also an ability to integrate with big data platforms as well as visualization products. Both R & Python are equally competent and versatile and can handle a variety of use cases. However, in general, R is preferred when the main objective is to derive insights for the business using exploratory analysis or modeling. Due to its general-purpose programming functionality, Python is typically preferred for developing applications which also have an analytics engine embedded in them. These two are not only popular today but they are here to stay for some more years to come.

An important disrupter has been the area of the distributed processing framework, pioneered by two Apache Open Source Projects – Hadoop & Spark. Hadoop picked up steam in the early 2010s and is still very popular. When it was introduced first, Hadoop’s capabilities were limited when compared to a relational database system. However, due to its low cost, flexibility, ability to quickly scale but more importantly with the development of many maps/reduce based enablers like Hive, PIG, Mahout etc., it started to deliver benefits and is still the choice of technology in many organizations that produce TBs of data daily.

While Hadoop has been a pioneer in the distributed data processing space, it lacked performance when it came to using cases like iterative data processing, predictive modeling/machine learning (again iterative in nature due to several steps involved) and real time/stream processing. This is mainly since MapReduce reads and writes back the data at each step and hence increasing latency. This was addressed with the advent of Apache Spark which is an in-memory distributed framework and holds the data in memory to perform a full operation (the concept of Resilient Distributed Datasets (RDD) makes it possible). This makes it many times faster than Hadoop’s MapReduce based operations for the use cases mentioned before. More imp, it’s also compatible with many programming languages like Scala, Python or Java and hence the developers can use the programming language of their choice to develop a spark based application

  • Predictive Modeling

The machine learning space also witnessed many advancements with organizations and data scientist using more and more “deeper” techniques. These are far better than the linear and logistics regressions of the world as they can uncover complex patterns, non-linearity, variable interactions etc. and provide higher accuracy. Some of these techniques are captured below

Supervised – GBM, XGBoost, Random Forests, Parametric GAMs, Support Vector Machines, Multilayer Perceptron

Unsupervised – K-nearest Neighbours, Matrix Factorization, Autoencoders, Restricted Boltzmann Machines

NLP & Text Mining – Latent Dirichlet Allocation (to generate keyword topic tags), Maximum Entropy/SVM/NN (for classification for sentiment), TensorFlow etc.

Optimization – Genetic Algorithms, Simulated Annealing, Tabu Search etc.

Ensembling (blending) of various techniques is also being adopted to improve the prediction accuracy by some of the organizations.

While the techniques described above are “deep” in the sense that they are more complex than their predecessors, these should not be confused with the whole different area of “Deep Learning” which, as of today, is finding more applications in the AI/Computer Vision spaces. While the “Deep Learning Models”, esp. Deep Convolutional Network, can be trained on structured data to solve the usual business use cases, they are mostly employed in the areas of image classification/recognition and image feature learning. One of the reasons behind Deep Learning not making headway into regular business use cases is because they are more resource intensive to develop, implement and maintain. These typically require advanced GPUs for development and may not be worthy for a regular business use cases unless justified by the ROI due to increased accuracy. However, a few organizations (non-tech) have started using them to develop non-AI predictive use cases because of accuracies offered that translated into higher ROIs.

  • Visualization

While most organizations favored off the shelf products like Tableau, QLikview, ELasticSearch Kibana etc., many are also adopting open source technologies like D3 and Angular as a low-cost option to develop customized and visually appealing and interactive web as well as mobile dashboards. The library offers several reusable components and modules which make development fast.

  • Sought Skillsets

With the advancements in both technology and algorithm fronts as well as the variety of business use cases asked by the organization, data science firms started looking for open-minded thinking, fundamental programming techniques, and basic mathematical skill sets. People with such skills are not only agile at solving any business problem but also flexible in learning new and evolving technologies. It is far easier for such data scientists to not only master R or Python but also quickly go over the learning curve for any emerging technology.

THEN > NOW > LATER

Given the current data science trends, the ongoing R&D and more importantly some of the use cases that businesses have already started asking about, the future of data science would be heavily focused on 3 things – Automation, Real Time Data Science Development (not scoring) aka the Embedded Data Science and obviously “Bigger Data”. This should spark the need and emergence of new data science paradigms in the areas of database handling, programming ecosystems, and newer algorithms.  More importantly, it will become critical for data scientists to be constantly aware of the ongoing R&D and proactively learn the emerging tools and techniques and not just play a catch-up game – something that’s not good for their career.

Amongst the technologies already in use, Scala, Python, Pyspark and the Spark Ecosystem should remain the main choice of technology for the coming 2-3 years at least. Julia hasn’t picked up much steam in the mainstream data science work, but is a fairly useful option due to its similarity with python and offering better execution speeds on the single threaded system for a good number of use cases. However, Julia may require more development and improvement of libraries before it really starts getting adopted as one of the default choices.

One of the main bets, however, would be Google’s Go programming language. One of the main advantages that GOlang offers is that it enables data scientists to develop “production ready” data science codes/services/applications. The codes written on single-threaded languages are typically very hard to production coupled with the huge amount of effort required to transition the model from the data scientist’s machines into a production platform (testing, error handling and deployment). Go has performed tremendously well in production allowing the data scientists to develop scalable and efficient application right from the beginning than a heavyweight python based application. Also, its ability to handle and report errors seamlessly ensures that the integrity of the application is maintained over time.

On the algorithm front, we hope to see more development and adoption in the areas of Deep Learning for regular business problems. As explained before, most of the applications of Deep Learning as of today are in the areas of image recognition, image feature analysis, and AI. While this area will continue to develop and we will see highly innovative use cases, it will be great to use these algorithms to solve regular business use cases. We should also see more adoption in the areas of boosting based algorithms – traditionally these required meticulous training (in order not to overfit) and a large amount of time due to a huge no. of iterations. However, with the latest advancements like XGB and LightGBM, we can expect to see more improved versions of boosting as well as increased adoption.

The area of “Big Data Visualization” would see more adoption. Interactive Visualizations developed on the top of streaming data would require a versatile skillset. Building NodeJS and AngularJS applications on the top of Spark (Spark streaming, Kafka) and tapping into the right database (mongo or Cassandra) would remain one of the viable options. Apache Zeppelin with Spark should also see more adoptions especially as it continues to be developed.

The data science industry is evolving at an exponential space – whether it’s the type of use cases it is addressing or the shift and availability on the tool/technology space. The key to a successful data science practice would be to hire individuals who are constantly aware of the R&D in tool/tech space and not afraid to embrace a new technology. Like a good forecast model, a good data scientist should evaluate available inputs (i.e. tech options already in use or under dev) and predict what’s be the best option(s) for tomorrow.

Analytics For Non Profit Organisation

Analytics have been growing at a rapid pace across the world. The well-established companies have realized the importance of analytics in their business where crucial decisions are taken that drives their revenue. But why do just the well-established corporates need to leverage this statistical and computational modus operandi when it can be implemented in a much-needed arena also?

The idea is to get to use the analytics for non-profit social organizations and provide a breakthrough. These are the organizations which strive to look for the upliftment of society by identifying the social responsibilities. The organizations cover a wide variety of aspects that helps to promote education, health, food, shelter etc

There are three main categories where the power of analytics could be utilized to its full potential:

  • Fundraising
  • Churn analysis
  • SROI (Social Return On Investment)

Fundraising analytics

One of the major factors that would help the NGO’s grow financially would be the fundraising. It involves planning and execution of offline and digital campaign to spread the awareness to the public and let the outside world know the work that’s been happening in the organization.

The fundraising analytics points down to studying the behavior of the donors. The first step towards understanding the behavior would be careful segmentation of the donor population. It then paves the way in categorizing the donor based on the segments. Later, targeting/recommendation can be carried out by considering the distinguishing factors like the previous donating patterns of the donors, average calls to the donor, his financial stability etc. The cause for fundraising will always be the major driver as different audience respond to different social cause like LGBT, cancer awareness, Women empowerment etc.

Churn Analysis

Churn analysis throws light in the arena of child education in non-profit organizations. The rising issue in every state in India would be the number of students who are dropping out of school. These students need to be given financial support to pursue the education back. Let’s assume that each student needs to be helped with Rs 5000 with this education for a year. Assuming they are targeting 10000 kids, then their expenditure would be 50 Mm. By implementing analytics in their background, say they employ 1,00,000 they can work on concentrating the 10% of the kids who have a very high propensity in dropping out. The expenditure becomes 5Mm and it gives a wider view of the people dropping out.

SROI (Social Return on Investment)

SROI is similar to the concept of Return On Investment(ROI) where in addition to the financial factors, the social and the environmental outcome also play a major role in determining the health of a Non-Profit Organisation.

SROI = (Tangible + Intangible value to community) / Total resource investment

The concept of SROI in NGO’s lie by the principle of associating the socio and environmental outcomes into a dollar value. Let us take an example of gender discrimination that happens in many parts of the world.

The possible outcome after solving the gender discrimination:

  • Women getting good quality education equal to men
  • Women getting the job opportunity and earn a living through it
  • Spreading awareness to the fellow community group

Now to measure the SROI in this scenario one should think of associating the financial equivalent of the above outcomes so that every single achievement is being communicated through dollar value.

In the first case, the women attaining education would involve the fees that are paid for each girl student. Secondly, the job opportunity would involve income to the women groups that contribute to SROI. Finally, scaling up the idea across the society will help in augmenting the first two steps across oceans and islands.

Under the umbrella of analytics, the SROI will be implemented with the help of a classifier to predict the outcome of the SROI. With the above example, there are 3 Business questions that can be answered

  • Whether the women will get a quality education or not
  • Whether the women will get a job opportunity or not
  • Whether the women will be spreading awareness or not

Based on the several factors like demographics, # of siblings, Qualification of parents and occupation of parents we will be able to finally offer a prediction. This will yield us with the women population achieving the expected outcome in the problem of gender equality. Eventually, the Net Financial Worth associated with the outcomes (Similar to the Net worth in case of GDP of a country) discussed and the time involved in achieving this would yield us the Investment.

So, the aforementioned arena of discussion proves the application of analytics in such a domain that would need enormous and endless support. The set of procedures will be able to clearly guide many social organizations to leverage the domain’s capability with the careful use of resources available.

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.