Announcing GeneraX – Affine’s Generative AI Product Suite

Affine has a rich legacy of developing AI-powered solutions. Right from its inception, there has been a strong emphasis on not just developing superior quality solutions but enhancing our learning curves and innovation opportunities. This approach helped us open up new avenues to solve business problems effectively. Thus, it has been the single most important differentiator, allowing us to build production-grade AI solutions for several global businesses.

Our accolades from global AI hackathons across multiple industries are a testament to the depth of knowledge we have in AI while signifying our advanced practices. It should be noted that in hackathons like Datacentric AI, Hackerearth, and Kaggle hackathons, we were the only AI company that made a spot in the top percentile among dedicated academic researchers in the field.

In the World of NLP:

Affine’s mastery in leveraging Transformer technology is displayed well in our NLP solutions. We were able to combine our Deep Learning expertise with open-source technologies like BERT, RoBERTa, etc., to deliver ground-breaking solutions that helped organizations reduce a significant amount of manual effort and deliver more accurate results. Some of the most recent solutions we developed were – Document summarizer, Context-based enhanced search, and Contextual AI Chatbot. You can contact us to know how these solutions can help your business.

In the World of Vision:

Specialization in Stable Diffusion matured during the development of our Satellite Image Segmentation product – Telescope. We used Stable Diffusion to create synthetic data that could be used to train the Image Segmentation Model. Telescope was thus developed with the intent to save millions of dollars and months of effort that would go into land surveys in multiple industries. We also created a mechanism using GAN models to create new gaming characters.

The Upcoming Generative AI Product Suite – GeneraX

The last few months have witnessed the widespread adoption of Generative AI, such as Open AI’s GPT in text generation, Dall-E 2 for image generation, and Google’s Bard chatbot. Despite some limitations, these AI implementations are revolutionary and provide excellent results. However, they are not completely business ready. A significant effort is required to ensure that these implementations give professional-grade, meaningful, and usable outcomes to businesses.

The grueling hours of learning the in-depth working of different AI technologies have always been guided by our intent to build the best real-world solution that could be used and benefit businesses. Affine’s knowledge of how things work under the hood is coming together with GPT 3 and Dall-E 2 to create enterprise-level SaaS products. The GPT and Dall-E APIs have helped us speed up development, give wider scope and convert the boutique solutions we pride ourselves on into plug-and-play products.

We’re kicking off our Generative AI product suite – GeneraX – with CreAItive!


“Are you a marketer frustrated with the prolonged ideation of designing creatives? And you spend hundred-thousands of dollars to create marketing-ready creatives and get only a handful of variations. It’s time to get over this creative generation cycle. Introducing Affine’s Image Segmentation and Stable Diffusion powered CreAItive. It’s a one-stop-shop for design ideation, experimentation, and creation of 100+ market-ready images on the go at a fraction of time and cost.”

What is Web3? What are its Use Cases?

In recent years, we have witnessed a massive shift towards digitization across various industries, from finance to healthcare, education, and entertainment. Digital Transformation has brought numerous benefits, such as convenience, efficiency, and accessibility. However, it has also created new challenges, such as centralization, data breaches, and privacy concerns.

Here comes Web3! It’s a new generation of the internet that promises to address these challenges by leveraging the power of decentralized networks. In this blog, we will explore the exciting world of Web3 and its potential to revolutionize how we interact with cyberspace. So, buckle up and get ready to uncover the future of decentralized digitization with Web3!

What is Web3?

The current web we use to access and share information is the 2nd generation. In the 2nd version of the web, the content we produce is saved in a central server controlled by an authority. Various data ranging from emails, health tracker data, shopping interests, social media posts, photos, entertainment, and choices to web browsing patterns and other forms are the data collected on a regular basis from the user and saved under a centralized service provider storage where users have no control over their data.

The true ownership of this data has never been owned by the user but rather by the central authority controlling the service. Web3, which is the 3rd generation of the web, will solve this critical ownership problem by shifting the control of content from central authority back to the users. Users have complete control over what they share and with whom they share and can completely revoke the permissions at any time. Web3 is all about less trust and more truth.

How will Web3 be different from Web2?

The real necessity of Web3 – Let’s look at real-life use cases that have facilitated the design thinking towards web3:

Use Case 1:  Many of us have played or heard of the popular flash-based game called Farmville, which was designed by Zynga on Facebook. In 2020 after 11 years of service, the development has been ceased leaving millions of fans of the game unable to access the game assets they’ve purchased over the years. Web3 can solve this problem by transferring the ownership of those assets as limited-time collectibles to the fans who bought them on an open decentralized marketplace.

Use Case 2: The fundamental problem that occurred when the popular social media site Orkut got shut down, resulting in millions of users losing access to their photos and posts shared over the platform, which are actual memories from the early days of the web in the 2000s. Web3 can solve this problem by bringing back the control of user data (posts, media) to the users and freedom to take the data to their platform of choice by making it interoperable.

Use Case 3: Free speech is a powerful principle of democracy that should be censorship resistant. There are many cases of social media accounts getting banned just because of criticizing authority of its flaws even though when it’s the truth, which indicates the suppression of the free flow of open speech. Essentially the accounts have been permanently locked in their previous posts on social media. A web3 based existent decentralized social media platform like Mastadon solves this problem where users can control the data they publish and interoperate with other platforms of their choice where there should always be one single source of truth that is censorship resistant.

What are the benefits of providing access to user data?

Healthcare data, for instance, can be shared with various medical sources for advancements in medical research, where the data exchange will be peer-to-peer. Our photos & media, meanwhile, can be permitted to be uploaded to Facebook, Instagram, Flickr, etc., without uploading individually. And the most important aspect of any web3 application should be the incentive structure the user can benefit from companies accessing their data. Users by choosing and providing access to their data should be incentivized for the contribution, which is clearly lacking in the web2 world.

Is Web3 based on blockchain?

One of the misconceptions most people believe is that web3 is completely blockchain-based. But the truth is that web3 is a culmination of technologies, whereas blockchain is a mere part of web3. For instance, we imagine blockchain like Bitcoin/Ethereum provides a solid trustless, permissionless cross-border payment between individuals without any central banking authority to control the transaction. Blockchains are excellent use cases for web3 where public platforms like incentive structure, decentralized access, decentralized finance, NFTs, and DAOs can be built to support the principles of web3 ideology. Even standardized technologies can be part of a web3 application development, given it implements basic principles of user privacy, ownership, and censorship-resistant data flow.

Web3 and Gaming Applications

As we see a trend towards adaptation of web3, we will see more games built around incentivizing the users. Game designs will make use of releasing limited game assets as collectible NFTs to its fans, thereby making them a partner in the development process and creating a win-win scenario when the game performs well for both the companies and fans alike. Users can be assured that they will still own the game assets as collectibles even though the game shut down in the future.

Web3 and Defi (Decentralized Finance)

The true potential of Finance will be unlocked when more financial products are implemented around the principles of Web3 and Decentralized Finance. Already existing applications like Uniswap and Airswap have taken the first steps in the evolution of Web3 financial products. Imagine finance becoming peer-to-peer between any two parties in the world where the transaction rules are governed by a contract running on a trustless network autonomously. This removes a whole lot of unnecessary paperwork and intermediatory fees and, most importantly, saves a lot of time for instantaneously accessing various financial products, even in remote places of world where banking is a luxury. Decentralized cross-border payments are the future.

Web3 and Metaverse

The Metaverse is a digital platform that provides an immersive experience to users using AR and VR technologies. We can view this as a 3D web where users can have 3D interactions with other users, bots, and applications. Metaverse as a platform will be there for enhanced social connections. Imagine Facebook as a 2D place where you can add a friend, chat with someone, join a group, etc. The same actions can take place in Metaverse in 3D with enhanced user experience and social connections. Web3, in some ways, will be a component of this digital social experience by powering apps that are censorship resistant, decentralized, and secure.

Web3 and AI

Eventually, AI is the umbrella term where the full potential of Web3 principles comes into play. By owning the data in various forms, users will have complete control over who to give access to, thereby getting an incentive for doing so. Imagine companies building AI models having access to the same reliable and quality data from real users who are willing to participate in their development activity. The users have the right to control the information to share and get incentivized, and the companies have access to golden data to build better AI models which perform well than the ones trained on noisy data. Web3 principles will govern the flow and access of this data by creating a more inclusive environment.

Summing up!

Privacy by design and default, less trust and more truth, whereas decentralized and censorship-resistant ownership is one of the principles of any future Web3 application. An ecosystem where humans/bots/ devices/applications can securely operate on a trustless network can be enabled by following these principles. While Web3 is primarily a concept under development today, some early applications demonstrated its implementation, such as Odysee, a decentralized video-sharing app, and NFT marketplaces where users have the freedom to sell an NFT on a platform of their choice by just connecting their wallet, Mastadon Social Network, etc. In Web3, we can even imagine building decentralized machine learning models that can perform more efficiently.

Deep Learning Demystified 2: Dive Deep into Convolutional Neural Networks

The above photo is not created by a specialized app or photoshop. It was generated by a Deep learning algorithm which uses convolutional networks to learn artistic features from various paintings and changes any photo depicting how an artist would have painted it.

Convolutional Neural Networks has become part of every state of the art solutions in areas like

  • Image recognition
  • Object Recognition
  • Self-driving cars in identifying pedestrians, objects.
  • Emotion recognition.
  • Natural Language Processing.

A few days back Google surprised me with a video called Smiles 2016 where all the photos of 2016 where I was partying with family, friends, colleagues are put together. It was a collection of photos where everyone in the photo was smiling. Emotion recognition. We will discuss a couple of Deep learning architectures that powers these applications in this blog.

Before we dive into CNN lets try to understand why not Feed Forward Neural network. According to universality theorem which we discussed in the previous blog, any network will be able to approximate a function just by adding Neurons(Functions), but there are no guarantees in time when will it reach the optimal solution. Feed Forward neural networks tend to flatten images to a flat vector thus losing all the spatial information that comes with an Image. So for problems where spatial feature importance is high CNN tend to achieve higher accuracy in a very shorter time compared to Feed-Forward Neural Networks.

Before we dive into what a Convolutional Neural Network is letting get comfortable with nuts and bolts which form it.


Before we dive into CNN lets take a look at how a computer looks at an image.

What we see
What a computer sees

Wow, it’s great to know that computer sees images, videos as a matrix of numbers. A common way of looking at an image in computer vision is a matrix of dimensions Width * Height * Channels. Where Channels are Red, Green, Blue and sometimes alpha is also part of channels.


Filters are a small matrix of numbers usually of size 3*3*3 (width, height, channel) or 7*7*3. Filters perform various operations like blur, sharpen, outline on a given image. Historically these filters are carefully hand picked to gain various features of an image. In our case, CNN creates these filters automatically using a combination of techniques like Gradient descent and Backpropagation. Filters are moved across an image starting from top left to the bottom right to capture all the essential features. They are also called as kernels in Neural networks.


In a convolutional layer, we convolve the filter with patches across an image. For example on the left-hand side of the below image is a matrix representation of a dummy image and the middle layer is the filter or kernel. The right side of the image has the output of convolution layer. Look at the formula in the image to understand how the kernel and a part of the image are combined together to form a new pixel.

First pixel in the image being calculated

Let’s see another example of how the next pixel in the image is being generated.

The second pixel in the output image is being calculated.


Max pooling is used for reducing dimensionality and down-sampling an input. The best way to understand Max-pooling is an example. The below image describes what a 2*2 Max pooling layer does.

In both the examples for convolution and Max-pooling, the image shows for only 2 pixels, but in reality, the same technique is applied to the entire image.

Now with an understanding of all the important components, let’s take a look at how the Convolutional Neural Network looks like.

The example used in Stanford CNN classes

As you can see from the above image, a CNN is a combination of layers stacked together. The above architecture can be simply depicted as CONV-RELU-CONV-RELU_POOL * 3 + Fully connected Layer.


Convolutional Neural Networks need huge amounts of labeled data and lots of computation power to get trained. They typically take weeks to get trained to achieve state of the art performance. Most of these architectures like AlexNet, ZF Net, VGG Net, Google Net, Microsoft Res Net take weeks to get trained. Does that mean, an organization without huge volumes of data and computation power cannot take advantage of it? The answer is No.

Transfer Learning to the Rescue

Most of the winners of the ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) competition has open sourced the architecture and the weights associated with these networks. It turns out, most of the weights particularly that of the filters can be reused after fine tuning to the domain specific problems. So for us to take advantage of these convolutional neural networks, all we need to do is pre-train the last few layers of the network. Which in general takes very little data and computation powers. For several of our scenarios, we were able to train models with state of art performance on GPU machines in few minutes to hours.


Apart from use cases like image recognition , CNN is being widely used in various network topology’s like object recognition (What objects are located in images), GAN (A recent breakthrough in helping computers create realistic images), converting low resolution images to high resolution images , in revolutionizing health sector in various forms of cancer detection and many more. In recent months there were architectures built for NLP achieving state of art results.

Statistical Model Lifecycle Management

Organizations have realized quantum jumps in business outcomes through the institutionalization of data-driven decision making. Predictive Analytics, powered by the robustness of statistical techniques, is one of the key tools leveraged by data scientists to gain insight into probabilistic future trends. Various mathematical models form the DNA of Predictive Analytics.

A typical model development process includes identifying factors/drivers, data hunting, cleaning and transformation, development, validation – business & statistical and finally productionisation. In the production phase, as actual data is included in the model environment, true accuracy of the model is measured. Quite often there are gaps (error) between predicted and actual numbers. Business teams have their own heuristic definitions and benchmark for this gap and any deviation leads to forage for additional features/variables, data sources and finally resulting in rebuilding the model.

Needless to say, this leads to delays in the business decision and have several cost implications.

Can this gap (error) be better defined, tracked and analyzed before declaring model failure? How can stakeholders assess the Lifecycle of any model with minimal analytics expertise?

At Affine, we have developed a robust and scalable framework which can address above questions. In the next section, we will highlight the analytical approach and present a business case where this was implemented in practice.


The solution was developed based on the concepts of Statistical Quality Control esp. Western Electric rules. These are decision rules for detecting “out-of-control” or non-random conditions using the principle of process control charts. Distributions of the observations relative to the control chart indicate whether the process in question should be investigated for anomalies.

X is the Mean error of the analytical model based on historical (model training) data. Outlier analysis needs to be performed to remove any exceptional behavior.
Zone A = Between Mean ± (2 x Std. Deviation) & Mean ± (3 x Std. Deviation)
Zone B = Between Mean ± Std. Deviation & Mean ± (2 x Std. Deviation)
Zone C = Between Mean & Mean ± Std. Deviation.
Alternatively, Zone A, B, and C can be customized based on the tolerance of Std. Deviation criterion and business needs.

1Any single data point falls outside the 3σ limit from the centerline (i.e., any point that falls outside Zone A, beyond either the upper or lower control limit)
2Two out of three consecutive points fall beyond the 2σ limit (in zone A or beyond), on the same side of the centerline
3Four out of five consecutive points fall beyond the 1σ limit (in zone B or beyond), on the same side of the centerline
4Eight consecutive points fall on the same side of the centerline (in zone C or beyond)

If any of the rules are satisfied, it indicates that the existing model needs to be re-calibrated.

Business Case

A large beverage company wanted to forecast industry level demand for a specific product segment in multiple sales geographies. Affine evaluated multiple analytical techniques and identified a champion model based on accuracy, robustness, and scalability. Since the final model was supposed to be owned by client internal teams, Affine enabled assessing lifecycle stage of a model through an automated process. A visualization tool was developed which included an alert system to help user proactively identify for any red flags. A detailed escalation mechanism was outlined to address any queries or red flags related to model performance or accuracies.

Fig1: The most recent data available is till Jun-16. An amber alert indicates that an anomaly is identified but this is most likely an exception case.

Following are possible scenarios based on actual data for Jul-16.

Case 1

Process in control and no change to model required.

Case 2:

A red alert is generated which indicates model is not able to capture some macro-level shift in the industry behavior.

Any single data point falls outside the 3σ limit from the centerline (i.e., any point that falls outside Zone A, beyond either the upper or lower control limit)

  1. Two out of three consecutive points fall beyond the 2σ limit (in zone A or beyond), on the same side of the centerline
  2. Four out of five consecutive points fall beyond the 1σ limit (in zone B or beyond), on the same side of the centerline
  3. Eight consecutive points fall on the same side of the centerline (in zone C or beyond)

If any of the rules are satisfied, it indicates that the existing model needs to be re-calibrated.

Key Impact and Takeaways

  1. Quantify and develop benchmarks for error limits.
  2. A continuous monitoring system to check if predictive model accuracies are within the desired limit.
  3. Prevent undesirable escalations thus rationalizing operational costs.
  4. Enabled through a visualization platform. Hence does not require strong analytical

Product Life Cycle Management in Apparel Industry

Product Life Cycle Estimation

“Watch the product life cycle; but more important, watch the market life cycle”

~Philip Kotler


The product life cycle describes the period over which an item is developed, brought to market and eventually removed from the market.

This paper describes a simple method to estimate Life Cycle stages – Growth, Maturity, and Decline (as seen in the traditional definitions) of products that have historical data of at least one complete life cycle.

Here, two different calculations have been done which helps the business to identify the number of weeks after which a product moves to a different stage and apply the PLC for improving demand forecasting.

A log-growth model is fit using Cumulative Sell through and Product Age which helps to identify the various stages of the product. A Log-Linear model is fit to determine the rate of change of product sales due to a shift in its stage, cet. par.

The life span of a product and how fast it goes through the entire cycle depends on market demand and how marketing instruments are used and vary for different products. Products of fashion, by definition, have a shorter life cycle, and they thus have a much shorter time in which to reap their reward.

An Introduction to Product Life Cycle (PLC)

Historically, PLC is a concept that has been researched as early as 1957 (refer Jones 1957, p.40). The traditional definitions mainly described 4 stages – Introduction, Growth, Maturity, and Decline. This was used mainly from a marketing perspective – hence referred to as Marketing-PLC.

With the development of new types of products and additional research in the field, Life Cycle Costing (LCC) and Life Cycle Assessment (LCA) were added to the traditional definition to give the Engineering PLC (or E-PLC). This definition considers the cost of using the product during its lifetime, services necessary for maintenance and decommissioning of the product.

According to Philip Kotler, ‘The product life cycle is an attempt to recognize distinct stages in sales history of the product’. In general, PLC has 4 stages – Introduction, Growth, Maturity, and Decline. But for some industries which consist of fast moving products, for example, apparel PLC can be defined in 3 stages. PLC helps to study the degree of product acceptance by the market over time which includes major rise or fall in sales.

PLC also varies based on product type that can be broadly divided into:

  1. Seasonal – Products that are seasonal (for e.g. mufflers, that are on shelves mostly in winter) have a steeper incline/decline due to the short growth and decline periods.
  2. Non-Seasonal – Products that are non-seasonal (for e.g. jeans, that are promoted in all seasons) have longer maturity and decline periods as sales tend to continue as long as stocks last.

Definition of Various Stages of PLC

Market Development & Introduction

This is when a new product is first brought to market before there is a proved demand for it. In order to create demand, investments are made with respect to consumer awareness and promotion of the new product in order to get sales going. Sales and Profits are low and there are only a few competitors in this stage.


In this stage, demand begins to accelerate and the size of the total market expands rapidly. The production costs and high profits are generated.


The sales growth reaches a point above which it will not grow. The number of competitors increases and so market share decreases. The sales will be maintained for some period with a good profit.


Here, the market becomes saturated and the product is no longer sold and becomes unpopular. This stage can occur as a natural result but can also be due to introduction of new and innovative products and better product features from the competitors.

This paper deals with the traditional definition of PLC and the application in Fashion products.

Why do Businesses Need PLC and How Does it Help Them?

Businesses have always invested significant amounts of resources to estimate PLC and demand. Estimating the life cycle of a new product accurately helps business take several key decisions, such as:

  • Provide promotions and markdowns at the right time.
  • Plan inventory levels better by incorporating PLC in demand prediction.
  • Plan product launch dates/season.
  • Determine the optimal discount percentages based on a product’s PLC stage (as discussed later in this paper).

Businesses primarily rely on the business sense and experience of their executives to estimate a product’s life cycle. Any data driven method to easily estimate PLC can help reduce costs and improve decision making.

How Does the Solution in this Paper Help?

The solution detailed in this paper can help businesses use data of previously launched products to predict the life cycles of similar new products. The age at which products transition from one life cycle phase to another as well as the life cycle curves of products can be obtained through this process. This also helps to identify the current stage of the products and the rate of sales growth during stage transition.

Below is an overview of the steps followed to achieve these benefits:

  • To identify products similar to a newly released product, we clustered products based on the significant factors affecting sales. This gives us a chance to obtain a data based PLC trend.
  • Next, sales is used to plot the Cumulative Sell Through Rate vs Product Age (in weeks).
  • A log-growth model fit across this plot will provide the Life Cycle trend of that product or cluster of products.
  • The second differential of this curve can be analyzed to identify shifts in PLC phases, to estimate the durations of each of the PLC phases.

Detailed Approach to Estimate PLC

The process followed to determine the different PLC stages is a generic one that can be incorporated into any model. However, in this paper, we have described how it was employed to help determine the effect of different PLC stages on sales for the apparel industry.

The procedure followed has been described in detail in the steps below:

i. Product Segmentation

The first step in estimating PLC is to segment products based on the features that primarily influence sales.

To predict the life cycle factor in demand prediction of a new product, we need to find similar products among those launched previously. The life cycle of the new product can be assumed to be similar to these.

ii. Identification of PLC Stages

To identify various stages, factors like Cumulative Sell through rate and Age of product were considered. The number of weeks in each stage was calculated at category level which consists of a group of products.

Cumulative sell through is defined as the cumulative Sales over the period divided by the total inventory at the start of the period. Sales of products were aggregated at category level by using the sum of sales at similar product age. For example, Sales of all products when the age was 1 week being aggregated, to get the sales of that category on week 1.

After exploring multiple methods to determine the different stages, we have finally used a log-growth model to fit a curve between age and cumulative sell through. Its equation is given below for reference:

Note: Φ1, Φ2 & Φ3 are parameters that control the asymptote and growth of the curve.

Using inflexion points of the fitted curve cut-off for different phases of product life cycle were obtained.

The fitted curve had 2 inflexion points that made it easy to differentiate the PLC stages.

The plot above shows the variation of Cumulative sell through rate (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Other Methods Explored

Several other methods were explored before determining the approach discussed in the previous section. The decision was based on the advantages and drawbacks of each of the methods given below:

Method 1:

Identification of PLC stages by analyzing the variation in Sell through and Cumulative Sell through.

Steps followed:

  • Calculated (Daily Sales / Total Inventory) across Cumulative Sell through rate at a category level
  • A curve between Cumulative Sell through rate (x-axis) and (Daily Sales / Total Inventory) in the y-axis was fitted using non-linear least square regression
  • Using inflexion points of the fitted curve cut-off for different phases of product life cycle is obtained

Advantages: The fitted curve followed a ‘bell-curve’ shape in many cases that made it easier to identify PLC stages visually.

Drawbacks: There weren’t enough data points in several categories to fit a ‘bell-shaped’ curve, leading to issues in the identification of PLC stages.

The plot above shows the variation of Total Sales (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Method 2:

Identification of PLC stages by analyzing the variation in cumulative sell through rates with age of a product (Logarithmic model).

Steps followed:

  • Calculated cumulative sell through rate across age at a category level.
  • A curve between age and cumulative sell through rate was fitted using a log linear model.
  • Using inflexion points of the fitted curve cut-off for different phases of product life cycle is obtained.


  1. Visual inspection of the fitted curve does not reveal any PLC stages.
  2. This method could not capture the trend as accurately as the log-growth models.

The plot above shows the variation of Cumulative sell through rate (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Application of PLC stages in Demand Prediction

After identifying the different PLC phases for each category, this information can be used directly to determine when promotions need to be provided to sustain product sales. It can also be incorporated into a model as an independent categorical variable to understand the impact of the different PLC phases on predicting demand.

In the context of this paper, we used the PLC phases identified as a categorical variable in the price elasticity model to understand the effect of each phase separately. The process was as follows:

The final sales prediction model had data aggregated at a cluster and sales week level. PLC phase information was added to the sales forecasting model by classifying each week in the cluster-week data into “Growth”, “Maturity” or “Decline”, based on the average age of the products in that cluster and week.

This PLC classification variable was treated as a factor variable so that we can obtain coefficients for each PLC stage.

The modeling equation obtained was:

In the above equation, “PLC_Phase” represents the PLC classification variable. The output of the regression exercise gave beta coefficients for the PLC stages “Growth” and “Maturity” with respect to “Decline”.

The “Growth” and “Maturity” coefficients were then treated such that they were always positive. This was because “Growth” and “Maturity” coefficients were obtained w.r.t. “Decline” and since “Decline” had a factor of 1, the other 2 had to be greater than 1.

The treated coefficients obtained for each cluster were used in the simulation tool in the following manner (more details given in tool documentation):

  • If there is a transition from “Growth” to “Maturity” stages in a product’s life cycle – then the PLC factor multiplied to sales is (“Maturity” coefficient / “Growth” coefficient).
  • If there is a transition from “Maturity” to “Decline” stages in a product’s life cycle – then the PLC factor multiplied to sales is (“Decline” coefficient / “Maturity” coefficient).
  • If there is no transition of stages in a product’s life cycle, then PLC factor is 1.


The method described in this paper enables identification of PLC stages for the apparel industry and demand prediction for old and new products. This is a generalized method and can be used for different industries as well, where a product may exhibit 4 or 5 stages of life cycle.

One of the drawbacks of product life cycle is that it is not always a reliable indicator of the true lifespan of a product and adhering to the concept may lead to mistakes. For example, a dip in sales during the growth stage can be temporary instead of a sign the product is reaching maturity. If the dip causes a company to reduce its promotional efforts too quickly, it may limit the product’s growth and prevent it from becoming a true success.

Also, if there are a lot of promotional activities or discount are applied, then it’s difficult to identify the true-life cycle.


Below are links to certain websites referred to:

Leveraging Advanced Analytics for Competitive Advantage Across FMCG Value Chain


According to World Bank, FMCG (Fast Moving Consumer Goods) market in India is expected to grow at a CAGR of 20.6% and is expected to reach US$ 103.7 billion by 2020 from US$ 49 billion in 2016. Some of the key changes that are fueling this growth are:

  • Industry Expansion – ITC Ltd has forayed into the frozen market with plans to launch frozen vegetables and fruits, aiming US$ 15.5 billion in revenues by 2030. Similarly, Patanjali Ayurveda is targeting a 10x growth by 2020, riding on the ‘ethnic’ recipes and winning consumer share of wallet.
  • Rural and semi-urban segments are growing at a rapid pace with FMCG accounting for 50% of total rural spending. There is an increasing demand for branded products in rural India. Rural FMCG market in India is expected to grow at a CAGR of 14.6%, and reach US$ 220 billion by 2025 from US$ 29.4 billion in 2016.
  • Logistics sector will see operational efficiencies with GST reforms. Historically, firms had installed hubs and transit points in multiple states to evade state value added tax (VAT). This is because the hub-to-hub transfer is treated as a stock transfer, and does not attract VAT. Firms can now focus on centralized hub operations, thus gaining efficiencies.
  • The rising share of the organized market in FMCG sector, coupled with the slow adoption of GST by wholesalers has led many FMCGs to explore alternative distribution channels such as direct distribution and cash-and-carry. Dabur, Marico, Britannia, and Godrej have already started making structural shifts in this direction.
  • Many leading FMCGs have started selling their brands through online grocery portals such as Grofers, Big Basket, and AaramShop. The trend is expected to increase with a strive towards cashless economy, and evolving payment mechanisms.
  • Traditional advertising mediums have seen a dip with the advent of YouTube, Netflix, and Hotstar.

Digital medium is being used more and more for branding and customer connect.

On top of this, barriers to new entrants in FMCG sector are eroding, owing to a wider consciousness of consumer needs, availability of finance and product innovations. This has raised the level of competition in the industry and generated a need to rethink the consumer offer, route to market, digital consumer engagement and premiumization.

In the above context, few buzzwords are circulating the FMCG corridors such as Analytics, Big data, Cloud, Predictive, Artificial Intelligence (AI) etc. They are being discussed in the light of preparing for the future – improved processes, innovations and transformations. Some disruptive use cases are:

  • With growing focus on direct distribution, AI becomes all more important to help sales personnel offer right trade promotions on the go.
  • With rising organized sector in urban segment, machine learning can improve the effectiveness of Go to marketing strategy by allowing customized shelf planning for various kinds of retailers.
  • AI can help recognize customer perceptions based on market research interviews, make predictions about their likes/dislikes, and design new targeted product offerings. For instance, a leading FMCG brand uses AI to recognize micro facial expressions in focus group research for a fragrance to predict whether the customer liked the product or not. On the same lines, Knorr is using AI to recommend recipes to consumers based on their favorite ingredients. Consumers can share this information via SMS with Knorr.
  • AI enabled vending machines can help personalize consumer experience. Coca Cola has come with an AI powered app for vending machines. The app will personalize the ordering experience to the user, and allow ordering of multiple drinks ahead of time. It will also customize in-app offers to keep people coming back to the vending machine.
  • With increasing adoption of the digital medium, Internet of things (IoT) enabled Smart manufacturing is creating havocs in the manufacturing function. IoT framework allows sensing of data from machine logs, controllers, sensors, equipment etc. in real-time. This data can be used to boost product quality compliance monitoring and predictive maintenance & scheduling.

Let’s go through some of the analytics use cases across FMCG functions that promise immediate value.

Function Wise Analytics Use Cases in FMCG

Go To Marketing

Go to Marketing plays a very important function in FMCG value chain by enabling products to reach the market. FMCG distribution models range from direct store delivery to retailer warehousing to third party distributor networks. Further complications arise due to the structure of the Indian market – core market vs. organized retail. Analytics can help optimize the GTM processes in multiple ways:

  • Network planning can help in minimizing logistics cost by optimizing fleet routes, number of retail outlets touched, order of contact and product mix on trucks in sync with each retailer’s demand.
  • Inventory orders can be optimized to reduce inventory pile-ups for slow moving products, and stock outs for faster moving products. SKU level demand forecasting followed by safety stock scenario simulations can help in capturing the impact of demand variability and lead time variability on stock outs.
  • Assortments intelligence promises a win-win situation for both FMCGs and retailers. Retailers increase margins by localizing assortments to local demand while FMCGs ensure a fluid movement of right products in right markets.
  • Smart Visi cooler allocations can help in increasing brand visibility and performance. Visi coolers come in different shapes and sizes. Traditionally, the sales personnel decide what kind of visi cooler to give to which retailer based on gut based judgment. Machine learning can be used to learn from retailer demand, performance and visibility data to make an optimal recommendation, thus improving brand visibility and performance.

Supply Chain & Operations

Analytics has percolated the supply value chain deeply. IoT is being popularly identified as the technology framework that will lend major disruptions with pooled data sources such as telemetry (fleet, machines, mobile), inventory and other supply chain process data. A couple of key applications are:

  • Vendor selection using risk scores based on contract, responsiveness, pricing, quantity and quality KPIs. Traditionally, this was done based on the qualitative assessment. But now, vendor risk modeling can be done to predict vendor risk scores, and high performance-low risk vendors can be selected from the contenders.
  • ‘Smart’ warehousing with IoT sensing frameworks. Traditionally, warehouses have functioned as a facility to only store inventory. But, IoT and AI have transformed it into a ‘Smart’ efficiency booster hub. For instance, AI can be used to automatically place the incoming batches on the right shelves such that picking them up for distribution consumes lesser resources and hence lesser cost.


Since FMCG sales structure is very personnel oriented, use cases such as incentive compensation, sales force sizing, territory alignment and trade promotion decisions continue to be very relevant.

  • Sales force sizing can be improved through a data-driven segmentation of retailer base followed by algorithmic estimation of sales effort required in a territory.
  • Trade promotion recommendations can be automated and personalized for a retailer based on historical performance and context. This will allow sales personnel to meet the retailer, key in some KPIs and recommend a personalized trade promotion in real-time.


Analytics has always been a cornerstone for enabling marketing decisions. It can help improve the accuracy and speed of these decisions.

  • Market mix modeling can be improved by simulating omnichannel spend attribution scenarios, thus optimizing overall marketing budget allocation and ROI.
  • Brand performance monitoring can be made intuitive by using rich web based visualizations that promise multi-platform consumption and quick decision making.
  • Sentiment analysis can help monitor the voice of customers on social media. Web based tools can provide a real-time platform to answer business queries such as what is important to customers, concerns/highlights, response to new launches & promotions etc.


With the advent of IoT, AI and Big data systems, use cases such as predictive maintenance have become more feasible. Traditionally, the manufacturer will need to wait for a failure scenario to occur a few times, and then learn from it, and predict the re-occurrence of that scenario. Companies are now focusing on sensing failures before they happen so that the threat of new failures can be minimized. This can result in immense cost savings through continued operations and quality control. The benefits can be further extended to:

  • Improved production forecasting systems using POS data (enabled by retail data sharing), sophisticated machine learning algorithms and external data sources such as weather, macroeconomics that can be either scraped or bought from third party data vendors.
  • Product design improvisations using attribute value modeling. Idea is to algorithmically learn what product attributes are most valued by consumers, and use the insights to design better products in future.

Promotions and Revenue Management

Consumer promotions are central for gaining short term sales lifts and influencing consumer perceptions. Analytics can help in designing and monitoring these promotions. Also, regional and national events can be monitored to calculate the promotional lifts, which could be used for designing better future promotional strategies.

  • Automated consumer promotion recommendations based on product price elasticity, consumer feedback from market research data, cannibalization scenarios and response to historical promotions.

A key step in adopting and institutionalizing analytics use cases is to assess where you are in the analytics maturity spectrum.

Analytics Maturity Assessment

A thorough analytics maturity assessment can help companies to understand their analytics positioning in the industry, and gain competitive advantages by enhancing analytical capabilities. Here are few high-level parameters to assess analytics maturity:

Now that we understand, where we are in the journey, let’s look at “How do we get there?”

Levers of Change

FMCGs need to adopt a multi-dimensional approach with respect to adapting to the changing trends. The dimensions could be:

  1. Thought Leadership: Companies need to invest considerable effort in developing a research and innovation ecosystem. We are talking about leapfrogging the traditional process improvement focus and getting on the innovation bandwagon. This requires hiring futurists, building think tanks inside the company, and creating an ‘’Edison mindset’’ (Progressive trial-error-learning mindset).
  2. Technology: Traditionally, companies have preferred the second-mover route when it comes to adoption of a newer technology. The rationale is that of risk avoidance and surety. With analytics enablement technologies such as Big data and cloud, this rationale falls to pieces. Primarily because, you are not the second-mover, but probably a double or triple digit move owing to widespread adoption across industries. Analytics enablement technologies have become a necessity for organizations.
  3. Learning from Others: Human beings are unique in having the ability to learn from the experience of others. This ability helps them not only correct their errors but also find new possibilities. Similarly, can FMCG learn from modern fast fashion retailers and revolutionize the speed to market? Can it learn from telecom and hyper-personalized offerings? Can it learn from banking and touch the consumer in multiple ways?

Harmonizing above levers along with relevant FMCG contextualization can lead to the desired transformation.

New Product Forecasting Using Deep Learning – A Unique Way


Forecasting demand for new product launches has been a major challenge for industries and cost of error has been high. Under predict demand and you lose on potential sales, overpredict them and there is excess inventory to take care of. Multiple research suggests that new product contributes to one-third of the organization sales across the various industry. Industries like Apparel Retailer or Gaming thrive on new launches and innovation, and this number can easily inflate to as high as 70%. Hence accuracy of demand forecasts has been a top priority for marketers and inventory planning teams.

There are a whole lot of analytics techniques adopted by analysts and decision scientists to better forecast potential demand, the popular ones being:

  • Market Test Methods – Delphi/Survey based exercise
  • Diffusion modeling
  • Conjoint & Regression based look alike models

While Market Test Methods are still popular but they need a lot of domain expertise and cost intensive processes to drive desired results. In recent times, techniques like Conjoint and Regression based methods are more frequently leveraged by marketers and data scientists. A typical demand forecasting process for the same is highlighted below:

Though the process implements an analytical temper of quantifying cross synergies between business drivers and is scalable enough to generate dynamic business scenarios on the go, it falls short of expectations on following two aspects

  • It includes heuristics exercise of identifying analogous products by manually defining product similarity. Besides, the robustness of this exercise is influenced by domain expertise. The manual process coupled with subjectivity of the process might lead to questionable accuracy standards.
  • It is still a supervised model and the key demand drivers need manual tuning to generate better forecasting accuracy.

For retailers and manufacturers esp. apparel, food, etc. where the rate of innovation is high and assortments keep refreshing from season to season, a heuristic method would lead to high cost and error for any demand forecasting exercise.

With the advent of Deep Learning’s Image processing capabilities, the heuristic method of identifying feature similarity can be automated with a high degree of accuracy through techniques like Convoluted Neural Network (CNN). It also minimizes the need for domain expertise as it self-learns feature similarity without much supervision. Since the primary reason for including product features in demand forecasting model is to understand the cognitive influence on customer purchase behavior, a deep learning based approach can capture the same with much higher accuracies. Besides techniques like Recurrent Neural Network (RNN) can be employed to make the models better at adaptive learning and hence making the system self-reliant with negligible manual interventions.

“Since the primary reason of including product features in demand forecasting model is to understand cognitive influence on customer purchase behavior, a deep learning framework is a better and accurate approach to capture the same”

In practice, CNN and RNN are two distinct methodologies and this article highlights a case where various Deep Learning models were combined to develop a self-learning demand forecasting framework.

Case Background

An apparel retailer wanted to forecast demand for its newly launched “Footwear” styles across various lifecycle stages. The current forecasting engine implemented various supervised techniques which were ensemble to generate desired demand forecasting. It had 2 major shortcomings:

  • The analogous product selection mechanism was heuristic and lead to low accuracy level in downstream processes.
  • The heuristic exercise was a significant road block in evolving the current process to a scalable architecture, making the overall experience a cost intensive one.
  • The engine was not able to replicate the product life cycle accurately.

Proposed Solution

We proposed to tackle the problem through an intelligent, automated and scalable framework

  • Leverage Convoluted Neural Networks(CNN) to facilitate the process of identifying the analogous product. CNN techniques have been proven to generate high accuracies in image matching problems.
  • Leverage Recurrent Neural Networks (RNN) to better replicate product lifecycle stages. Since RNN memory layers are better predictors of next likely event, it is an apt tool to evaluate upcoming time-based performances.
  • Since the objective was to devise a scalable method, a cloud-ready easy to use UI was proposed, where user can upload the image of an upcoming style and the demand forecasts would be generated instantly.

Overall Approach

The entire framework was developed in Python using Deep Learning platforms like Tensor Flow with an interactive user interface powered by Django. The Deep Learning systems were supported through NVIDIA GPUs hosted on Google Cloud.

The demand prediction framework consists of following components to ensure an end to end analytical implementation and consumption.

1. Product Similarity Engine

An image classification algorithm was developed by leveraging Deep Learning techniques like Convolution Neural Networks. The process included:

Data Collation

  • Developed an Image bank consisting of multi-style shoes across all categories/sub-categories e.g. sports, fashion, formals etc.
  • Included multiple alignments of the shoe images.

Data Cleaning and Standardization

  • Removed duplicate images.
  • Standardized the image to a desired format and size.

Define High-Level Features

  • Few key features were defined like brands, sub-category, shoe design – color, heel etc.

Image Matching Outcomes

  • Implemented a CNN model with 5+ hidden layers.

The following image is an illustrative representation of the CNN architecture implemented

  • Input Image: holds raw pixel values of the image with features being width, height & RGB values.
  • Convolution: Conv Net is to extract features from input data. Formation of matrix by sliding filters over an image and computing dot product is called “Feature Map”.
  • Non-Linearity – RELU: This layer applies element-wise activation filter leveraged to stimulate non-linearity relationships in a standard ANN.
  • Pooling:  Reduces the dimensionality of each feature map and retains important information. Helps in arriving at a scale invariant representation of an image.
  • Dropouts: To prevent overfitting random connections are severed.
  • SoftMax Layer: Output layer that classifies the image to appropriate category/subcategory/heel height classes.

Identified Top N matching shoes and calculated their probability scores. Classified image orientation as top, side (right/left) alignment of the same image

Similarity Index- Calculated based on the normalized overall probability scores.

Analogous Product: Attribute Similarity Snapshot (Sample Attributes Highlighted)

2. Forecasting Engine

A demand forecasting engine was developed on the available data by evaluating various factors like:

  • Promotions – Discounts, Markdown
  • Pricing changes
  • Seasonality – Holiday sales
  • Average customer rating
  • Product Attributes – This was sourced from the CNN exercise highlighted in the previous step
  • Product Lifecycle – High sales in initial weeks followed by declining trend

The following image is an illustrative representation of the demand forecasting model based on RNN architecture.

The RNN implementation was done using Keras Sequential model and the loss function was estimated using “mean squared error” method.

Demand Forecast Outcome

The accuracy from the proposed Deep Learning framework was in the range of 85-90% which was an improvement on the existing methodology of 60-65%.

Web UI for Analytical Consumption

An illustrative snapshot is highlighted below:

Benefits and Impact

  • Higher accuracy through better learning of the product lifecycle.
  • The overall process is self-learning and hence can be scaled quickly.
  • Automation of decision intensive processes like analogous product selection led to reduction in execution time.
  • Long-term cost benefits are higher.

Key Challenges & Opportunities

  • The image matching process requires huge data to train.
  • The feature selection method can be an automated through unsupervised techniques like Deep Auto Encoders which will further improve scalability.
  • Managing image data is a cost intensive process but it can be rationalized over time.
  • The process accuracies can be improved by creating a deeper architecture of the network and an additional one-time investment of GPU configurations.

Bayesian Theorem: Breaking it to Simple Using PyMC3 Modelling


This article edition of Bayesian Analysis with Python introduced some basic concepts applied to the Bayesian Inference along with some practical implementations in Python using PyMC3, a state-of-the-art open-source probabilistic programming framework for exploratory analysis of the Bayesian models.

The main concepts of Bayesian statistics are covered using a practical and computational approach. The article covers the main concepts of Bayesian and Frequentist approaches, Naive Bayes algorithm and its assumptions, challenges of computational intractability in high dimensional data and approximation, sampling techniques to overcome challenges, etc. The results of Bayesian Linear Regressions are inferred and discussed for the brevity of concepts.


Frequentist vs. Bayesian approaches for inferential statistics are interesting viewpoints worth exploring. Given the task at hand, it is always better to understand the applicability, advantages, and limitations of the available approaches.

In this article, we will be focusing on explaining the idea of Bayesian modeling and its difference from the frequentist counterpart. To make the discussion a little bit more intriguing and informative, these concepts are explained with a Bayesian Linear Regression (BLR) model and a Frequentist Linear Regression (LR) model.

Bayesian and Frequentist Approaches

The Bayesian Approach:

Bayesian approach is based on the idea that, given the data and a probabilistic model (which we assume can model the data well), we can find out the posterior distribution of the model’s parameters. For e.g.

In Bayesian Linear Regression approach, not only the dependent variable y,  but also the parameters(β) are assumed to be drawn from a probability distribution, such as Gaussian distribution with mean=βTX, and variance =σ2I (refer equation 1). The outputs of BLR is a distribution, which can be used for inferring new data points.

The Frequentist Approach, on the other hand, is based on the idea that given the data, the model and the model parameters, we can use this model to infer new data. This is commonly known as the Linear Regression Approach. In LR approach, the dependent variable (y) is a linear combination of weights term-times the independent variable (x), and e is the error term due to the random noise.

Ordinary Least Square (OLS) is the method of estimating the unknown parameters of LR model. In OLS method, the parameters which minimize the sum of squared errors of training data are chosen. The output of OLS are “single point” estimates for the best model parameter.

Let’s get started with Naive Bayes Algorithm, which is the backbone of Bayesian machine learning algorithms. Here, we can predict only one value of y, so basically it is a point estimation

Naive Bayes Algorithm for Classification       

Discussions on Bayesian Machine Learning models require a thorough understanding of probability concepts and the Bayes Theorem. So, now we discuss Bayes’ Algorithm. Bayes’ theorem finds the probability of an event occurring, given the probability of an already occurred event. Suppose we have a dataset with 7 features/attributes/independent variables (x1, x2, x3,…, x7), we call this data tuple as X. Assume H is the hypothesis of the tuple belonging to class C. In Bayesian terminology, it is known as the evidencey is the dependent variable/response variable (i.e., the class in classification problem). Then Mathematically, Bayes theorem is stated as :


  1. P(H|X) is the probability that the hypothesis H holds correct, given that we know the ‘evidence’ or attribute description of X. P(H|X) is the probability of H conditioned on X, a.k.a., Posterior Probability.                              
  2. P(X|H) is the posterior probability of X conditioned on H and is also known as ‘Likelihood’.
  3. P(H) is the prior probability of H. This is the fraction of occurrences for each class out of total number of samples.
  4. P(X) is the prior probability of evidence (data tuple X), described by measurements made on a set of attributes (x1, x2, x3,…, x7).

As we can see, the posterior probability of H conditioned on X is directly proportional to likelihood times prior probability of class and is inversely proportional to the ‘Evidence’.

Bayesian approach for regression problem: Assumptions of Bayes theorem, given a sales prediction problem with 7 independent variables.

i) Each pair of features in the dataset are independent of each other. For e.g., feature x1 has no effect on x2, & x2 has no effect on feature x7.
ii) Each feature makes an equal contribution towards the dependent variable.

Finding the posterior distribution of model parameters is computationally intractable for continuous variables, we use Markov Chain Monte Carlo and Variational Inferencing methods to overcome this issue.

From Naive Bayes theorem (equation 3), posterior calculation needs a prior, a likelihood and evidence. Prior and likelihood are calculated easily as they are defined by the assumed model. As P(X) doesn’t depend on H and given the values of features, the denominator is constant. So, P(X) is just a normalization constant. We need to maximize the value of numerator in equation 3. However, the evidence (probability of data) is calculated as:

Calculating the integral is computationally intractable with high dimensional data. In order to build faster and scalable systems, we require some sampling or approximation techniques to calculate the posterior distribution of parameters given in the observed data. In this section, two important methods for approximating intractable computations are discussed. These are sampling-based approach. Markov-chain Monte Carlo Sampling (MCMC sampling) and approximation-based approach known as Variational Inferencing (VI). Brief introduction of these techniques are as mentioned below:

  • MCMC– We use sampling techniques like MCMC to draw samples from the distribution, followed by approximating the distribution of the posterior. Refer to George’s blog [1], for more details on MCMC initialization, sampling and trace diagnostics.
  • VI– Variational Inferencing method tries to find the best approximation of the distribution from a parameter family. It uses an optimization process over parameters to find the best approximation. In PyMC3, we can use Automatic Differentiation Variational Inference (ADVI), which tries to minimize the Kullback–Leibler (KL) divergence between a given parameter family distribution and the distribution proposed by the VI method.

Prior Selection: Where is the prior in data, from where do I get one?

Bayesian modelling gives alternatives to include prior information into the modelling process. If we have domain knowledge or an intelligent guess about the weight values of independent variables, we can make use of this prior information. This is unlike the frequentist approach, which assumes that the weight values of independent variables come from the data itself. According to Bayes theorem:

Now that the method for finding posterior distribution of model parameters are being discussed, the next obvious question based on equation 5 is how to find a good prior. Refer [2] for understanding how to select a good prior for the problem statement. Broadly speaking, the information contained in the prior has a direct impact on the posterior calculations. If we have a more “revealing prior” (a.k.a., a strong belief about the parameters), we need more data to “alter” this belief. The posterior is mostly driven by prior. Similarly, if we have an “vague prior” (a.k.a., no information about the distribution of parameters), the posterior is much driven by data. It means that if we have a lot of data, the likelihood will wash away the prior assumptions [3]. In BLR, the prior knowledge modelled by a probability distribution is updated with every new sample (which is modelled by some other probability distribution).

Modelling Using PyMC3 Library for Bayesian Inferencing

Following snippets of code (borrowed from [4]), shows Bayesian Linear model initialization using PyMC3 python package. PyMC3 model is initialized using “with pm.Model()” statement. The variables are assumed to follow a Gaussian distribution and Generalized Linear Models (GLMs) used for modelling.  For an in-depth understanding on PyMc3 library, I recommend Davidson-Pilon’s book [5] on Bayesian methods.

Fig. 1 Traceplot shows the posterior distribution for the model parameters as shown on the left hand side. The progression of the samples drawn in the trace for variables are shown on the right hand side.

We can use “Traceplot” to show the posterior distribution for the model parameters and shown on the left-hand side of Fig. 1. The samples drawn in the trace for the independent variables and the intercept for 1,000 iterations are shown on the right-hand side of the Fig 1. Two colours – orange and blue, represent the two Markov chains.

After convergence, we get the coefficients of each feature, which is its effectiveness in explaining the dependent variable. The values represented in red are the Maximum a posteriori estimate (MAP), which is the mean of the variable value from the distribution. The sales can be predicted using the formula:

As it is a Bayesian approach, the model parameters are distributions. Following plots show the posterior distribution in the form of histogram. Here the variables show 94% HPD (Highest Posterior Density). HPD in Bayesian statistics is the credible interval, which tells us we are 94% sure that the parameter of interest falls in the given interval (for variable x6, the value range is -0.023 to 0.36).

We can see that the posteriors are spread out, which is an indicative of less data points used for modelling, and the range of values each independent variable can take is not modelled within a small range (uncertainty in parameter values are very high). For e.g., for variable x6, the value range is from -0.023 to 0.36, and the mean is 0.17. As we add more data, the Bayesian model can shrink this range to a smaller interval, resulting in more accurate values for weights parameters.

Fig. 2 Plots showing the posterior distribution in the form of histogram.

When to use linear and BLR, Map, etc. Do we go Bayesian or Frequentist?

The equation for linear regression on the same dataset is obtained as:

If we see Linear regression equation (eq. 7) and Bayesian Linear regression equation (eq. 6), there is a slight change in the weight’s values. So, which approach should we take up? Bayesian or Frequentist, given that both are yielding approximately the same results?

When we have a prior belief about the distributions of the weight variables (without seeing the data) and want this information to be included into the modelling process, followed by automatic belief adaptation as we gather more data, Bayesian is a preferable approach. If we don’t want to include any prior belief and model adaptions, the weight variables as point estimates, go for Linear regression. Why are the results of both models approximately the same? 

The maximum a posteriori estimates (MAP) for each variable is the peak value of the variable in the distribution (shown in Fig.2)  close to the point estimates for variables in LR model. This is the theoretical explanation for real-world problems. Try using both approaches, as the performance can vary widely based on the number of data points, and data characteristics.


This blog is an attempt to discuss the concepts of Bayesian inferencing and its implementation using PyMC3. It started off with the decade’s old Frequentist-Bayesian perspective and moved on to the backbone of Bayesian modelling, which is Bayes theorem. Once setting the foundations, the concepts of intractability to evaluate posterior distributions of continuous variables along with the solutions via sampling methods viz., MCMC and VI are discussed.  A strong connection between the posterior, prior and likelihood is discussed, taking into consideration the data available in hand. Next, the Bayesian linear regression modelling using PyMc3 is discussed, along with the interpretations of results and graphs. Lastly, we discussed why and when to use Bayesian linear regression.


The following are the resources to get started with Bayesian inferencing using PyMC3.





[5] Davidson-Pilon, Cameron. Bayesian methods for hackers: probabilistic programming and Bayesian inference. Addison-Wesley Professional, 2015

Explainable AI

The advancement in AI technology has led us to solve several problems with technology working side by side. The complexity of these AI models is growing, and so is the need to understand them. A growing concern is to regulate the bias in AI, which can occur for several reasons. One of many reasons is partial input data that will cause bias in the training model. So, it becomes increasingly essential to comprehend how the algorithm came to a result. Explainable AI is a set of tools and frameworks to help you understand and interpret predictions made by your machine learning models.

“Explainable artificial intelligence (XAI) is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms. Explainable AI is used to describe an AI model, its expected impact, and potential biases.” -IBM

Feature Attributions

Feature attributions depict how much each feature in your model contributed to each instance of test data’s predictions. When you run explainable AI techniques, you get the predictions and feature attribution information.

Understanding the Feature Attribution Methods


LIME is termed “Local Interpretable Model Agnostic Explanation.” It explains any model by approximating it locally with an interpretable model. It first creates a sample dataset locally by permuting the features or values from the original test instance. Then, a linear model is fitted to the perturbed dataset to understand the contribution of each feature. The linear model gives the final weight of each feature after fitting, which is the LIME value of these features. LIME has several methods based on the different models’ architecture and the input data.

LIME provides a tabular explainer that explains predictions based on tabular data. We implemented LIME for classification on the Iris dataset and regression on the Boston housing dataset. In the image below, we can see that petal length contributes positively to the Setosa class, whereas sepal length negatively contributes. Similarly, LSTAT and RM are the most contributing features in predicting Boston house prices.

Figure 1: (a) LIME explanation for logistic regression model trained on iris dataset and
(b) LIME explanation for linear regression trained on Boston housing dataset.

The sample dataset is created in text data by randomly permuting the words from the original text instance. The relevance of terms contributing to the prediction result is determined by fitting a linear model on the sample dataset.

In the following text classification example, LIME highlights both positively and negatively contributing words towards the classification of the text in business class. The term “WorldCom” is contributing the most positively.

Figure 2: LIME explanation for BERT model trained on BBC news dataset.

For image classification tasks, LIME finds the region of an image (set of super-pixels) with the strongest association with a prediction label. It generates perturbations by turning on/off some of the super-pixels in the image. Then a model to be explained predicts the target class on perturbated image. Next, a linear model is trained using the dataset of “perturbed” samples with their responses, which provides the weightage of super pixels.

In the example below LIME show the area which have strong association with the prediction of “Labrador”.

Figure 3: LIME explanation for image classification model on cats and dog dataset.


GradCAM stands for Gradient-weighted Class Activation Mapping. It uses the gradients of any target concept (say, “dog” in a classification network) flowing into the final convolutional layer. It works by evaluating the predicted class’s score gradients concerning the convolutional layer’s feature maps, which are then pooled to determine the weighted combination of feature maps. These weights are passed through the ReLu activation function to get the positive activations, producing a coarse localization map highlighting the critical regions in the image for predicting the concept.

We implement GradCAM to understand a CNN model trained on a human activity recognition dataset. In the following image, the most positive contributing area to the predicted class is shown in red, whereas the area in blue has no positive contribution towards the predicted class.

Figure 4: GradCAM result for VGG16 model trained on human activity recognition dataset.

Deep Taylor

It is a method to decompose the output of the network classification into the contributions (relevance) of its input elements using Taylor’s theorem, which gives an approximation of a differentiable function. The output neuron is first decomposed into input neurons in a neural network. Then, the decomposition of these neurons is redistributed to their inputs, and the redistribution process is repeated until the input variables are reached. Thus, we get the relevance of input variables in the classification output.

As we can see in the image below, the pixels of the bicycle have the maximum contribution in the predicted biking class.

Figure 5: Deep Taylor result for VGG16 model trained on human activity recognition dataset.


Shapley values are a concept in a cooperative game theory algorithm that assigns credit to each player in a game for a particular outcome. When applied to machine learning models, this indicates that each model feature is considered as a “player” in the game, with AI Explanations allocating each proportionate feature credit for the prediction’s result. SHAP provides various methods based on model architecture to calculate Shapley values for different models. 

The SHAP gradient explainer is a method to drive the SHAP value on the image data. It calculates the gradient of the output score with respect to the input, i.e., the pixel’s intensity—this feature attribution method is designed for differentiable models like convolutional neural networks.

In the following image, the pixels in red are most positively contributing to the biking class. In contrast, the pixels in blue are confusing the model with a different class and hence negatively contributing.

Figure 6: SHAP result for VGG16 model trained on human activity recognition dataset.

The SHAP kernel explainer is the only method that is model agnostic for the calculation of Shapley values. It is an extended and adapted method of linear LIME to calculate Shapley values. The Kernel Explainer builds a weighted linear regression using your data and predictions. Whatever function indicates the predicted values, the coefficients of the solution of weighted linear regression are the Shapley values. As a result, a gradient-based explanation method cannot be used since object detection models are non-differentiable. However, we can use the SHAP kernel explainer.

This method explains one detection in one image. Here we are explaining the person in the middle. We see that the dark red patches with the highest contribution are located within the bounding box of our target. Interestingly, the highest contribution seems to come from head and shoulders.

Figure 7: SHAP results for YOLOv3 model trained on the COCO dataset.

SHAP also has the functionality to derive explanations for NLP models. We demonstrate the use of SHAP for text classification and text summarization, which explains the contribution of words or a combination of words towards a prediction. 

As we can see below, for example, in the text classification, the word “company” is the highest contributor in classifying text into the business class. In summary, the term “neglected” has “but it is almost like we are neglected” as its highest contributor in the text summarization.

Figure 8: (a) SHAP result for text classification using BERT and (b) SHAP result for text summarization using Schleifer/distilbart-cnn-12-6 model.


The Surrogate Object Detection Explainer (SODEx) explains an object detection model using LIME, which explains a single prediction with a linear surrogate model. It first segments the image into super pixels and generates perturbed samples. Then, the black box model predicts the result of every perturbed observation. Using the dataset of perturbed samples and their responses, it trains a surrogate linear model that provides super pixel weights.

It gives explanations for all the detected objects in an image. The green-colored patches show positive contributions, and the red-colored patches negatively contribute. Here, the model focuses on hands and legs to detect a person.

Figure 9: SODEx result for YOLOv3 model trained on the COCO dataset.


SEG-GRAD-CAM is an extension of Grad-CAM for semantic segmentation. It can generate heat maps to explain the relevance of the decisions of individual pixels or regions in the input image. The GradCAM uses the gradient of the logit for the predicted class with respect to chosen feature layers to determine their general relevance. But a CNN for semantic segmentation produces logits for every pixel and class. This idea allows us to adapt GradCAM to a semantic segmentation network flexibly since we can determine the gradient of the logit of just a single pixel, or pixels of an object instance, or simply all pixels of the image.

Like in GradCAM, the red pixels are the most positively contributing, whereas the pixels in blue have zero positive contributions. Here, the pixels of the car’s windscreen have the maximum contribution.

Figure 10: SegGradCAM result for U-net model trained on camvid dataset.


Explainable AI builds confidence in the model’s behavior by ensuring that the model does not focus on idiosyncratic details of the training data that will not generalize to unseen data. Therefore, it guarantees the fairness of ML models. We have implemented the explainable AI techniques for the models trained on tabular, text, and image data, thus enhancing these models’ transparency and interactivity. You can then use this information to verify that the model is behaving as expected, recognize bias in your models, and get ideas for improving your model and training data.


  1. Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.” International Journal of Computer Vision 2019
  4. Sejr, J.H.; Schneider-Kamp, P.; Ayoub, N. Surrogate Object Detection Explainer (SODEx) with YOLOv4 and LIME
  6. Vinogradova, K., Dibrov, A., & Myers, G. (2020). Towards Interpretable Semantic Segmentation via Gradient-Weighted Class Activation Mapping 

Top 5 Challenges- Implementing Industry 4.0!

Today, the entire world is grappling with the COVID-19 pandemic, which has intensified supply chain concerns and prompted many businesses to rethink their sourcing strategies. Several businesses are focusing on localization for two reasons: one, to be closer to the source, and the other, to minimize the risk of disruption.

Well, necessity is the mother of invention, and this is undeniably true for technological innovations, precisely Industry 4.0 solutions. Evolution is the result of real hardship. In the case of manufacturing, the movement of Industry 4.0 is caused by volatile market demands for better and quicker production techniques, shrinking margins, and intense contention among enterprises that is impossible without smart technology.

Smart Manufacturing will be based on digitization and Industry 4.0 and large enterprises are inclining towards digital innovation. However, SME’s and MSMEs are still struggling with several challenges to adopting the Digital Transformation and Industry 4.0 initiatives. These obstacles may dissuade some manufacturing companies from adopting these technologies, causing them to fall behind their peers.

The Top Five Challenges!

SMEs and MSMEs still experience difficulties achieving Industry 4.0 goals, although smart manufacturing is often associated with Industry 4.0 and digital transformation. Here are the five challenges:

1. Organization Culture:

This is one of the immense challenges for any organization to evolve from ad-hoc decisions to data-based decision-making. Part of this is driven by the data availability and the CXO’s awareness and willingness to adopt new Digital technologies. Navigating the balance between culture and technology together is one of the toughest challenges of digital transformation.

2. Data Readiness/Digitization:

Any Digital Revolution succeeds on the availability of data. Unfortunately, this is one of the most significant opportunities for SMEs. Most of the manufacturing plants in SMEs lack basic data capture and storage infrastructure.

Most places have different PLC protocols (e.g., Siemens, Rockwell, Hitachi, Mitsubishi, etc.), and the entire data is encrypted and locked. This either requires unlocking encryption by the control systems providers or calls for separate sensor or gateway installations. Well, this is a huge added cost, and SMEs have not seen any benefits so far, as they have been running their businesses frugally.

3. Data Standardization and Normalization:

This is a crucial step in the Digital Transformation journey, enabling the data to be used for real-time visibility, benchmarking, and machine learning.

Most SMEs grow in an organic way, and there’s an intent to grow most profitably. Typically, IT and OT technology investments are kept to a bare minimum. As a result, most SMEs are missing SCADA/MES’S systems that integrate the data in a meaningful way and help store it centrally. As a result of missing this middleware, most of the data needs to be sourced from different sensors directly or PLCs and sent via gateways.

All this data cannot be directly consumed for visualization and needs an expensive middleware solution (viz., LIMS (Abbott, Thermo Fischer), and LEDs- GE Proficy); this is again an added cost.

Additionally, the operational data is not all stored in a centralized database. Instead, it is available in real-time from Programmable Logic Controllers (PLCs), machine controllers, Supervisory Control and Data Acquisition (SCADA) systems, and time-series databases throughout the factory. This increases the complexity of data acquisition and storage.

4. Lack of Talent for Digital:

Believe it or not, we have been reeling under a massive talent crunch for digital technologies. As of 2022, a huge talent war is attracting digital talent across all services, consulting, and product-based companies.

As a result, we don’t have enough people who have seen the actual physical shop floor, understand day-to-day challenges, and have enough digital and technical skills to enable digital transformation. A systematic approach is needed to help up-skill existing resources and develop new digital talent across all levels.

5. CXO Sponsorship:

This is a key foundation for any digital transformation and Industry4.0 initiative. Unless there’s CXO buy-in and sponsorship, any digital transformation initiative is bound to fail. For CXO’s to start believing in the cause, they need to be onboarded, starting with just awareness of what’s possible, emphasizing benefits and ROI as reasons to believe.

Once there’s a top-down willingness and drive, things will become much easier regarding funding, hiring of technical talent or consulting companies, and execution.

Final Takeaway

It should go without saying that the above stats do not include all the challenges manufacturers encounter when they embark on the Industry 4.0 journey. Additionally, more industries and professionals should actively engage in skill improvement initiatives for immediate implementation and prepare employees for the future. Because of the ever-changing nature of IIoT technologies and their rapid pace, this list of challenges will continue to change over time.

We would love to hear back from you on your experiences implementing Industry 4.0 and digital transformation projects and the challenges you faced.

Please feel free to comment and share your experiences.

#Affine is conducting an Event, Demystifying Industry 4.0, on Industry 4.0 and Digital Transformation for CXOs on March 25th, 2022. The Event aims to provide major Industry 4.0 use cases for automotive suppliers and ecosystems.

Stay tuned for more information!



