Deep Learning Demystified 2: Dive Deep into Convolutional Neural Networks

The above photo is not created by a specialized app or photoshop. It was generated by a Deep learning algorithm which uses convolutional networks to learn artistic features from various paintings and changes any photo depicting how an artist would have painted it.

Convolutional Neural Networks has become part of every state of the art solutions in areas like

  • Image recognition
  • Object Recognition
  • Self-driving cars in identifying pedestrians, objects.
  • Emotion recognition.
  • Natural Language Processing.

A few days back Google surprised me with a video called Smiles 2016 where all the photos of 2016 where I was partying with family, friends, colleagues are put together. It was a collection of photos where everyone in the photo was smiling. Emotion recognition. We will discuss a couple of Deep learning architectures that powers these applications in this blog.

Before we dive into CNN lets try to understand why not Feed Forward Neural network. According to universality theorem which we discussed in the previous blog, any network will be able to approximate a function just by adding Neurons(Functions), but there are no guarantees in time when will it reach the optimal solution. Feed Forward neural networks tend to flatten images to a flat vector thus losing all the spatial information that comes with an Image. So for problems where spatial feature importance is high CNN tend to achieve higher accuracy in a very shorter time compared to Feed-Forward Neural Networks.

Before we dive into what a Convolutional Neural Network is letting get comfortable with nuts and bolts which form it.

Images

Before we dive into CNN lets take a look at how a computer looks at an image.

What we see
What a computer sees

Wow, it’s great to know that computer sees images, videos as a matrix of numbers. A common way of looking at an image in computer vision is a matrix of dimensions Width * Height * Channels. Where Channels are Red, Green, Blue and sometimes alpha is also part of channels.

Filters

Filters are a small matrix of numbers usually of size 3*3*3 (width, height, channel) or 7*7*3. Filters perform various operations like blur, sharpen, outline on a given image. Historically these filters are carefully hand picked to gain various features of an image. In our case, CNN creates these filters automatically using a combination of techniques like Gradient descent and Backpropagation. Filters are moved across an image starting from top left to the bottom right to capture all the essential features. They are also called as kernels in Neural networks.

Convolutional

In a convolutional layer, we convolve the filter with patches across an image. For example on the left-hand side of the below image is a matrix representation of a dummy image and the middle layer is the filter or kernel. The right side of the image has the output of convolution layer. Look at the formula in the image to understand how the kernel and a part of the image are combined together to form a new pixel.

First pixel in the image being calculated

Let’s see another example of how the next pixel in the image is being generated.

The second pixel in the output image is being calculated.

Max-Pooling

Max pooling is used for reducing dimensionality and down-sampling an input. The best way to understand Max-pooling is an example. The below image describes what a 2*2 Max pooling layer does.

In both the examples for convolution and Max-pooling, the image shows for only 2 pixels, but in reality, the same technique is applied to the entire image.

Now with an understanding of all the important components, let’s take a look at how the Convolutional Neural Network looks like.

The example used in Stanford CNN classes

As you can see from the above image, a CNN is a combination of layers stacked together. The above architecture can be simply depicted as CONV-RELU-CONV-RELU_POOL * 3 + Fully connected Layer.

Challenges

Convolutional Neural Networks need huge amounts of labeled data and lots of computation power to get trained. They typically take weeks to get trained to achieve state of the art performance. Most of these architectures like AlexNet, ZF Net, VGG Net, Google Net, Microsoft Res Net take weeks to get trained. Does that mean, an organization without huge volumes of data and computation power cannot take advantage of it? The answer is No.

Transfer Learning to the Rescue

Most of the winners of the ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) competition has open sourced the architecture and the weights associated with these networks. It turns out, most of the weights particularly that of the filters can be reused after fine tuning to the domain specific problems. So for us to take advantage of these convolutional neural networks, all we need to do is pre-train the last few layers of the network. Which in general takes very little data and computation powers. For several of our scenarios, we were able to train models with state of art performance on GPU machines in few minutes to hours.

Conclusion

Apart from use cases like image recognition , CNN is being widely used in various network topology’s like object recognition (What objects are located in images), GAN (A recent breakthrough in helping computers create realistic images), converting low resolution images to high resolution images , in revolutionizing health sector in various forms of cancer detection and many more. In recent months there were architectures built for NLP achieving state of art results.

Statistical Model Lifecycle Management

Organizations have realized quantum jumps in business outcomes through the institutionalization of data-driven decision making. Predictive Analytics, powered by the robustness of statistical techniques, is one of the key tools leveraged by data scientists to gain insight into probabilistic future trends. Various mathematical models form the DNA of Predictive Analytics.

A typical model development process includes identifying factors/drivers, data hunting, cleaning and transformation, development, validation – business & statistical and finally productionisation. In the production phase, as actual data is included in the model environment, true accuracy of the model is measured. Quite often there are gaps (error) between predicted and actual numbers. Business teams have their own heuristic definitions and benchmark for this gap and any deviation leads to forage for additional features/variables, data sources and finally resulting in rebuilding the model.

Needless to say, this leads to delays in the business decision and have several cost implications.

Can this gap (error) be better defined, tracked and analyzed before declaring model failure? How can stakeholders assess the Lifecycle of any model with minimal analytics expertise?

At Affine, we have developed a robust and scalable framework which can address above questions. In the next section, we will highlight the analytical approach and present a business case where this was implemented in practice.

Approach

The solution was developed based on the concepts of Statistical Quality Control esp. Western Electric rules. These are decision rules for detecting “out-of-control” or non-random conditions using the principle of process control charts. Distributions of the observations relative to the control chart indicate whether the process in question should be investigated for anomalies.

X is the Mean error of the analytical model based on historical (model training) data. Outlier analysis needs to be performed to remove any exceptional behavior.
Zone A = Between Mean ± (2 x Std. Deviation) & Mean ± (3 x Std. Deviation)
Zone B = Between Mean ± Std. Deviation & Mean ± (2 x Std. Deviation)
Zone C = Between Mean & Mean ± Std. Deviation.
Alternatively, Zone A, B, and C can be customized based on the tolerance of Std. Deviation criterion and business needs.

RuleDetails
1Any single data point falls outside the 3σ limit from the centerline (i.e., any point that falls outside Zone A, beyond either the upper or lower control limit)
2Two out of three consecutive points fall beyond the 2σ limit (in zone A or beyond), on the same side of the centerline
3Four out of five consecutive points fall beyond the 1σ limit (in zone B or beyond), on the same side of the centerline
4Eight consecutive points fall on the same side of the centerline (in zone C or beyond)

If any of the rules are satisfied, it indicates that the existing model needs to be re-calibrated.

Business Case

A large beverage company wanted to forecast industry level demand for a specific product segment in multiple sales geographies. Affine evaluated multiple analytical techniques and identified a champion model based on accuracy, robustness, and scalability. Since the final model was supposed to be owned by client internal teams, Affine enabled assessing lifecycle stage of a model through an automated process. A visualization tool was developed which included an alert system to help user proactively identify for any red flags. A detailed escalation mechanism was outlined to address any queries or red flags related to model performance or accuracies.

Fig1: The most recent data available is till Jun-16. An amber alert indicates that an anomaly is identified but this is most likely an exception case.

Following are possible scenarios based on actual data for Jul-16.

Case 1

Process in control and no change to model required.

Case 2:

A red alert is generated which indicates model is not able to capture some macro-level shift in the industry behavior.

Any single data point falls outside the 3σ limit from the centerline (i.e., any point that falls outside Zone A, beyond either the upper or lower control limit)

  1. Two out of three consecutive points fall beyond the 2σ limit (in zone A or beyond), on the same side of the centerline
  2. Four out of five consecutive points fall beyond the 1σ limit (in zone B or beyond), on the same side of the centerline
  3. Eight consecutive points fall on the same side of the centerline (in zone C or beyond)

If any of the rules are satisfied, it indicates that the existing model needs to be re-calibrated.

Key Impact and Takeaways

  1. Quantify and develop benchmarks for error limits.
  2. A continuous monitoring system to check if predictive model accuracies are within the desired limit.
  3. Prevent undesirable escalations thus rationalizing operational costs.
  4. Enabled through a visualization platform. Hence does not require strong analytical
    expertise.

Product Life Cycle Management in Apparel Industry

Product Life Cycle Estimation

“Watch the product life cycle; but more important, watch the market life cycle”

~Philip Kotler

Abstract

The product life cycle describes the period over which an item is developed, brought to market and eventually removed from the market.

This paper describes a simple method to estimate Life Cycle stages – Growth, Maturity, and Decline (as seen in the traditional definitions) of products that have historical data of at least one complete life cycle.

Here, two different calculations have been done which helps the business to identify the number of weeks after which a product moves to a different stage and apply the PLC for improving demand forecasting.

A log-growth model is fit using Cumulative Sell through and Product Age which helps to identify the various stages of the product. A Log-Linear model is fit to determine the rate of change of product sales due to a shift in its stage, cet. par.

The life span of a product and how fast it goes through the entire cycle depends on market demand and how marketing instruments are used and vary for different products. Products of fashion, by definition, have a shorter life cycle, and they thus have a much shorter time in which to reap their reward.

An Introduction to Product Life Cycle (PLC)

Historically, PLC is a concept that has been researched as early as 1957 (refer Jones 1957, p.40). The traditional definitions mainly described 4 stages – Introduction, Growth, Maturity, and Decline. This was used mainly from a marketing perspective – hence referred to as Marketing-PLC.

With the development of new types of products and additional research in the field, Life Cycle Costing (LCC) and Life Cycle Assessment (LCA) were added to the traditional definition to give the Engineering PLC (or E-PLC). This definition considers the cost of using the product during its lifetime, services necessary for maintenance and decommissioning of the product.

According to Philip Kotler, ‘The product life cycle is an attempt to recognize distinct stages in sales history of the product’. In general, PLC has 4 stages – Introduction, Growth, Maturity, and Decline. But for some industries which consist of fast moving products, for example, apparel PLC can be defined in 3 stages. PLC helps to study the degree of product acceptance by the market over time which includes major rise or fall in sales.

PLC also varies based on product type that can be broadly divided into:

  1. Seasonal – Products that are seasonal (for e.g. mufflers, that are on shelves mostly in winter) have a steeper incline/decline due to the short growth and decline periods.
  2. Non-Seasonal – Products that are non-seasonal (for e.g. jeans, that are promoted in all seasons) have longer maturity and decline periods as sales tend to continue as long as stocks last.

Definition of Various Stages of PLC

Market Development & Introduction

This is when a new product is first brought to market before there is a proved demand for it. In order to create demand, investments are made with respect to consumer awareness and promotion of the new product in order to get sales going. Sales and Profits are low and there are only a few competitors in this stage.

Growth

In this stage, demand begins to accelerate and the size of the total market expands rapidly. The production costs and high profits are generated.

Maturity

The sales growth reaches a point above which it will not grow. The number of competitors increases and so market share decreases. The sales will be maintained for some period with a good profit.

Decline

Here, the market becomes saturated and the product is no longer sold and becomes unpopular. This stage can occur as a natural result but can also be due to introduction of new and innovative products and better product features from the competitors.

This paper deals with the traditional definition of PLC and the application in Fashion products.

Why do Businesses Need PLC and How Does it Help Them?

Businesses have always invested significant amounts of resources to estimate PLC and demand. Estimating the life cycle of a new product accurately helps business take several key decisions, such as:

  • Provide promotions and markdowns at the right time.
  • Plan inventory levels better by incorporating PLC in demand prediction.
  • Plan product launch dates/season.
  • Determine the optimal discount percentages based on a product’s PLC stage (as discussed later in this paper).

Businesses primarily rely on the business sense and experience of their executives to estimate a product’s life cycle. Any data driven method to easily estimate PLC can help reduce costs and improve decision making.

How Does the Solution in this Paper Help?

The solution detailed in this paper can help businesses use data of previously launched products to predict the life cycles of similar new products. The age at which products transition from one life cycle phase to another as well as the life cycle curves of products can be obtained through this process. This also helps to identify the current stage of the products and the rate of sales growth during stage transition.

Below is an overview of the steps followed to achieve these benefits:

  • To identify products similar to a newly released product, we clustered products based on the significant factors affecting sales. This gives us a chance to obtain a data based PLC trend.
  • Next, sales is used to plot the Cumulative Sell Through Rate vs Product Age (in weeks).
  • A log-growth model fit across this plot will provide the Life Cycle trend of that product or cluster of products.
  • The second differential of this curve can be analyzed to identify shifts in PLC phases, to estimate the durations of each of the PLC phases.

Detailed Approach to Estimate PLC

The process followed to determine the different PLC stages is a generic one that can be incorporated into any model. However, in this paper, we have described how it was employed to help determine the effect of different PLC stages on sales for the apparel industry.

The procedure followed has been described in detail in the steps below:

i. Product Segmentation

The first step in estimating PLC is to segment products based on the features that primarily influence sales.

To predict the life cycle factor in demand prediction of a new product, we need to find similar products among those launched previously. The life cycle of the new product can be assumed to be similar to these.

ii. Identification of PLC Stages

To identify various stages, factors like Cumulative Sell through rate and Age of product were considered. The number of weeks in each stage was calculated at category level which consists of a group of products.

Cumulative sell through is defined as the cumulative Sales over the period divided by the total inventory at the start of the period. Sales of products were aggregated at category level by using the sum of sales at similar product age. For example, Sales of all products when the age was 1 week being aggregated, to get the sales of that category on week 1.

After exploring multiple methods to determine the different stages, we have finally used a log-growth model to fit a curve between age and cumulative sell through. Its equation is given below for reference:

Note: Φ1, Φ2 & Φ3 are parameters that control the asymptote and growth of the curve.

Using inflexion points of the fitted curve cut-off for different phases of product life cycle were obtained.

The fitted curve had 2 inflexion points that made it easy to differentiate the PLC stages.

The plot above shows the variation of Cumulative sell through rate (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Other Methods Explored

Several other methods were explored before determining the approach discussed in the previous section. The decision was based on the advantages and drawbacks of each of the methods given below:

Method 1:

Identification of PLC stages by analyzing the variation in Sell through and Cumulative Sell through.

Steps followed:

  • Calculated (Daily Sales / Total Inventory) across Cumulative Sell through rate at a category level
  • A curve between Cumulative Sell through rate (x-axis) and (Daily Sales / Total Inventory) in the y-axis was fitted using non-linear least square regression
  • Using inflexion points of the fitted curve cut-off for different phases of product life cycle is obtained

Advantages: The fitted curve followed a ‘bell-curve’ shape in many cases that made it easier to identify PLC stages visually.

Drawbacks: There weren’t enough data points in several categories to fit a ‘bell-shaped’ curve, leading to issues in the identification of PLC stages.

The plot above shows the variation of Total Sales (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Method 2:

Identification of PLC stages by analyzing the variation in cumulative sell through rates with age of a product (Logarithmic model).

Steps followed:

  • Calculated cumulative sell through rate across age at a category level.
  • A curve between age and cumulative sell through rate was fitted using a log linear model.
  • Using inflexion points of the fitted curve cut-off for different phases of product life cycle is obtained.

Drawbacks:

  1. Visual inspection of the fitted curve does not reveal any PLC stages.
  2. This method could not capture the trend as accurately as the log-growth models.

The plot above shows the variation of Cumulative sell through rate (y-axis) vs Age (x-axis). The data points are colored based on the PLC life stage identified:  Green for “Growth Stage”, Blue for “Maturity Stage” and Red for “Decline Stage”.

Application of PLC stages in Demand Prediction

After identifying the different PLC phases for each category, this information can be used directly to determine when promotions need to be provided to sustain product sales. It can also be incorporated into a model as an independent categorical variable to understand the impact of the different PLC phases on predicting demand.

In the context of this paper, we used the PLC phases identified as a categorical variable in the price elasticity model to understand the effect of each phase separately. The process was as follows:

The final sales prediction model had data aggregated at a cluster and sales week level. PLC phase information was added to the sales forecasting model by classifying each week in the cluster-week data into “Growth”, “Maturity” or “Decline”, based on the average age of the products in that cluster and week.

This PLC classification variable was treated as a factor variable so that we can obtain coefficients for each PLC stage.

The modeling equation obtained was:

In the above equation, “PLC_Phase” represents the PLC classification variable. The output of the regression exercise gave beta coefficients for the PLC stages “Growth” and “Maturity” with respect to “Decline”.

The “Growth” and “Maturity” coefficients were then treated such that they were always positive. This was because “Growth” and “Maturity” coefficients were obtained w.r.t. “Decline” and since “Decline” had a factor of 1, the other 2 had to be greater than 1.

The treated coefficients obtained for each cluster were used in the simulation tool in the following manner (more details given in tool documentation):

  • If there is a transition from “Growth” to “Maturity” stages in a product’s life cycle – then the PLC factor multiplied to sales is (“Maturity” coefficient / “Growth” coefficient).
  • If there is a transition from “Maturity” to “Decline” stages in a product’s life cycle – then the PLC factor multiplied to sales is (“Decline” coefficient / “Maturity” coefficient).
  • If there is no transition of stages in a product’s life cycle, then PLC factor is 1.

Conclusion

The method described in this paper enables identification of PLC stages for the apparel industry and demand prediction for old and new products. This is a generalized method and can be used for different industries as well, where a product may exhibit 4 or 5 stages of life cycle.

One of the drawbacks of product life cycle is that it is not always a reliable indicator of the true lifespan of a product and adhering to the concept may lead to mistakes. For example, a dip in sales during the growth stage can be temporary instead of a sign the product is reaching maturity. If the dip causes a company to reduce its promotional efforts too quickly, it may limit the product’s growth and prevent it from becoming a true success.

Also, if there are a lot of promotional activities or discount are applied, then it’s difficult to identify the true-life cycle.

References

Below are links to certain websites referred to:

Leveraging Advanced Analytics for Competitive Advantage Across FMCG Value Chain

Introduction

According to World Bank, FMCG (Fast Moving Consumer Goods) market in India is expected to grow at a CAGR of 20.6% and is expected to reach US$ 103.7 billion by 2020 from US$ 49 billion in 2016. Some of the key changes that are fueling this growth are:

  • Industry Expansion – ITC Ltd has forayed into the frozen market with plans to launch frozen vegetables and fruits, aiming US$ 15.5 billion in revenues by 2030. Similarly, Patanjali Ayurveda is targeting a 10x growth by 2020, riding on the ‘ethnic’ recipes and winning consumer share of wallet.
  • Rural and semi-urban segments are growing at a rapid pace with FMCG accounting for 50% of total rural spending. There is an increasing demand for branded products in rural India. Rural FMCG market in India is expected to grow at a CAGR of 14.6%, and reach US$ 220 billion by 2025 from US$ 29.4 billion in 2016.
  • Logistics sector will see operational efficiencies with GST reforms. Historically, firms had installed hubs and transit points in multiple states to evade state value added tax (VAT). This is because the hub-to-hub transfer is treated as a stock transfer, and does not attract VAT. Firms can now focus on centralized hub operations, thus gaining efficiencies.
  • The rising share of the organized market in FMCG sector, coupled with the slow adoption of GST by wholesalers has led many FMCGs to explore alternative distribution channels such as direct distribution and cash-and-carry. Dabur, Marico, Britannia, and Godrej have already started making structural shifts in this direction.
  • Many leading FMCGs have started selling their brands through online grocery portals such as Grofers, Big Basket, and AaramShop. The trend is expected to increase with a strive towards cashless economy, and evolving payment mechanisms.
  • Traditional advertising mediums have seen a dip with the advent of YouTube, Netflix, and Hotstar.

Digital medium is being used more and more for branding and customer connect.

On top of this, barriers to new entrants in FMCG sector are eroding, owing to a wider consciousness of consumer needs, availability of finance and product innovations. This has raised the level of competition in the industry and generated a need to rethink the consumer offer, route to market, digital consumer engagement and premiumization.

In the above context, few buzzwords are circulating the FMCG corridors such as Analytics, Big data, Cloud, Predictive, Artificial Intelligence (AI) etc. They are being discussed in the light of preparing for the future – improved processes, innovations and transformations. Some disruptive use cases are:

  • With growing focus on direct distribution, AI becomes all more important to help sales personnel offer right trade promotions on the go.
  • With rising organized sector in urban segment, machine learning can improve the effectiveness of Go to marketing strategy by allowing customized shelf planning for various kinds of retailers.
  • AI can help recognize customer perceptions based on market research interviews, make predictions about their likes/dislikes, and design new targeted product offerings. For instance, a leading FMCG brand uses AI to recognize micro facial expressions in focus group research for a fragrance to predict whether the customer liked the product or not. On the same lines, Knorr is using AI to recommend recipes to consumers based on their favorite ingredients. Consumers can share this information via SMS with Knorr.
  • AI enabled vending machines can help personalize consumer experience. Coca Cola has come with an AI powered app for vending machines. The app will personalize the ordering experience to the user, and allow ordering of multiple drinks ahead of time. It will also customize in-app offers to keep people coming back to the vending machine.
  • With increasing adoption of the digital medium, Internet of things (IoT) enabled Smart manufacturing is creating havocs in the manufacturing function. IoT framework allows sensing of data from machine logs, controllers, sensors, equipment etc. in real-time. This data can be used to boost product quality compliance monitoring and predictive maintenance & scheduling.

Let’s go through some of the analytics use cases across FMCG functions that promise immediate value.

Function Wise Analytics Use Cases in FMCG

Go To Marketing

Go to Marketing plays a very important function in FMCG value chain by enabling products to reach the market. FMCG distribution models range from direct store delivery to retailer warehousing to third party distributor networks. Further complications arise due to the structure of the Indian market – core market vs. organized retail. Analytics can help optimize the GTM processes in multiple ways:

  • Network planning can help in minimizing logistics cost by optimizing fleet routes, number of retail outlets touched, order of contact and product mix on trucks in sync with each retailer’s demand.
  • Inventory orders can be optimized to reduce inventory pile-ups for slow moving products, and stock outs for faster moving products. SKU level demand forecasting followed by safety stock scenario simulations can help in capturing the impact of demand variability and lead time variability on stock outs.
  • Assortments intelligence promises a win-win situation for both FMCGs and retailers. Retailers increase margins by localizing assortments to local demand while FMCGs ensure a fluid movement of right products in right markets.
  • Smart Visi cooler allocations can help in increasing brand visibility and performance. Visi coolers come in different shapes and sizes. Traditionally, the sales personnel decide what kind of visi cooler to give to which retailer based on gut based judgment. Machine learning can be used to learn from retailer demand, performance and visibility data to make an optimal recommendation, thus improving brand visibility and performance.

Supply Chain & Operations

Analytics has percolated the supply value chain deeply. IoT is being popularly identified as the technology framework that will lend major disruptions with pooled data sources such as telemetry (fleet, machines, mobile), inventory and other supply chain process data. A couple of key applications are:

  • Vendor selection using risk scores based on contract, responsiveness, pricing, quantity and quality KPIs. Traditionally, this was done based on the qualitative assessment. But now, vendor risk modeling can be done to predict vendor risk scores, and high performance-low risk vendors can be selected from the contenders.
  • ‘Smart’ warehousing with IoT sensing frameworks. Traditionally, warehouses have functioned as a facility to only store inventory. But, IoT and AI have transformed it into a ‘Smart’ efficiency booster hub. For instance, AI can be used to automatically place the incoming batches on the right shelves such that picking them up for distribution consumes lesser resources and hence lesser cost.

Sales 

Since FMCG sales structure is very personnel oriented, use cases such as incentive compensation, sales force sizing, territory alignment and trade promotion decisions continue to be very relevant.

  • Sales force sizing can be improved through a data-driven segmentation of retailer base followed by algorithmic estimation of sales effort required in a territory.
  • Trade promotion recommendations can be automated and personalized for a retailer based on historical performance and context. This will allow sales personnel to meet the retailer, key in some KPIs and recommend a personalized trade promotion in real-time.

Marketing

Analytics has always been a cornerstone for enabling marketing decisions. It can help improve the accuracy and speed of these decisions.

  • Market mix modeling can be improved by simulating omnichannel spend attribution scenarios, thus optimizing overall marketing budget allocation and ROI.
  • Brand performance monitoring can be made intuitive by using rich web based visualizations that promise multi-platform consumption and quick decision making.
  • Sentiment analysis can help monitor the voice of customers on social media. Web based tools can provide a real-time platform to answer business queries such as what is important to customers, concerns/highlights, response to new launches & promotions etc.

Manufacturing

With the advent of IoT, AI and Big data systems, use cases such as predictive maintenance have become more feasible. Traditionally, the manufacturer will need to wait for a failure scenario to occur a few times, and then learn from it, and predict the re-occurrence of that scenario. Companies are now focusing on sensing failures before they happen so that the threat of new failures can be minimized. This can result in immense cost savings through continued operations and quality control. The benefits can be further extended to:

  • Improved production forecasting systems using POS data (enabled by retail data sharing), sophisticated machine learning algorithms and external data sources such as weather, macroeconomics that can be either scraped or bought from third party data vendors.
  • Product design improvisations using attribute value modeling. Idea is to algorithmically learn what product attributes are most valued by consumers, and use the insights to design better products in future.

Promotions and Revenue Management

Consumer promotions are central for gaining short term sales lifts and influencing consumer perceptions. Analytics can help in designing and monitoring these promotions. Also, regional and national events can be monitored to calculate the promotional lifts, which could be used for designing better future promotional strategies.

  • Automated consumer promotion recommendations based on product price elasticity, consumer feedback from market research data, cannibalization scenarios and response to historical promotions.

A key step in adopting and institutionalizing analytics use cases is to assess where you are in the analytics maturity spectrum.

Analytics Maturity Assessment

A thorough analytics maturity assessment can help companies to understand their analytics positioning in the industry, and gain competitive advantages by enhancing analytical capabilities. Here are few high-level parameters to assess analytics maturity:

Now that we understand, where we are in the journey, let’s look at “How do we get there?”

Levers of Change

FMCGs need to adopt a multi-dimensional approach with respect to adapting to the changing trends. The dimensions could be:

  1. Thought Leadership: Companies need to invest considerable effort in developing a research and innovation ecosystem. We are talking about leapfrogging the traditional process improvement focus and getting on the innovation bandwagon. This requires hiring futurists, building think tanks inside the company, and creating an ‘’Edison mindset’’ (Progressive trial-error-learning mindset).
  2. Technology: Traditionally, companies have preferred the second-mover route when it comes to adoption of a newer technology. The rationale is that of risk avoidance and surety. With analytics enablement technologies such as Big data and cloud, this rationale falls to pieces. Primarily because, you are not the second-mover, but probably a double or triple digit move owing to widespread adoption across industries. Analytics enablement technologies have become a necessity for organizations.
  3. Learning from Others: Human beings are unique in having the ability to learn from the experience of others. This ability helps them not only correct their errors but also find new possibilities. Similarly, can FMCG learn from modern fast fashion retailers and revolutionize the speed to market? Can it learn from telecom and hyper-personalized offerings? Can it learn from banking and touch the consumer in multiple ways?

Harmonizing above levers along with relevant FMCG contextualization can lead to the desired transformation.

New Product Forecasting Using Deep Learning – A Unique Way

Background

Forecasting demand for new product launches has been a major challenge for industries and cost of error has been high. Under predict demand and you lose on potential sales, overpredict them and there is excess inventory to take care of. Multiple research suggests that new product contributes to one-third of the organization sales across the various industry. Industries like Apparel Retailer or Gaming thrive on new launches and innovation, and this number can easily inflate to as high as 70%. Hence accuracy of demand forecasts has been a top priority for marketers and inventory planning teams.

There are a whole lot of analytics techniques adopted by analysts and decision scientists to better forecast potential demand, the popular ones being:

  • Market Test Methods – Delphi/Survey based exercise
  • Diffusion modeling
  • Conjoint & Regression based look alike models

While Market Test Methods are still popular but they need a lot of domain expertise and cost intensive processes to drive desired results. In recent times, techniques like Conjoint and Regression based methods are more frequently leveraged by marketers and data scientists. A typical demand forecasting process for the same is highlighted below:

Though the process implements an analytical temper of quantifying cross synergies between business drivers and is scalable enough to generate dynamic business scenarios on the go, it falls short of expectations on following two aspects

  • It includes heuristics exercise of identifying analogous products by manually defining product similarity. Besides, the robustness of this exercise is influenced by domain expertise. The manual process coupled with subjectivity of the process might lead to questionable accuracy standards.
  • It is still a supervised model and the key demand drivers need manual tuning to generate better forecasting accuracy.

For retailers and manufacturers esp. apparel, food, etc. where the rate of innovation is high and assortments keep refreshing from season to season, a heuristic method would lead to high cost and error for any demand forecasting exercise.

With the advent of Deep Learning’s Image processing capabilities, the heuristic method of identifying feature similarity can be automated with a high degree of accuracy through techniques like Convoluted Neural Network (CNN). It also minimizes the need for domain expertise as it self-learns feature similarity without much supervision. Since the primary reason for including product features in demand forecasting model is to understand the cognitive influence on customer purchase behavior, a deep learning based approach can capture the same with much higher accuracies. Besides techniques like Recurrent Neural Network (RNN) can be employed to make the models better at adaptive learning and hence making the system self-reliant with negligible manual interventions.

“Since the primary reason of including product features in demand forecasting model is to understand cognitive influence on customer purchase behavior, a deep learning framework is a better and accurate approach to capture the same”

In practice, CNN and RNN are two distinct methodologies and this article highlights a case where various Deep Learning models were combined to develop a self-learning demand forecasting framework.

Case Background

An apparel retailer wanted to forecast demand for its newly launched “Footwear” styles across various lifecycle stages. The current forecasting engine implemented various supervised techniques which were ensemble to generate desired demand forecasting. It had 2 major shortcomings:

  • The analogous product selection mechanism was heuristic and lead to low accuracy level in downstream processes.
  • The heuristic exercise was a significant road block in evolving the current process to a scalable architecture, making the overall experience a cost intensive one.
  • The engine was not able to replicate the product life cycle accurately.

Proposed Solution

We proposed to tackle the problem through an intelligent, automated and scalable framework

  • Leverage Convoluted Neural Networks(CNN) to facilitate the process of identifying the analogous product. CNN techniques have been proven to generate high accuracies in image matching problems.
  • Leverage Recurrent Neural Networks (RNN) to better replicate product lifecycle stages. Since RNN memory layers are better predictors of next likely event, it is an apt tool to evaluate upcoming time-based performances.
  • Since the objective was to devise a scalable method, a cloud-ready easy to use UI was proposed, where user can upload the image of an upcoming style and the demand forecasts would be generated instantly.

Overall Approach

The entire framework was developed in Python using Deep Learning platforms like Tensor Flow with an interactive user interface powered by Django. The Deep Learning systems were supported through NVIDIA GPUs hosted on Google Cloud.

The demand prediction framework consists of following components to ensure an end to end analytical implementation and consumption.

1. Product Similarity Engine

An image classification algorithm was developed by leveraging Deep Learning techniques like Convolution Neural Networks. The process included:

Data Collation

  • Developed an Image bank consisting of multi-style shoes across all categories/sub-categories e.g. sports, fashion, formals etc.
  • Included multiple alignments of the shoe images.

Data Cleaning and Standardization

  • Removed duplicate images.
  • Standardized the image to a desired format and size.

Define High-Level Features

  • Few key features were defined like brands, sub-category, shoe design – color, heel etc.

Image Matching Outcomes

  • Implemented a CNN model with 5+ hidden layers.

The following image is an illustrative representation of the CNN architecture implemented

  • Input Image: holds raw pixel values of the image with features being width, height & RGB values.
  • Convolution: Conv Net is to extract features from input data. Formation of matrix by sliding filters over an image and computing dot product is called “Feature Map”.
  • Non-Linearity – RELU: This layer applies element-wise activation filter leveraged to stimulate non-linearity relationships in a standard ANN.
  • Pooling:  Reduces the dimensionality of each feature map and retains important information. Helps in arriving at a scale invariant representation of an image.
  • Dropouts: To prevent overfitting random connections are severed.
  • SoftMax Layer: Output layer that classifies the image to appropriate category/subcategory/heel height classes.

Identified Top N matching shoes and calculated their probability scores. Classified image orientation as top, side (right/left) alignment of the same image

Similarity Index- Calculated based on the normalized overall probability scores.

Analogous Product: Attribute Similarity Snapshot (Sample Attributes Highlighted)

2. Forecasting Engine

A demand forecasting engine was developed on the available data by evaluating various factors like:

  • Promotions – Discounts, Markdown
  • Pricing changes
  • Seasonality – Holiday sales
  • Average customer rating
  • Product Attributes – This was sourced from the CNN exercise highlighted in the previous step
  • Product Lifecycle – High sales in initial weeks followed by declining trend

The following image is an illustrative representation of the demand forecasting model based on RNN architecture.

The RNN implementation was done using Keras Sequential model and the loss function was estimated using “mean squared error” method.

Demand Forecast Outcome

The accuracy from the proposed Deep Learning framework was in the range of 85-90% which was an improvement on the existing methodology of 60-65%.

Web UI for Analytical Consumption

An illustrative snapshot is highlighted below:

Benefits and Impact

  • Higher accuracy through better learning of the product lifecycle.
  • The overall process is self-learning and hence can be scaled quickly.
  • Automation of decision intensive processes like analogous product selection led to reduction in execution time.
  • Long-term cost benefits are higher.

Key Challenges & Opportunities

  • The image matching process requires huge data to train.
  • The feature selection method can be an automated through unsupervised techniques like Deep Auto Encoders which will further improve scalability.
  • Managing image data is a cost intensive process but it can be rationalized over time.
  • The process accuracies can be improved by creating a deeper architecture of the network and an additional one-time investment of GPU configurations.

Bayesian Theorem: Breaking it to Simple Using PyMC3 Modelling

Abstract

This article edition of Bayesian Analysis with Python introduced some basic concepts applied to the Bayesian Inference along with some practical implementations in Python using PyMC3, a state-of-the-art open-source probabilistic programming framework for exploratory analysis of the Bayesian models.

The main concepts of Bayesian statistics are covered using a practical and computational approach. The article covers the main concepts of Bayesian and Frequentist approaches, Naive Bayes algorithm and its assumptions, challenges of computational intractability in high dimensional data and approximation, sampling techniques to overcome challenges, etc. The results of Bayesian Linear Regressions are inferred and discussed for the brevity of concepts.

Introduction

Frequentist vs. Bayesian approaches for inferential statistics are interesting viewpoints worth exploring. Given the task at hand, it is always better to understand the applicability, advantages, and limitations of the available approaches.

In this article, we will be focusing on explaining the idea of Bayesian modeling and its difference from the frequentist counterpart. To make the discussion a little bit more intriguing and informative, these concepts are explained with a Bayesian Linear Regression (BLR) model and a Frequentist Linear Regression (LR) model.

Bayesian and Frequentist Approaches

The Bayesian Approach:

Bayesian approach is based on the idea that, given the data and a probabilistic model (which we assume can model the data well), we can find out the posterior distribution of the model’s parameters. For e.g.

In Bayesian Linear Regression approach, not only the dependent variable y,  but also the parameters(β) are assumed to be drawn from a probability distribution, such as Gaussian distribution with mean=βTX, and variance =σ2I (refer equation 1). The outputs of BLR is a distribution, which can be used for inferring new data points.

The Frequentist Approach, on the other hand, is based on the idea that given the data, the model and the model parameters, we can use this model to infer new data. This is commonly known as the Linear Regression Approach. In LR approach, the dependent variable (y) is a linear combination of weights term-times the independent variable (x), and e is the error term due to the random noise.

Ordinary Least Square (OLS) is the method of estimating the unknown parameters of LR model. In OLS method, the parameters which minimize the sum of squared errors of training data are chosen. The output of OLS are “single point” estimates for the best model parameter.

Let’s get started with Naive Bayes Algorithm, which is the backbone of Bayesian machine learning algorithms. Here, we can predict only one value of y, so basically it is a point estimation

Naive Bayes Algorithm for Classification       

Discussions on Bayesian Machine Learning models require a thorough understanding of probability concepts and the Bayes Theorem. So, now we discuss Bayes’ Algorithm. Bayes’ theorem finds the probability of an event occurring, given the probability of an already occurred event. Suppose we have a dataset with 7 features/attributes/independent variables (x1, x2, x3,…, x7), we call this data tuple as X. Assume H is the hypothesis of the tuple belonging to class C. In Bayesian terminology, it is known as the evidencey is the dependent variable/response variable (i.e., the class in classification problem). Then Mathematically, Bayes theorem is stated as :

Where:  

  1. P(H|X) is the probability that the hypothesis H holds correct, given that we know the ‘evidence’ or attribute description of X. P(H|X) is the probability of H conditioned on X, a.k.a., Posterior Probability.                              
  2. P(X|H) is the posterior probability of X conditioned on H and is also known as ‘Likelihood’.
  3. P(H) is the prior probability of H. This is the fraction of occurrences for each class out of total number of samples.
  4. P(X) is the prior probability of evidence (data tuple X), described by measurements made on a set of attributes (x1, x2, x3,…, x7).

As we can see, the posterior probability of H conditioned on X is directly proportional to likelihood times prior probability of class and is inversely proportional to the ‘Evidence’.

Bayesian approach for regression problem: Assumptions of Bayes theorem, given a sales prediction problem with 7 independent variables.

i) Each pair of features in the dataset are independent of each other. For e.g., feature x1 has no effect on x2, & x2 has no effect on feature x7.
ii) Each feature makes an equal contribution towards the dependent variable.

Finding the posterior distribution of model parameters is computationally intractable for continuous variables, we use Markov Chain Monte Carlo and Variational Inferencing methods to overcome this issue.

From Naive Bayes theorem (equation 3), posterior calculation needs a prior, a likelihood and evidence. Prior and likelihood are calculated easily as they are defined by the assumed model. As P(X) doesn’t depend on H and given the values of features, the denominator is constant. So, P(X) is just a normalization constant. We need to maximize the value of numerator in equation 3. However, the evidence (probability of data) is calculated as:

Calculating the integral is computationally intractable with high dimensional data. In order to build faster and scalable systems, we require some sampling or approximation techniques to calculate the posterior distribution of parameters given in the observed data. In this section, two important methods for approximating intractable computations are discussed. These are sampling-based approach. Markov-chain Monte Carlo Sampling (MCMC sampling) and approximation-based approach known as Variational Inferencing (VI). Brief introduction of these techniques are as mentioned below:

  • MCMC– We use sampling techniques like MCMC to draw samples from the distribution, followed by approximating the distribution of the posterior. Refer to George’s blog [1], for more details on MCMC initialization, sampling and trace diagnostics.
  • VI– Variational Inferencing method tries to find the best approximation of the distribution from a parameter family. It uses an optimization process over parameters to find the best approximation. In PyMC3, we can use Automatic Differentiation Variational Inference (ADVI), which tries to minimize the Kullback–Leibler (KL) divergence between a given parameter family distribution and the distribution proposed by the VI method.

Prior Selection: Where is the prior in data, from where do I get one?

Bayesian modelling gives alternatives to include prior information into the modelling process. If we have domain knowledge or an intelligent guess about the weight values of independent variables, we can make use of this prior information. This is unlike the frequentist approach, which assumes that the weight values of independent variables come from the data itself. According to Bayes theorem:

Now that the method for finding posterior distribution of model parameters are being discussed, the next obvious question based on equation 5 is how to find a good prior. Refer [2] for understanding how to select a good prior for the problem statement. Broadly speaking, the information contained in the prior has a direct impact on the posterior calculations. If we have a more “revealing prior” (a.k.a., a strong belief about the parameters), we need more data to “alter” this belief. The posterior is mostly driven by prior. Similarly, if we have an “vague prior” (a.k.a., no information about the distribution of parameters), the posterior is much driven by data. It means that if we have a lot of data, the likelihood will wash away the prior assumptions [3]. In BLR, the prior knowledge modelled by a probability distribution is updated with every new sample (which is modelled by some other probability distribution).

Modelling Using PyMC3 Library for Bayesian Inferencing

Following snippets of code (borrowed from [4]), shows Bayesian Linear model initialization using PyMC3 python package. PyMC3 model is initialized using “with pm.Model()” statement. The variables are assumed to follow a Gaussian distribution and Generalized Linear Models (GLMs) used for modelling.  For an in-depth understanding on PyMc3 library, I recommend Davidson-Pilon’s book [5] on Bayesian methods.

Fig. 1 Traceplot shows the posterior distribution for the model parameters as shown on the left hand side. The progression of the samples drawn in the trace for variables are shown on the right hand side.

We can use “Traceplot” to show the posterior distribution for the model parameters and shown on the left-hand side of Fig. 1. The samples drawn in the trace for the independent variables and the intercept for 1,000 iterations are shown on the right-hand side of the Fig 1. Two colours – orange and blue, represent the two Markov chains.

After convergence, we get the coefficients of each feature, which is its effectiveness in explaining the dependent variable. The values represented in red are the Maximum a posteriori estimate (MAP), which is the mean of the variable value from the distribution. The sales can be predicted using the formula:

As it is a Bayesian approach, the model parameters are distributions. Following plots show the posterior distribution in the form of histogram. Here the variables show 94% HPD (Highest Posterior Density). HPD in Bayesian statistics is the credible interval, which tells us we are 94% sure that the parameter of interest falls in the given interval (for variable x6, the value range is -0.023 to 0.36).

We can see that the posteriors are spread out, which is an indicative of less data points used for modelling, and the range of values each independent variable can take is not modelled within a small range (uncertainty in parameter values are very high). For e.g., for variable x6, the value range is from -0.023 to 0.36, and the mean is 0.17. As we add more data, the Bayesian model can shrink this range to a smaller interval, resulting in more accurate values for weights parameters.

Fig. 2 Plots showing the posterior distribution in the form of histogram.

When to use linear and BLR, Map, etc. Do we go Bayesian or Frequentist?

The equation for linear regression on the same dataset is obtained as:

If we see Linear regression equation (eq. 7) and Bayesian Linear regression equation (eq. 6), there is a slight change in the weight’s values. So, which approach should we take up? Bayesian or Frequentist, given that both are yielding approximately the same results?

When we have a prior belief about the distributions of the weight variables (without seeing the data) and want this information to be included into the modelling process, followed by automatic belief adaptation as we gather more data, Bayesian is a preferable approach. If we don’t want to include any prior belief and model adaptions, the weight variables as point estimates, go for Linear regression. Why are the results of both models approximately the same? 

The maximum a posteriori estimates (MAP) for each variable is the peak value of the variable in the distribution (shown in Fig.2)  close to the point estimates for variables in LR model. This is the theoretical explanation for real-world problems. Try using both approaches, as the performance can vary widely based on the number of data points, and data characteristics.

Conclusion

This blog is an attempt to discuss the concepts of Bayesian inferencing and its implementation using PyMC3. It started off with the decade’s old Frequentist-Bayesian perspective and moved on to the backbone of Bayesian modelling, which is Bayes theorem. Once setting the foundations, the concepts of intractability to evaluate posterior distributions of continuous variables along with the solutions via sampling methods viz., MCMC and VI are discussed.  A strong connection between the posterior, prior and likelihood is discussed, taking into consideration the data available in hand. Next, the Bayesian linear regression modelling using PyMc3 is discussed, along with the interpretations of results and graphs. Lastly, we discussed why and when to use Bayesian linear regression.

Resources:

The following are the resources to get started with Bayesian inferencing using PyMC3.

[1] https://eigenfoo.xyz/bayesian-modelling-cookbook/

[2] https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

[3] https://stats.stackexchange.com/questions/58564/help-me-understand-bayesian-prior-and-posterior-distributions

[4] https://towardsdatascience.com/bayesian-linear-regression-in-python-using-machine-learning-to-predict-student-grades-part-2-b72059a8ac7e

[5] Davidson-Pilon, Cameron. Bayesian methods for hackers: probabilistic programming and Bayesian inference. Addison-Wesley Professional, 2015

Explainable AI

The advancement in AI technology has led us to solve several problems with technology working side by side. The complexity of these AI models is growing, and so is the need to understand them. A growing concern is to regulate the bias in AI, which can occur for several reasons. One of many reasons is partial input data that will cause bias in the training model. So, it becomes increasingly essential to comprehend how the algorithm came to a result. Explainable AI is a set of tools and frameworks to help you understand and interpret predictions made by your machine learning models.

“Explainable artificial intelligence (XAI) is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms. Explainable AI is used to describe an AI model, its expected impact, and potential biases.” -IBM

Feature Attributions

Feature attributions depict how much each feature in your model contributed to each instance of test data’s predictions. When you run explainable AI techniques, you get the predictions and feature attribution information.

Understanding the Feature Attribution Methods

LIME

LIME is termed “Local Interpretable Model Agnostic Explanation.” It explains any model by approximating it locally with an interpretable model. It first creates a sample dataset locally by permuting the features or values from the original test instance. Then, a linear model is fitted to the perturbed dataset to understand the contribution of each feature. The linear model gives the final weight of each feature after fitting, which is the LIME value of these features. LIME has several methods based on the different models’ architecture and the input data.

LIME provides a tabular explainer that explains predictions based on tabular data. We implemented LIME for classification on the Iris dataset and regression on the Boston housing dataset. In the image below, we can see that petal length contributes positively to the Setosa class, whereas sepal length negatively contributes. Similarly, LSTAT and RM are the most contributing features in predicting Boston house prices.

Figure 1: (a) LIME explanation for logistic regression model trained on iris dataset and
(b) LIME explanation for linear regression trained on Boston housing dataset.

The sample dataset is created in text data by randomly permuting the words from the original text instance. The relevance of terms contributing to the prediction result is determined by fitting a linear model on the sample dataset.

In the following text classification example, LIME highlights both positively and negatively contributing words towards the classification of the text in business class. The term “WorldCom” is contributing the most positively.

Figure 2: LIME explanation for BERT model trained on BBC news dataset.

For image classification tasks, LIME finds the region of an image (set of super-pixels) with the strongest association with a prediction label. It generates perturbations by turning on/off some of the super-pixels in the image. Then a model to be explained predicts the target class on perturbated image. Next, a linear model is trained using the dataset of “perturbed” samples with their responses, which provides the weightage of super pixels.

In the example below LIME show the area which have strong association with the prediction of “Labrador”.

Figure 3: LIME explanation for image classification model on cats and dog dataset.

GradCAM

GradCAM stands for Gradient-weighted Class Activation Mapping. It uses the gradients of any target concept (say, “dog” in a classification network) flowing into the final convolutional layer. It works by evaluating the predicted class’s score gradients concerning the convolutional layer’s feature maps, which are then pooled to determine the weighted combination of feature maps. These weights are passed through the ReLu activation function to get the positive activations, producing a coarse localization map highlighting the critical regions in the image for predicting the concept.

We implement GradCAM to understand a CNN model trained on a human activity recognition dataset. In the following image, the most positive contributing area to the predicted class is shown in red, whereas the area in blue has no positive contribution towards the predicted class.

Figure 4: GradCAM result for VGG16 model trained on human activity recognition dataset.

Deep Taylor

It is a method to decompose the output of the network classification into the contributions (relevance) of its input elements using Taylor’s theorem, which gives an approximation of a differentiable function. The output neuron is first decomposed into input neurons in a neural network. Then, the decomposition of these neurons is redistributed to their inputs, and the redistribution process is repeated until the input variables are reached. Thus, we get the relevance of input variables in the classification output.

As we can see in the image below, the pixels of the bicycle have the maximum contribution in the predicted biking class.

Figure 5: Deep Taylor result for VGG16 model trained on human activity recognition dataset.

SHAP

Shapley values are a concept in a cooperative game theory algorithm that assigns credit to each player in a game for a particular outcome. When applied to machine learning models, this indicates that each model feature is considered as a “player” in the game, with AI Explanations allocating each proportionate feature credit for the prediction’s result. SHAP provides various methods based on model architecture to calculate Shapley values for different models. 

The SHAP gradient explainer is a method to drive the SHAP value on the image data. It calculates the gradient of the output score with respect to the input, i.e., the pixel’s intensity—this feature attribution method is designed for differentiable models like convolutional neural networks.

In the following image, the pixels in red are most positively contributing to the biking class. In contrast, the pixels in blue are confusing the model with a different class and hence negatively contributing.

Figure 6: SHAP result for VGG16 model trained on human activity recognition dataset.

The SHAP kernel explainer is the only method that is model agnostic for the calculation of Shapley values. It is an extended and adapted method of linear LIME to calculate Shapley values. The Kernel Explainer builds a weighted linear regression using your data and predictions. Whatever function indicates the predicted values, the coefficients of the solution of weighted linear regression are the Shapley values. As a result, a gradient-based explanation method cannot be used since object detection models are non-differentiable. However, we can use the SHAP kernel explainer.

This method explains one detection in one image. Here we are explaining the person in the middle. We see that the dark red patches with the highest contribution are located within the bounding box of our target. Interestingly, the highest contribution seems to come from head and shoulders.

Figure 7: SHAP results for YOLOv3 model trained on the COCO dataset.

SHAP also has the functionality to derive explanations for NLP models. We demonstrate the use of SHAP for text classification and text summarization, which explains the contribution of words or a combination of words towards a prediction. 

As we can see below, for example, in the text classification, the word “company” is the highest contributor in classifying text into the business class. In summary, the term “neglected” has “but it is almost like we are neglected” as its highest contributor in the text summarization.

Figure 8: (a) SHAP result for text classification using BERT and (b) SHAP result for text summarization using Schleifer/distilbart-cnn-12-6 model.

SODEx

The Surrogate Object Detection Explainer (SODEx) explains an object detection model using LIME, which explains a single prediction with a linear surrogate model. It first segments the image into super pixels and generates perturbed samples. Then, the black box model predicts the result of every perturbed observation. Using the dataset of perturbed samples and their responses, it trains a surrogate linear model that provides super pixel weights.

It gives explanations for all the detected objects in an image. The green-colored patches show positive contributions, and the red-colored patches negatively contribute. Here, the model focuses on hands and legs to detect a person.

Figure 9: SODEx result for YOLOv3 model trained on the COCO dataset.

SegGradCAM

SEG-GRAD-CAM is an extension of Grad-CAM for semantic segmentation. It can generate heat maps to explain the relevance of the decisions of individual pixels or regions in the input image. The GradCAM uses the gradient of the logit for the predicted class with respect to chosen feature layers to determine their general relevance. But a CNN for semantic segmentation produces logits for every pixel and class. This idea allows us to adapt GradCAM to a semantic segmentation network flexibly since we can determine the gradient of the logit of just a single pixel, or pixels of an object instance, or simply all pixels of the image.

Like in GradCAM, the red pixels are the most positively contributing, whereas the pixels in blue have zero positive contributions. Here, the pixels of the car’s windscreen have the maximum contribution.

Figure 10: SegGradCAM result for U-net model trained on camvid dataset.

Conclusion

Explainable AI builds confidence in the model’s behavior by ensuring that the model does not focus on idiosyncratic details of the training data that will not generalize to unseen data. Therefore, it guarantees the fairness of ML models. We have implemented the explainable AI techniques for the models trained on tabular, text, and image data, thus enhancing these models’ transparency and interactivity. You can then use this information to verify that the model is behaving as expected, recognize bias in your models, and get ideas for improving your model and training data.

References

  1. Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.” International Journal of Computer Vision 2019
  2. http://www.heatmapping.org/slides/2018_ICIP_2.pdf
  3. https://towardsdatascience.com/shap-shapley-additive-explanations-5a2a271ed9c3
  4. Sejr, J.H.; Schneider-Kamp, P.; Ayoub, N. Surrogate Object Detection Explainer (SODEx) with YOLOv4 and LIME
  5. https://www.steadforce.com/blog/explainable-object-detection
  6. Vinogradova, K., Dibrov, A., & Myers, G. (2020). Towards Interpretable Semantic Segmentation via Gradient-Weighted Class Activation Mapping 

Top 5 Challenges- Implementing Industry 4.0!

Today, the entire world is grappling with the COVID-19 pandemic, which has intensified supply chain concerns and prompted many businesses to rethink their sourcing strategies. Several businesses are focusing on localization for two reasons: one, to be closer to the source, and the other, to minimize the risk of disruption.

Well, necessity is the mother of invention, and this is undeniably true for technological innovations, precisely Industry 4.0 solutions. Evolution is the result of real hardship. In the case of manufacturing, the movement of Industry 4.0 is caused by volatile market demands for better and quicker production techniques, shrinking margins, and intense contention among enterprises that is impossible without smart technology.

Smart Manufacturing will be based on digitization and Industry 4.0 and large enterprises are inclining towards digital innovation. However, SME’s and MSMEs are still struggling with several challenges to adopting the Digital Transformation and Industry 4.0 initiatives. These obstacles may dissuade some manufacturing companies from adopting these technologies, causing them to fall behind their peers.

The Top Five Challenges!

SMEs and MSMEs still experience difficulties achieving Industry 4.0 goals, although smart manufacturing is often associated with Industry 4.0 and digital transformation. Here are the five challenges:

1. Organization Culture:

This is one of the immense challenges for any organization to evolve from ad-hoc decisions to data-based decision-making. Part of this is driven by the data availability and the CXO’s awareness and willingness to adopt new Digital technologies. Navigating the balance between culture and technology together is one of the toughest challenges of digital transformation.

2. Data Readiness/Digitization:

Any Digital Revolution succeeds on the availability of data. Unfortunately, this is one of the most significant opportunities for SMEs. Most of the manufacturing plants in SMEs lack basic data capture and storage infrastructure.

Most places have different PLC protocols (e.g., Siemens, Rockwell, Hitachi, Mitsubishi, etc.), and the entire data is encrypted and locked. This either requires unlocking encryption by the control systems providers or calls for separate sensor or gateway installations. Well, this is a huge added cost, and SMEs have not seen any benefits so far, as they have been running their businesses frugally.

3. Data Standardization and Normalization:

This is a crucial step in the Digital Transformation journey, enabling the data to be used for real-time visibility, benchmarking, and machine learning.

Most SMEs grow in an organic way, and there’s an intent to grow most profitably. Typically, IT and OT technology investments are kept to a bare minimum. As a result, most SMEs are missing SCADA/MES’S systems that integrate the data in a meaningful way and help store it centrally. As a result of missing this middleware, most of the data needs to be sourced from different sensors directly or PLCs and sent via gateways.

All this data cannot be directly consumed for visualization and needs an expensive middleware solution (viz., LIMS (Abbott, Thermo Fischer), and LEDs- GE Proficy); this is again an added cost.

Additionally, the operational data is not all stored in a centralized database. Instead, it is available in real-time from Programmable Logic Controllers (PLCs), machine controllers, Supervisory Control and Data Acquisition (SCADA) systems, and time-series databases throughout the factory. This increases the complexity of data acquisition and storage.

4. Lack of Talent for Digital:

Believe it or not, we have been reeling under a massive talent crunch for digital technologies. As of 2022, a huge talent war is attracting digital talent across all services, consulting, and product-based companies.

As a result, we don’t have enough people who have seen the actual physical shop floor, understand day-to-day challenges, and have enough digital and technical skills to enable digital transformation. A systematic approach is needed to help up-skill existing resources and develop new digital talent across all levels.

5. CXO Sponsorship:

This is a key foundation for any digital transformation and Industry4.0 initiative. Unless there’s CXO buy-in and sponsorship, any digital transformation initiative is bound to fail. For CXO’s to start believing in the cause, they need to be onboarded, starting with just awareness of what’s possible, emphasizing benefits and ROI as reasons to believe.

Once there’s a top-down willingness and drive, things will become much easier regarding funding, hiring of technical talent or consulting companies, and execution.

Final Takeaway

It should go without saying that the above stats do not include all the challenges manufacturers encounter when they embark on the Industry 4.0 journey. Additionally, more industries and professionals should actively engage in skill improvement initiatives for immediate implementation and prepare employees for the future. Because of the ever-changing nature of IIoT technologies and their rapid pace, this list of challenges will continue to change over time.

We would love to hear back from you on your experiences implementing Industry 4.0 and digital transformation projects and the challenges you faced.

Please feel free to comment and share your experiences.

#Affine is conducting an Event, Demystifying Industry 4.0, on Industry 4.0 and Digital Transformation for CXOs on March 25th, 2022. The Event aims to provide major Industry 4.0 use cases for automotive suppliers and ecosystems.

Stay tuned for more information!

References

[Ref]- https://knowledge.wharton.upenn.edu/article/fedex-digital-transformation/

The transformer revolution in video recognition. Are you ready for it?

Imagine how many lives could be saved if caretakers or medical professionals could be alerted when an unmonitored patient showed the first signs of sickness. Imagine how much more secure our public spaces could be if police or security personnel could be alerted upon suspicious behavior. Imagine how many tournaments could be won if activity recognition could inform teams and coaches of flaws in athletes’ form and functioning.

With Human Activity Recognition (HAR), all these scenarios can be effectively tackled. HAR has been one of the most complex challenges in the domain of computer vision. It has a wide variety of potential applications such as – sports, post-injury rehabilitation, analytics, security surveillance traffic monitoring, etc. But the complexity of HAR arises from the fact that an action spans both spatial and temporal dimensions. In normal computer vision tasks, where a model is trained to classify/detect objects located in a picture, only spatial dimension is involved. However, in HAR, learning multiple frames together over the time helps in classifying the action better. Hence, the model must be able to track both the spatial and temporal components.

The architecture used for video activity recognition includes 2D convolutions, 3D CNN volume filters to capture spatio-temporal information (Tran et al., 2015), 3D convolution factorized into separate spatial and temporal convolutions (Tran et al., 2018), LSTM for spatio-temporal information (Karpathy et al., 2014), as well as the combination/enhancements of these techniques.

TimeSformer – A revolution by Facebook!

Transformers have been making waves in Natural Language Processing for the last few years. They employ self-attention with encoder-decoder architecture to make accurate predictions by extracting information about context. In the domain of computer vision, the first implementation of transformers came through ViT (visual transformers) developed by Google. In ViT, a picture is divided into patches of size 16×16 (see Figure 1) and then flattened to 1D vectors. Then they are embedded and passed through an encoder. The self-attention is calculated with respect to all the other patches.

\vision-tranformer-gif

Figure 1. The Vision Transformer treats an input image as a sequence of patches, akin to a series of word embeddings generated by an NLP Transformer. (Source: Google AI Blog: Transformers for Image Recognition at Scale (googleblog.com))

Recently Facebook developed TimeSformer, the first instance in which transformers are used for HAR. In the case of TimeSformer, as in the case of other HAR methods, the input is a block of continuous frames from the video clip, for example, 16 continuous frames of size 3x224x224. To calculate the self-attention for a patch in a frame, two sets of other patches are used:

 a) other patches of the same frame (spatial attention).

 b) patches of the adjacent frames (temporal attention).

There are several and different ways to use these patches. We have utilized only the “divided space-time attention” (Figure 2) for this purpose. It uses all the patches of the current frame and patches at the same position of the adjacent frames. In “divided attention”, temporal attention and spatial attention are separately applied within each block, and it leads to the best video classification accuracy (Bertasius et al., 2021).

Figure 2. Divided space-time attention in a block of frames (Link: TimeSformer: A new architecture for video understanding (facebook.com))

It must be noted that TimeSformer does not use any convolutions which brings down the computational cost significantly. Convolution is a linear operator; the neighboring pixels are used by the kernel in computations. Vision transformers (Dosovitskiy et al., 2020), on the other hand, are permutation invariant and require sequences of data. So, for the transformer’s input, spatial non-sequential data is converted into a sequence. Learnable positional embeddings are added per patch (analogously taken in an NLP task) to allow the model to learn the structure of the image.

TimeSformer is roughly three times faster to train than 3DCNNs, requires less than one-tenth the amount of compute for inference, and has 121,266,442 trainable parameters compared to only 40,416,074 trainable parameters in the 2DCNN model and 78,042,250 parameters in 3DCNN models.

In addition to this, TimeSformer has the advantages of customizability and scalability over convolutional models – the user can choose the size and depth of the frames which is used as input to the model. The original study has utilized images as big as 556×556 and as deep as 96 frames and yet could not scale exponentially. 

But challenges abound…

Following are some challenges while tracking motion:

  1. The challenge of angles: A video can be shot from multiple angles; the pattern of motion could appear different in different angles.
  2. The challenge of camera movement: Depending on whether the camera moves with the motion of the object or not, the object could appear to be static or moving. Shaky cameras add further complexity.
  3. The challenge of occlusion: During motion, the object could be hidden by another object temporarily in the foreground.
  4. The challenge of delineation: Sometimes it is not easy to differentiate where one action ends and the other begins.
  5. The challenge of multiple actions: Different objects in a video could be performing different actions, adding complexity to recognition.
  6. The challenge of change in relative size: Depending on whether an object is moving towards or away from the camera, its relative size could change continuously adding further complexity to recognition.

Dataset

In order to determine video recognition capability, we have used a modified UCF11 data set.

We have removed the Swing class as it has many mislabeled data. The goal is to recognize 10 activities – basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. Each has 120-200 videos of different lengths ranging from 1-21 seconds. The link to the dataset is (CRCV | Center for Research in Computer Vision at the University of Central Florida (ucf.edu)).

How we trained the model

We carried out different experiments to get the best-performing model. Our input block contained 8 frames of size 3x224x224. The base learning rate was 0.005 which was reduced by a factor of 0.1 in the 11th and 14th steps. For augmentation, color jitter (within 40), random horizontal flip and random crop (from 256×320 to 224×224) were allowed. We trained the model on Amazon AWS Tesla M60 GPU with a batch size of 2 (due to memory limitations) for 15 epochs.

Metrics are everything

In the original code, Timesformer samples one input block per clip for training and validation. In the case of test videos, it takes 3 different crops and averages over the predictions (we term this samplewise accuracy). As a result, several of the models we trained could achieve over 95% validation accuracy. However, in our humble opinion, this is not satisfactory because it does not examine all the different spatio-temporal possibilities in the video. To address that, we take two other metrics into consideration.

  • Blockwise accuracy – a video clip is considered an object obtained by combining continuous building blocks (no overlap). The model makes predictions for all the input blocks, and this accuracy of prediction is measured. This is more suitable for real-time scenarios.
  • Clipwise accuracy – prediction of all the blocks of a video is considered and the mode is assigned to be the prediction for the clip, and that accuracy is measured. This also helps to understand real-time accuracy in a larger timeframe.

The final outcome

Our best model had the following performance metric values:

  • Samplewise accuracy – 97.3%
  • Blockwise accuracy – 86.8%
  • Clipwise accuracy – 92.2%

The confusion matrix for the clipwise accuracy is given in Figure 3(a). For comparison, the confusion matrix for 2dCNN and 3DCNN models are shown in Figure 3(b) and 3(c) respectively.

These metrics are quite impressive and far better than the results we obtained using 2D convolution (VGG16) as well as 3D convolution (C3D); 81.3% and 74.6% respectively. This points to the potential of TimeSformers.

Figure 3(a). Clipwise accuracy confusion matrix for TimeSformer

Concluding Remarks

In this work, we have explored the effectiveness of the TimeSformer for Human Activity Recognition task on the modified UCF11 dataset. This non-convolution model has outperformed the 2DCNN and 3DCNN models and performed extremely well on some hard to classify classes such as ‘basketball’ and ‘walking’. Future work includes trying more augmentation techniques to fine-tune this model and using Vision-based Transformers in other video-related tasks such as video captioning and action localization.

When done right, TimeSformer can truly change the game for Human Activity Recognition. Its use cases across healthcare and sports, safety and security can truly come to life with TimeSformer.

References

[1] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).

[2] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459).

[3] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).

[4] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding?. arXiv preprint arXiv:2102.05095.

[5] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

Operational Transformation: Top 5 Key Success Factors in the Auto Industry

In a world driven by transformational change, one industry that has remained relatively stable was automobiles.

While incremental changes regularly occurred, the way vehicles were made and operated remained fundamentally rooted around the Internal Combustion Engine. Computerized fuel injection systems replaced carburetors, suspension systems learned to automatically adjust to terrain changes and telematics enabled proactive failure avoidance. But, at the heart of it, the core technology chugged on regardless.

This, however, is set to change, and change drastically. Emission regulations, societal pressure due to climate change, and rising fuel costs are rapidly forcing the industry to move towards electric vehicles. The demand for autonomous vehicles is growing, given the shortage of skilled drivers to keep supply chains running. But, the greatest change, and challenge, that the automobile industry faces is that it will need a workforce with drastically different skills from what it has at present.

Continuous Change Requires Continuous Learning

Logo  Description automatically generated with low confidence

The rise of companies like Apple and Alphabet in the automotive industry underlines an important trend – that the lines between in the automotive and technology industries is blurring. 

With AI and robotics playing an increasingly large role in automobile design, manufacturing and service, the skill sets that entry-level employees need have not only changed, but will keep changing rapidly. Building the vehicles of the future also require brand new and special skills, such as in cloud systems, UX design, driver assistance systems and autonomous systems.

There is also the reality that, with fewer humans on the shop-floor in Industry 4.0, many skills that were required of different people – mechanical engineering, electrical engineering and IT programming, are now required of one person. Therefore, learning something that will be relevant for 25 years is already a thing of the past – a skill that an employee learns today may have become obsolete even before the employee has thoroughly learned it!

Replacing an existing workforce is not the answer. Upskilling is.

Changing with the times does not require changing human resources. On the contrary, such a move can be counterproductive. Hiring and firing is costly, socially traumatic and, often, legally impossible, besides which, existing workers have core automotive production skills that those with newer skills like software engineering lack. 

“Most auto companies, in fact, have people with the necessary skills in other departments. Identifying them and cross-skilling them for new roles can often help companies overcome the talent gap.”

What is therefore required is a workforce that is mentally geared to an environment of continuous upskilling. The industry instead needs to closely examine how it can arrive at an optimal mix of experienced and new-age workers, and invest in training, reskilling and upskilling to make the most of this mix. It also needs to anticipate change, then stay ahead of the curve by upgrading its workforce to operate incoming tools and technologies. 

Cross-skilling in IT and OT is the answer

Latest technologies and machinery in Information Technology (IT) and Operational Technology (OT) have converged to a significant extent. Connected machines, connected factories, smart metering and many other ingredients of Industry 4.0 are in fact rooted in the convergence of IT and OT to make way for IoT. This means that the automotive sector will have to find professionals that bring expertise and experience in both IT and OT. They can do this by cross-skilling. 

Cross-skilling is when organizations train employees in more than one job function and skill sets. Increasing the number of employees that are experts in both IT and OT can ensure a higher bar for operational excellence through technology in the automotive sector, and that is the need of our times.

Collaboration with the educational sector is vital

A picture containing graphical user interface  Description automatically generated

The auto industry cannot in the long run manage this change in isolation. It has to collaborate with educational institutions and regulatory bodies to ensure a steady pipeline of talent that is geared towards quick learning and quick relearning and is application-oriented. This process should begin at a young age and, importantly, should treat industry-oriented courses on par with academic ones.

The AI industry is a vital part of this human transformation

Graphical user interface  Description automatically generated

As machines replace humans, the AI industry has become a core part of most aspects of the auto industry. In fact, companies like Affine can offer AI-enabled training that can identify technologies required in the future, which employees need skilling in them and design training programs that optimize skill levels and course times.

It is also crucial that companies and educational institutions give more emphasis to basic training in AI and robotics. This need not result in an in-depth knowledge of AI, but an understanding of how using AI can help employees perform and learn better. 

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.