Python Newbie – Doctest

Any Software Development process consists of five stages:

  1. Requirement Analysis
  2. Design
  3. Development
  4. Testing
  5. Maintenance

Though each and every process mentioned above is important in SDLC lifecycle, this post will mainly focus on the importance of testing and enlighten on how we can use doctest a module in python to perform testing.

Importance of testing

We all make mistakes and if left unchecked, some of these mistakes can lead to failures or bugs that can be very expensive to recover from. Testing our code helps to catch these mistakes or avoid getting them into production in the first place.

Testing therefore is very important in software development.

Used effectively, tests help to identify bugs, ensure the quality of the product, and to verify that the software does what it is meant to do.

Python module- doctest

Doctest helps you test your code by running examples embedded in the documentation and verifying that they produce the expected results. It works by parsing the help text to find examples, running them, then comparing the output text against the expected value.

To make things easier, let us start by understanding the above implementation using a simple example

Python inline function

So, in the above snippet, I have written a basic inline function that adds up a number to itself.

For this function, I run manually a couple of test cases to do some basic verification (to do sanity check of my function).

Now, consider a scenario in which python can read the above output and perform the sanity check for us at the run time. This is exactly the idea behind a doctest.

Now, let’s see how we can implement one.

Let’s take a very simple example to calculate what day of the week it will be ahead of the current weekday. We write a docstring for our function which helps us to understand what our function does and what inputs it takes and so on. In this docstring, I have added couple of test cases which will be read by the doctest module at the run time while testing is carried out.

Implementation of doctest

When we run the above script from the command, we will get the foollowing output:

Doctest Output

We can see from the above snippet that all test cases mentioned in the docstring were successful as the resulted outcome matched with the expected outcome.

But what happens if any test fails, or the script does not behave as expected?

To test this, we add a false test case as we know our script can only take integers as input.

What will happen if we give a string as an input? Let us check out.

Test case with strings as input

I have used the same script but made a small change in the test case where I have passed strings as an input.

So, what is the outcome for the above test case?

Failed test case output

Voila! We can see that the doctest has clearly identified the failed test case and listed the reason for the failure of the above-mentioned test case.

Conclusion

That is all you require to get started with doctest. In this post, we have come across the power of doctest and how it makes a lot easier to carry out automated testing for most of the script. Though, many developers find doctest easier as compared to unittest because in its simplest form, there is no API to learn before using it. However, as the examples become more complex, the lack of fixture management can make writing doctest tests more cumbersome than when using unittest. But still due to the ease of its module, doctest is worth adding to our codes.

In upcoming blogs, we are going to discuss more about handy python module that can ease our task and we will also dive into some other concepts like Machine Learning and Deep Learning. Until then, keep learning!

Recommendation Systems for Beginners

Why do we need a recommendation system ?

Let us take the simplest and the most relatable example of E-commerce giant, Amazon. When we shop at Amazon, it gives us the options of bundles and products that are usually bought along with the product you are currently going to buy. For example, if you are to buy a smartphone, it recommends you to buy a back cover for the product as well.

For a second let us think and try to figure out what Amazon is trying to do in the figure below:

What does a recommendation system do ?

A recommendation system recommends you products or items that can be of your interest or liking. Let’s take another example:

It’s quite easy to notice that they are trying to sell the equipment that is generally required for a camera (memory card and the case). Now, the main question is, how do they do it for millions of items listed on their website. This is where a recommendation system comes handy.

When we first set up our Netflix account, they ask us what our preferences are, which movie or TV show is most likely to be watched by us or what genre of movie is our favorite. So as the first layer of recommendation, Netflix gives us recommendations based on our input, it shows us movies or shows similar to the input that we had provided to it. Once we are a frequent user, it has gathered enough data and gives recommendations more accurately based on our preferences of genres, cast, star rating, and so on…

The ultimate aim here is to recommend a user with an item such that he will watch it or buy it(in the case of Amazon), this in turn makes sure that the user is engaged with the platform and the customer lifetime value(CLTV) is maintained.

Objective of this blog

By the end of this blog, one will have a basic understanding of how to approach towards building a recommendation system. To make things more lucid let us take an example and try building a Hotel recommendation system. In this process, we will cover data understanding and the algorithms that can be used to realize how a nascent recommendation engine is built. We will use analogies between diurnal used products like Amazon and Netflix to have a clearer understanding.

Understanding the data required for building a recommendation system

To build a recommendation system, we must be clear with the problem statement and the end objective to provide accurate recommendations. For example, consider the following scenarios:

  1. Providing a user with a hotel recommendation based on his/her current search and historical behavior (giving a recommendation knowing that a user is looking for a hotel in Las Vegas and prefers hotels with casinos).
  2. Providing a hotel recommendation based on the user’s historical behavior, targeting those users who are not actively engaged (searching) but can be incentivized towards making a booking by targeting through a relevant recommendation (a general recommendation can be based on metrics such as a user’s historical star rating preference or historical budget preference).

These are two different objectives, and hence, the approach towards achieving both of them is different.

One must be aware of what type of data is available and also needs to know how to leverage that data to proceed towards building a recommendation engine.

There are two types of data which are of importance in our use case:

Explicit Data:

Explicit signals or input is where a user directly gives feedback to a particular item/product. This can be star values, say in the range of 1 to 5 or just a binary 1(like) and 0(dislike). For example, when we rate an item on Amazon or when we rate a movie on IMDb, these are explicit signals where we are directly giving our feedback towards an item. One thing to keep in mind is that we should be aware that each and every individual is not the same, i.e. for an Item X, User A, and User B can have different ratings. User A can be generous with his ratings and can give a rating of 5 stars whereas, User B is a critic and gives Item X 3.5 stars and gives 5 stars only for exceptional Items.

Replicating the example for our Hotel recommendation use case can be summarized like, the filters that a user applies while searching for a Hotel, say, filters like swimming pool or WiFi are explicit signals, here the user is explicitly saying that he is interested in properties which have WiFi and a swimming pool.

Additionally, the explicit data is sparse in most of the cases, as it is not ideally possible for a user to give ratings to each and every item. Logically thinking, I would not have seen each and every movie on Netflix and hence can only provide feedback for the set of movies that I have seen. These ratings reflect how much a user likes or approves of an item.

Implicit Data:

Implicit signals are obtained by capturing a user’s interaction with the item. This can be a continuous value, like the number of times a user has clicked on an item or the number of times a user has watched an Action movie or Binary, similar to just clicked or not clicked. For example, while scrolling through amazon.com the amount of time spent viewing an item or the number of times you have clicked the item can act as implicit feedback.

Drawing parallels for hotel recommendations with implicit signals can be understood as follows. Consider that we have the historical hotel bookings of a user A, and we see that in the 4 out of 5 bookings made by the user, it was a property that was near the beach. This can act as an implicit signal where we can say that user A prefers hotels near the beach.

Types of Recommendation Systems

Let us take a specific example given below to further explain the recommendation models:

While making a hotel recommendation system, we have the user’s explicit and implicit signals. However, we do not have all the signals for all the users, for a set of users E, we have explicit signals and for a set of users I, we have implicit signals.

Further, let us assume that a hotel property has the following attributes:

WiFi Availability, Couple Friendly, Budget Friendly and Nature Friendly (closer to nature)

For simplicity, let us assume that these are flags, such that if a property A has WiFi in it, the WiFi availability column will be 1. Hence our hotels data will look something like the following:

Let us name this table as Hotel_Type for further use

Content Based Filtering:

This technique is used when explicit signals are provided by the user or when we have the user and item attributes and the interaction of the user with that item. The objective here is to show items/products which are similar to the item/product that a person has already purchased or shows a liking for, or in another case, show a product to a user where he explicitly says that he is looking for something in particular. Taking our example, consider that you are booking a hotel from xyz.com, you apply filters for couple-friendly properties, here you are explicitly saying that you are looking for a couple-friendly property and hence, xyz.com will show you properties that are couple friendly. Similarly, while providing recommendations, if we have explicit signals from a user we try to get the best match for that signal with the list of items that we have and provide recommendations accordingly.

Model Algorithms:

Cosine Similarity: It is a measure of similarity

between two non-zero vectors. The values range from 0 to 1.

Cosine Similarity is used as a measure of similarity between two non-zero vectors. Here the vectors can be both user or item based.

Let us take an example, Assume that a user A has specifically shown interest towards property X from the hotel_type table (the user has previously booked the property or has searched for the property X multiple times recently), we now have to recommend him properties that are similar to property X. To do so, we are going to find the similarity between property X and the rest of the properties in the table.

We can clearly see that property Q has the highest similarity with property X and followed by property P. So if we are to recommend a property to user A, we will recommend him property Q knowing that he has a preference for property X.

Pearson Correlation: It is a measure of linear correlation between two variables. The values range from -1 to 1.

Let us take an example where we are getting explicit input from the user where the user is shown the 4 categories (WiFi, Budget, Couple, Nature). The user has the option to provide his input by selecting as many as he wants, he can even select none. Considering the case when a user B has selected at least one of the 4 options. Now, assume user B’s input looks like the following:

While one can say that we can use cosine similarity in this case by just filling in the null values as 0. However, it is not advised to do so since, cosine similarity assumes the 0’s as a negative preference and in this explicit signal we cannot for sure say that user B is not looking for a couple friendly or a budget friendly property just because the user has not given an input in that field.

Hence, to avoid this, we use Pearson correlation and the output of the similarity measuring technique would look like the following:

We can see that property Z is highly correlated to user B’s explicit signal and hence, we will provide Z as a recommendation for user B.

So, for the set of users E (explicitly proving us their preference) we will use Pearson Correlation and for the set of users I (implicitly telling us that he/she is looking for a property with a certain set of attributes), we will use Cosine Similarity.

Note: A user’s explicit signal is always preferred over an implicit signal. For example, in the past, I have only booked hotels in the urban areas, however, now I want to book a hotel near the beach (nature friendly). In my explicit search, I specify this, but if you are making an implicit signal from my past bookings you will see that I do not prefer hotels near the beach and would recommend me hotels in the city. In conclusion, Pearson correlation and Cosine similarity are the most widely used similarity techniques, however, we need to always use the correct similarity measuring technique as per our use case. More information on different types of similarity techniques can be found here.

Collaborative Filtering:

This modeling technique is used by leveraging user-item interaction. Here, we try to match or group similar users and recommend based on the preferences of similar users. Let us consider a user-item interaction matrix (rating matrix) where we have the hotel rating a user has given a particular hotel:

Rating Matrix

Now let us compare user A and user E, we can see that they both have similar tastes and have rated Hotel Y as 4, seeing this let us assume that user A will rate Hotel X as 5 and hotel R as 3. Hence, we can give a recommendation of hotel X to user A by noticing the similarity between user A and user E (considering that he will like Hotel X and rate it 5).

So, if we are provided with the interaction of a user with an item where the user has given feedback towards the item, we can use collaborative filtering (for example, the rating matrix). Explicit ratings such as star rating

given by the user or Implicit signals such as a flag if the user has booked a property or contrary of user-item interaction.

Model Algorithms:

Memory and Model-Based Approach are the two types of techniques to implement collaborative filtering. The key difference between the two is that in the memory-based approach we do not use parametric machine learning models.

Memory-Based Approach: It can be divided into two subdivisions, user-item filtering, and item-item filtering. In the user-item approach, we identify clusters of similar users and utilize the interaction of a particular user in that cluster to predict the interaction of the whole cluster. For example, to predict the rating user C gives to a hotel X, we will take a weighted sum of hotel X’s rating by the other users, here weight is the similarity number between user X and the other users. Adjusted cosine similarity can also be used to remove the difference in the nature of individuals, which brings critics and the general public on the same scale.

Item-item filtering is similar to user-item filtering, but here we take an item and see the users that liked that item and find other sets of items that those set of users or similar users also liked. It takes items, finds similar items, and outputs those items as recommendations.

Model-Based Approach: In this technique, we use machine learning models to predict the rating for an item that could have been given by a user and hence, provide recommendations.

Several ML models that are used, to name a few, Matrix factorization, SVD (singular value decomposition), ALS, and SVD++. Some also use neural networks, decision trees, and latent factor models to enhance the results. We will delve into Matrix Factorization below.

Matrix Factorization:

In matrix factorization, the goal is to complete the matrix and fill in the null values in the rating matrix.

The preferences of the users are identified by a small number of hidden features of the user and items. Here there are two hidden feature vectors for users(user matrix 4×2) and items(item matrix 2×4). Once we multiply the user and item matrix back together, we get back our ratings matrix with null values replaced by predicted values. Once we get the predicted values, we can recommend an Item with the highest rating for a user (not considering the items already interacted with).

Note: Here we are not providing any feature vector for the users or the items, the computation decomposes and creates vectors on its own and, finally, predicts by filling in the null values.

If we have user demographics information and user’s features and preference information and item features, we can use SVD++ where we can pass users and item feature vectors as well to get the best recommendation results.

Hybrid Models:

The hybrid model combines multiple models/algorithms to build a recommendation system. To improve the effectiveness of the recommendation, we can combine collaborative filtering and content-based filtering giving appropriate weights to the individual models and finally using this hybrid system to give out a recommendation.

For example, we can combine the results of the following by giving weights:

  1. Using Matrix factorization (collaborative filtering) on ratings matrix to match similar users and predict a rating for the user-item pair.
  2. Using Pearson correlation (content-based filtering) to find similarity between users who provide explicit filters and the hotels with feature vectors.

The combined results can be used to provide recommendations to users.

Conclusion:

Building a recommendation system highly depends on the end objective and the data we have at hand. Understanding the algorithms and knowing which one to use to get recommendations plays a vital role in building a suitable recommendation system. Additionally, a sound recommendation system also uses multiple algorithms and combines the results to provide the final recommendations.

References:

Bring your Art to Life with Pix2Pix

As an artist, I always wondered if I could bring my art to life. Although, it makes no sense, what if I told you that this was possible with Machine Learning? Imagine a machine learning algorithm that can convert all your sketches with a simple line of your drawing as a reference point to convert this into an oil painting based on its understanding of real-world shapes and patterns from human drawings, and photos. As an accomplished artist, your results can be quite interesting.

Pix2Pix is a Generative Adversarial Network, or GAN model designed for general purpose image-toimage translation. Image to Image translation is a problem where you have to translate a given image
domain to a target domain. For example, let’s say the input domain images are of cats, and the target
domain images are of dogs. In this case, the Image-to-Image translation algorithm learns mapping from
inputs to the target domain in such a way that if you input the image of a dog, it can change it to an
image of a cat.

Pix2pix can also be used to:

  • Convert satellite imagery into a Google Maps-style street view
  • Translate images from daytime to nighttime
  • Sketch products to product photographs. For e.g., for shoe commercials
  • Convert high intensity images into low intensity and vice-versa

Pix2Pix algorithm is one of the first successful general Image-to-Image translation algorithms that use
“Gan Loss” to generate realistic image outputs. It is shorthand for an implementation of a generic imageto-image translation using conditional adversarial networks.
Compared to other GAN models for conditional image generation, pix2pix is relatively simple and
capable of generating large, high-quality images across a variety of image translation tasks.

The comparison below should give you an idea of its potential:

The GAN architecture is comprised of a Generator Model for outputs of new plausible synthetic images,
and a Discriminator Model that classifies images as Real (from the dataset) or Fake (generated). The
discriminator model is updated directly, whereas the generator model is updated via the discriminator
model, and the two models are trained simultaneously in an adversarial process where the generator
seeks to better fool the discriminator where the discriminator seeks to better identify the counterfeit
images.

The Pix2Pix model is a type of conditional GAN, or cGAN, where the generation of the output image is
conditional based on the input, and in this case, it is a source image. The discriminator is provided with a
source image, and the target image; the model must determine whether the target is a plausible
transformation of the source image.

The Generator’s Network

Generator network uses a U-Net-based architecture. U-Net’s architecture is similar to an AutoEncoder network as it uses Encoder and the Decoder for processing .

  • U-Net’s network has skip connections between Encoder layers and Decoder layers.

As shown in the picture, the output of the first layer of Encoder is directly passed to the last layer
of the Decoder, and output of the second layer of Encoder is pass to the second last layer of
the Decoder and so on.

if there are total N layers in U-Net’s (including middle layer), then there will be a skip connection
from the kth layer in the Encoder network to the (N-k+1)th layer in the Decoder network. where 1
≤ k ≤ N/2.

‘x’ and ‘y’ represent input and output channels, respectively.

The Generator’s Architecture

The Generator network is made up of these two networks:
• The Encoder network is a downsampler
• The Decoder network is an upsampler The Generator’s Encoder Architecture
• The Encoder network of the Generator network has seven convolutional blocks
• Each convolutional block has a convolutional layer, followed by a Leaky ReLU activation function
• Each convolutional block also has a batch normalization layer except for the first layer
The Generator’s Decoder Architecture
• The Decoder network of the Generator network has seven upsampling convolutional blocks
• Each upsampling block has an upsampling layer, followed by a convolutional layer, a batch of normalization layer and a ReLU activation function There are six skip-connections in a Generator network. The concatenation happens along the channel axis.
• The output from the 1st Encoder block is concatenated to the 6th Decoder block.
• The output from the 2nd Encoder block is concatenated to the 5th Decoder block.
• The output from the 3rd Encoder block is concatenated to the 4th Decoder block.
• The output from the 4th Encoder block is concatenated to the 3rd Decoder block.
• The output from the 5th Encoder block is concatenated to the 2nd Decoder block.
• The output from the 6th Encoder block is concatenated to the 1st Decoder block.

Discriminator’s Architecture

Discriminator network uses of PatchGAN architecture. The PatchGAN network contains five convolutional blocks.

Pix2Pix Network’s Training

Pix2Pix is a conditional GANs. The loss function for the conditional GANs can be written as below:

Following are the steps that involve training the model for the Pix2Pix algorithm:

  1. Import TensorFlow and required Libraries

2. Load the Dataset

3. Input Pipeline

4. Build the Generator

• The architecture of generator is a modified U-Net.
• Each block in the encoder is (Conv -> Batchnorm -> Leaky ReLU)
• Each block in the decoder is (Transposed Conv -> Batchnorm -> Dropout (applied to the first
three blocks) -> ReLU)
• There are skip connections between the encoder and decoder (as in U-Net).

5. Generator loss

• It is a sigmoid cross entropy loss of the generated images and an array of ones
• It includes L1 loss which is MAE (mean absolute error) between the generated image and the
target image
• This allows the generated image to become structurally similar to the target image
• The formula to calculate the total generator loss = gan_loss + LAMBDA * l1_loss, where LAMBDA
= 100 l

The training procedure for the generator is shown below:

6. Build the Discriminator

  • The Discriminator is a PatchGAN.
  • Each block in the discriminator is (Conv -> BatchNorm -> Leaky ReLU)
  • The shape of the output after the last layer is (batch_size, 30, 30, 1)
  • Each 30×30 patch of the output classifies a 70×70 portion of the input image (such an architecture is called a PatchGAN).
  • Discriminator receives 2 inputs:
  • Input image and the target image, which it should classify as real.
  • Input image and the generated image (output of the generator), which it should classify as fake.
  • We concatenate these 2 inputs together in the code (tf.concat([inp, tar], axis=-1))

7. Discriminator loss

  • The discriminator loss function takes 2 inputs: real images and generated images
  • real_loss is a sigmoid cross entropy loss of the real images and an array of ones (since these are the real images)
  • generated_loss is a sigmoid cross entropy loss of the generated images and an array of zeros (since these are the fake images)
  • Then the total_loss is the sum of real_loss and the generated_loss

8.Define the Optimizers and Checkpoint-saver

9. Generate Images

Write a function to plot some images during training.

  • We pass images from the test dataset to the generator
  • The generator will then translate the input image into the output
  • Last step is to plot the prediction
  1. Training
  • For each example input, generate an output
  • The discriminator receives the input_image and the generated image as the first input. The second input is the input_image and the target_image
  • Next, we calculate the generator and the discriminator loss
  • Then, we calculate the gradients of loss with respect to both the generator and the discriminator variables (inputs) and apply those to the optimizer
  • Then log the losses to TensorBoard

The Training Loop:

  • Iterates over the number of epochs
  • On each epoch, it clears the display, and runs generate_images to show its progress
  • On each epoch it iterates over the training dataset, printing a ‘.’ for each example
  • It saves a checkpoint every 20 epochs

The beauty about a trained pix2pix network is that it will generate an output from any arbitrary input.
Following are the inputs and their corresponding outputs generated after applying Pix2Pix

Conclusion

Pix2Pix is a whole new strategy for Image-to-Image translation using a combination of the Generator and
Discriminator. It gives us chance to turn our art into life. It also proves to be useful in various spheres like
exploring satellite images and in various Augment Reality techniques. This technique could open a new
opportunity for Virtual Reality and give it a whole new approach.

Reference

  • https://arxiv.org/pdf/1611.07004.pdf
  • https://openaccess.thecvf.com/content_CVPR_2019/papers/Qu_Enhanced_Pix2pix_Dehazing_Network_CVPR_2019_paper.pdf
  • https://openaccess.thecvf.com/content_cvpr_2017/papers/Isola_Image-ToImage_Translation_With_CVPR_2017_paper.pdf

Evolution Of Human Resource In The New World Of Technology How has the Human Resources changed with time?

Of all the departments and functions in a corporate organization, Human Resource is the one function related to employees’ personal aspects. The entire employee job cycle is taken care of by the Human Resources (HR), ranging from hiring, compensation, leave management, employee satisfaction, development, and growth till exit. This function requires personal involvement and conscience that may vary from person to person.

Another rather unique aspect of an organization – technology is the practical and scientific application of various aspects such as skills, processes, methods, and organization techniques. It does not concern any personalized features and derives the same result irrespective of who, where, or when someone brings it into the organizational process.

But what if these two aspects are related?

The challenges related to HR, like Employee Engagement, Employee Retention, development of the leaders, competitive compensation, global outreach of businesses, and various other factors have stimulated extensive innovation in the HR field. For instance, more than 92% of the recruiters have turned their trust into Social Media hiring in the recent decade rather than organic hiring methods. More than 3% recruiters use “Snapchat” as a recruiting channel, moving beyond LinkedIn, Facebook, and Twitter. Below are some of such instances that bring HR and Technology together.

HR and Virtual Reality

COVID-19 has undoubtedly helped change the mindset of protagonists across the Corporate industry, especially in India. Holding appraisal meetings, taking interviews, onboarding, and even celebrations as part of work today takes place over video calls. Technology and Virtual Reality help HR with talent management, training, onboarding, and inductions, hiring, etc. as the new normal.

HR and Machine Learning Machine Learning (ML) uses algorithms for automated data analysis to create automated analytical models. HR deals with massive data sets from Recruitment and the Employee Database. ML technology helps HR improve the efficiency of initial research with dedicated hours to acquiring next-level results. So far, in Human Resources, machine learning

applications are confined mainly to the Recruitment process. However, it will be exciting to oversee the advancements in this field.

HR and Could Computing Cloud Computing ensures using a network of remote servers hosted on the internet instead of a personal computer or a local server. It helps data processing by storing and managing valuable information over the cloud, enabling the HR department to push its expertise into the middle and higher-level leadership, resulting in efficient business performance and execution. When the data on performance, attendance, track of time, etc. gets automated, the focus can be shifted to increasing productivity, transforming the HR department from being a cost center to generating revenues.

The mentioned instances are only some broad spectrum of Technology inter-dependencies in the HR department. The ever-changing and fast-paced technological advances are only making HR strive towards innovation that ultimately makes it even more indulged with technology.

Significantly, the global pandemic has altered stigmas helping people become adaptive. Several theories believe that remote working can continue as the “new normal” once we overcome this pandemic.

It can make the who’s who of the Corporate world, refine the entire workplace experience with HR as the bridge that connects extremes in an organization, exposing and expanding them with the latest technological developments.

It will be interesting to see how the Corporates fit into this new reality.

Making our Data Scientists and ML engineers more efficient (Part 2)

In the last post, we briefly touched upon the concept of MLOps and one of its elements, namely the Feature Store. We intend to cover a few more interconnected topics that are key to successful ML implementations and realizing sustained business impact.

At Affine, we ensure that a lion’s share of our focus in a ML project is on:

  1. Investing more time on building Great Features and Feature Stores (than ML algos – yes, please don’t frown upon this ? ).
  2. Setting up Robust and Reproducible Data and ML Pipelines to ensure faster and accurate Re-training and Serving.
  3. Following ML and Production Standard Coding Practices.
  4. Incorporating Model Monitoring Modules to ensure that the model stays healthy.
  5. Reserving the last hour of each business day on Documentation.
  6. And last but not the least, Regularly reciting the Magic words – Automate, Automate, Automate! ?

The following architecture is an attempt to simplify and encapsulate the above points:

This is the Utopian version of the ML architecture that every team aspires for. This approach attempts to address 3 aspects with respect to ML Training and Serving – Reproducibility, Continuous (or Inter-Connected-ness) and Collaboration.

Reproducibility of model artifacts/predictions – The ML components (Feature Engineering, Dataset Creation, ML tuning, Feature Contribution, etc.) should be build in such a way that it is simple to reproduce the output of each of those components seamlessly and accurately, at a later point in time. Now, from an application standpoint, this may be required for a root-case analysis (or model compliance) or re-trigger creation of ML pipeline on new dataset at a later point in time. However, equally importantly, if you can reproduce results accurately, it also guarantees that the ML pipeline (Data to ML training) is stable and robust! This also ultimately leads to reliable Model Serving.

Continuous Training aspect – This is known as Continuous Integration in the context of software engineering, and refers to automation of components like code merge, unit/regression testing, build, etc. to enable continuous delivery. A typical ML pipeline also comprises of several components (Feature Selection, Model Tuning, Feature Contribution, Model Validation, Model Serialization and finally Model Registry and Deployment). The Continuous aspect ensures that each module of our ML pipeline is fully automated and fully integrated (parameterized) with other modules, such that Data runs, Model Runs, and ML Deployment Runs happen seamlessly when the pipeline is re-triggered.

However, it all starts with “Collaboration” – Right Team and Right Mindset!

Before we delve deeper into each of these points, it is essential to touch upon one more topic – the need for a tightly-knit cross-functional team. It’s not pragmatic to expect the Data Scientists to handle all of the above aspects and the same is true for ML engineers. However, for a successful MLOps strategy, it’s important to get outside of our comfort zones, learn cross functional skills and collaborate closely. This means that the Data Scientists should learn to write production grade codes (modularization, testing, versioning, documentation, etc.) Similarly, the ML engineers should understand ML aspects like Feature Engineering and Model Selection to appreciate why these seemingly complex ML artifacts are critical in solving the business problem.

As for managing such a cross functional team of people that bring different niches, we need people who thrive in this knowledge-based ecosystem, where the stack keeps getting bigger every day. We need a ‘Jack of all Trades’, someone who knows a little bit of everything and possess the articulation skills to bring out the best from the team.

Getting the right cross functional team in place is the first (and unarguably the most important) piece of the puzzle. In the next article, we will go a bit deeper on each of the components described above. Please stay tuned…

Making our Data Scientists and ML Engineers more Efficient (Part 1)

There is a lot of backend grunt work involved in deploying an ML model successfully before we even begin to realize its business benefits. More than 60-70% of the effort that goes into an ML project involves what we call as the “glue work” – from EDAs to getting the QAed features/analytical dataset ready; to moving the final model to production (btw real-time deployments are a nightmare! ?). This glue work is usually the non-jazzy side of ML, but it’s what makes the models do what they are supposed to do –

  • Ensure Reproducible and Consistent Training and Serving Pipelines and Datasets
  • Ensure Upstream and Downstream QA for stable Feature Values and Predictions
  • Ensure 99.99% Availability
  • Sometimes, just re-write the entire ML serving part on a completely different server-side environment

Imagine doing the above time, and again every time a model needs to be refreshed or a new model/use case needs to be trained. The times we are living in are fluid, and so are the shelf lives of our models. The traditional approaches to building and deploying models prove to be a huge bottleneck while building models at scale, simply because there is a substantial amount of glue work.

At Affine, we are inculcating a mindset change (read MLOps) amongst our frontline DS and ML engineers to ensure faster training and deployment of ML models (Continuous Training Continuous Delivery/Deployment) across a variety of use cases.

The key to a successful MLOps implementation begins with a mindset change and motivation to adopt new practices, which may initially look like an overhead for a specific project, but guarantees tremendous gains in efficiency and business value over the mid to long run.

One such practice is the adoption of Enterprise Feature Store. For those familiar with DS/ML terminologies, this essentially is a most comprehensive Analytical Dataset that you can think of – comprises all possible features at the lowest possible granularity (think user-product-day level), and refreshed at a very frequency. Think of it as a living data source that has an ability to serve multiple ML use cases – through a simple API/SDK.

The Feature Store not only facilitates faster creation of Training AD but also Serving/Scoring AD and across multiple use cases within an org function (Ex: Marketing or Supply Chain)

What this means for Data Scientists

  • Faster Dev Cycle
  • Zero Data Errors: Feature Store being the Single Source of Truth
  • Leverage collective intellectual thought leadership from multiple DS who designed those features

What this means for ML Engineers à

  • Consistency of Data between Training and Deployment environment
  • Faster Deployment in Production
  • No loss of information between the DS team and ML engineers

What this means for Business à

  • Adaptive Solution means faster and relevant response to market dynamics
  • Reusability of components across use cases means standardization and cost-effectiveness

Machine Learning, too many, is the combination of Science and Art. However, we sincerely believe that there is a third component – that of Process. Feature Store is one such ML hygiene practice, and when coupled with a few other elements results in scalable and sustainable ML impact that is very much the need of the world today.

Please stay tuned for more on ML Hygiene ?

Super-Resolution with Deep Learning for Image Enhancement

Have you ever looked at your old photographs and hoped it had better quality? Or wished to convert all your photos to a better resolution to get more likes? Well, Deep learning can do it!

Image Super Resolution can be defined as increasing the size of small images while keeping the drop-in quality to a minimum or restoring High Resolution (HR) images from rich details obtained from Low Resolution (LR) images.

The process of enhancing an image is quite complicated due to the multiple issues within a given low-resolution image. An image may have a “lower resolution” as it is smaller in size or as a result of degradation. Super Resolution has numerous applications like:

  • Satellite Image Analysis
  • Aerial Image Analysis
  • Medical Image Processing
  • Compressed Image
  • Video Enhancement, etc.

We can relate the HR and LR images through the following equation:

LR = degradation(HR)

The goal of super resolution is to recover a high-resolution image from a low-resolution input.

Deep learning can estimate the High Resolution of an image given a Low Resolution copy. Using the HR image as a target (or ground-truth) and the LR image as an input, we can treat this like a supervised learning problem.

One of the most used techniques for upscaling an image is interpolation. Although simple to implement, this method faces several issues in terms of visual quality, as the details (e.g., sharp edges) are often not preserved.

Most common interpolation methods produce blurry images. Several types of Interpolation techniques used are :

  • Nearest Neighbor Interpolation
  • Bilinear Interpolation
  • Bicubic Interpolation

Image processing for resampling often uses Bicubic Interpolation over Bilinear or Nearest Neighbor Interpolation when speed is not an issue. In contrast to Bilinear Interpolation, which only takes 4 pixels (2×2) into account, bicubic interpolation considers 16 pixels (4×4).

SRCNN was the first Deep Learning method to outperform traditional ones. It is a convolutional neural network consisting of only three convolutional layers:

  • Pre-Processing and Feature Extraction
  • Non-Linear Mapping
  • Reconstruction

Before being fed into the network, an image needs up-sampling via Bicubic Interpolation. It is then converted to YCbCr (Y Component and Blue-difference to Red-difference Chroma Component) color space, while the network uses only the luminance channel (Y). The network’s output is then merged with interpolated CbCr channels to produce a final color image. This procedure intends is to change the brightness (the Y channel) of the image while there is no change in the color (CbCr channels) of the image.

The SRCNN consists of the following operations:

  • Pre-Processing: Up-scales LR image to desired HR size.
  • Feature Extraction: Extracts a set of feature maps from the up-scaled LR image.
  • Non-Linear Mapping: Maps the feature maps representing LR to HR patches.
  • Reconstruction: Produces the HR image from HR patches.

LR images are preserved in SR image to contain most of the information as a super-resolution requirement. Super-resolution models, therefore, mainly learn the residuals between LR and HR images. Residual network designs are, therefore, essential.

Up-Sampling Layer¶

The up-sampling layer used is a sub-pixel convolution layer. Given an input of size H×W×CH×W×C and an up-sampling factor ss, the sub-pixel convolution layer first creates a representation of size H×W×s2CH×W×s2C via a convolution operation and then reshapes it to sH×sW×CsH×sW×C, completing the up-sampling operation. The result is an output spatially scaled by factor ss

Enhanced Deep Residual Networks (EDSR)

Furthermore, when super resolution techniques started gaining momentum, EDSR was developed which is an enhancement over the shortcomings of SRCNN and produces much more refined results.

EDSR network consists of :

  • 32 residual blocks with 256 channels
  • pixel-wise L1 loss instead of L2
  • no batch normalization layers to maintain range flexibility
  • scaling factor of 0.1 for residual addition to stabilize training

On comparing several techniques, we can clearly see a trade-off between performance and speed. ESPCN looks the most efficient while EDSR is the most accurate and expensive. You may choose the most suitable method depending on the application.

Further, I have implemented Super Resolution using EDSR on few images as shown below:

  • Importing Modules

This block of code imports all the necessary modules needed for training. DIV2K is the dataset that consists of Diverse 2K resolution-high-quality-images used in various image processing codes. EDSR algorithm is also imported here for further training.

  • Defining the Parameters

This block of code assigns the number of residual blocks as 16. Furthermore, the super resolution factor and the downgrade operator is defined. Higher the number of residual blocks, better the model will be in capturing minute features, even though its more complicated to train the same. And hence, we stick with residual blocks as 16.

A directory is created where the model weights will be stored while training

Images are downloaded from DIV2K with two folders of train and a valid consisting of both low and high resolution.

Once, the dataset has been loaded, both the train and valid images need to be converted into TensorFlow dataset objects.

  • Training the Model

Now the model is trained with the number of steps as 30k. The image is evaluated using the generator every 1000th step. The training takes around 12 hours in Google Colab GPU.

  • Obtaining the Result

Furthermore, the models are saved and can be used in the future for further modifications.

For output, the model is loaded using weight files and further both the images of low resolution and high resolution are plotted simultaneously for comparison.

High-Resolution Images showcased on the right are obtained by applying the Super-Resolution algorithm on the Low-Resolution Images showcased on the left.

Conclusion

You can clearly observe a significant improvement in the resolution of images post-application of the Super-Resolution algorithm, making it exceptionally useful for Spacecraft Images, Aerial Images, and Medical Procedures that require highly accurate results.

References

Semantic Literature Search Powered by Sentence-BERT

Suppose you were given a challenge to pick out a horror novel from a small collection of books, without any prior information regarding these books. How would you go about this task in the most efficient way possible? The most appropriate strategy would be flicking through the books to figure out the theme based on its words/sentences/paragraphs. Another approach could be reading the reviews on the cover page for context.

This article discusses how we can use a pre-trained BERT model to accomplish a similar task. Kaggle recently released an Open Research Dataset Challenge called CORD-19 , with over 52,000 research articles on COVID-19, SARS-CoV-2, and other related topics. The problem statement was to provide insights on the “tasks” using this vast research corpus. The required solution must fetch all the relevant research articles based on questions user submits.

Our approach to finding the most relevant research articles (related to the task), was to compare the question with abstracts of all the research articles present. By finding the similarity score, we could easily rank the top-N articles. Comparing the question with the entire corpus is as easy as flicking through the small collection of novels to pick out the horror genre. This automation takes place by using Sentence-Transformers library, pre-trained BERT model and Spacy library

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 ACL paper by Nils Reimers and Iryna Gurevych mention the architecture, and advantages of Siamese and Triplet Network Structures, some of which are:

  • Finding similar pairs of sentences from a corpus of 10,000 sentences requires around 50 million inference computations and ~65 hours with BERT/RoBERTa. Sentence BERT can quite significantly reduce the embeddings construction time for the same 10,000 sentences to ~5 seconds!
  • Fine-tuning a pre-trained BERT network and using siamese/triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine similarity.

Methodology:

1. Install sentence-transformers and load a pre-trained BERT model. We have used ‘bert-base-nli-mean-tokens’ due to its high performance on STS benchmark dataset.

from sentence_transformers import SentenceTransformer model = SentenceTransformer(‘bert-base-nli-mean-tokens’)

2. Vectorize/encode the abstracts of each article using pre-trained Bert embedding. Suppose the abstracts are contained in a list named ‘abstract’.

abstract_embeddings = model.encode(abstract)

The abstracts are encoded using BERT pre-trained embeddings and the shape would be (size of abstract list) rows × 768 columns.

3. Dump the abstract_embeddings as pickle.

with open(“my.pkl”, “wb”) as f: pickle.dump(abstract_embeddings, f)

4. Inference from the trained embeddings in real time.

i) Load the embedding file “my.pkl”.

with open(“my.pkl”, “rb”) as f: df = pickle.load(f)

ii) Encode the question into a 768-D embedding using step 1-2. For e.g., the question “What do we know about COVID-19 risk factors?” would be represented by a 768 dimensional embedding.

iii) Compare the question with all the abstracts’ embeddings and find the cosine similarity to rank and give top-N results. As the question and the abstracts are in numerical form, this can be done easily using scipy function.

distances = scipy.spatial.distance.cdist([query_embedding], df, “cosine”)[0]

This article discusses how to use pre-trained Sentence-BERT library for a downstream NLP task, which is to find relevant research articles based on user’s questions. We can compare the question with the title or research body. Comparing the question with only the title could be less informative as the text-length of the title will be too small. Contrary to this, comparing the question with the research body requires heavy computations. Additionally, shrinking the huge research body to 768 dimensions will lead to a huge loss of information. However, feel free to experiment with embeddings of the title and research body. Your findings may be fascinating.

Reference(s):

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks
https://github.com/UKPLab/sentence-transformers
https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md
https://www.aclweb.org/anthology/D19-1410.pdf
https://www.aclweb.org/anthology/D19-1410/

Bidirectional Encoder Representations for Transformers (BERT) Simplified

In the past, Natural Language Processing (NLP) models struggled to differentiate words based on context due to the use of shallow embedding methods for text analysis.

Bidirectional Encoder Representations for Transformers (BERT) has revolutionized the NLP research space. It excels at handling language problems considered to be “context-heavy” by attempting to map vectors onto words post reading the entire sentence in contrast to traditional methods in NLP models.

This blog sheds light on the term BERT by explaining its components.

BERT (Bidirectional Encoder Representation from Transformers)

Bidirectional – Reads text from both the directions. As opposed to the directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore, it is considered bidirectional, though it would be more accurate to call it non-directional.

Encoder – Encodes the text in a format that the model can understand. It maps an input sequence of symbol representations to a sequence of continuous representations. It is composed of a stack with 6 identical layers. Each layer has two sub-layers. The first layer is a multi-head self-attention mechanism. And the second layer is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by Layer Normalization. The key feature of layer normalization is that it normalizes the inputs across the features.

Representation – To handle a variety of down-stream tasks, our input representation can unambiguously represent both a single sentence and a pair of sentences, e.g. Question & Answering, in one token sequence in the form of transformer representations.

Transformers – Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.

Transformers are a combination of 3 things:

In this blog, we will only talk about the Attention Mechanism.

Limitations of RNNs over transformers:

  • RNNs and its derivatives are sequential, which contrasts with one of the main benefits of a GPU i.e. parallel processing
  • LSTM, GRU and derivatives can learn a lot of long-term information, but they can only remember sequences of 100s, not 1000s or 10,000s and above

Attention Concept

As you can see in the image above, attention must be paid at the stop sign. And for the text, eating (verb) has higher attention in relation to oats.

Transformers use attention mechanisms to gather information about the relevant context of a given word, then encode that context in the vector that represents the word. Thus, attention and transformers together form smarter representations.

Types of Attention:

  1. Self-Attention
  2. Scaled Dot-Product Attention
  3. Multi-Head Attention

Self-Attention

Self-attention, also called intra-attention is an attention mechanism that links different positions of a single sequence to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, etc.

Scaled Dot-Product Attention

Scaled Dot-Product Attention consists of queries Q and keys K of dimension dk, and values V of dimension dv. We compute the dot products of the query with all keys, divide each of them by √dk, and apply a SoftMax function to obtain the weights on the values.

The two most commonly used attention functions are:

  • Dot-product (multiplicative) attention: This is identical to the algorithm, except for the scaling factor of √1dk.
  • Additive attention: Computes the compatibility function using a feed-forward network with a single hidden layer.

While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice as it uses a highly optimized matrix multiplication code.

Multi-Head Attention

Instead of performing a single attention function with dmodel dimensional keys, values and queries, it is beneficial to linearly project the queries and values h times with different, trained linear projections to dk, dk and dv dimensions, respectively. We can then perform the attention function in parallel to each of these projected versions, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Applications of BERT

  1. Context-based Question Answering: It is the task of finding an answer to a question over a given context (e.g., a paragraph from Wikipedia), where the answer to each question is a segment of the context.
  2. Named Entity Recognition (NER): It is the task of tagging entities in text with their corresponding type.
  3. Natural Language Inference: Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
  4. Text Classification

Conclusion:

Recent experimental improvements due to transfer learning with language models have demonstrated that rich and unsupervised pre-training is an integral part of most language understanding systems. It is in our interest to

further generalize these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broader set of NLP tasks.

References:

  1. https://arxiv.org/abs/1810.04805
  2. https://arxiv.org/abs/1706.03762

TensorFlow Lite: The Future of AI in Mobile Devices

Do you use Google Services on your mobile phones? If so, you would have probably noticed that the predictive capability of this service has recently improved with respect to speed and accuracy. The enhanced predictive capability of Google is now faster, thanks to TensorFlow Lite working with your phone’s GPU.

What is TensorFlow Lite?

TFLite is TensorFlow’s light weight solution for mobile, embedded and other IoT devices. It can be described as a toolkit that helps developers run the TensorFlow model on such devices.

What is the need for TensorFlow Lite? Running Machine Learning models on mobile devices are not easy due to the limitation of resources like memory, power, storage, etc. Ensuring that the deployed AI models are optimized for performance under such constraints becomes a necessary step in such scenarios. This is where the TFLite comes into the picture. TFLite models are hyper-optimized with model pruning and quantization to ensure accuracy for a small binary size with low latency, allowing them to overcome limitations and operate efficiently on such devices.

TensorFlow Lite consists of two main components:

  • The TensorFlow Lite converter: that converts TensorFlow models into an efficient form and creates optimizations to improve binary size and performance.
  • The TensorFlow Lite interpreter: runs the optimized models on different types of hardware, including mobile phones, embedded Linux devices, and microcontrollers.

TensorFlow Lite Under the Hood

Before deploying the model on any platform, the trained model needs to go through a conversion process. The diagram below depicts the standard flow for deploying a model using TensorFlow Lite.

Fig: TensorFlow Lite conversion and inference flow diagram.

Step 1: Train the model in TensorFlow with any API, for e.g. Keras. Save the model (h5, hdf5, etc.)

Step 2: Once the trained model has been saved, convert it into a TFLite flat buffer using the TFLite converter. A Flat buffer, a.k.a. TFLite model is a special serialized format optimized for performance. The TFLite model is saved as a file with the extension .tflite

Step 3: Post converting the TFLite flat buffer from the trained model, it can be deployed to mobile or other embedded devices. Once the TFLite model gets loaded by the interpreter on a mobile platform, we can go ahead and perform inferences using the model.

Converting your trained model (‘my_model.h5’) into a TFLite model (‘my_model.tflite’) can be done with just a few lines of code as shown below:

How does TFLite overcome these challenges?

TensorFlow Lite uses a popular technique called Quantization. Quantization is a type of optimization technique that constrains an input from a large set of values (such as the real numbers) to a discrete set (such as the integers). Quantization essentially reduces the precision representation of a model. For instance, in a typical deep neural network, all the weights and activation outputs are represented by a 32-bit floating-point numbers. Quantization converts the representation to the nearest 8-bit integers. And by doing so, the overall memory requirement for the model reduces drastically which makes it ideal for deployment in mobile devices. While these 8-bit representations can be less precise, certain techniques can be applied to ensure that the inference accuracy of the new quantized model is not affected significantly. This means that quantization can be used to

make models smaller and faster without sacrificing accuracy. Stay tuned for the follow up blog that will be a walkthrough of how to run a Deep learning model on a Raspberry Pi 4. In the meantime, you can keep track of all the latest additions to TensorFlow Lite at https://www.tensorflow.org/lite/

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.