Isolating Toxic Comments to prevent Cyber Bullying

Online communities are susceptible to Personal Aggression, Harassment, and Cyberbullying.

Affine

AI & AE Team

Online communities are susceptible to Personal Aggression, Harassment, and Cyberbullying. This is expressed in the usage of Toxic Language, Profanity or Abusive Statements. Toxicity is the use of threats, obscenity, insults, and identity based hateful language.

Our objective was to create an algorithm which could identify Toxicity and other Categories with an Accuracy of > 90%. We will now discuss how we created a model with an accuracy of 98%.

Approach Part 1: Engineer the right features for Emotion Capture

No doubt Toxicity can be detected and measured by the presence of cuss words, obscenities etc. But is that all? Is the bag of words approach which stresses on content alone sufficient enough to define Toxicity? Are we missing information which can be assessed by the sequential nature of language and expression which comes not from the chosen word but also the way it is written or used?
This additional thought process of using Emotion as well as content was validated on the Kaggle Toxic Challenge problem data.
Emotion and Intensity of people’s written comment were captured via Feature Engineering and Deep Learning Models. The content was captured using Machine Learning Models that follow Bag of Words
The inclusion of Emotions coupled with our Ensembling Approach raised the accuracy levels from 60%(Bag of words) to 98%.

Approach Part 2: Create Disparate Models for the best Ensemble

With the Feature Engineering been taken care of, we have the all the information that needed to be extracted from our Corpus. Now the next step which is getting the best possible Ensemble model.

We did a Latent Dirichlet Allocation on the Corpus using the Bag of words approach and we found how some of the Categories were quite similar to another when we saw the Probability Distribution. Categories such as “Toxic”, “Severe Toxic”, “Obscene” which have a small margin of Decision Boundary. This at least confirms that creating one model that predicts all categories may be Time Consuming.

Our Strategy – Come up with the best model for a category. Concentrate on Parameter Tuning of Individual Models. Parameter Tune it till we get the best possible model. Finally, ensemble these models.

This paved the way for arriving at the best ensemble model as now each of the individual models have less correlations with one another.
Also, conceptually we had two classes of Models: The Deep Learning Sequential Models and the Machine Learning Bag of words models. This again contributed to our Ensembling Idea. Meaning there will be certain categories which the Sequential Model will be particularly good at and likewise the Bag of Words Models.

LSTM is a special case of Recurrent Neural Networks. Our Sequential Models consisted of LSTM models each of which were tuned in a manner as to have an edge in predicting one or at most two categories.

Our Bag of words model consisted of Light GBM models. Each of these models with their own different parameters and strengths in a category.

Illustration: As a part of our Ensembling Strategy, Sequential and Bag of Words Models were built for each Toxic Category.

Conclusion: Text Categorization can be dealt efficiently if we combine Bag of Words as well as the Sequential Approach. Also, emotions play a key role in Text Categorization for Classification Problems such as Toxicity.

Contributors

Anshuman Neog: Anshuman is a Consultant at Affine. He is a Deep Learning & Natural Language Processing enthusiast.

Ansuman Chand: Ansuman is a Business Analyst at Affine. His Interests involve real-world implementation of Computer Vision, Machine Intelligence and Cognitive Science. Wishes to actively leverage ML, DL techniques to build solutions.

About Author

Affine is leading AWS select consulting partner renowned for providing cutting-edge cloud services on AWS platform

Affine

Isolating Toxic Comments to prevent Cyber Bullying

Affine

AI & AE Team

Approach Part 2: Create Disparate Models for the best Ensemble

Contributors

About Author

Affine is leading AWS select consulting partner renowned for providing cutting-edge cloud services on AWS platform

Recommended Blogs & Articles

5 Pillars of AI Deployment in Startups

Although AI is becoming a critical factor in the long-term success of startups, a majority of them fail to deploy it. Most of them feel that employing...

Ankit Agarwal

A Lapse From Model-Centric to Data-Centric AI

Recently, AI has taken off the ground and has been bringing revolutionary changes in the industry. Its influence has been seen in many aspects of busi...

Affine

Accelerate Your eCommerce Sales with Big Data and AI for 2021

Holiday season is the most exciting time of the year for businesses. It has always driven some of the highest sales of the year. In 2019, online holid...

Heena Kohli

Accelerator or Incubator, Which One is Right for Your Startup?

Bringing ideas to life and transforming them into a business requires time, effort, and patience. It is crucial to have a support network in place tha...

Naganudeep V

AI in Robotic Process Automation – The Missing Link

Robotic Process Automation as we know it today is a framework through which large scale processes can be automated. The biggest advantage of current R...

Eron kar

Bayesian Theorem: Breaking it to Simple Using PyMC3 Modelling

Abstract This article edition of Bayesian Analysis with Python introduced some basic concepts applied to the Bayesian Inference along with some pra...

Dr. Monika Singh

Bidirectional Encoder Representations for Transformers (BERT) Simplified

In the past, Natural Language Processing (NLP) models struggled to differentiate words based on context due to the use of shallow embedding methods fo...

Shifu Jain

Bring your Art to Life with Pix2Pix

As an artist, I always wondered if I could bring my art to life. Although, it makes no sense, what if I told you that this was possible with Machine L...

Anamika Jha

Capsule Network: A step towards AI mimicking human learning systems

1. A quick introduction to Convolution Neural Networks The field of computer vision has witnessed a paradigm shift after the introduction of Convol...

Sourav Mazumdar

CatBoost – A new game of Machine Learning

Gradient Boosted Decision Trees and Random Forest are one of the best ML models for tabular heterogeneous datasets. CatBoost is an algorithm for gr...

Anamika Jha

Changing Business Requirements In Demand Forecasting

Affine recently completed 6 years, I have been a part of it for about 3 of those years. As an analytics firm, the most common business problem that we...

Affine

Corporate storytelling – A Mythological Perspective

Stories are powerful, ideas are omnipotent. The world as we perceive now, is the cumulative product of innumerable ideas over a period of about 70,000...

Shuddhashil Mullick

Cloud’s Role in the Rise of Gaming

Gaming is one of the fastest evolving industries, with considerable technological advancements. We’ve come from retro arcade games to LAN parties an...

Affine

Cloud Analytics to Improve the Clout of Indie Games

Indie games were once for a niche crowd that enjoyed the retro-styled game design and mission progression. Low-key passionate developers who made the ...

Affine

Data Augmentation For Deep Learning Algorithms

Plentiful high-quality data is the key to great deep learning models. But good data doesn’t come easy, and that scarcity can impede the development ...

Affine

DECIPHERING: How do Consumers Make Purchase Decisions?

Background Suppose you are looking for a product on a particular website. As soon as you commence on the journey of making; the first search for a ...

Vaibhav Bajaj

Deep Learning Demystified

What is Deep Learning? Traditional Machine Learning had used handwritten features and modality-specific machine learning to classify images, text o...

Affine

Gradient Boosting Trees for Classification: A Beginner’s Guide

Introduction Machine learning algorithms require more than just fitting models and making predictions to improve accuracy. Nowadays, most winning m...

Aratrika Pal

Hotel Recommendation Systems: What is it and how to effectively build one?

What is a Hotel Recommendation System? A hotel recommendation system aims at suggesting properties/hotels to a user such that they would prefer the...

Mohammad Ibrahim Khan

How AI Analytics Will Lead the Way for the Game Industry

Analytics in gaming is a compelling prospect. Gaming went from being a niche hobbyist culture to becoming a reckoning mainstream phenomenon. ...

Affine

How Can Startups Implement AI in their Solution?

While building an AI strategy for startups may seem difficult, it has now become a necessity to gain a long-term competitive advantage. ...

Affine

How to build a legal document summarizer?

Have you ever thought how legal experts manage series of court statements effectively! Reading ~500 paged document and drawing out the general context...

Shifu Jain

Human Activity Recognition: Fusing Modalities for Better Classification

Human Activity Recognition using high-dimensional visual streams has been gaining popularity in recent times. Using a video input to categorize human ...

AI Practices

HYPER DASH: How To Manage The Progress Of Your Algorithm In Real-time?

Most of our readers who work with Machine Learning or Deep Learning models daily understand the struggle of peeking at the terminal to check for the c...

Anamika Jha