Amazon Fine Food Reviews

Problem:

The challenge is to develop a machine learning model that would classify a given food review into one of the two categories, positive review or negative review, with high precision and recall. A 'rating' attribute was provided for each review with values 1 to 5, which I have used as class label for performing supervised training and predicting the class labels for test dataset. It's is a classic sentiment analysis problem, means that for every food review, the polarity of it has to be predicted.

Data:

The Amazon Fine Food Reviews data is distributed across 2 categories with around 0.56 million reviews and 10 attributes. Each record in the dataset has information about the user, review text, timestamp of the review etc.

Approach:

Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset and drew helpful insights by plotting Word Clouds, Distplots, Histograms, etc.
Performed Data Cleaning & Data Preprocessing by removing unnecessary and duplicates rows and for text reviews removed HTML tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer
Based on the insights drawn from EDA, Feature engineered the data and augmented new features to the dataset.
Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram), TF-IDF, Avg-Word2Vec and TF-IDF-Word2Vec
Build machine learning models

Models:

Applied the following machine learning models on different featurization of data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec

K-Nearest Neighbours
Naive Bayes
Logistic Regression
Decision Tree
Random Forest
XGBoost and
Recurrent Neural Networks(LSTM)

Conclusions:

Was able to achieve 94% accuracy with both Random Forest and XGBoost with optimal base learners = 10 and depth = 10
While performing dimensionality reduction, Truncated SVD helped me to find the best values for least number of dimensions that retain maximum information.

Technologies Used: Machine Learning, Python