Amazon Fine Food Reviews

Problem:

The challenge is to develop a machine learning model that would classify a given food review into one of the two categories, positive review or negative review, with high precision and recall. A 'rating' attribute was provided for each review with values 1 to 5, which I have used as class label for performing supervised training and predicting the class labels for test dataset. It's is a classic sentiment analysis problem, means that for every food review, the polarity of it has to be predicted.

Data:

The Amazon Fine Food Reviews data is distributed across 2 categories with around 0.56 million reviews and 10 attributes. Each record in the dataset has information about the user, review text, timestamp of the review etc.

Approach:
  • Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset and drew helpful insights by plotting Word Clouds, Distplots, Histograms, etc.
  • Performed Data Cleaning & Data Preprocessing by removing unnecessary and duplicates rows and for text reviews removed HTML tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer
  • Based on the insights drawn from EDA, Feature engineered the data and augmented new features to the dataset.
  • Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram), TF-IDF, Avg-Word2Vec and TF-IDF-Word2Vec
  • Build machine learning models
Models:

Applied the following machine learning models on different featurization of data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec

  • K-Nearest Neighbours
  • Naive Bayes
  • Logistic Regression
  • Decision Tree
  • Random Forest
  • XGBoost and
  • Recurrent Neural Networks(LSTM)
Conclusions:
  • Was able to achieve 94% accuracy with both Random Forest and XGBoost with optimal base learners = 10 and depth = 10
  • While performing dimensionality reduction, Truncated SVD helped me to find the best values for least number of dimensions that retain maximum information.
Technologies Used: Machine Learning, Python