Amazon Fine Food Reviews
Problem:
The challenge is to develop a machine learning model that would classify a given food review into one of the two categories, positive review or negative review, with high precision and recall. A 'rating' attribute was provided for each review with values 1 to 5, which I have used as class label for performing supervised training and predicting the class labels for test dataset. It's is a classic sentiment analysis problem, means that for every food review, the polarity of it has to be predicted.
Data:
The Amazon Fine Food Reviews data is distributed across 2 categories with around 0.56 million reviews and 10 attributes. Each record in the dataset has information about the user, review text, timestamp of the review etc.
Approach:
- Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset and drew helpful insights by plotting Word Clouds, Distplots, Histograms, etc.
- Performed Data Cleaning & Data Preprocessing by removing unnecessary and duplicates rows and for text reviews removed HTML tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer
- Based on the insights drawn from EDA, Feature engineered the data and augmented new features to the dataset.
- Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram), TF-IDF, Avg-Word2Vec and TF-IDF-Word2Vec
- Build machine learning models
Models:
Applied the following machine learning models on different featurization of data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec
- K-Nearest Neighbours
- Naive Bayes
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost and
- Recurrent Neural Networks(LSTM)
Conclusions:
- Was able to achieve 94% accuracy with both Random Forest and XGBoost with optimal base learners = 10 and depth = 10
- While performing dimensionality reduction, Truncated SVD helped me to find the best values for least number of dimensions that retain maximum information.