Quora Question Pair Similarity

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. So the objective of this project is to predict whether pair of similarly worded questions are semantically similar or not. Built a deep-learning model using Siamese Long short-term memory(LSTM) network to find the semantic similarity of question pairs from Quora's public dataset.

Data:

Quora's Question Pair Similarity dataset is distributed across 2 categories (similar or not) with over 0.4 million records of question pairs. Each record in the train dataset contains couple of question texts, their respective id's and a binary class label indicating whether the questions are similar or not.

Business Objectives and Performance Metric:

The cost of a mis-classification can be very high. Calculated probability of a pair of questions to be duplicates and choosen a threshold to determine question pairs as similar.
No strict latency concerns.
Interpretability is partially important.
Used log-loss & Binary Confusion Matrix as perfomance metric for this problem.

Models:

Siamese LSTM neural network

Conclusions:

Using feature engineering techniques, I was able to minimize the log loss to 0.28
Further, we can try things like transfer learning to pretrain the LSTM network and create data using data augmentation to improve the performance metric.

Technologies Used: Deep Learning, Python