Industry Equipment Failures

Problem:

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The challenge is to develop a machine learning model that would detect whether a machinery is defective or not with high precision by reducing the cost metric. The datasets’ positive (or defective) class consists of component failures for a specific component of the APS system. The negative (or non-defective) class consists of trucks with failures for components not related to the APS.

Data:

The Industry Equipment Failures data is distributed across 2 categories with around 60,000 records with 171 features and high imbalance in the class labels ratio.
The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.
Each record in the dataset has information about the machinery with anonymous names making it harder to draw analogies with trends and insights from real-life.
Cost-metric of miss-classification:

Total_cost = $ (No_Instances_with_False_Negatives * False_Negative_Cost) + (No_Instances_with_False_Positives * False_Positive_Cost) ,
where False_Positive_Cost = 10 and False_Negative_Cost = 500.

Approach:

Data Parsing
Features with more than 75% missing instances are removed. 6 features are removed in total.
Missing values in the rest of the columns are compensated/filled using Mean imputed and Median imputed method
Compensate the unbalanced data set by adopting SMOTE (Synthetic Minority Oversampling Techniques)
Standardizing/ Normalizing the data
ROC/ precision_recall curve is used to choose a threshold such that the recall is high and the total cost is minimized. Validation set is used for selecting threshold using cross validation.
Model training and testing. Following models were user to train.

K-Nearest Neighbours
Logistic Regression
Random Forest
XGBoost

Conclusions:

Medium imputed features with XGBoost model with depth 15 and 300 learners gave me the best results of $13700 cost

Technologies Used: Machine Learning, Python