Predict pricing for tube assemblies used in Caterpillar equipment

Introduction

This is my Capstone project for General Assembly’s full-time 12-week Data Science Immersive course. It is heavily inspired by a 2015 Kaggle competition (Kaggle Caterpillar tube assembly pricing), but with some additional goals. I wanted to create not just a pricing model, but also:

a) Clearly formulate and explain key hypotheses regarding factors that drive price

b) Understand and set lower and upper bounds for model performance

c) Show how and why key steps in cleaning, EDA, feature engineering, model building and optimization were executed

d) Understand the main factors that drive pricing, and present recommendations based on these

Here’s a picture of a typical tube assembly that is used to pneumatically control various operations in Caterpillar’s mining and construction equipment.

Report Organization

I will present some key steps and results in this post. A detailed presentation covering all the aspects discussed above, is embedded below.

Executive Summary

Quoted price is a non-linear function of various factors.

  1. Order quantity is the most significant factor affecting quoted price
  2. Tube assembly diameter and length are important predictors of price, are likely a proxy for amount of material used in building tube assemblies
  3. Annual usage (per tube assembly), total annual usage (per supplier) and the length of time Caterpillar has had a relationship with supplier, all play a role in price
  4. An important area of focus for price negotiations/reductions should be larger diameter tube assemblies … bulk procurement of these would help
  5. Materials of construction don’t seem to significantly affect price. This may present an opportunity to ‘upscale’ materials for performance reasons if desired by the technical team, without significant price impact

Key results

A. Price is strongly dependent on quantity ordered.

B. Certain component types (ex. CP-028 - straight adapter) are predominantly present in greater numbers in lower priced assemblies.

C. Parts with higher wall thickness (= more material) seem to be priced higher.

D. An Extreme Gradient Boosting (xgboost) model performs very well with the right features (note: R2 refers to the fit between predicted and actual costs).

E. Feature importances from an optimized xgboost model are shown below.

Discussion

I want to briefly discuss some learnings from planning and executing this project.

Firstly, coming up with a methodical high-level plan of attack really helped. Without this, it would have been easy to get lost in the weeds of EDA, model building or optimization without making timely progress. It also helped to set clear lower and upper bounds for model performance targets. Any model that performs worse than the lower bound is useless. And the upper bound helps guide decision-making along the way - for ex. would the model get any better if additional time is spent on model optimization.

Secondly, thinking through the lens of cost breakdowns from the supplier’s perspective helped formulate key hypotheses to explore. It not only helped to articulate hypotheses to test during exploratory data analysis, but was also instrumental in engineering the right features for a good predictive model.

Lastly, from a model building and optimization perspective, log transformation of the target variable (price) really helped in building a better nonlinear model, and made intuitive sense. The transformation not only penalizes an under-predicted point more than an over-predicted one (important for budgeting), but also helps place higher importance on differences at small values versus the same differences at larger values. For example, a $1 difference in prediction at a price of $2 is much more important than a $1 difference in prediction at $100. It was also important to realize that using a linear combination of features as inputs (i.e. PCA), and scaling the data prior to model fitting actually hurt model performance in this case. In the case of scaling, this could be due to potentially ‘smoothing’ out any non-linearities by transforming to a scaled feature space. In the case of PCA, using a linear combinations of features may have resulted in choosing the wrong features to combine non-linearly in the model.