Timeline & Issues

The Challenge

  • Find a model that was more than 90% accurate at predicting tennis point outcomes.

The Stumbling Blocks

  • Non-binary classification
  • Logistic or linear
  • Numeric, logical, string
  • Unlimited possibilities
  • Comparing accuracy of different models

XGBoost

  • Concentrate on data points that are hard to classify
  • Reduces both bias and variance parts of the error
  • Classifiers are trained on different training data sets, so it helps in reducing the variance
  • Difficulty in learning when the dataset is noisy
  • Easy to overfit on the training data


Initial model was more accurate than the random forest model, with approximately 90-91% accuracy.

Feature Engineering & Combining Different Models

Feature Engineering

The goal is to use a dataset's domain knowledge and transform it into a new variable that better captures data variations in the model.

Multiple Models

The goal was to collect multiple results and vote on which most frequently appeared. Unfortunately, having categorical results made it difficult to weight the results towards the models that were more accuarate.

Feature Selection & Final Model

Unfortunately, the additional features had the non-desired effect of decreasing accuracy. Improvement was noticeable although results were unfavourable.


Our methodology was:

  • Choose lasso regression.
  • Reduce noise between variables through choosing significant variables.
  • Final model is XGBoost with features derived from lasso regression.

Conclusions

  • Starting is important, the first model may not be perfect.
  • Tuning models takes time and patience.
  • When removing correlation, do so sparingly.
  • Variable performance may or is actually drastically different between models.
Predicted F Predicted U Predicted W Sum
Actual F 2896 1016 42 3954
Actual U 468 4081 29 4578
Actual W 6 38 3902 3946
Sum 3370 5135 3973 12478