kNN & Neuron Network

a. kNN (K-Nearest Neighbour)

  • kNN method is: if most of the k similar samples in the feature space (that is, the closest samples in the feature space) belong to a certain category, the sample also belongs to this category.
  • After normalizing tennis related variables by applying the function (x - min(x)) / (max(x) - min(x)) which limit all numeric data to a range between 0 and 1, we train our model through kNN () function.
  • Libraries used: class, KODAMA, caret

b. Neural Network

  • The motivation behind using Neural Networks on the AO Tennis dataset was to utilise the deep learning nature to build an accurate model to classify the point endings.
  • To determine whether our neural network would classify the dataset well, we first divided the data into a training set (67%) and test set (33%). This allowed us to determine how many neurons to include in the hidden layer based on the accuracy of the model.
  • Libraries used: neuralnet

XGBoost

  • The motivation behind using XGBoost was due to the iterative alterations to the weights during the tree building process and an in-built regularisation system.
  • The issue with using a random forest model (our base model) was the random nature of the model.
  • All explanatory variables were converted from characters into numerics.

Parameter Tuning

  • Maximum tree depth: 6
  • Classification Objective: Multiclass Probabilities
  • Evaluation Metric: Multiclass Log-Loss
  • Stopping Rounds Threshold: 200
  • Learning Rate: 0.15

Feature Engineering

  • New Variables such as Rally_info, Aces, Gender and Other Variables were derived off the raw data.
  • Rally_info increase in approximately 1.5% was seen using the random forest classification method.

XGBoost - Result

  • XGBoost showed assuring signs of improvement from the other models from the resulting classification error.
  • Along with the aforementioned features, it was able to classify around 91.5% of the dataset correctly. Predictions with the hidden test set on Kaggle returned an accuracy of 92.6%.