The data set is provided by Tennis Australia. It is modelled on this competition conducted in 2018, but new data is provided. The goal is “Predicting How Points End in Tennis”. That contest represented the first public release of a large amount of tracking data from professional tennis matches. This task has the potential to revolutionize the way that tennis uses data science to collect match statistics and make a huge impact on the sport.

## Motivation

Tennis, one of the most popular professional sports around the world, still uses manual coding of point outcomes. This is not only labor-intensive but it also raises concerns that outcome categories may not always be consistent from one coder to the next. The purpose of this contest is to find a better approach.

## Point Endings

Every tennis match is made up of a sequence of points. A point begins with a serve and players exchange shots until a player makes an error or is unable to return a shot in play.

Traditionally, the shot ending a point in tennis has been had been described in one of three mutually exclusive ways: a winner, an unforced error, or a forced error. A winner is a shot that was in play, not touched by the opponent, and ends with the point going to the player who made the shot. The other two categories are two distinct types of errors where both end with the point going to the player who did not make the shot. The distinction between an unforced and forced error is based on the nature of the incoming shot and a judgment about whether the shot was playable or not. As you can imagine, this distinction is not a perfect science.

## Outcome Coding

Point endings give us insight into player performance. For this reason, accurate statistics about point outcomes are essential to the sport. At professional tennis tournaments, human coders are trained to label and document outcomes during matches. This is the primary way that the sport gathers information about winners and errors.

## Tracking Data

The adoption of the player challenge system in the mid-2000s has lead to the use of multi-camera tracking systems for the majority of top professional matches. These tracking systems monitor the 3D coordinates of the ball position and 2D coordinates of the player position throughout a match. The richness of these data hold considerable promise for addressing many challenging questions in the sport.

## Objective

The objective of this contest is as follows:

Predict how a point ends in tennis using modern tracking data. Variable to be predicted: ‘outcome’ taking one of three values: Winner (W), Unforced Error (UE), Forced Error (FE)

## Data

• See the data dictionary (data_dictionary.docx) for a full feature description of the contest dataset.
• ao_training.rda contains the full training set, approximately 50% of observations
• ao_test_unlabelled.rda contains the test set that you need to predict, has all of the same variables as the training set except for the outcome variable
• ao_sample_predictions.csv shows the format of the file that you need to upload to kaggle with your predictions. The outcome column provided in this data are random outcomes.

## Hints

• The variables event, year, pointid and matchid should not be used in the model.
• There are multiple observations per pointid, because these are rallies where there have been multiple hits. However, the outcome of for each pointid is unique. It is this unique outcome for a pointid that you need to predict.
• Experiment with “feature engineering” creating new variables that better summarise the point, based on the variables provided, to improve the accuracy of your model. A simple suggestion would be to make a new variable “gender” to lable the point as coming from a men’s game or a women’s game.
• Some sample code to build a model and generate your prediction file is:
library(tidyverse)
library(randomForest)

ao_rf <- randomForest(outcome~., data=ao_tr[,-c(1,2,5,6)], importance=TRUE)
ao_ts_unlabelled\$pred <- predict(ao_rf, newdata=ao_ts_unlabelled)

mymode <- function(d) {
levels(d)[which.max(tabulate(d))]
}
predictions <- ao_ts_unlabelled %>%
group_by(pointid) %>%
summarise(outcome = mymode(pred)) %>%
arrange(pointid)

write_csv(predictions, path="predictions_22_4.csv")

## Evaluation Criteria

The kaggle criteria CategorizationAccuracy is used to assess your prediction. It is proportion of correction predictions. Using the code above will give a value of about 0.89. This is your benchmark to improve upon.

2. Submissions need to be made as an individual between weeks 8-10. In week 10 you will be able to merge with other class members to form a team, and make team submissions.
3. You final work will be with a team of 4-5 class members.
4. Do some basic exploration of the dataset.
6. Try, and try again to improve your model. You can submit one prediction per day.

# Project report and presentation

The data analysis report can be a maximum of 5 pages, and must abide by the section structure described below.

• Section 1: Introduction

The introduction will describe the data set and motivate the problem. It should be brief.

• Section 2: Methodology

This section describes the models and methods you have used, including a justification of your choices. You should also present your model fitting, diagnostics, etc.

• Section 3: Results and Discussion

This includes for example graphs and tables, as well as a discussion of the results.

• Section 4: Conclusion

This includes summary of the findings.

You should clearly explain what you have done, using figures to supplement your explanation. Your figures must be of proper size with labeled, readable axes. In general, you should take pride in making your report readable and clear. You will be graded both on stastical content and quality of presentation.

Finally, each team will make a presentation of their work for the class max 5 slides, in html format. Each team will have 5 minutes (4 minutes for presentation, and 1 minute for Q & A). All team members must participate by speaking in the presentation. Score will be given by other members of the class. All students must be present to evaluate the presentations, and if not points will be deducted from the absent individual’s score.

• Total points: 20
• Accuracy of classifier on Kaggle: 6
• Report: 7
• Presentation: 7