Predict the outcome of English Premier League matches using machine learning and historical match data. This project leverages web scraping, data cleaning, feature engineering, and Random Forest classification to make informed predictions.
The project is divided into two main parts: Scraping and Prediction.
We scrape match results from the EPL to build a dataset for predicting match winners. The notebook uses the requests library to download data and Beautiful Soup to parse HTML, extracting the relevant tables. The data is then loaded into pandas for cleaning and preparation for the machine learning model.
Workflow recap:
- Start with the standings_url for the Premier League
- Scrape the table data from the HTML page
- Access each team's individual page
- Scrape scores and fixtures for each match of that team
- Scrape shooting stats for each match of that team
- Combine the data frames for each team
- Loop through each season and selected teams to consolidate into a single data frame
- Merge everything and perform minor data cleanup
This notebook predicts winners for EPL matches using the scraped dataset.
Steps included:
- Load match data into pandas and clean it
- Apply a Random Forest machine learning model to predict match outcomes
- Measure the algorithm's performance and improve model accuracy
- Continue iteratively improving the model
Summary of the Prediction process:
- Scraped data about Premier League matches
- Cleaned the data and prepared it for the ML model
- Created a simple ML model with a few predictors and a target
- Trained a Random Forest model
- Computed accuracy and precision scores
- Improved accuracy by generating more predictors using rolling averages
- Enhanced precision by analyzing both sides of the matches of team A and team B
- Achieved a final model precision of 67.5 percent
Possible next steps:
- Scrape more seasons, ideally 10 to 20, to improve predictions
- Use additional columns in the dataset to generate more features for the model
- Python 3.8+
- Pandas for data manipulation and analysis
- Scikit-learn for machine learning
- Requests for web scraping
- Beautiful Soup for HTML parsing
- Jupyter Notebook for interactive development
epl-ml-match-winner-prediction/
│
├── data_scrape.ipynb # Notebook to scrape EPL match data
├── ml_predictions.ipynb # Notebook for ML model training and evaluation
├── matches_data.csv # Raw match data collected
├── requirements.txt # Python dependencies
└── LICENSE # Project license
- Clone the repository
git clone https://github.com/andreyestevam/epl-ml-match-winner-prediction.git
cd epl-ml-match-winner-prediction- Install required packages
pip install -r requirements.txt- Run Jupyter Notebooks
You can either: Scrape the data (can be slightly different than the one collected in this repository due to possible website changes):
jupyter notebook data_scrape.ipynbOr: Train and Evaluate the ML model:
jupyter notebook ml_predictions.ipynbThe Random Forest model is evaluated using accuracy and precision metrics. Initial models are simple, but precision improves significantly by adding rolling averages, using more predictors, and considering both teams' historical performance. The final precision achieved, with the existing data in this repository, is 67.5 percent.
This project was inspired by a project-based learning approach. All data cleaning, feature engineering, and machine learning modeling were implemented independently, reflecting hands-on experimentation and practical learning with Python, Pandas, Beautiful Soup, and scikit-learn.