Note as of May 15th 2016 - I'm in the process of updating the package to submit to CRAN. I have made a number of changes to code and data formats. If you find something doesn't work anymore, please submit an issue.

engsoccerdata

This R package is mainly a repository for complete soccer datasets, along with some built-in functions for analyzing parts of the data. Currently I include three English ones (League data, FA Cup data, Playoff data - described below) and some European leagues (Spain, Germany, Italy, Holland).

Free to use for non-commerical use. Compiled by James Curley.

Please cite as:
James P. Curley (2016). engsoccerdata: English Soccer Data 1871-2015. R package version 0.1.4 DOI

If you do use it on any publications, blogs, websites, etc. please note the source (i.e. me!). Also, if you do use it - I would love to see any analysis produced from it etc. Of course, I accept no responsibility for any error that may be contained herewithin.

Contact details: jc3181 AT columbia DOT edu

Installation

To install this directly into R.

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
library(engsoccerdata)

data(package="engsoccerdata")    # lists datasets currently available

If you get an error message like this one

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Problem with the SSL CA cert (path? access rights?)

which has happened on occasions for me, try this:

library(RCurl)
library(httr)
set_config( config( ssl_verifypeer = 0L ) )

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
library(engsoccerdata)

engsoccerdata

Last update: 15 May 2016, v0.1.5

Datasets

Help Needed !

I am about to submit this package to CRAN. I would love help in collating more results. If anyone wants to work on a particular league or competition please let me know. These are the things I'd like to work on:

Functions

Some built-in functions:

What does england.csv contain?

all top 4 tier games ever played 1888-2016

In the csv file, I've used divisions 1,2,3,3a,3b, 4 as the notation I've also used tier 1,2,3,4 - to refer to 3,3a & 3b all belonging to tier 3

Dataset includes:

teams that dropped out half way through a season: - 1919 Leeds City - 1931 Wigan Borough - 1961 Accrington Stanley

Team Names used in the file are those that are currently used: e.g. Small Heath are Birmingham City, Ardwick are Manchester City, etc.

The modern Accrington Stanley are 'Accrington' to distinguish from original Accrington Stanley and earlier Accrington FC

What does facup.csv contain?

This was a pain to put together. It contains every single FA Cup tie (whether played or not) from the first inception of the competition in 1871 to the 2015/16 season. It does not contain pre-qualifying rounds (yet). It is best to describe each variable name in turn to give more information:

Important notes to above:

I have tried to make the dataset as complete as possible. The FA Cup data is difficult as some of it is just unobtainable. For instance, I have added venues and attendances for all semis and finals and have included this information sporadically wherelse I was able to get it. I have not done a systematic application of this to early rounds. Several games in the FA Cup are played at neutral grounds or even the visiting team is allowed to play at home (e.g. if a minnow plays a big team). I have not managed to systematically check this. Also, there was a trend to play 2nd and 3rd and 4th replays at neutral venues. This could be systematically checked but I have not yet. Further, I think I have all games that ever ended in penalties added in correctly.

Finally, team names. There are great disputes about which teams branch off from which teams in history and who should have shared history. I have tried to be consistent in naming teams with their most current name throughout (e.g. Millwall Rovers, Millwall Athletic and Millwall are all listed as the current name - Millwall), or the name that they used when they stopped playing (e.g. Mitchell St. George's are always listed as Birmingham St. George's). I have also tried to follow the same team name format as in england.csv - I think the three Accrington teams may be the only one I need to re-edit for this purpose.

What does playoffs.csv contain?

What does spain.csv contain?

Please refer to the spainliga rpubs below for further information.

Other Leagues:

I've just added complete all top tier results for Holland (1956-2016), Germany (1963-2016) and Italy (1934-2016). These dataframes contain all league results played in regular season. They don't yet include relegation/promotion playoff fixtures. Further, I have not yet completed all final checks of the data. I believe they are error free - but if others want to test and check, I'd welcome this.


Any help in improving the quality of these datasets is appreciated.

List of Sources

Shiny apps:

Tutorials/demos

(note as of May 2015, the code in these may need to change to reflect the change in names of datasets and some functions) - http://rpubs.com/jalapic/daygoals #goal scoring trends on unqiue dates in soccer history - http://rpubs.com/jalapic/facuplast8 #quick walkthrough of some of the FA Cup data - http://rpubs.com/jalapic/gpg #very quick look at id-ing breakpoints in English scoring trends - http://rpubs.com/jalapic/gamebygame #plotting game by game trends across seasons - http://rpubs.com/jalapic/seasons #visualizing season to season changes in top tier performance - http://rpubs.com/jalapic/laliga #visualizing historical Spanish La Liga data

FiveThirtyEight

Oliver Roeder and I have written several articles for fivethirtyeight using these data:

Also this piece on league inequality:

Media Hits

(listing them here so I don't forget them)

Elsewhere

More in depth analysis by Simon on David Sumpter's Collective Behavior blog: - http://www.collective-behavior.com/liverpool-is-still-the-most-successful-english-club-team-but-for-how-long/ - http://www.collective-behavior.com/how-the-big-four-made-football-predictable/