NOTE: If you are not interested in the details of programming with R or getting data from the internet with it, you want to skip this post and have a look at Part 2.
First, we have to get some data. transfermarkt.de is a german site hosting statistics of all kinds of stuff around soccer. With a few simple calls, we can get a table from their site into R:
Let's load some packages first:
library(RCurl)
library(XML)
library(gdata)
Now define the URL where the table is located:
buli.url <- "http://www.transfermarkt.de/de/1-bundesliga/startseite/wettbewerb_L1.html"
Now parse this URL and extract the table body from it:
parsedHTML <- htmlParse(getURL(buli.url), asText = T)
raw.tab <- xpathSApply(parsedHTML, "//tbody", xmlValue)[1]
Now let's clean up that messy HTML stuff:
cl.tab <- trim(gsub("[\r\n]+", " ", raw.tab))
spl.tab <- trim(strsplit(cl.tab, "\t", fixed = T)[[1]])
spl.tab <- spl.tab[spl.tab != ""]
And build a dataframe from spl.tab. The first block of calls only creates vectors of indices we need to get the correct information from the long character vector spl.tab into the dataframe table.
team.inds <- seq(1, length(spl.tab), 5)
squad.inds <- seq(2, length(spl.tab), 5)
age.inds <- seq(3, length(spl.tab), 5)
value.inds <- seq(4, length(spl.tab), 5)
mvalue.inds <- seq(5, length(spl.tab), 5)
spl.tab[value.inds] <- trim(gsub("[.€]+", "", spl.tab[value.inds]))
spl.tab[mvalue.inds] <- trim(gsub("[.€]+", "", spl.tab[mvalue.inds]))
Now build the dataframe.
table <- data.frame(Team = spl.tab[team.inds],
Squad = spl.tab[squad.inds],
Age = spl.tab[age.inds],
Value = spl.tab[value.inds],
mValue = spl.tab[mvalue.inds])
buli.stand.url <- "http://www.transfermarkt.de/de/1-bundesliga/tabelle/wettbewerb_L1.html"
parsedHTML <- htmlParse(getURL(premleague.stand.url), asText = T)
raw.tab <- xpathSApply(parsedHTML, "//table", xmlValue)[2]
cl.tab <- trim(gsub("[\r\n#]+", " ", raw.tab))
spl.tab <- trim(strsplit(cl.tab, "\t", fixed = T)[[1]])
spl.tab <- spl.tab[spl.tab != ""]
spl.tab <- spl.tab[(grep("Pkt.", spl.tab, fixed = T)+1):(grep("Jahrestabelle", spl.tab, fixed = T)-1)]
pos.inds <- seq(1, length(spl.tab), 9)
team.inds <- seq(2, length(spl.tab), 9)
games.inds <- seq(3, length(spl.tab), 9)
won.inds <- seq(4, length(spl.tab), 9)
draw.inds <- seq(5, length(spl.tab), 9)
lost.inds <- seq(6, length(spl.tab), 9)
goals.inds <- seq(7, length(spl.tab), 9)
diff.inds <- seq(8, length(spl.tab), 9)
points.inds <- seq(9, length(spl.tab), 9)
stand.table <- data.frame(Pos = as.numeric(spl.tab[pos.inds]),
Team = spl.tab[team.inds],
Games = as.numeric(spl.tab[games.inds]),
Won = as.numeric(spl.tab[won.inds]),
Draw = as.numeric(spl.tab[draw.inds]),
Lost = as.numeric(spl.tab[lost.inds]),
Goals = spl.tab[goals.inds],
Diff = as.numeric(spl.tab[diff.inds]),
Points = as.numeric(spl.tab[points.inds]))
The number of goals for and against a team is only available within the column "Goals", so we have to extract them into two new columns.
stand.table$Goals.for <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, ":")[[1]][1]))
stand.table$Goals.aga <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, ":")[[1]][2]))
Now we sort both tables (table which holds the values of the teams and stand.table which holds the standing of the current championship season) by team name. Then we cbind the two tables and check if the team names from both tables are equal. They should be. Last step: Sort the big table by position in the current championship and get rid of the redundant "Team2" column.
stand.table <- stand.table[order(stand.table$Team),]
table <- table[order(table$Team),]
names(table)[names(table) == "Team"] <- "Team2"
big.tab <- cbind(stand.table, table)
all(big.tab$Team == big.tab$Team2) # should be TRUE
big.tab <- big.tab[order(big.tab$Pos),-grep("Team2", names(big.tab), fixed = T)]
After all this "magic", we get a table looking like this (only first 5 rows are shown here, I used the Premier League as an example):
Pos Team Games Won Draw Lost Goals Diff Points Goals.for Goals.aga Squad Age Value mValue
1 1 FC Chelsea 7 6 1 0 15:4 11 19 15 4 23 27.0 387000000 16826087
2 2 Manchester United 7 5 0 2 17:9 8 15 17 9 28 26.1 441000000 15750000
3 3 Manchester City 7 4 3 0 15:8 7 15 15 8 25 26.5 498250000 19930000
4 4 FC Everton 7 4 2 1 14:8 6 14 14 8 22 27.1 128250000 5829545
5 5 Tottenham Hotspur 7 4 2 1 13:8 5 14 13 8 24 27.5 254500000 10604167
In the next post, I will demonstrate how to get nice plots and analyses out of this kind of dataframe... and maybe we will answer the question if soccer is really only about the money.
View comments