Today, let us have a look at movies. The Internet Movie Database (IMDb) has some data dumps available on their website. It's a subset of the information available on the IMDb site, but it's more than enough. I will spare you my code to convert these data dumps in R dataframes, because the code is boring and complicated (unfortunately, the data dumps are not too nice to read automatically).

I just wanna show you what you can do with these dumps. I gonna use the data from IMDb user ratings (saved in the variable rat). A compressed .Rdata file of these ratings is roughly 7 MB big and has almost 390,000 rows. rat look like this:

> rat[grep("The Wire", rat$title),][1:10,]
       n.votes rating                                                  title year
25026      300    8.0        "Curb Your Enthusiasm" (2000) {The Wire (#1.6)} 2000
103831     644    8.0 "Star Trek: Deep Space Nine" (1993) {The Wire (#2.22)} 1993
131481      93    8.0                                      "The Wire" (1997) 1997
131482    2457    9.5                                      "The Wire" (2002) 2002
131483     843    9.5                       "The Wire" (2002) {-30- (#5.10)} 2002
131484     213    9.2                  "The Wire" (2002) {A New Day (#4.11)} 2002
131485     227    8.7             "The Wire" (2002) {All Due Respect (#3.2)} 2002
131486     250    8.9                "The Wire" (2002) {All Prologue (#2.6)} 2002
131487     553    8.8                   "The Wire" (2002) {Alliances (#4.5)} 2002
131488     218    8.8                "The Wire" (2002) {Back Burners (#3.7)} 2002
       series.ep ep.code season episode decade
25026       TRUE     1.6      1       6   1990
103831      TRUE    2.22      2      22   1990
131481     FALSE    <NA>     NA      NA   1990
131482     FALSE    <NA>     NA      NA   2000
131483      TRUE    5.10      5      10   2000
131484      TRUE    4.11      4      11   2000
131485      TRUE     3.2      3       2   2000
131486      TRUE     2.6      2       6   2000
131487      TRUE     4.5      4       5   2000
131488      TRUE     3.7      3       7   2000


First, let's have a look at the overall distribution of all ratings. Let's exclude episode ratings, we are only interested in movies and whole series right now.

Let us start with a nice histogram.

library(MASS)
truehist(rat[rat$series.ep == F, "rating"], border = "#00000000", col = "darkblue", xlab = "Rating", ylab = "Probability")
abline(v = mean(rat[rat$series.ep == F, "rating"]), col = "lightgreen", lwd = 2)
curve(dnorm(x, mean = mean(rat[rat$series.ep == F, "rating"]), sd = sd(rat[rat$series.ep == F, "rating"])), from = 1, to = 10, add = T, lwd = 2, col = "red", lty = "dotted")
rat.tab <- table(rat[rat$series.ep == F, "rating"])
rat.tab[which(rat.tab == max(rat.tab))]

The mean movie rating (green line) on IMDb is 6.14 (rounded). Users can rate movies on IMDb between 1 and 10. Movie ratings on IMDb are not normally distributed but slightly shifted to the right. A normal distribution with the same mean and standard deviation as the ratings is included in the histogram, it's the dotted red line.

Since we have information about the decade a movie was published, let's have a look at ratings over the decades. Note, the year noted at the left of the plot is the start of that decade.

dotplot(rev(xtabs(rating ~ decade, data = rat[rat$series.ep == F,]) / xtabs(~ decade, data = rat[rat$series.ep == F,])), cex = 1.3, xlab = "Mean rating")
Wow, that's harsh - movies obviously sucked from 1900 to 1909. The "Golden Twenties" win with a mean rating of 6.40. That's very close to movies from the 2nd best decade (1940 to 1950) which have a mean rating of 6.35. I tried to create this plot with error bars to visualize the variance in decades. However, error bars are practically invisible because there are so many cases in each decade. The decade with the least ratings is 1910 to 1919 with "only" 1,854 ratings. The decade with the most ratings is the one between 2000 and 2009 - it has 108,304 ratings! It's no surprise that error bars are practically invisible with such high counts.

Now we gonna look into series. There is always a huge discussion going on which season of a series is the best. Let's have a look what IMDb users say. I gonna plot the mean of each season with error bars to get an impression of statistical significance. Note, that I gonna plot the mean of means because the data dump you can download from IMDb only supplies mean ratings of episodes. Normally, I would calculate the mean for each season based on "raw" user ratings.

I gonna compare three series which many people say they were the best they ever saw.

First: Define some patterns you want to find in the data dump.
series.patterns <- c("\"Breaking Bad\"",
                     "\"The Wire\"",
                     "\"The Sopranos\"")

Now build up a dataframe with all hits.
rat.series <- data.frame()
for (series in series.patterns) {
    rat.series <- rbind(rat.series, rat[grep(series, rat$title, fixed = T),]) }

Now, I'm extracting the series title (I only need that for the legend of the plot).
rat.series <- rat.series[rat.series$series.ep == T,]
rat.series$series.title <- sapply(rat.series$title, USE.NAMES = F, FUN = function (title) {
    ti <- grep("[\"]{1}[[:print:]]*[\"]{1}", strsplit(title, "(", fixed = T)[[1]], value = T)
    gsub("[\"[:space:]]", "", ti) } )

Now plot the result with the help of a function from the "sciplot" package.
lineplot.CI(season, rating, series.title, data = rat.series, col = c("#FF0000C8", "#00FF00C8", "#0000FFC8"), lwd = 2, xlab = "Season", ylab = "Mean Rating")
For me being such a big fan of "The Wire" that's a tough result. IMDb users say that "Breaking Bad" is at least equally outstanding great. I just started watching "Breaking Bad", so I guess it's alright. "The Sopranos" seem to suck after season 3... I quit within season 3. Maybe, that's alright, too :)

A second duel is between two cartoon classics.
Both are rated worse over time with a few peaks at season 8 and 11 (Imaginationland?!? Come on, people, you can't be serious!) for South Park.

If you have any ideas for other duels, let me know.

But now to the answer we've all been waiting for: The best series EVER (at least in the eyes of voting IMDb users). Place your bets, ladies and gentlemen.

rat.se is a dataframe holding all data for episodes (so no movie ratings). We only want to look at the 250 series with the most votes. So we cross-tab number of votes over the title of the series (I already showed above how I used regular expressions extracting the title of the series). We also sort the xtab and take the first 250 entries.

xt.votes <- sort(xtabs(n.votes ~ series.title, data = rat.se), decreasing = T)[1:250]

Now I gonna use the names of xt.votes, iterate through them to calculate means, standard errors and number of votes for each series. The result is saved in the variable df. I sort this dataframe by rating and extract the Top 10.
df <- data.frame()
for (name in names(xt.votes)) {
    mean.rat.se <- mean(rat.se.tt[rat.se.tt$series.title == name, "rating"])
    se.rat.se <- se(rat.se.tt[rat.se.tt$series.title == name, "rating"])
    nvotes.rat.se <- sum(rat.se.tt[rat.se.tt$series.title == name, "n.votes"])
    new.df <- data.frame(title = name, mRating = mean.rat.se, SERating = se.rat.se, nVotes = nvotes.rat.se)
    df <- rbind(df, new.df) }

df <- df[order(df$mRating, decreasing = T),][1:10,]

Aaaaand ... plot!
par(mar = c(10, 4, 4, 2))
plot(df$mRating, axes = F, ylab = "Mean Episode Rating", xlab = "", pch = 19, col = "blue", cex = 1.5, ylim = c(8.4,8.9), xlim = c(0.7,10.3))
plotCI(x = df$mRating, ui = df$mRating + df$SERating, li = df$mRating - df$SERating, add = T, type = "n", col = "darkblue")
axis(side = 2)
axis(side = 1, labels = df$title, at = 1:10, tick = F, las = 2, cex.axis = 0.8)

Click the plot to read the axis labels...

So we have a winner: Game of Thrones! The error bars represent standard errors of the mean ratings for each episode - that's not the waterproof way to do it (wo would need the raw ratings for that). Nevertheless, we can see some interesting things:
  • There are only very small differences between the Top 10 series. Look at the scale of the y axis. The difference between Game of Thrones and The Wire is really really small!
  • "Firefly" and "Sherlock" have relatively great error bars. The reason: There are only few episodes of them. "Sherlock" still has a chance to gather more episodes. Unfortunately, Firefly does not.
  • Dexter is overrated ;)
Enough of the movies and series for now... Maybe, we will come back to that dataset some other time.





0

Add a comment

Hi all, this is just an announcement.

I am moving Rcrastinate to a blogdown-based solution and am therefore leaving blogger.com. If you're interested in the new setup and how you could do the same yourself, please check out the all shiny and new Rcrastinate over at

http://rcrastinate.rbind.io/

In my first post over there, I am giving a short summary on how I started the whole thing. I hope that the new Rcrastinate is also integrated into R-bloggers soon.

Thanks for being here, see you over there.

Alright, seems like this is developing into a blog where I am increasingly investigating my own music listening habits.

Recently, I've come across the analyzelastfm package by Sebastian Wolf. I used it to download my complete listening history from Last.FM for the last ten years. That's a complete dataset from 2009 to 2018 with exactly 65,356 "scrobbles" (which is the word Last.FM uses to describe one instance of a playback of a song).
3

Giddy up, giddy it up

Wanna move into a fool's gold room

With my pulse on the animal jewels

Of the rules that you choose to use to get loose

With the luminous moves

Bored of these limits, let me get, let me get it like

Wow!

When it comes to surreal lyrics and videos, I'm always thinking of Beck. Above, I cited the beginning of the song "Wow" from his latest album "Colors" which has received rather mixed reviews. In this post, I want to show you what I have done with Spotify's API.

Click here for the interactive visualization

If you're interested in the visualisation of networks or graphs, you might've heard of the great package "visNetwork". I think it's a really great package and I love playing around with it. The scenarios of graph-based analyses are many and diverse: whenever you can describe your data in terms of "outgoing" and "receiving" entities, a graph-based analysis and/or visualisation is possible.
12

Here is some updated R code from my previous post. It doesn't throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).

Have fun and leave a comment if you have any questions.
3

So, Strava's heatmap made quite a stir the last few weeks. I decided to give it a try myself. I wanted to create some kind of "personal heatmap" of my runs, using Strava's API. Also, combining the data with Leaflet maps allows us to make use of the beautiful map tiles supported by Leaflet and to zoom and move the maps around - with the runs on it, of course.

So, let's get started. First, you will need an access token for Strava's API.

I've been using the ggplot2 package a lot recently. When creating a legend or tick marks on the axes, ggplot2 uses the levels of a character or factor vector. Most of the time, I am working with coded variables that use some abbreviation of the "true" meaning (e.g. "f" for female and "m" for male or single characters for some single character for a location: "S" for Stuttgart and "M" for Mannheim).

In my plots, I don't want these codes but the full name of the level.

It's been a while since I had the opportunity to post something on music. Let's get back to that.

I got my hands on some song lyrics by a range of artists. (I have an R script to download all lyrics for a given artist from a lyrics website.
4

Lately, I got the chance to play around with Shiny and Leaflet a lot - and it is really fun! So I decided to catch up on an old post of mine and build a Shiny application where you can upload your own GPX files and plot them directly in the browser.

Of course, you will need some GPX file to try it out. You can get an example file here (you gonna need to save it in a .gpx file with a text editor, though). Also, the Shiny application will always plot the first track saved in a GPX file.
9
Blog Archive
BlogRoll
BlogRoll
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.