Here is some updated R code from my previous post. It doesn't throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).

Have fun and leave a comment if you have any questions.



options(stringsAsFactors = F)

rm(list=ls())

library(httr)
library(rjson)
library(leaflet)
library(dplyr)

token <- "<your Strava API token>"

# Functions ---------------------------------------------------------------

get.coord.df.from.stream <- function (stream.obj) {
  data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]),
             lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]]))
}

get.stream.from.activity <- function (act.id, token) {
  stream <- GET("https://www.strava.com/",
                path = paste0("api/v3/activities/", act.id, "/streams/latlng"),
                query = list(access_token = token))
  content(stream)
}

get.activities2 <- function (token) {
  activities <- GET("https://www.strava.com/", path = "api/v3/activities",
                    query = list(access_token = token, per_page = 200))
  activities <- content(activities, "text")
  activities <- fromJSON(activities)
  res.df <- data.frame()
  for (a in activities) {
    values <- sapply(c("name", "distance", "moving_time", "elapsed_time", "total_elevation_gain",
                       "type", "id", "start_date_local",
                       "location_country", "average_speed", "max_speed", "has_heartrate", "elev_high",
                       "elev_low", "average_heartrate", "max_heartrate"), FUN = function (x) {
                         if (is.null(a[[x]])) {
                           NA } else { a[[x]] }
                       })
    res.df <- rbind(res.df, values)
  }
  names(res.df) <- c("name", "distance", "moving_time", "elapsed_time", "total_elevation_gain",
                     "type", "id", "start_date_local",
                     "location_country", "average_speed", "max_speed", "has_heartrate", "elev_high",
                     "elev_low", "average_heartrate", "max_heartrate")
  res.df
}

get.multiple.streams <- function (act.ids, token) {
  res.list <- list()
  for (act.id.i in 1:length(act.ids)) {
    if (act.id.i %% 5 == 0) cat("Actitivy no.", act.id.i, "of", length(act.ids), "\n")
    stream <- get.stream.from.activity(act.ids[act.id.i], token)
    coord.df <- get.coord.df.from.stream(stream)
    res.list[[length(res.list) + 1]] <- list(act.id = act.ids[act.id.i],
                                             coords = coord.df)
  }
  res.list
}

activities <- get.activities2(token)

stream.list <- get.multiple.streams(activities$id, token)


# Leaflet -----------------------------------------------------------------

lons.range <- c(9.156572, 9.237580)
lats.range <- c(48.74085, 48.82079)

map <- leaflet() %>%
  addProviderTiles("OpenMapSurfer.Grayscale", # nice: CartoDB.Positron, OpenMapSurfer.Grayscale, CartoDB.DarkMatterNoLabels
                   options = providerTileOptions(noWrap = T)) %>%
  fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))

add.run <- function (act.id, color, act.name, act.dist, strlist = stream.list) {
  act.ind <- sapply(stream.list, USE.NAMES = F, FUN = function (x) {
    x$act.id == act.id
  })
  act.from.list <- strlist[act.ind][[1]]
  map <<- addPolylines(map, lng = act.from.list$coords$lon,
               lat = act.from.list$coords$lat,
               color = color, opacity = 1/3, weight = 2,
               popup = paste0(act.name, ", ", round(as.numeric(act.dist) / 1000, 2), " km"))
}

# plot all

for (i in 1:nrow(activities)) {
  add.run(activities[i, "id"], ifelse(activities[i, "type"] == "Run", "red",
                                      ifelse(activities[i, "type"] == "Ride", "blue", "black")),
          activities[i, "name"], activities[i, "distance"])
}

map
3

View comments

  1. Could you direct us to your previous post?

    ReplyDelete
    Replies
    1. Hi, it's linked under "previous post". The URL is http://rcrastinate.blogspot.com/2018/01/where-do-you-run-to-map-your-strava.html

      Delete
  2. This is great blog post i have read your others blog too. you can read my blog Spotify Error Code 3

    ReplyDelete
  1. Hi all, this is just an announcement.

    I am moving Rcrastinate to a blogdown-based solution and am therefore leaving blogger.com. If you're interested in the new setup and how you could do the same yourself, please check out the all shiny and new Rcrastinate over at

    http://rcrastinate.rbind.io/



    In my first post over there, I am giving a short summary on how I started the whole thing. I hope that the new Rcrastinate is also integrated into R-bloggers soon.

    Thanks for being here, see you over there.
    0

    Add a comment

  2. Alright, seems like this is developing into a blog where I am increasingly investigating my own music listening habits.
    Recently, I've come across the analyzelastfm package by Sebastian Wolf. I used it to download my complete listening history from Last.FM for the last ten years. That's a complete dataset from 2009 to 2018 with exactly 65,356 "scrobbles" (which is the word Last.FM uses to describe one instance of a playback of a song).

    Getting scrobble data

    I'm pretty sure this can be done more efficiently, but this is how I'm getting the data. Be advised that this takes quite a long time.

    install.packages("devtools")
    devtools::install_github("zappingseb/analyze_last_fm")
    library(analyzelastfm)
    library(lubridate)
    library(ggplot2)
    library(gridExtra)
    library(grid)
    library(ggrepel)
    library(scales)

    lkey <- "< Last.FM API key goes here >"

    data18 <- UserData$new("< Last.FM user name >", lkey, year = 2018)
    data17 <- UserData$new("< Last.FM user name >", lkey, year = 2017)
    data16 <- UserData$new("< Last.FM user name >", lkey, year = 2016)
    data15 <- UserData$new("< Last.FM user name >", lkey, year = 2015)
    data14 <- UserData$new("< Last.FM user name >", lkey, year = 2014)
    data13 <- UserData$new("< Last.FM user name >", lkey, year = 2013)
    data12 <- UserData$new("< Last.FM user name >", lkey, year = 2012)
    data11 <- UserData$new("< Last.FM user name >", lkey, year = 2011)
    data10 <- UserData$new("< Last.FM user name >", lkey, year = 2010)
    data09 <- UserData$new("< Last.FM user name >", lkey, year = 2009)

    scrobs <- rbind(data18$data_table,
                    data17$data_table,
                    data16$data_table,
                    data15$data_table,
                    data14$data_table,
                    data13$data_table,
                    data12$data_table,
                    data11$data_table,
                    data10$data_table,
                    data09$data_table)

    Before moving on, let's parse the timestamp and order the dataset by it.

    scrobs$timestamp <- parse_date_time(scrobs$datetext, "dmYHM")

    scrobs <- scrobs[order(scrobs$timestamp),]

    rownames(scrobs) <- 1:nrow(scrobs)

    Artist and album charts

    I have a dataframe "scrobs" with one scrobble per row now. Let's do the obvious thing and plot 10-year overall charts of artists and albums - NB: I'm using Last.FM's brand color here ;)

    top.artists <- as.data.frame(sort(table(scrobs$artist), decreasing = T)[1:50])
    ggplot(top.artists, aes(x = Var1, y = Freq)) +
      geom_bar(stat = "identity", fill = "#d51007") +
      coord_flip() +
      geom_text(aes(label = Freq), hjust = "right", size = 3, col = "white") +
      labs(x = "Playcount", y = "")

    top.albums <- as.data.frame(sort(table(scrobs$album), decreasing = T)[1:54])
    ggplot(top.albums, aes(x = Var1, y = Freq)) +
      geom_bar(stat = "identity", fill = "#d51007") +
      coord_flip() +
      geom_text(aes(label = Freq), hjust = "right", size = 3, col = "white") +
      labs(y = "Playcount", x = "")

    Top 50 artists

    Top 50 albums

    I wonder how many artists you could name from my top 50 albums. It's quite interesting to see how I obviously have a group of top 5 artists but only one album that stands out at the top of the charts (seriously, "Friend and Foe" by Menomena is great art!).

    Artists and albums through time 

    Next, I wanted to see how the total number of artists I played during all this time developed over all these years. So, what I want to do is get a cumulative view of the number of artists (and albums) over time. Whenever a new artist or album appears in the list of scrobbles, the counter is increased by 1. I did it this way (again, I think it could be done more efficiently, but this works and is quite transparent).

    # New artist annotation

    scrobs$new.artist <- F
    for (i in 1:nrow(scrobs)) {
      if (i %% 1000 == 0) cat(i, "\n")
      cur.row <- scrobs[i,]
      prev.rows <- scrobs[1:(i-1),]
      if (!(cur.row$artist %in% prev.rows$artist)) scrobs[i, "new.artist"] <- T 
    }
    scrobs$new.artists.till.now <- cumsum(scrobs$new.artist)

    # New album annotation

    scrobs$new.album <- F
    for (i in 1:nrow(scrobs)) {
      if (i %% 1000 == 0) cat(i, "\n")
      cur.row <- scrobs[i,]
      prev.rows <- scrobs[1:(i-1),]
      if (!(cur.row$album %in% prev.rows$album)) scrobs[i, "new.album"] <- T 
    }
    scrobs$new.albums.till.now <- cumsum(scrobs$new.album)

    Now that we have the information available in columns "new.artists.till.now" and "new.albums.till.now", let's plot them.

    p1 <- ggplot(scrobs, aes(x = timestamp, y = new.artists.till.now)) +
      geom_line(size = 1) +
      labs(x = "", y = "Number of new artists since beginning")

    p2 <- ggplot(scrobs, aes(x = timestamp, y = new.albums.till.now)) +
      geom_line(size = 1) +
      labs(x = "", y = "Number of new albums since beginning")

    p3 <- grid.arrange(p1, p2)

    This is what we get (as always: please click on the figure to see a larger version).
    It's obvious that the lines look quite similar. New artists coincide with new albums. Then, there are a two other things worth mentioning:
    • After the beginning of 2014, the curve seems to get "steppier". Each sharp upward movement means that there have been a number of new artists in a shorter period of time. My suspicion is that, after 2013, I had increased phases where I was looking for new artists purposefully.
    • Obviously, at the end of 2018, there has been a sharp increase in the number of artists. This has to do something with me getting a Spotify account. I'm curious how this develops in the future.

    Getting tag information

    Now, I want to add some information to each scrobble: the top 5 tags given to them by the community on Last.FM. I am coming back to this later. This procedure takes quite long because it involves lots of calls to the Last.FM API - just in case you want to do this with your own data.

    In case you are wondering why I am looking up the tags for the "unique.tracks" object: It's a lot faster, because the tags for one track are always the same. Whenever a track repeats within "scrobs", the API would give the exact same result. So, I am using "unique.tracks" which is about a sixth of the size of "scrobs" (meaning that, on average, I have been listening to each track 6 times - of course, this distribution is heavily skewed, so the mean value does not tell us much here).

    library(RLastFM)

    unique.tracks <- unique(scrobs[, c("track", "artist")])

    counter <- 0
    unique.tracks$track.top5.tags <- apply(unique.tracks, 1, FUN = function (row) {
      counter <<- counter + 1
      if (counter %% 500 == 0) cat(counter, "\n")
      Sys.sleep(.1)
      paste(track.getTopTags(track = row["track"],
                             artist = row["artist"])$tag[1:5], collapse = ", ")
    })

    unique.tracks$merge.col <- paste(unique.tracks$track, unique.tracks$artist, sep = "___")
    scrobs$merge.col <- paste(scrobs$track, scrobs$artist, sep = "___")

    scrobs <- merge(scrobs, unique.tracks[,c("merge.col", "track.top5.tags")], by = "merge.col", all.x = T, all.y = F)

    scrobs <- scrobs[order(scrobs$timestamp),]

    scrobs <- scrobs[,-1] # getting rid of merge.col

    Now for some clean-up of these tags. While we're at lowering all the tags, translating some of them (different forms of "deutsch" into "german") and unifying the different varieties of "hip-hop"/"hip hop", and "hophop",  I am also putting them all into a vector ("all.tags").

    all.tags <- list()
    counter <- 0
    scrobs$track.top5.tags <- sapply(scrobs$track.top5.tags, USE.NAMES = F, FUN = function (x) {
      counter <<- counter + 1
      if (counter %% 20000 == 0) cat(counter, "\n")
      tags.i <- strsplit(x, ", ", fixed = T)[[1]]
      tags.i <- tolower(tags.i)
      tags.i <- gsub("hiphop", "hip-hop", tags.i, fixed = T)
      tags.i <- gsub("hip hop", "hip-hop", tags.i, fixed = T)
      tags.i <- gsub("deutscher", "german", tags.i, fixed = T)
      tags.i <- gsub("deutsch", "german", tags.i, fixed = T)
      tags.i <- gsub("deutsche", "german", tags.i, fixed = T)
      all.tags[[length(all.tags) + 1]] <<- tags.i
      paste(tags.i, collapse = ", ")
    })
    rm(counter)

    all.tags <- do.call("c", all.tags)

    Tag chart

    With this, we can already plot a rough overview over the top 50 tags in the top 5 tags (that sounds a bit complicated) of my scrobbles in 10 years.

    tag.tab <- as.data.frame(sort(table(all.tags), decreasing = T)[1:50])
    tag.tab <- tag.tab[!(tag.tab$all.tags %in% c("null", "na")),]
    ggplot(tag.tab, aes(x = all.tags, y = Freq)) +
      geom_bar(stat = "identity", fill = "#d51007") +
      labs(x = "Tag", y = "How often did the tag appear in the top 5 tags of my scrobbles?") +
      geom_text(aes(label = Freq), hjust = "right", size = 3, col = "white") +

      coord_flip()



    There are some patterns here: "rock" is by far the most frequent tag. However, I guess this tag is one of the most general on this list. There is some "hip-hop" going on and "female vocalists" are also quite prominent in the most frequent tags. Also, "funk" and "soul" appears in the higher ranks. Obviously, some people tag songs with "radiohead" (songs by Radiohead, I reckon). Well, some people...

    Of course, there's a lot more one can do with these tags. But I wanted to play around with all-time scrobble visualizations first. For that, we need to code the year and month of each scrobble (we just need to get that out of the timestamp).

    scrobs$month <- substr(scrobs$datetext, 4, 6)
    scrobs$month <- factor(scrobs$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                                    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
    scrobs$month.year <- substr(scrobs$datetext, 4, 11)
    month.year.order <- paste(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
                              rep(2009:2018, each = 12))
    scrobs$month.year <- factor(scrobs$month.year, levels = month.year.order)


    Number of scrobbles per month, year, and day


    Let's get a simple overview over the number of scrobbles per months. For this, I want to take into account how many days the months have. If there's more time, you can listen to more music, right? For this, I am dividing the total number of scrobbles in each month by the number of days within this month multiplied by 10 (because there are ten years in the dataset). February is a little complicated because there were two leap years (in German, it's "switch years" - Schaltjahre), 2012 and 2016. So, I am adding 2 days for those years (not that it would make much difference in the plots...).

    month.tab <- as.data.frame(table(scrobs$month))
    month.tab$norm.scrobs <- month.tab$Freq / c(31*10, 28*10+2, 31*10, 30*10, 31*10, 30*10, 31*10, 31*10, 30*10, 31*10, 30*10, 31*10)

    ggplot(month.tab, aes(x = Var1, y = norm.scrobs)) +
      geom_bar(stat = "identity", fill = "#d51007") +
      geom_text(aes(label = round(norm.scrobs, 2)), vjust = 2, col = "white") +

      labs(x = "", y = "Mean scrobbles per day in month (over 10 year period)")



    As expected, months in the summer (especially July) are months when I didn't listen to music so frequently, mainly because I didn't take my music collection with me on summer vacations. Also, the winter months are the months when I stay at home more. And there: music.

    Of course, we can also look at the scrobbles in "instances" of months and not aggregated over years. We could do it this way (and still distinguish the years by bar color). I did not do the "length of month" normalization here.

    ggplot(scrobs, aes(x = month.year, fill = factor(year))) +
      geom_bar(width = 1) +
      labs(x = "Month/Year", y = "Playcount", fill = "Year") +
      scale_fill_manual(values = brewer.pal(10, "Spectral")) +
      scale_x_discrete(breaks = levels(scrobs$month.year)[seq(1, nlevels(scrobs$month.year), 3)]) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))


    We could even do this by day and add a smoother line to it.

    plot.df <- as.data.frame(table(scrobs$date))
    names(plot.df) <- c("Date", "Playcount")
    plot.df$year <- year(plot.df$Date)
    plot.df$month <- month(plot.df$Date)
    plot.df$month.year <- paste(plot.df$year, plot.df$month, sep = "-")

    ggplot(plot.df, aes(x = Date, y = Playcount, col = factor(year), group = 1)) +
      geom_line() +
      labs(col = "Year") +
      scale_x_discrete(breaks = NULL) +
      geom_smooth(method = "loess", span = .1, se = F)


    Well OK, this is mighty colorful but not entirely useful, right? The only thing that's quite salient is the increasing number of scrobbles at the very end of 2018. We also see this in a violin plot of the years. I am showing you this in two ways: without facets with the year on the x-axis and faceted by years.

    ggplot(plot.df, aes(x = factor(year), y = Playcount, fill = factor(year))) +
      geom_violin(draw_quantiles = c(1/4, 1/2, 3/4), col = "#00000000") +
      geom_jitter(height = 0, width = .2, alpha = .3) +
      geom_point(inherit.aes = F, aes(x = factor(year), y = Playcount), stat = "summary",
                 color = "white", size = 3) +
      scale_fill_manual(values = brewer.pal(10, "Spectral")) +
      labs(x = "") + guides(fill = F)

    ggplot(plot.df, aes(x= 1, y = Playcount, fill = factor(year))) +
      geom_violin(draw_quantiles = c(1/4, 1/2, 3/4), col = "#00000000") +
      geom_jitter(height = 0, width = .2, alpha = .3) +
      geom_point(inherit.aes = F, aes(x = 1, y = Playcount), stat = "summary",
                 color = "white", size = 3) +
      scale_fill_manual(values = brewer.pal(10, "Spectral")) +
      scale_x_continuous(breaks = NULL) +
      labs(x = "") + guides(fill = F) +
      facet_wrap(~ factor(year), scales = "free", ncol = 5)




    Aren't these gorgeous? I really like violin plots. The horizontal lines denote quartiles, the white dot represents the mean scrobbles per day for each year. Each violin spans from the minimum to the maximum value and the outlines of the violins represent the density ("distribution") of data points. Each day is represented by a semi-transparent black dot.

    For the faceted one I allowed the y-axis to vary freely because there is one outlier in 2010 that "crushes" all the violins to the baseline. What's going on there?

    I can tell you: It's August 21st, 2010 with 227 scrobbles. With an hypothetical mean track length of 4 minutes, that's 227*4/60 = 15.13 hours of music which does not sound to plausible. Here's what happened: I've been on summer holidays and had my iPod nano with me (it still lies in a drawer here, still working, but the smartphone took his job). Back in 2010, I forgot the iPod in the hotel and when I returned home, I asked the staff to look for it and send it to me via mail. They found it and sent it on its way to Germany. Of course, they didn't switch it off before mailing it - meaning that at some point during its journey (presumably very early on) - the "shake-to-play-random-songs" trigger went off. This caused the iPod to play songs till the battery went dead. The battery obviously lasted quite long.

    Due to the varying y-axes in the faceted case, I left out the white dots denoting the mean number of scrobbles per day in the non-faceted case - this would just be utterly confusing when comparing the height of a white dot in one facet with another one.

    Artist dispersion

    One last thing before I will wrap this post up: I was interested in the dispersion of artists. While frequency captures the overall number of scrobbles per artist, dispersion tries to capture how the data points are spread out over the whole dataset. I have played around with a few more sophisticated dispersion measures (e.g., Gries DP from the realm of linguistic research). However, the data at hand does not seem to be fit for this kind of data because I get really low dispersion measures. That is why I opted for a very basic measure: The ratio of days, each artist has been played. An easy example: Suppose, we had 100 days in the dataset and artist x was played at 35 of these days (one scrobble is enough, which could be a weakness of this measure), artist x gets a ratio value of 35 / 100 = .35 or 35%. Let's do this with all artists in the top 30 and the whole range of days:

    topx <- sort(table(scrobs$artist), decreasing = T)[1:30]
    n.days <- length(unique(scrobs$date))
    day.ratio.topx <- sapply(names(topx), USE.NAMES = T, FUN = function (art) {
      n.art.days <- length(unique(scrobs[scrobs$artist %in% art, "date"]))
      n.art.days / n.days
    })
    normed.day.ratio.topx <- day.ratio.topx / topx # We'll use this later

    We can do the same with months.

    n.months <- length(unique(scrobs$month.year))
    month.ratio.topx <- sapply(names(topx), USE.NAMES = T, FUN = function (art) {
      n.art.mnths <- length(unique(scrobs[scrobs$artist %in% art, "month.year"]))
      n.art.mnths / n.months
    })
    normed.month.ratio.topx <- month.ratio.topx / topx # For later

    I am putting all these values in a dataframe and plot everything for the top 30 artists. I am putting the total playcount on the x-axis and the new measures on the y-axis. I am putting in some smoother lines to have a quick look at the relationship between the two variables.

    art.df <- data.frame(topx)
    art.df$day.ratio <- day.ratio.topx
    art.df$normed.day.ratio <- normed.day.ratio.topx
    art.df$month.ratio <- month.ratio.topx
    art.df$normed.month.ratio <- normed.month.ratio.topx

    ggplot(art.df, aes(x = Freq, y = day.ratio, label = Var1)) +
      geom_point() +
      geom_label_repel() +
      geom_line(stat = "smooth", method = loess, alpha = .3, size = 2, col = "blue") +
      labs(x = "Playcount", y = "Artist played on % of all days") +
      scale_y_continuous(labels = percent)

    ggplot(art.df, aes(x = Freq, y = month.ratio, label = Var1)) +
      geom_point() +
      geom_label_repel() +
      geom_line(stat = "smooth", method = loess, alpha = .3, size = 2, col = "blue") +
      labs(x = "Playcount", y = "Artist played in % of all months") +
      scale_y_continuous(labels = percent)



    This is quite interesting: while Menomena are my top artists, Radiohead is played in more months, the same holds for Prince. That means that my scrobbles of Radiohead are more evenly "spread out" over the whole range of the data. However, there is one problem with these plots: it is no surprise that there seems to be a positive relationship between the variables with total playcount. If you listen to an artist more, the probability rises that it is played in more months, right? We can correct for this effect by norming the measures by the total playcount. The picture looks a bit different then. Also, the relationships are weakened. I already did this with the variables normed.day.ratio and normed.month.ratio. Let's plot them.

    ggplot(art.df, aes(x = Freq, y = normed.day.ratio, label = Var1)) +
      geom_point() +
      geom_label_repel() +
      geom_line(stat = "smooth", method = loess, alpha = .3, size = 2, col = "blue") +
      labs(x = "Playcount", y = "Artist played on % of all days / Playcount") +
      scale_y_continuous()

    ggplot(art.df, aes(x = Freq, y = normed.month.ratio, label = Var1)) +
      geom_point() +
      geom_label_repel() +
      geom_line(stat = "smooth", method = loess, alpha = .3, size = 2, col = "blue") +
      labs(x = "Playcount", y = "Artist played in % of all months / Playcount") +
      scale_y_continuous()



    Do you see what happened? All top 5 artists "get corrected" a lot. And especially Muse are gaining points on the normed ratio scale. That means: when I correct for the total number of plays, Muse is my top artist in terms of being played throughout all days and months. I have two tracks in mind that could have triggered this: "Panic Station" and "Supermassive Black Hole" - two tracks I come back to quite frequently. Maybe, I'll go into more detail in a future post. This correction might be a little too strict and one could play around a lot more with this (e.g., with residualization techniques), but it gives a good first impression, I think.

    Wrapping up

    That's it for this post. I will come back to this dataset because there is a lot more to analyze and visualize here. But for now, let's wrap it up:
    • We got 10 years of scrobble data from Last.FM.
    • We made simple artist and album charts.
    • We had a look at the cumulative development of artists and albums in time.
    • We got tag information on all tracks from Last.FM (if there was information available).
    • We created a tag chart.
    • We had a look at scrobbles per months, days and years.
    • We analysed the dispersion of artists over the whole period of 10 years and tried a simple norming procedure to correct for the overall playcount.
    If you liked this post, I hope you will come back for the future ones. It could take a while for me to write them up but you can follow me on Twitter or have a look at R-bloggers to keep up-to-date. Bye.




    3

    View comments

  3. Giddy up, giddy it up
    Wanna move into a fool's gold room
    With my pulse on the animal jewels
    Of the rules that you choose to use to get loose
    With the luminous moves
    Bored of these limits, let me get, let me get it like
    Wow!

    When it comes to surreal lyrics and videos, I'm always thinking of Beck. Above, I cited the beginning of the song "Wow" from his latest album "Colors" which has received rather mixed reviews. In this post, I want to show you what I have done with Spotify's API. We will also see how "Colors" compares to all the other albums by Beck on a range of the API's measures. This is what we gonna do:

    • Set up the R packages to access the Spotify and Last.FM APIs
    • Get chart information on a specific user (in this case: me) from Last.FM
    • Get information on these top artists from Spotify
    • Visualize some of the information with ggplot2
    • Get information on all the songs by Beck and Radiohead
    • Again: Visualize this information
    First, let's set things up. If you want to download the RLastFM package (it's not on CRAN), you can follow these instruction. If you need an account for the Spotify API, you can get one on this page. Get an account for the Last.FM API with this form.

    library(spotifyr)
    library(RLastFM)
    library(dplyr)

    # Spotify
    s.id <- "< API ID >"
    s.sc <- "< API Secret"

    # LastFM
    l.ky <- "< API Key >"
    l.ap <- "http://ws.audioscrobbler.com/2.0/"

    Sys.setenv(SPOTIFY_CLIENT_ID = s.id)
    Sys.setenv(SPOTIFY_CLIENT_SECRET = s.sc)

    access_token <- get_spotify_access_token()

    I slightly adapted the user.getTopArtists() function in the RLastFM package because I wanted to use the "limit" parameter of the Last.FM API.

    user.getTopArt2 <- function (username, period = NA, key = l.ky, parse = TRUE, limit = 20) {
      params = list(method = "user.gettopartists", user = username, 
                    period = period, api_key = key, limit = limit)
      params = params[!as.logical(lapply(params, is.na))]
      ret = getForm(l.ap, .params = params)
      doc = xmlParse(ret, asText = TRUE)
      if (parse) 
        doc = RLastFM:::p.user.gettopartists(doc)
      return(doc) }

    Also, I adapted the function get_artist_audio_features() from the spotifyr package to override the dialog where I am required to type the name of the artist in interactive sessions.

    get.art.audio.feats <- function (artist_name, access_token = get_spotify_access_token()) {
      artists <- get_artists(artist_name)
      selected_artist <- artists$artist_name[1]
      artist_uri <- artists$artist_uri[artists$artist_name == 
                                         selected_artist]
      albums <- get_albums(artist_uri)
      if (nrow(albums) > 0) {
        albums <- select(albums, -c(base_album_name, base_album, 
                                    num_albums, num_base_albums, album_rank))
      }
      else {
        stop(paste0("Cannot find any albums for \"", selected_artist, 
                    "\" on Spotify"))
      }
      album_popularity <- get_album_popularity(albums)
      tracks <- get_album_tracks(albums)
      track_features <- get_track_audio_features(tracks)
      track_popularity <- get_track_popularity(tracks)
      tots <- albums %>% left_join(album_popularity, by = "album_uri") %>% 
        left_join(tracks, by = "album_name") %>% left_join(track_features, 
                                                           by = "track_uri") %>% left_join(track_popularity, by = "track_uri")
      return(tots) }

    Now, we getting the relevant data. First, we are accessing the current top 20 artists (of all time) of a specific Last.FM user.

    top.arts <- as.data.frame(user.getTopArt2("< user name >", period = "overall", limit = 20))
    head(top.arts)
                         artist playcount                                 mbid rank
    1                  Menomena      1601 ad386705-fb8c-40ec-94d7-e690e079e979    1
    2                      Beck      1594 a8baaa41-50f1-4f63-979e-717c14979dfb    2
    3                 Radiohead      1556 a74b1b7f-71a5-4011-9441-d0b5e4122711    3
    4                 PJ Harvey      1546 e795e03d-b5d5-4a5f-834d-162cfb308a2c    4
    5                    Prince      1501 cdc0fff7-54cf-4052-a283-319b648670fd    5
    6 Nick Cave & The Bad Seeds      1243 172e1f1a-504d-4488-b053-6344ba63e6d0    6

    Now, we are accessing the "artist audio features" for each one of the top 20 artists. I'm including a Sys.sleep(5) here, because I don't want to access Spotify's API too frequently. With 5 seconds between each artist, we should be on the safe side - I guess 2 seconds would be enough.

    Please note that get.art.audio.feats() returns a tibble that contains one row for each of the artist's tracks. That is why I have to aggregate the data by artist. Hence, the "audio features" of an artist are the mean values of all the tracks by this artist.

    art.info.list <- list()
    for (i in 1:nrow(top.arts)) {
      cat(i, "\n")
      new.art.info <- get.art.audio.feats(top.arts[i, "artist"])
      new.art.info$artist <- top.arts[i, "artist"]
      art.info.list[[length(art.info.list)+1]] <- new.art.info
      Sys.sleep(5)
    }
    art.info <- do.call("rbind", art.info.list)

    agg.art.info <- aggregate(cbind(danceability, energy, speechiness,
                                    acousticness, instrumentalness, liveness,
                                    valence, tempo, track_popularity) ~ artist,
                              data = art.info, FUN = mean)

    Now, let's merge Last.FM's playcount information with the information we got from Spotify's API. We are also sorting the dataframe by my playcount on Last.FM.

    agg.art.info <- merge(agg.art.info, top.arts[,c("playcount", "rank", "artist")], by = "artist", all.x = T, all.y = F)
    agg.art.info <- agg.art.info[order(agg.art.info$playcount, decreasing = T),]
    saveRDS(agg.art.info, file = "agg.art.info.Rds")


    What we get is this:

                          artist danceability    energy speechiness acousticness instrumentalness  liveness
    7                   Menomena    0.5397246 0.5475217  0.05414783    0.2678826       0.28483706 0.1451246
    3                       Beck    0.6043865 0.6106874  0.06503006    0.2653964       0.20943021 0.1872540
    13                 Radiohead    0.4095833 0.5755077  0.05504551    0.3131903       0.39465664 0.2026224
    11                 PJ Harvey    0.5155636 0.4578315  0.09909212    0.4328224       0.13406858 0.1739836
    12                    Prince    0.7078350 0.7295825  0.04883786    0.2464166       0.04265596 0.1405495
    8  Nick Cave & The Bad Seeds    0.4147489 0.5152961  0.05414545    0.3465942       0.07602337 0.2952866
         valence    tempo track_popularity playcount rank
    7  0.3084768 131.4496          5.84058      1601    1
    3  0.4795632 115.6165         21.64417      1594    2
    13 0.2858109 118.7648         47.62821      1556    3
    11 0.4331848 118.7181         21.32727      1546    4
    12 0.7672718 126.3280         25.81553      1501    5
    8  0.3521593 119.5002         27.47619      1243    6

    Just a few quick notes concerning the audio features given by the Spotify API (I'm citing Spotify's webpage here):

    • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    • track_popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
      Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. 
    I am only using a few of these features for visualizations here, namely danceability, valence, and track_popularity. Let's start with a scatter plot. ggplot2 and the really great package "ggrepel" already do a good job here.

    library(ggplot2)
    library(ggrepel)
    library(tidyr)
    library(scales)
    library(RColorBrewer)


    data <- readRDS("agg.art.info.Rds")
    ggplot(data, aes(x = danceability, y = playcount, size = track_popularity, col = valence)) +
      geom_point() +
      geom_label_repel(aes(label = artist)) +
      labs(x = "Danceability", y = "Playcount", size = "Mean popularity",
           col = "Valence", title = "Popularity, Valence, and Danceability",
           subtitle = "TOP 20 Last.FM artists - with track information from Spotify") +
      scale_color_continuous(low = "darkred", high = "green") +
      scale_size(range = c(2, 6)) +
      theme_minimal()

    (please click to enlarge)

    It's quite a lot information in this plot. Prince - along with the not very popular Nikka Costa - is definitely the most danceable and most positive (in terms of emotional valence) in my top artists. There is no clear relationship between danceability and my playcount. Also, "Menomena", my top artist, is really not popular on Spotify - I guess you can call that a hipster factor.

    Now, I want to have a semantic profile plot for my top 5 artists. It's a little bit of coding work before we get this.

    for (var in c("tempo", "track_popularity", "playcount")) {
      data[, paste0("norm.", var)] <- rescale(data[, var], to = c(0, 1))
    }

    data.long <- gather(data, key = variable, value = value, c("danceability", "energy",
                        "speechiness", "acousticness", "instrumentalness", "liveness", "valence"))

    substr(data.long$variable, 1, 1) <- toupper(substr(data.long$variable, 1, 1))
    data.long$artist <- factor(data.long$artist, levels = data$artist)

    ggplot(data.long[data.long$rank %in% 1:5,], aes(y = value, x = variable, group = artist, col = artist)) +
      geom_line(size = 2, lineend = "round", alpha = .8) +
      scale_color_manual(values = brewer.pal(5, "Spectral")) +
      scale_y_continuous(breaks = NULL) +
      labs(x = "", y = "", col = "Artist") +
      coord_flip() +
      theme_dark() + theme(axis.text.y = element_text(size = 15),

                           panel.grid.major.y = element_line(size = 1))

    (please click to enlarge)

    It's still not optimal, but we get a good clue of the different profiles of the artists. We can read this plot horizontally to get a ranking of artists for one feature. But we can also read it vertically to get a clue of the overall distribution of features.

    Now, let's move on to a more detailed analysis of Beck and Radiohead. First, I am analyzing all the albums by Beck with their year of release on the x-axis, their popularity on the y-axis, danceability color-coded and valence coded by size.

    beck <- as.data.frame(get.art.audio.feats("Beck"))

    beck.alb <- aggregate(cbind(danceability, energy, speechiness,
                                acousticness, instrumentalness, liveness,
                                valence, tempo, track_popularity) ~ album_name,
                          data = beck, FUN = mean)

    beck.alb <- merge(beck.alb, unique(beck[, c("album_name", "album_release_year", "album_popularity")]), by = "album_name", all.x = T, all.y = F)

    beck.alb$album_release_year <- substr(beck.alb$album_release_year, 1, 4)
    beck.alb$n.tracks <- sapply(beck.alb$album_name, USE.NAMES = F, FUN = function (x) table(beck$album_name)[x])
    beck.alb[beck.alb$album_name == "One Foot In The Grave (Expanded Edition)", "album_name"] <- "One Foot In The Grave"
    beck.alb <- beck.alb[order(beck.alb$album_release_year),]

    ggplot(beck.alb, aes(x = album_release_year, y = album_popularity,
                         label = album_name, size = valence,
                         color = danceability)) +
      geom_point() +
      geom_label_repel() +
      scale_color_continuous(low = "darkred", high = "green") +
      scale_size(range = c(3, 10)) +
      labs(x = "Release year", size = "Valence", y = "Album popularity", color = "Danceability",
           title = "Albums by Beck", subtitle = "Values taken from Spotify API") +
      theme_minimal()

    (please click to enlarge)

    Another thing I wanted to show you is some kind of "album profile" or - more precisely - an answer to the question whether there are tracks on albums that are very salient in terms of track popularity. We can easily do this with a bit of transformation and facet wrapping. I am coding track popularity with both the height of the bar and the color of the bar to get an impression of the popularities of the tracks compared to the overall dataset (because the axes are allowed to very freely for the facets).

    beck %>% group_by(album_name) %>%
      mutate(track.no = 1:length(track_name)) %>%
      ungroup() %>% as.data.frame() -> beck

    ggplot(beck, aes(x = factor(track.no), y = track_popularity, fill = track_popularity)) +
      geom_bar(stat = "identity") +
      labs(x = "Track", y = "Track popularity", fill = "Track popularity",
           title = "Popularity of tracks on albums by Beck",
           subtitle = "Values taken from Spotify API") +
      facet_wrap(~ album_name, scales = c("free"))

    (please click to enlarge)

    See the rather high first bar for "Mellow Gold"? That's "Loser", still the most popular of Beck's tracks with the rest of the tracks on "Mellow Gold" rather in "Guero" territory in terms of popularity. Also, "Colors" - Beck's latest album - is rather balanced in terms of popularity.

    [EDIT] A friend had the idea to fix the y-axis on this plot to make the albums more comparable in terms of popularity. Indeed, large differences for Beck's albums are visible. I did the same for Radiohead below - and it's quite interesting that Radiohead's albums are much more homogeneous in terms of popularity! [/EDIT]



    And the same for Radiohead - only that I am excluding two albums (see in the code below) and am using the tempo feature for the coloring.

    radiohead <- as.data.frame(get.art.audio.feats("Radiohead"))

    radiohead <- radiohead[!(radiohead$album_name %in% c("OK Computer OKNOTOK 1997 2017", "TKOL RMX 1234567")),]

    radiohead.alb <- aggregate(cbind(danceability, energy, speechiness,
                                acousticness, instrumentalness, liveness,
                                valence, tempo, track_popularity) ~ album_name,
                          data = radiohead, FUN = mean)

    radiohead.alb <- merge(radiohead.alb, unique(radiohead[, c("album_name", "album_release_year", "album_popularity")]), by = "album_name", all.x = T, all.y = F)

    radiohead.alb$album_release_year <- substr(radiohead.alb$album_release_year, 1, 4)
    radiohead.alb$n.tracks <- sapply(radiohead.alb$album_name, USE.NAMES = F, FUN = function (x) table(radiohead$album_name)[x])
    radiohead.alb <- radiohead.alb[order(radiohead.alb$album_release_year),]

    ggplot(radiohead.alb, aes(x = album_release_year, y = album_popularity,
                         label = album_name, size = valence,
                         color = tempo)) +
      geom_point() +
      geom_label_repel() +
      scale_color_continuous(low = "darkred", high = "green") +
      scale_size(range = c(3, 10)) +
      labs(x = "Release year", size = "Valence", y = "Album popularity", color = "Tempo",
           title = "Albums by Radiohead", subtitle = "Values taken from Spotify API") +
      theme_minimal()


    It's a shame how Disk 2 of "In Rainbows" compares to Disk 1 because it contains really great songs. Also, it's interesting that "OK Computer" still not reaches the popularity of "Pablo Honey" - as the second plot suggests, this is because of the popularity of "Creep". For the second plot, I am not double-coding track popularity with color but am using the valence feature for the color.

    radiohead %>% group_by(album_name) %>%
      mutate(track.no = 1:length(track_name)) %>%
      ungroup() %>% as.data.frame() -> radiohead

    ggplot(radiohead, aes(x = factor(track.no), y = track_popularity, fill = valence)) +
      geom_bar(stat = "identity") +
      labs(x = "Track", y = "Track popularity", fill = "Valence",
           title = "Popularity and valence of tracks on albums by Radiohead",
           subtitle = "Values taken from Spotify API") +
      facet_wrap(~ album_name, scales = c("free"))


    And with fixed y-axes:



    Here we see that this valence feature certainly has some problems. "Fitter Happier" is one of the most positive songs by Radiohead with a value of 0.74??? I beg to differ...






    0

    Add a comment



  4. If you're interested in the visualisation of networks or graphs, you might've heard of the great package "visNetwork". I think it's a really great package and I love playing around with it. The scenarios of graph-based analyses are many and diverse: whenever you can describe your data in terms of "outgoing" and "receiving" entities, a graph-based analysis and/or visualisation is possible. During my work as a linguist, I already used graphs for different purposes like linking-structures within dictionaries, visualising co-occurence patterns of words and so on.

    Today, I want to show you something completely different: transfers of male football players in the German "1. Bundesliga", the first division of male football in Germany. We can also describe this data in terms of outgoing and receiving entities (the nodes in the network): the clubs who are selling the players and the clubs who are buying the players. The edges (connections) within the network are the players themselves. And there are further attributes associated with the edges, e.g. the price of the player.

    I'll spare you the boring details of getting the data (please write a comment if you would like more details on that). I start with the raw data structure created by the scraping process. It's a dataframe called transfer.df that looks like this:


    "abloese" (or "Ablöse") means "transfer fee" and currently holds a string with certain codes and the currency: "ablösefrei" means that no transfer fee had to be payed (remember the famous Bosman ruling?). "-" means that the information on a transfer fee doesn't make sense (e.g., when a player finishes his career). "?" means that no information about the transfer fee is available. "Mio." and "Tsd." just encodes "million" or "thousand", we have to deal with that later.

    But we have to take care of something else first. In the dataframe, a player always appears twice if he changed teams within the 1. Bundesliga: the first time for his old club (as outgoing) and the second time for his new club (as incoming). I dealt with it this way (also, I loaded the required packages):

    library(visNetwork)
    library(igraph)
    library(stringr)


    transfer.df2 <- data.frame()
    all.players <- unique(transfer.df$player)
    for (pi in all.players) {
      vork <- grep(pi, transfer.df$player, fixed = T)
      if (length(vork) == 1) {
        transfer.df2 <- rbind(transfer.df2, transfer.df[vork,])
      } else {
        transfer.df2 <- rbind(transfer.df2, transfer.df[vork[1],])
      }
    }

    This basically means that, whenever a player appears more than once in transfer.df, the first appearance is kept and the second appearance is deleted. The resulting network wouldn't be any different if we would keep the second appearance. So, now we are using transfer.df2 as our data structure.

    Now, we have to deal with the "abloese" (transfer fee) column:

    transfer.df2$abloese.num <- sapply(transfer.df2$abloese, USE.NAMES = F, FUN = function (x) {
      if (x %in% c("-", "?")) NA else {
        if (x == "ablösefrei") 0 else {
          mio <- grepl("Mio.", x, fixed = T)
          tsd <- grepl("Tsd.", x, fixed = T)
          x2 <- gsub(",", ".", x, fixed = T)
          x3 <- gsub("Mio. €", "", x2, fixed = T)
          x4 <- as.numeric(str_trim(gsub("Tsd. €", "", x3, fixed = T)))
          if (mio) x4*1000000 else {
            if (tsd) x4*1000 else { "FEHLER" }
          }
        }
      }

    })

    Basically, this is what we are doing:

    • If abloese is "-" or "?", we are using NA
    • If abloese is "ablösefrei" we are putting in 0
    • Then we see whether "Mio." appears in the string.
    • Then we see whether "Tsd." appears in the string.
    • Then we are deleting these substrings and the EUR sign and
    • trim the string and convert it to a numeric value.
    • If "Mio." appeared in the string, we are multiplying the result with one million and if "Tsd." appeared in the string, we are multiplying the result with one thousand (both will never appear in the string, it doesn't make sense).
    Alright, now we have the column abloese.num and can move on to group the different transfer fees because we want to assign different colours to the edges in the network dependent on the transfer sum. The thresholds are arbitrary.

    transfer.df2$abl.group <- cut(transfer.df2$abloese.num, c(0, 200*1000, 1000*1000, 2000*1000, 5000*1000, 10000*1000, 60000*1000), include.lowest = T)

    transfer.df2$abl.col <- ifelse(transfer.df2$abloese.num == 0, "green",
                                   ifelse(transfer.df2$abl.group == "[0,2e+05]", "#ffffcc",
                                          ifelse(transfer.df2$abl.group == "(2e+05,1e+06]", "#fed976",
                                                 ifelse(transfer.df2$abl.group == "(1e+06,2e+06]", "#feb24c",
                                                        ifelse(transfer.df2$abl.group == "(2e+06,5e+06]", "#fc4e2a",
                                                               ifelse(transfer.df2$abl.group == "(5e+06,1e+07]", "#e31a1c",
                                                                      ifelse(transfer.df2$abl.group == "(1e+07,6e+07]", "#800026", "grey")))))))
    transfer.df2$abl.col <- ifelse(is.na(transfer.df2$abl.group), "grey", transfer.df2$abl.col) 

    Now, I am converting the dataframe to an igraph object and this object to visNetwork object. I'm sure the igraph step could be skipped, but this works like a charm and doesn't take much time.

    graph <- graph.data.frame(transfer.df2)

    vn <- toVisNetworkData(graph)

    I am assigning color codes to the nodes:

    vn$nodes$color <- ifelse(vn$nodes$id %in% clubs, "tomato",
                             ifelse(vn$nodes$id == "Vereinslos", "green",
                                    ifelse(vn$nodes$id == "Karriereende", "blue", "grey")))

    All clubs in the 1. Bundesliga get "tomato" (clubs is an object I defined earlier) all clubs that are not in the 1. Bundesliga (e.g., Hamburger SV) get "grey". There are two other special "clubs": "Karriereende" for "end of career" and "Vereinslos" for "no club", both get "green".

    Three things left to be done:

    vn$edges$title <- paste(vn$edges$player, vn$edges$abloese, sep = " - ")
    vn$edges$color <- vn$edges$abl.col
    vn$edges$width <- 4
    • Assign a title to the edges that consists of the player name and the transfer fee. This appears upon hovering the edge.
    • Assign the grouped transfer fee color we defined earlier.
    • Increase the width of the edges to make the color more visible.
    No, for creating the HTML file for the graph:

    visNetwork(nodes = vn$nodes, edges = vn$edges, height = "1000px", width = "100%") %>%
      visOptions(highlightNearest = TRUE) %>%
      #visIgraphLayout(layout = "layout_with_dh") %>%
      visEdges(arrows = "to", arrowStrikethrough = F) %>% visSave(file = "~/Desktop/transfers.html", selfcontained = T)

    Please visit my personal webspace for the final result. The "redder" an edge in the network is, the more expensive the transfer was. You can also click on the nodes to only highlight all adjacent nodes (selling and buying clubs), drag nodes around (graph physics!) and hover over edges to see the specific player being transfered. Of course, zooming is enabled. visNetwork does all that. I love that package!

    12

    View comments

  5. Here is some updated R code from my previous post. It doesn't throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).

    Have fun and leave a comment if you have any questions.



    options(stringsAsFactors = F)

    rm(list=ls())

    library(httr)
    library(rjson)
    library(leaflet)
    library(dplyr)

    token <- "<your Strava API token>"

    # Functions ---------------------------------------------------------------

    get.coord.df.from.stream <- function (stream.obj) {
      data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]),
                 lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]]))
    }

    get.stream.from.activity <- function (act.id, token) {
      stream <- GET("https://www.strava.com/",
                    path = paste0("api/v3/activities/", act.id, "/streams/latlng"),
                    query = list(access_token = token))
      content(stream)
    }

    get.activities2 <- function (token) {
      activities <- GET("https://www.strava.com/", path = "api/v3/activities",
                        query = list(access_token = token, per_page = 200))
      activities <- content(activities, "text")
      activities <- fromJSON(activities)
      res.df <- data.frame()
      for (a in activities) {
        values <- sapply(c("name", "distance", "moving_time", "elapsed_time", "total_elevation_gain",
                           "type", "id", "start_date_local",
                           "location_country", "average_speed", "max_speed", "has_heartrate", "elev_high",
                           "elev_low", "average_heartrate", "max_heartrate"), FUN = function (x) {
                             if (is.null(a[[x]])) {
                               NA } else { a[[x]] }
                           })
        res.df <- rbind(res.df, values)
      }
      names(res.df) <- c("name", "distance", "moving_time", "elapsed_time", "total_elevation_gain",
                         "type", "id", "start_date_local",
                         "location_country", "average_speed", "max_speed", "has_heartrate", "elev_high",
                         "elev_low", "average_heartrate", "max_heartrate")
      res.df
    }

    get.multiple.streams <- function (act.ids, token) {
      res.list <- list()
      for (act.id.i in 1:length(act.ids)) {
        if (act.id.i %% 5 == 0) cat("Actitivy no.", act.id.i, "of", length(act.ids), "\n")
        stream <- get.stream.from.activity(act.ids[act.id.i], token)
        coord.df <- get.coord.df.from.stream(stream)
        res.list[[length(res.list) + 1]] <- list(act.id = act.ids[act.id.i],
                                                 coords = coord.df)
      }
      res.list
    }

    activities <- get.activities2(token)

    stream.list <- get.multiple.streams(activities$id, token)


    # Leaflet -----------------------------------------------------------------

    lons.range <- c(9.156572, 9.237580)
    lats.range <- c(48.74085, 48.82079)

    map <- leaflet() %>%
      addProviderTiles("OpenMapSurfer.Grayscale", # nice: CartoDB.Positron, OpenMapSurfer.Grayscale, CartoDB.DarkMatterNoLabels
                       options = providerTileOptions(noWrap = T)) %>%
      fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))

    add.run <- function (act.id, color, act.name, act.dist, strlist = stream.list) {
      act.ind <- sapply(stream.list, USE.NAMES = F, FUN = function (x) {
        x$act.id == act.id
      })
      act.from.list <- strlist[act.ind][[1]]
      map <<- addPolylines(map, lng = act.from.list$coords$lon,
                   lat = act.from.list$coords$lat,
                   color = color, opacity = 1/3, weight = 2,
                   popup = paste0(act.name, ", ", round(as.numeric(act.dist) / 1000, 2), " km"))
    }

    # plot all

    for (i in 1:nrow(activities)) {
      add.run(activities[i, "id"], ifelse(activities[i, "type"] == "Run", "red",
                                          ifelse(activities[i, "type"] == "Ride", "blue", "black")),
              activities[i, "name"], activities[i, "distance"])
    }

    map
    3

    View comments

    1. Could you direct us to your previous post?

      ReplyDelete
      Replies
      1. Hi, it's linked under "previous post". The URL is http://rcrastinate.blogspot.com/2018/01/where-do-you-run-to-map-your-strava.html

        Delete
    2. This is great blog post i have read your others blog too. you can read my blog Spotify Error Code 3

      ReplyDelete
  6. So, Strava's heatmap made quite a stir the last few weeks. I decided to give it a try myself. I wanted to create some kind of "personal heatmap" of my runs, using Strava's API. Also, combining the data with Leaflet maps allows us to make use of the beautiful map tiles supported by Leaflet and to zoom and move the maps around - with the runs on it, of course.

    So, let's get started. First, you will need an access token for Strava's API. I found all the necessary information for this in this helpful "Getting started" post. As soon as you have the token, you have access to your own data.

    Now, let's load some packages and define functions for getting and handling the data. For the get.activities() function, I adapted code from here.

    library(httr)
    library(rjson)
    library(OpenStreetMap)
    library(leaflet)
    library(scales)
    library(dplyr)

    token <- "<access token goes here>"

    get.coord.df.from.stream <- function (stream.obj) {
      data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]),
                 lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]]))
    }

    get.stream.from.activity <- function (act.id, token) {
      stream <- GET("https://www.strava.com/",
                    path = paste0("api/v3/activities/", act.id, "/streams/latlng"),
                    query = list(access_token = token))
      content(stream)
    }

    get.activities <- function (token) {
      activities <- GET("https://www.strava.com/", path = "api/v3/activities",
                        query = list(access_token = token, per_page = 200))
      activities <- content(activities, "text")
      activities <- fromJSON(activities)
      activities <- lapply(activities, function(x) {
        x[sapply(x, is.null)] <- NA
        unlist(x)
      })
      data.frame(do.call("rbind", activities))
    }

    get.multiple.streams <- function (act.ids, token) {
      res.list <- list()
      for (act.id.i in 1:length(act.ids)) {
        if (act.id.i %% 5 == 0) cat("Actitivy no.", act.id.i, "of", length(act.ids), "\n")
        stream <- get.stream.from.activity(act.ids[act.id.i], token)
        coord.df <- get.coord.df.from.stream(stream)
        res.list[[length(res.list) + 1]] <- list(act.id = act.ids[act.id.i],
                                                 coords = coord.df)
      }
      res.list
    }

    We have all the functions we need to get and parse the APIs output available now. Let's apply them. The logic is: First, we get all activities. This dataframe has a column called 'id' which we can use to get all the raw data for all activities (called 'streams' in the Strava API). The function get.coord.df.from.stream() creates a dataframe with lat/lon coordinates for one stream.

    activities <- get.activities(token)
    stream.list <- get.multiple.streams(activities$id, token)

    We might want to get the boundaries of the cumulated set of all streams. We can use these boundaries as a bounding box for plotting the data. This means that all activities are going to be in the plotted map section.

    all.lats <- unlist(sapply(stream.list, USE.NAMES = F, FUN = function (x) {
      x$coords$lat
    }))
    all.lons <- unlist(sapply(stream.list, USE.NAMES = F, FUN = function (x) {
      x$coords$lon
    }))
    lats.range <- range(all.lats)
    lons.range <- range(all.lons)

    Alternatively, you can set your own bounding box. These are the boundaries for Stuttgart, Germany. One suggestion: to find your own boundaries, you can plot your first map and use the locator() function in RStudio, this is a very convenient way of getting coordinates by clicking.

    lons.range <- c(9.156572, 9.237580)
    lats.range <- c(48.74085, 48.82079)

    We start by plotting the tracks in red against a black background. You can play around with the alpha value and the lwd parameter to change the appearance. By plotting several tracks over each other, thicker lines represent routes I took more often.

    # Setting up the plot
    par(bg = "black")
    plot(x = lons.range, y = lats.range, type = "n", bty = "n", xlab = "", ylab = "", xaxt = "n", yaxt = "n")

    # Plotting tracks one by one
    for (el in stream.list) {
      lines(el$coords$lon, el$coords$lat,
            col = alpha("darkred", .4), lwd = 2)
    }

    A black-and-red plot of my runs through Stuttgart.

    Now, this already looks quite nice and we see some kind of network through the city. But the city itself is missing. We need to get some map below the tracks. I am using the OpenStreetMap package for this. I already used it in an earlier post. Note that getting the map tiles from the servers might take a long time and might fail in some cases. Loading time (and the resolution of the final map) will depend heavily on the 'zoom' parameter.

    map <- openmap(c(max(lats.range), min(lons.range)),
                   c(min(lats.range), max(lons.range)), type = "maptoolkit-topo", zoom = 14)
    transmap <- openproj(map, projection = "+proj=longlat")
    plot(transmap, raster = T)

    for (el in stream.list) {
      lines(el$coords$lon, el$coords$lat,
            col = alpha("darkred", .5), lwd = 3)
    }

    My runs through Stuttgart on an OpenStreetMap, click to enlarge


    I also like a simple satellite map. I am using the 'bing' map type for this. There is a high-def version available here.

    map <- openmap(c(max(lats.range), min(lons.range)),
                   c(min(lats.range), max(lons.range)), type = "bing", zoom = 15)
    transmap <- openproj(map, projection = "+proj=longlat")
    plot(transmap, raster = T)

    for (el in stream.list) {
      lines(el$coords$lon, el$coords$lat,
            col = alpha("yellow", 1/3), lwd = 3)
    }

    My runs through Stuttgart on a Bing satellite map, click to enlarge or click here for an HD version


    Now, for the final step. These static maps already look quite nice and with sufficient resolution (as in the HD case with the satellite map), we can also zoom the map without loosing too much quality. But a more dynamical map would also be nice. Let's use the wonderful leaflet package for this. I already done this in another post with only a single track wrapped in a Shiny app. I am using some pipe notation from the dplyr package to adapt the map and get the tracks onto it.

    map <- leaflet() %>%
      addProviderTiles("CartoDB.Positron",
                       options = providerTileOptions(noWrap = T)) %>%
      fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))

    for (el in stream.list) {
      map <- addPolylines(map, lng = el$coords$lon, lat = el$coords$lat,
                          color = "red", opacity = 1/3, weight = 2)
    }

    With the function saveWidget(), we can save the resulting html map. You can move the map to Basel or Zürich to find some more tracks I ran there.

    With the leaflet functions, we could even associate each track with a little mouseover text (like total distance or the date). I did not include this here because quite a few tracks have been plotted over each other and mouseover texts might just confuse us here.

    Have fun running and plotting. 







    0

    Add a comment

  7. I've been using the ggplot2 package a lot recently. When creating a legend or tick marks on the axes, ggplot2 uses the levels of a character or factor vector. Most of the time, I am working with coded variables that use some abbreviation of the "true" meaning (e.g. "f" for female and "m" for male or single characters for some single character for a location: "S" for Stuttgart and "M" for Mannheim).

    In my plots, I don't want these codes but the full name of the level. Since I am not aware of any super-fast and easy to use function in base R (let me know in the comments if there is one), I came up with a very simple function and put this in my .Rprofile (that means that it is available whenever I start R). I called it "replace.levels". The dot before the name means that it is invisible and does not show up in my Global Environment overview in RStudio. You have to call it with .replace.levels(<args>), of course.

    This is how it works:
    .replace.levels takes two arguments: vec and replace.list

    vec is the vector where substitutions shall be made.
    replace.list is a list with named elements. The names of the elements are substituted for the contents of the elements.

    The function also checks if all elements (both old and new ones) appear in vec. If not, it throws an error. It is a very simple function, but it saves me a lot of typing.

    Example:

    a <- c("B", "C", "F")
    .replace.levels(a, list("B" = "Braunschweig", "C" = "Chemnitz", "F" = "Frankfurt"))
    [1] "Braunschweig" "Chemnitz"     "Frankfurt"

    Here it is:

    .replace.levels <- function (vec, replace.list) {
      # Checking if all levels to be replaced are in vec (and other way around)
      cur.levels <- unique(as.character(vec))
      not.in.cur.levels <- setdiff(cur.levels, names(replace.list))
      not.in.new.levels <- setdiff(names(replace.list), cur.levels)
      if (length(not.in.cur.levels) != 0 | length(not.in.new.levels) != 0) {
        stop("The following elements do not match: ", paste0(not.in.cur.levels, not.in.new.levels, collapse = ", "))
      }
      for (el in 1:length(replace.list)) {
        vec <- gsub(names(replace.list[el]), replace.list[[el]], vec)
      }
      vec
    }
    0

    Add a comment

  8. It's been a while since I had the opportunity to post something on music. Let's get back to that.

    I got my hands on some song lyrics by a range of artists. (I have an R script to download all lyrics for a given artist from a lyrics website. Since these lyrics are protected by copyright law, I cannot share the download script here, but I can show some of the analyses I made with the lyrics.)

    My main question is: What can we learn about an artist, or several artists, when we have a corpus of lyrics. I gonna analyze lyrics by the following artists:

    • Beck
    • Bob Dylan
    • Britney Spears
    • Leonard Cohen
    • Lou Reed
    • Metallica
    • Michael Jackson
    • Modest Mouse
    • Nick Cave and TBS
    • Nikka Costa
    • Nine Inch Nails
    • Nirvana
    • PJ Harvey
    • Prince
    • Radiohead
    • Rihanna
    • The Cure
    • The Doors
    • The National
    Let's start with an easy one. I wanna know which artist has the longest songs. The more words there are in the respective lyrics, the longer the song.


    Mean length of songs in words (click to enlarge).
    That's quite a surprise (at least for me). Rihanna and Britney Spears, certainly the most prototypical actual pop artists in the list, have actually pretty long lyrics. Another measure from linguistics is the type-token ratio where the number of different words (types) is divided by the total number of words (tokens). This measure is often interpreted as "lexical diversity" because the vocabulary is more diverse if there are only a few words that are repeated very often. Suppose you have a song that only consists of the words "oh yeah" and this is repeated 10 times, you will have 2 types and 20 tokens, which would lead to a type-token ratio of 2/20=0.1.
    Mean type-token ratio of songs (higher means more diverse vocabulary, click to enlarge).
    Well, look at that - Nikka Costa, one of my favorite funk/soul artists comes out on top in this list, followed by Beck and The Doors. Rihanna and Britney obviously have a lot of words in their songs, but with regard to lexical diversity, they rank last within the artists analysed here.

    Let's try something content-related. Obviously, it's quite hard to tackle the content (or even meaning) of songs. But we can do some really easy stuff. The first thing I want to try is what I want to call the "self-centered ratio". I simply define a list of keywords (or better: sequences of characters) that are referencing the first person: "i", "me", "i've", "i'm", "my", "mine", "myself". Now I calculate for each song how many of the words in the lyrics are in this list and divide this number by the number of words in the song. Suppose you have a song with these lyrics: "i'm my enemy and my enemy is mine" (I really don't know what that would mean but that's just an example, right?). The "self-centered ratio" would be 4/8 = 0.5 because we have "i'm", "my", "my" and "mine" and 8 words altogether ("i'm" is counted as one word here because it is not separated by a space). Here is the result.
    Mean self-centered ratio of songs (click to enlarge)
    Britney and the Nine Inch Nails are definitely not very similar in terms of their music (that's a wild guess, I only know very few songs by Britney Spears!), but they are quite similar when it comes to singing about stuff that concerns themselves.

    Next up is sentiment analysis. Professionally, I don't like it very much because in my opinion, it has a lot of empirical and methodological problems. But why not give it a try for this application here? We're not here for the hard science side of things, are we? So, what I did was basically the same as for the self-centered ratio but only with much bigger keyword lists for positive words and negative words (so, actually I did it twice, one time with positive words and one time with negative words). I got the word lists from here (for negative words) and from here (for positive words).

    I show you two plots, one where you can see both ratios and one where I combined both ratios per song to get one value (positive value + negative value). These are the results:
    Mean ratio of positive and negative words of songs (click to enlarge). 

    Mean combined measure for sentiment of songs (click to enlarge).
    Actually, this seems to make sense. I'm no expert for Metallica, but for Nine Inch Nails, Nirvana and Radiohead, this second plot seems to make sense. Also, Prince, Michael Jackson, Nikka Costa, Rihanna and Britney Spears getting an overall positive score works for me. Nick Cave is sometimes called the "Prince of Darkness". In this analysis, however, this is not really confirmed. Or the "dark" aspects of his lyrics are just hidden from this quite coarse approach. Just think of the song "People just ain't no good". Here, each occurence of "good" is counted as positive because my simple word list approach is simply not sensitive for the negation in this line.

    One last thing: I wanted to know if artists can be clustered (grouped) just with the use of their lyrics. What we need is a measure of dissimilarity for each artist-artist combination. There are several ways to do that and I experimented with a few (e.g. cosine distance or correlation of frequency vectors). It turns out, there is an even easier measure to do this: Let's take the first 500 most frequent word each artist uses in their lyrics. With the other artist, we do the same. Then, we intersect these two sets of word lists and divide it by 500. What we get is the ratio of words that are present in both top-500 vocabularies, which is essentially a similarity measure. If we do 1 minus this value, we get a dissmilarity measure which we can use as input to a hierarchical cluster analysis. This is what we get.
    Dendrogram for a hierarchical cluster analysis of overlapping top-500 words.
     Look at that, I think it works quite nice: We get a "pop" cluster on the left with Nikka Costa, Britney Spears, Rihanna, Michael Jackson and Prince. Feel free to interpret the other clusters in the comments. As I said, I think it works quite OK.

    R CODE is available here!

    LOOK, there are frequency plots available here for all the artists!


    4

    View comments

  9. Lately, I got the chance to play around with Shiny and Leaflet a lot - and it is really fun! So I decided to catch up on an old post of mine and build a Shiny application where you can upload your own GPX files and plot them directly in the browser.

    Of course, you will need some GPX file to try it out. You can get an example file here (you gonna need to save it in a .gpx file with a text editor, though). Also, the Shiny application will always plot the first track saved in a GPX file. Also, only tracks are supported at the moment, no routes or waypoints. Basically, most sports tracker apps or devices produce such files.

    Here's the link to the GPX Shiny app: https://rcrastinate.shinyapps.io/GPXshiny/
    (Please note that as soon as my free 25 active hours limit of shinyapps.io is reached, the app won't work until the end of the current cycle - this happens quite a lot at the moment. Also, sometimes the loading of map tiles is quite slow - there's nothing I can do about that at my end.)

    You can choose the type of map that is used and the color of your track. Also, the application throws an error if you upload a GPX file that cannot be parsed or a totally different file. For catching these errors, I'm using code I explained in another one of my posts. The error notification window is implemented using the shinyBS package. It's pretty easy to use.

    As soon as you upload a GPX file that can be parsed by the application, an elevation diagram appears in the right panel.

    Here's a screenshot of the Shiny application.

    I really like how the input panel fades in and out upon hovering in or out. To realise this, I used the styles.css file available here. Also, the input panel is trackable.

    Thanks again to all the nice people implementing R, Shiny, Leaflet and providing all the awesome map tiles - for free!
    9

    View comments

Blog Archive
BlogRoll
BlogRoll
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.