I got my hands on some song lyrics by a range of artists. (I have an R script to download all lyrics for a given artist from a lyrics website. Since these lyrics are protected by copyright law, I cannot share the download script here, but I can show some of the analyses I made with the lyrics.)
My main question is: What can we learn about an artist, or several artists, when we have a corpus of lyrics. I gonna analyze lyrics by the following artists:
- Beck
- Bob Dylan
- Britney Spears
- Leonard Cohen
- Lou Reed
- Metallica
- Michael Jackson
- Modest Mouse
- Nick Cave and TBS
- Nikka Costa
- Nine Inch Nails
- Nirvana
- PJ Harvey
- Prince
- Radiohead
- Rihanna
- The Cure
- The Doors
- The National
![]() |
Mean length of songs in words (click to enlarge). |
![]() |
Mean type-token ratio of songs (higher means more diverse vocabulary, click to enlarge). |
Let's try something content-related. Obviously, it's quite hard to tackle the content (or even meaning) of songs. But we can do some really easy stuff. The first thing I want to try is what I want to call the "self-centered ratio". I simply define a list of keywords (or better: sequences of characters) that are referencing the first person: "i", "me", "i've", "i'm", "my", "mine", "myself". Now I calculate for each song how many of the words in the lyrics are in this list and divide this number by the number of words in the song. Suppose you have a song with these lyrics: "i'm my enemy and my enemy is mine" (I really don't know what that would mean but that's just an example, right?). The "self-centered ratio" would be 4/8 = 0.5 because we have "i'm", "my", "my" and "mine" and 8 words altogether ("i'm" is counted as one word here because it is not separated by a space). Here is the result.
![]() |
Mean self-centered ratio of songs (click to enlarge) |
Next up is sentiment analysis. Professionally, I don't like it very much because in my opinion, it has a lot of empirical and methodological problems. But why not give it a try for this application here? We're not here for the hard science side of things, are we? So, what I did was basically the same as for the self-centered ratio but only with much bigger keyword lists for positive words and negative words (so, actually I did it twice, one time with positive words and one time with negative words). I got the word lists from here (for negative words) and from here (for positive words).
I show you two plots, one where you can see both ratios and one where I combined both ratios per song to get one value (positive value + negative value). These are the results:
![]() |
Mean ratio of positive and negative words of songs (click to enlarge). |
![]() |
Mean combined measure for sentiment of songs (click to enlarge). |
One last thing: I wanted to know if artists can be clustered (grouped) just with the use of their lyrics. What we need is a measure of dissimilarity for each artist-artist combination. There are several ways to do that and I experimented with a few (e.g. cosine distance or correlation of frequency vectors). It turns out, there is an even easier measure to do this: Let's take the first 500 most frequent word each artist uses in their lyrics. With the other artist, we do the same. Then, we intersect these two sets of word lists and divide it by 500. What we get is the ratio of words that are present in both top-500 vocabularies, which is essentially a similarity measure. If we do 1 minus this value, we get a dissmilarity measure which we can use as input to a hierarchical cluster analysis. This is what we get.
![]() |
Dendrogram for a hierarchical cluster analysis of overlapping top-500 words. |
R CODE is available here!
LOOK, there are frequency plots available here for all the artists!
What a fun example! I teach a workshop on text analysis and song lyrics might make a perfect example. I've been looking for a corpus that is likely to yield a result from latent semantic analysis that is easy to interpret (i.e. one that contains only a few topics than anyone will understand). Can I get your data set, or code that downloads a set automatically?
ReplyDeleteHi Bob, thanks for your comment. Please check your inbox...
DeleteDo you still have the code saved to build this? I'd love to test this out with lyrics from a few of my favorite artists!
ReplyDeleteHello unknown person, I've updated the links at the bottom of the post. Please check if they work for you. If you want to scrape lyrics yourself, please send me an e-mail to swolf2007@gmail.com. Maybe, the old script still works...
Delete