I’ve spent the past year thinking about rap lyrics. Matt Daniels’ The Largest Vocabulary in Hip Hop remains as the pinnacle of pop culture meets computer science and was the inspiration behind my development of Cypher, a program that calculates the clarity of rap music.
The deep analysis I planned to carry out using Cypher stalled when I realized how difficult it was to find the vocals of rap songs. I talked to one of my friends who is a DJ and he pointed me toward websites that professionals use. Still, a cappellas of anything but the most popular songs aren’t out there even if you’re willing to pay for them. The alternative, coding an audio processing algorithm to strip lyrics from instrumentals, would have been nearly impossible.
In discussing the Cypher project with friends and some people who just stumbled across my articles, it was clear that Matt Daniels really got people thinking about all the possibilities behind analyzing rap lyrics. Today, I want to take the leap and explain how you can start gaining insights into lyrics for yourself. I want to help turn the wouldn’t it be cool if… into an approachable, fun side project. In this post, I’ll leave out most of the technical details and focus on what you have to think about in building your rap lyric data set.
What information do you want?
Most projects will hinge upon lyrics. So you’ll want those. Beyond that, song title, artist name, and album are just a few of the bits of information to be considered. It’s important to know this before you start collecting your data because it will determine where you can get it from and how you can store it. For example, if you just want to see how many times the word “Percocet” appears in a set of three thousand rap songs, you might not care about the song name or artist. However, in building your data set, you’ll want to store as much information (as cleanly as) as possible since you’ll presumably use it for more than one project. Otherwise, you risk having to go and look through the same entries, gathering the few extra fields you left out the first time around.
Where will you get your data?
Rap Genius is going to provide an abundance of lyrics and accompanying data with great accuracy. Note that it is against Rap Genius’ terms of service to scrape the website in order to build a data set of your own. This will have to be strictly for personal use and I don’t advocate for breaking their ToS. Older lyric sites, like AZLyrics and Metrolyrics, are still usable, but have more inconsistencies. You might also want to supplement this core information with other data. Where is the rapper from? How old is he? What acts is he associated with? To form a more comprehensive data set, you might also look to scrape Wikipedia or even parse through social media channels.
How will you get the data?
When you find a suitable data source, you’ll have to access the data programatically. This can be done by using a web scraper. At the most basic, a web scraper executes a WGET request against the page and returns the entire page as structured html. From there, you can parse through the lyrics and other metadata as it fits into your model.
wget http://www.metrolyrics.com/forever-lyrics-drake.html from your command line
Alternatively, some websites like Rap Genius offer an API to more easily access the data. However, the structure still leaves something to be desired, namely, the lyrics themselves. So, even though the API is a great help, you’ll probably need to use a web scraper to get everything else.
How will you store the data?
This step might not be intuitive to those without programming experience, but it’s (pretty much) necessary to store the data offline so that you can arrange it, normalize it, and operate upon it more efficiently. You don’t want to have to make web requests every step of the way through development of your program. Once you make the data your own (this is where you’d enter a legal red zone for commercial products) you are free to manipulate it how you want, and a certain level of fine-tuning will be required when you want to start analyzing songs- whether an LP’s worth, The Wu Tang Clan’s entire discography, or a random sampling of ten thousand songs. This isn’t what you’d call “Big Data,” but the fun part is that the technologies that you can use to accomplish this task can range from simple to highly-performant.
Once you have a better idea of your ambitions for your data set, you can think about how to store the data and in what format. Will a SQL database get the job done? Might you create XML files? What about just storing the lyrics in plaintext with a simple directory structure? I advocate for using some database, but the sub-optimal approaches are fine and will be friendlier unless you’re looking to do an awful lot of processing. The advantage of using a database is that it will require you to enforce some constraints, making the data much more structured and easier to work with.
How will you analyze the data?
If you’ve carefully thought about building your data set, this part will be easy. Once you determine what you are searching for in your data, it’s off to the races. You will pull in each song’s lyrics, or, lyrics aggregated however way you decided to arrange them, and then gradually calculate your results. From there, you can manually look through your results and further sharpen your data set and analysis tools. For example, you might notice that the words and, &, n’, and an’ are getting treated as distinct words but you want to your analyzer to treat them as the same thing.
Once you’ve built your rap lyric data set, you’ll be able to embark on exciting projects. With your results in hand, you can create beautiful visualizations like in The Largest Vocabulary in Hip Hop, use them to bolster knowledge and critiques, use them for a machine learning project, or just explore a realm of rap that was previously inaccessible to you. For my next post, I’ll write about how I’m building my data set- what design decisions I’ve made and what technologies I’ve decided to use.