Mad Men is one of my favorite TV shows. Each episode generates a lot of internet
analysis, and my go-to blog for Mad Men is
Tom+Lorenzo, aka “TLo”. They have insightful
posts on each episode, and even better yet a great crew of readers who leave
hundreds of comments.
Mad Men has an ensemble cast where a character will figure prominently in one
episode and not even show up in the next. One possible way to show how
characters come and go is to count how often each character’s name shows up in
the comments. So I quickly scraped TLo’s blog, and created the above chart. See
bottom half of this post for the code and methodology.
I think the chart above tells a bit of a story about this season so far. Betty
has shown up in only one episode, and that was in the second episode which
birthed the internet meme “Fat Betty.”
Betty’s name showed up the most for her in that episode, and otherwise has had a
steady number of background mentions.
Episode 4 is when Pete had his fight, and sure enough he got his spike in the
comments, but otherwise has been fairly quiet in the comments.
Don dominates the comments as he dominates the show as he dominates everyone
around him. Roger is a steady presence, Joan is a major minor character, and
Peggy had her high moment of the season so far in the last episode when she and
her Catholic mother got into it.
For some reason TLo shut off the commments for previous seasons, but hopefully I
can get that data or grab similar data from somewhere else. It would be fun to
see this kind of graph for all seasonss to see how characters come and go.
Scrape, Parse, Visualize
I first create a list of each of the episode urls and save them in a file
episodes like this:
Then I can grab the web pages with a simple wget command:
wget -i episodes
I save a file called characters which contains the major character names that
we will count in each episode recap and related comments.
This shell script will then loop through each html file from the scrape step,
tokenize each file, and then count the occurence of each character name. Lastly
it will save the output in a format that will be easy to read into R or any
other tool that can deal with flat files.
That above shell script will then output a file in this format where the first
column is the count, the second is the character name, and the third is the
309 don mad-men-a-little-kiss.html
220 megan mad-men-a-little-kiss.html
158 joan mad-men-a-little-kiss.html
113 peggy mad-men-a-little-kiss.html
107 roger mad-men-a-little-kiss.html
81 pete mad-men-a-little-kiss.html
69 betty mad-men-a-little-kiss.html
15 sally mad-men-a-little-kiss.html
210 betty mad-men-tea-leaves.html
191 don mad-men-tea-leaves.html
77 peggy mad-men-tea-leaves.html
73 megan mad-men-tea-leaves.html
50 sally mad-men-tea-leaves.html
45 roger mad-men-tea-leaves.html
17 pete mad-men-tea-leaves.html
Now that file can be loaded into R, we do a bit of data wrangling and wrestling,
and we get the picture on the top of this blog post. I’ve included the R code
below, which is a quick and dirty hack to get the data visualized.