In this series of posts (part 1, part 2), I have been showing how to use Python and other data scientist tools to analyze a collection of tweets related to the 2014 Bonnaroo Music and Arts Festival. So far, the investigation has been limited to summary data of the full dataset. The beauty of Twitter is that it occurs in realtime, so we can now peer into the fourth dimension and learn about these tweets as a function of time.
More Organic
Before we view the Bonnaroo tweets as a time series, I would like to make a quick comment about the organic-ness of the tweets. If you recall from the previous post, I removed duplicates and retweets from my collection in order to make the tweet database more indicative of true audience reactions. On further investigation, it seems that there were many spammy media sources still in the collection. To make the tweets even more organic, I decided to look at the source of the tweets.
Because Kanye West was the most popular artist from the previous posts’ analysis, I decided to look at the top 15 sources that mentioned him:
twitterfeed 1585
dlvr.it 749
Twitter for iPhone 366
IFTTT 256
Hootsuite 201
Twitter for Websites 188
Twitter Web Client 127
Facebook 120
Twitter for Android 119
WordPress.com 102
Tumblr 81
Instagram 73
iOS 42
TweetDeck 38
TweetAdder v4 37
twitterfeed and dlvr.it are social media platforms for deploying mass tweets, and a look at some of these tweets reveals this fact. So, I decided to create a list of “organic sources”, which consists of mobile Twitter clients, and use these to cull the tweet collection
organic_sources = ['Twitter for iPhone', 'Twitter Web Client',
'Facebook', 'Twitter for Android', 'Instagram']
organics = organics[organics['source'].isin(organic_sources)]
With this new dataset, I re-ran the band popularity histogram from the previous post, and I was surprised to see that Kanye got bumped down to third place! It looks like Kanye’s popular with the media, but Jack White and Elton John were more popular with the Bonnaroo audience.
4th Dimensional Transition
Let’s now look at the time dependence of the tweets. For this, we would like to use the created_at
field as our index and tell pandas to treat its elements as datetime objects.
# Clean up field
organics['created_at'] = [tweetTime['$date'] for tweetTime in organics['created_at']]
organics['created_at'] = pd.to_datetime(Series(organics['created_at']))
organics = organics.set_index('created_at',drop=False)
organics.index = organics.index.tz_localize('UTC').tz_convert('EST')
To look at the number of tweets per hour, we have to resample our tweet collection.
ts_hist = organics['created_at'].resample('60t', how='count')
The majority of my time spent creating this blog post consisted of fighting with matplotlib trying to get decent looking plots. I thought it would be cool to try to make a “fill between” plot, which took way longer to figure out than it should have. The key is that fill_between
takes 3 inputs: an array for the x-axis and two y-axis arrays between which the function fills color. If one just wants to plot a regular curve and fill to the x-axis, one must create an array of zeros that is the same length as the curve. Also, I get pretty confused with which commands should be called with ax, plt, and fig. Anyway, the code and corresponding figure are below.
# Prettier pandas plot settings
# Not sure why 'default' is not the default...
pd.options.display.mpl_style='default'
x_date = tshist.index
zero_line = np.zeros(len(x_date))
fig, ax = plt.subplots()
ax.fill_between(x_date, zero_line, ts_hist.values, facecolor='blue', alpha=0.5)
# Format plot
plt.setp(ax.get_xticklabels(),fontsize=12,family='sans-serif')
plt.setp(ax.get_yticklabels(),fontsize=12,family='sans-serif')
plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)
plt.show()
As you can see, tweet frequency was pretty consistent during each day of the festival and persisted until the early hours of each morning.
Band Popularity Time Series
We can now go back to questions from the previous post and look at how the top five bands’ popularity changed with time. Using my program from the previous post, buildMentionHist
, we can add a column for each band to our existing organics
dataframe. Each row of the bands’ columns will contain a True or False value corresponding to whether or not the artist was mentioned in the tweet. We resample the columns like above but do this in bins of 10 minutes.
import buildMentionHist as bmh
import json
path = 'bonnaroooAliasList.json'
alias_dict = [json.loads(line) for line in open(path)][0]
bandPop = organics['text'].apply(bmh.build_apply_fun(alias_dict),
alias_dict)
top_five = bandPop.index.tolist()[:5] # Get top five artists' names
bandPop = pd.concat([organics, bandPop], axis=1)
top_five_ts = DataFrame()
for band in top_five:
top_five_ts[band] = bandPop[bandPop[band] == True]['text'].resample('10min', how='count')
We now have a dataframe called top_five_ts
that contains the time series information for the top five most popular bands at Bonnaroo. All we have to do now is plot these time series. I wanted to again make some fill between plots but with different colors for each band. I used the prettyplotlib library to help with this because it has nicer looking default colors. I plot both the full time series and a “zoomed-in” time series that is closer to when the artists’ popularities peaked on Twitter. I ran into a lot of trouble trying to get the dates and times formatted correctly on the x-axis of the zoomed-in plot, so I have included that code below. There is probably a better way to do it, but at least this finally worked.
import pytz
import prettyplotlib as ppl
from prettyplotlib import brewer2mpl
for band in top_five_ts:
ppl.fill_between(top_five_ts.index.tolist(),0.,top_five_ts[band])
ax = plt.gca()
fig = plt.gcf()
set2 = brewer2mpl.get_map('Set2','qualitative',8).mpl_colors
# Note: have to make legend by hand for fill_between plots.
# BEGIN making legend
for color in set2:
legendProxies.append(plt.Rectangle((0, 0), 1, 1, fc=color))
leg = legend(legendProxies, topfive, loc=2)
leg.draw_frame(False)
# END making legend
# BEGIN formatting xaxis
datemin = datetime(2014,6,13,12,0,0)
datemax = datetime(2014,6,16,12,0,0)
est = pytz.timezone('EST')
plt.axis([est.localize(datemin), est.localize(datemax), 0, 80])
fmt = dates.DateFormatter('%m/%d %H:%M',tz=est)
ax.xaxis.set_major_formatter(fmt)
ax.xaxis.set_tick_params(direction='out')
# END formatting xaxis
plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)
Here is the full time series:
And here is the zoomed-in time series:
If we look at when each band went on stage, we can see that each bands’ popularity spiked while they were performing. This is good - it looks like we are measuring truly “organic” interest on Twitter!
Band | Performance Time |
---|---|
Jack White | 6/14 10:45PM - 12:15AM |
Elton John | 6/15 9:30PM - 11:30PM |
Kanye West | 6/13 10:00PM - 12:00AM |
Skrillex | 6/14 1:30AM - 3:30AM |
Vampire Weekend | 6/13 7:30PM - 8:45PM |
Delving into the Text
Up until now, I have not looked too much about the actual text of the tweets other than to find a mention of an artist. Using the nltk library, we can learn a little more about some general qualities of the text. The simplest quantity is looking at the most frequently used words. To do this, I go through every tweet and break all of the words up into individual elements of a list. In the language of natural language processing, we are “tokenizing” the text. Common english stopwords are omitted, as well as any mentions of the artists or artists’ aliases. I use a regular expression code to only grab words from the sentences and ignore punctuation (except for apostrophes). I also take our alias_dict
from the previous post and make sure that those words are not collected when tokenizing the tweets.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
def custom_tokenize(text, custom_words=None, clean_custom_words=False):
"""
This routine takes an input "text" and strips punctuation
(except apostrophes), converts each words to lowercase,
removes standard english stopwords, removes a set of
custom_words (optional), and returns a list of all of the
leftover words.
INPUTS:
text = text string that one wants to tokenize
custom_words = custom list or dictionary of words to omit
from the tokenization.
clean_custom_world = Flag as True if you want to clean
these words.
Flag as False if mapping this function
to many keys. In that case,
pre-clean the words before running
this function.
OUTPUTS:
words = This is a list of the tokenized version of each word
that was in "text"
"""
tokenizer = RegexpTokenizer(r"[\w']+")
stop_url = re.compile(r'http[^\\s]+')
stops = stopwords.words('english')
if clean_custom_words:
custom_words = tokenize_custom_words(custom_words)
words = [w.lower() for w in text.split() if not re.match(stop_url, w)]
words = tokenizer.tokenize(' '.join(words))
words = [w for w in words if w not in stops and w not in custom_words]
return words
def tokenize_custom_words(custom_words):
tokenizer = RegexpTokenizer(r"[\w']+")
custom_tokens = []
stops = stopwords.words('english')
if type(custom_words) is dict: # Useful for alias_dict
for k, v in custom_words.iteritems():
k_tokens = [w.lower() for w in k.split() if w.lower() not in stops]
# Remove all punctuation
k_tokens = tokenizer.tokenize(' '.join(k_tokens))
# Remove apostrophes
k_tokens = [w.replace("'","") for w in k_tokens]
# Below takes care of nested lists, then tokenizes
v_tokens = [word for listwords in v for word in listwords]
v_tokens = tokenizer.tokenize(' '.join(v_tokens))
# Remove apostrophes
v_tokens = [w.replace("'","") for w in v_tokens]
custom_tokens.extend(k_tokens)
custom_tokens.extend(v_tokens)
elif type(custom_words) is list:
custom_tokens = [tokenizer.tokenize(words) for words in custom_words]
custom_tokens = [words.replace("'","") for words in custom_tokens]
custom_tokens = set(custom_tokens)
return custom_tokens
Using the above code, I can apply the custom_tokenize
function to each row of my organics
dataframe. Before doing this, though, I make sure to run the tokenize_custom_words
function on the alias dictionary. Otherwise, I would end up cleaning the aliases for every row in the dataframe which is a waste of time.
import custom_tokenize as tk
clean_aliases = tk.tokenize_custom_words(alias_dict)
token_df = organics['text'].apply(tk.custom_tokenize,
custom_words=clean_aliases,
clean_custom_words=False)
Lastly, I collect all of the tokens into one giant list and use the FreqDist
nltk function to get the word frequency distribution.
# Need to flatten all tokens into one big list:
big_tokens = [y for x in token_df.values for y in x]
distr = nltk.FreqDist(big_tokens)
distr = distr.pop('bonnaroo') # Obviously highest frequency
distr.plot(25)
A couple things caught my eye - the first being that people like to talk about themseleves (see popularity of " i’m “). Also it was pretty popular to misspell “Bonnaroo” (see popularity of " bonaroo “). I wanted to see if there was any correlation between mispellings and maybe people being intoxicated at night, but the time series behavior of the mispellings looks similar in shape (though not magnitude) to the full tweet time series that is plotted earlier in the post.
misspell = token_df.apply(lambda x: 'bonaroo' in x)
misspell = misspell[misspell].resample('60t', how='count')
The other thing that interested me was the the word “best” was one of the top 25 most frequent words. Assuming that “best” correlates with happiness, we can see that people got happier and happier as the festival progressed:
This is, of course, a fairly simplistic measure of text sentiment. In my next post, I would like to quantify more robust measures of Bonnaroo audience sentiment.
By the way, the code used in this whole series on Bonnaroo is available on my GitHub.