Like A Girl

Pushing the conversation on gender equality.

Code Like A Girl

Looking through the tweets: Mother’s Day 2017

On 14th May 2017, people all over the world were expressing their love for their mothers on social networks, may it be posting pictures or sharing touching stories about how their mothers have shaped and impacted their lives. To understand the content created this Mother’s Day, I developed a Twitter streaming and mining engine by collecting data from Twitter’s Streaming API and delving into it.

I wrote some Python code to collect and mine the Twitter stream data. I created a MySQL database to collect the data being streamed from Twitter Streaming API. Here’s a look at how I went about it.

1. Get Twitter API keys

Twitter offers a number of streaming endpoints, of which I chose to connect to a Public stream, as it gives you access to public data flowing through Twitter and is suitable for the data mining use case.

Create a Twitter application and get your access token and consumer key pairs. The consumer key is your API key and the access tokens allow your application to make authorized calls to the Twitter Streaming API.

2. Create a data store to collect Twitter data

To collect data relating to twitter users and tweets, I created two tables: twitter_users and tweets in a new MySQL database, twitter_data. The SQL to create the datastore is as follows:

create database twitter_data;
use twitter_data;
create table twitter_users(
user_id varchar(255) NOT NULL,
name varchar(255),
location varchar(1000),
description varchar(1000),
followers_count bigint,
friends_count bigint,
favourites_count bigint,
statuses_count bigint,
created_at varchar(100),
time_zone varchar(100),
lang varchar(100),
primary key(user_id)
);
create table tweets(
tweet_id varchar(255),
created_at varchar(100),
text varchar(1000),
tweet_by_user_id varchar(255),
in_reply_to_user_id varchar(255),
geo varchar(255),
coordinates varchar(100),
place varchar(255),
retweet_count bigint,
favorite_count bigint,
hashtags varchar(255),
primary key(tweet_id)
);

3. Connect to Streaming API and download data

I used Python package tweepy for streaming twitter data from the Streaming API. It handles the authentication and connection life cycles, reading incoming messages and streaming the data to your data store.

You can find my streaming setup and data ingestion code here.

4. Summarize collected data

I started collecting data using my Twitter Stream collection app on Friday evening and stopped the collection late Sunday night (end of Mother’s day).

I collected 1,710,046 tweets from 1,028,142 twitter users over this period. I tracked the following hashtags in my Stream listener:

#happymothersday, #loveyoumom, #mothersday, #mother, #momsarethebest and #mothersdaygifts

You can find my analysis code here. Here’s summary statistics I computed on this collected data.

a. Number of tweets by top 10 locations:

1663457 tweets did not have a location tagged. Among the tagged ones,

b. Number of users (who tweeted on tracked hashtags) by top 10 locations

293600 users among those who tweeted one of the tracked hashtags did not have a location(free text field) specified on their profile. Among the rest of them,

c. Popularity ranks of hashtags tracked

Most popular of the hashtags I tracked were #mother, #mothersday and#happymothersday. The rest of the 3 hashtags I tracked: #loveyoumom (count: 3838), #mothersdaygifts (count: 1295) and #momsarethebest (count: 443) were not used as much, in comparison and hence does not show up significantly on the chart, given the large relative difference in value.

There’s lots of interesting insights that can be found from this dataset. The summary stats presented above only help demonstrate setting up a Streaming listener, ingesting data in an analysis-friendly format and computing some high-level summary statistics on the collected data. I am planning on mining emoji’s from this data next.

I spun up an AWS m3.medium EC2 instance to run the data collection listener and used a MySQL db.m3.large RDS instance on AWS to store the collected data.

You can find my Python codebase for this project here.