Big Data Analytics on a large Wikipedia dataset done in the Cloud (AWS)

Wikipedia is often a great reflector of current events. Pages that are being accessed with increasing frequency often indicate an event related to the topic of the page. Wikipedia serves as a fairly unbiased source of reporting of the news, and sometimes even reveals hidden patterns. This project involved processing and analysing 1 month (720 hours/135 GB) worth of traffic log from Wikipedia. The month chosen was November 2016 (ring a bell?).

What did I use?

Python
MapReduce
AWS (Amazon Web Services)
Python Data Analysis Library (pandas)
Jupyter Notebook

Data Pre-processing

The amount of data to process was huge. Before it could be analysed, it needed to be filtered and processed into a form which could be later fed to different analysis libraries (in my case, pandas). Some of the properties of interest were - only English pages, no malformed data, URL normalization and percent-encoding, only Wikipedia namespaces, article title limitations, blacklisted file extensions, Wikipedia Disambiguations etc.

As you can see, there were lot of cases which needed to be handled. To make it easier to incorporate these cases, debug, and easily scale, I decided to break all the pre-processing components into granular functions, where each function just had one role - tell whether the input could be filtered. I could then ‘chain’ the functions in any order desired, enabling me to add/remove conditions, and debug easily if something didn’t work as expected. Like -

def filter_malformed_data(input):
    # some code
    return <True or False>

def filter_english_pages(input):
    # some code
    return <True or False>

filters = [filter_malformed_data(), filter_english_pages()]

if all(f(data) for f in filters):
    # proceed further

Map-Reduce Operation

Once data was filtered out, next big step was to write the mappers and reducers that would finally produce the daily view count for the month of November 2016. Since it was an academic project, I cannot discuss the map-reduce program in detail. The only topics of interest were the ones having more than 100,000 hits in total.

To accomplish this task, I used the EMR (Elastic MapReduce) service provided by AWS. I had 3 nodes in the cluster - 1 master and 2 core (slave) nodes. The machines were of type m4.large. The entire job took around 2 hours to run (135 GB data, remember?). The final output of the mapreduce job was a file few megabytes in size and of the following format -

<Total Count> <Topic>   <Day 1 View Count>  <Day 2 View Count>  .....  <Day 30 View Count>

Example -

9016683 United_States_presidential_election_2016   108018  126420  127054  135039  134971  220274  375259  723502  2332337 811996  536615  354246  372025  305104  279582  239844  206588  153615  125084  124844  140998  134532  153956  146080  110354  124210  115150  120080  100642  78264

Analysis

Once the data was ready, I used pandas to do some basic analyses. The most obvious question to ask was, “Which was the most searched topic of November 2016?”. And the answer, unsurprisingly, was Donald Trump! There were almost 18 million hits on that topic. As expected, United States Presidential Election 2016, Hillary Clinton, Melania Trump and Barack Obama were in the most searched topics too!