Most Influential Users from Twitter Social Graph using Spark
Scenario
Among millions of users on Twitter, there are many usecases where it is important to identify the most influential users in the network. These users form small ‘hubs’ in the network and share the best connectivity with their peers. It was required to identify the most influential users (nodes) in the Twitter network. The appropriate algorithm to identify the influential users is the PageRank
algorithm that would run many iterations before narrowing down on the result.
What did I use?
- Scala
- Apache Spark
- SparkSQL
- AWS
Algorithm
Influential nodes are those nodes that are followed by the most number of other people/nodes. Additionally, if the followers of a node are influential themselves, then the followed node becomes even more influential. For example, if A has 10 followers and B has 5 followers, then A is supposed to be more influential than B. If C also has 10 followers, then A and C are equally influential. However, if the followers of C are more influential than the followers of A (ie, if the followers of C have more followers of their own than the followers of A), then C becomes more influential than A (simply because it is followed by more influential people).
The influence of each node is calculated using the PageRank
algorithm. The algorithm iteratively tries to find the score for each node. In an ideal case, the algorithm stops (converges) when the score does not change between subsequent iterations. The mathematical equation used to calculate the PageRank
is:
Each node starts with an intial score of 1/n
, where n is the number of nodes in the graph. For nodes which don’t have any outgoing vertices (sinks), their weight is equally divided among all the nodes in the graph (as if virtually drawing the edge from that node to all the nodes in the network).
Implementation
This algorithm requires repeated iterations over the data. Spark becomes extremely relevant technology to use as it enables in-memory, iterative processing of data. To implement, I used the EMR + Spark service provided by AWS. The cluster consisted of 5 r3.xlarge instances (requirements calculated based on size of the data).
The language most convenient to write a Spark program is Scala (little learning curve involved of-course). The program computed the PageRank value for each node in the Twitter social graph. The algorithm stopped after 10 iterations and outputted the following output for the entire graph:
[user_id]\t[PageRank_value]
Conclusion
It was a fairly challenging project as it required a good understanding of how the recursive nature of the PageRank algorithm worked, and how the scores of each node propagated to the next iteration. It also required learning a totally new language which had powerful ‘functional’ way of writing code. The code also needed to be fast and efficient enough to finish under 30 minutes. All these made this an immensely tricky, yet informative project.