Visualising Intersecting Sets Of Twitter Followers

By Paul Campbell | January 25, 2018

Twitter Analytics

There has been a surge in a lot of great twitter analytics recently in the #rstats world, in part due to Michael Kearney’s excellent rtweet package. It makes wrangling with the twitter API like water off a duck’s back (twitter, bird, duck? sorry). You almost feel like some sort of Zuckerberg-esque data tycoon when you realise the amount of data you can access with a few lines of code.

If you’re interested in doing some twitter scraping + visualising with R and want a top-notch resource, check out Bob Rudis’ work in progress 21 Recipes for Mining Twitter Data with rtweet. Just like everything else Bob does, it’s fantastic, and will get you from setting up your API token to making gorgeous network graphs in no time.


UpSetR

Today I won’t be making a network graph, but something a bit different. I came across the UpSetR package a while ago and have been meaning to test it out for some time. It is a method for visualising intersecting sets. Sets are typically visualised with Venn diagrams, but these can become fairly illegible when n sets starts creeping > 3. They also tend to consist of arbitrary assumptions with no underlying data behind the presumed intersections.



A better approach

So let’s combine some twitter scraping with data driven intersecting set visualisation in the form of twitter follower overlap…

UpSetR requires the data in a binary matrix format for it to compute the intersects, so we need to devise a method of creating one. To do so for a set of twitter handles we need to do the following:

  1. Scrape the user_id of every follower of each handle
  2. Create a de-duplicated master list of all followers
  3. For each handle, create a binary list indicating whether or not each follower in the master list follows that handle
  4. Combine the binary lists into one data frame

Scraping the data we need

I’m going to use a list of notable #rstats tweeters and rtweets’s get_followers function inside a purrr::map_df call to do this. Bear in mind that this can take some time depending on the number of followers you’ll be requesting. Twitter rate limits cap the number of search results returned to 18,000 every 15 minutes. So if the map function will be requesting more than this, make sure you set retryonratelimit = TRUE and it will resume scraping when a fresh 15 minute window starts. Also, set n = ... to a number > than the max follower count for your selection of handles if you want to ensure you capture all followers.

library(rtweet)
library(tidyverse)
library(UpSetR)

# get a list of twitter handles you want to compare (I left out Hadley because everyone follows him)
rstaters <- c("dataandme", "JennyBryan", "hrbrmstr", "xieyihui", "drob", "juliasilge", "thomasp85")

# scrape the user_id of all followers for each handle in the list and bind into 1 dataframe
followers <- map_df(rstaters, ~ get_followers(.x, n = 20000, retryonratelimit = TRUE) %>% mutate(account = .x))

We should now have a long 2 column dataframe with user_id (followers) and the hande of who they are following in another. Let’s take a peek.

head(followers)
## # A tibble: 6 x 2
##   user_id            account  
##   <chr>              <chr>    
## 1 1591792928         dataandme
## 2 1626739328         dataandme
## 3 174927976          dataandme
## 4 15343013           dataandme
## 5 273868864          dataandme
## 6 875747413763424256 dataandme
tail(followers)
## # A tibble: 6 x 2
##   user_id    account  
##   <chr>      <chr>    
## 1 92966425   thomasp85
## 2 3295717490 thomasp85
## 3 219555432  thomasp85
## 4 113125081  thomasp85
## 5 2282250918 thomasp85
## 6 2336317420 thomasp85

Onwards…


Wrangling the data

Now we form our binary matrix with another map_df function and ifelse statement, this time binding columns rather than rows.

# get a de-duplicated list of all followers
aRdent_followers <- unique(followers$user_id)

# for each follower, get a binary indicator of whether they follow each tweeter or not and bind to one dataframe
binaries <- rstaters %>% 
  map_dfc(~ ifelse(aRdent_followers %in% filter(followers, account == .x)$user_id, 1, 0) %>% 
            as.data.frame) # UpSetR doesn't like tibbles

# set column names
names(binaries) <- rstaters

# have a look at the data
head(binaries)
##   dataandme JennyBryan hrbrmstr xieyihui drob juliasilge thomasp85
## 1         1          0        0        0    1          1         0
## 2         1          0        0        0    0          0         0
## 3         1          0        0        0    0          0         0
## 4         1          0        0        0    0          0         0
## 5         1          0        0        0    0          0         0
## 6         1          0        0        0    0          0         0

We can work with it…


Plot with UpSetR

Now the hard work is done we can plot our intersecting sets!

# plot the sets with UpSetR
upset(binaries, nsets = 7, main.bar.color = "SteelBlue", sets.bar.color = "DarkCyan", 
      sets.x.label = "Follower Count", text.scale = c(rep(1.4, 5), 1), order.by = "freq")

This plot may require a bit of explanation (it took me a while figure what was going on). Basically, the bars on the bottom left show total follower count for each twitter handle (count of 1s in their binary list). The bars at the top show the count of the intersections denoted in the dot matrix below them.

So for columns in the matrix with only 1 dot, the bar above it shows the count of unique (no intersection) followers that twitter has. This would be the outer area of a Venn diagram that has not intersected with anything else. Columns with 2 or more dots show the count of followers the dotted twitter handles share (intersecting sections of a Venn diagram).

I think it’s an elegant way to visualise this informaiton. What can we learn from it?

Insight

It may be interesting to note that Mara Averick (dataandme), despite having the highest follower count, does not feature in the top 3 intersections, perhaps showing that her following is more diverse and reaches out behind the core #rstats niche.

Similarly, Bob Rudis (hrbrmstr) looks to be the least ‘intersected’ tweeter of this group, with highest proportion of unique followers.

Conversely, Thomas Lin Pedersen’s (thomasp85) following is more likely to intersect with the other tweeters; we can see that his number of unique followers is only a fraction higher than the number that also follow all 6 of the others.

Perhaps unsurprisingly, our biggest intersection is between StackOverflow’s TidyText dream team David Robinson (drob) & Julia Silge.


What do you think? Love it? Hate it? Prefer a 7-set Venn? Let us know.

Thanks to the team behind UpSet, you can learn more about the project here.

And as always, get in touch if you have any data projects you think we could help with!

¡Hasta luego!

comments powered by Disqus