By Paul Campbell | January 25, 2018
There has been a surge in a lot of great twitter analytics recently in the #rstats world, in part due to Michael Kearney’s excellent
rtweet package. It makes wrangling with the twitter API like water off a duck’s back (twitter, bird, duck? sorry). You almost feel like some sort of Zuckerberg-esque data tycoon when you realise the amount of data you can access with a few lines of code.
If you’re interested in doing some twitter scraping + visualising with R and want a top-notch resource, check out Bob Rudis’ work in progress 21 Recipes for Mining Twitter Data with rtweet. Just like everything else Bob does, it’s fantastic, and will get you from setting up your API token to making gorgeous network graphs in no time.
Today I won’t be making a network graph, but something a bit different. I came across the
UpSetR package a while ago and have been meaning to test it out for some time. It is a method for visualising intersecting sets. Sets are typically visualised with Venn diagrams, but these can become fairly illegible when n sets starts creeping > 3. They also tend to consist of arbitrary assumptions with no underlying data behind the presumed intersections.
A better approach
So let’s combine some twitter scraping with data driven intersecting set visualisation in the form of twitter follower overlap…
UpSetR requires the data in a binary matrix format for it to compute the intersects, so we need to devise a method of creating one. To do so for a set of twitter handles we need to do the following:
- Scrape the
user_idof every follower of each handle
- Create a de-duplicated master list of all followers
- For each handle, create a binary list indicating whether or not each follower in the master list follows that handle
- Combine the binary lists into one data frame
Scraping the data we need
I’m going to use a list of notable #rstats tweeters and
get_followers function inside a
purrr::map_df call to do this. Bear in mind that this can take some time depending on the number of followers you’ll be requesting. Twitter rate limits cap the number of search results returned to 18,000 every 15 minutes. So if the
map function will be requesting more than this, make sure you set
retryonratelimit = TRUE and it will resume scraping when a fresh 15 minute window starts. Also, set
n = ... to a number > than the max follower count for your selection of handles if you want to ensure you capture all followers.
library(rtweet) library(tidyverse) library(UpSetR) # get a list of twitter handles you want to compare (I left out Hadley because everyone follows him) rstaters <- c("dataandme", "JennyBryan", "hrbrmstr", "xieyihui", "drob", "juliasilge", "thomasp85") # scrape the user_id of all followers for each handle in the list and bind into 1 dataframe followers <- map_df(rstaters, ~ get_followers(.x, n = 20000, retryonratelimit = TRUE) %>% mutate(account = .x))
We should now have a long 2 column dataframe with
user_id (followers) and the hande of who they are following in another. Let’s take a peek.
## # A tibble: 6 x 2 ## user_id account ## <chr> <chr> ## 1 1591792928 dataandme ## 2 1626739328 dataandme ## 3 174927976 dataandme ## 4 15343013 dataandme ## 5 273868864 dataandme ## 6 875747413763424256 dataandme
## # A tibble: 6 x 2 ## user_id account ## <chr> <chr> ## 1 92966425 thomasp85 ## 2 3295717490 thomasp85 ## 3 219555432 thomasp85 ## 4 113125081 thomasp85 ## 5 2282250918 thomasp85 ## 6 2336317420 thomasp85
Wrangling the data
Now we form our binary matrix with another
map_df function and
ifelse statement, this time binding columns rather than rows.
# get a de-duplicated list of all followers aRdent_followers <- unique(followers$user_id) # for each follower, get a binary indicator of whether they follow each tweeter or not and bind to one dataframe binaries <- rstaters %>% map_dfc(~ ifelse(aRdent_followers %in% filter(followers, account == .x)$user_id, 1, 0) %>% as.data.frame) # UpSetR doesn't like tibbles # set column names names(binaries) <- rstaters # have a look at the data head(binaries)
## dataandme JennyBryan hrbrmstr xieyihui drob juliasilge thomasp85 ## 1 1 0 0 0 1 1 0 ## 2 1 0 0 0 0 0 0 ## 3 1 0 0 0 0 0 0 ## 4 1 0 0 0 0 0 0 ## 5 1 0 0 0 0 0 0 ## 6 1 0 0 0 0 0 0
We can work with it…
Plot with UpSetR
Now the hard work is done we can plot our intersecting sets!
# plot the sets with UpSetR upset(binaries, nsets = 7, main.bar.color = "SteelBlue", sets.bar.color = "DarkCyan", sets.x.label = "Follower Count", text.scale = c(rep(1.4, 5), 1), order.by = "freq")
This plot may require a bit of explanation (it took me a while figure what was going on). Basically, the bars on the bottom left show total follower count for each twitter handle (count of 1s in their binary list). The bars at the top show the count of the intersections denoted in the dot matrix below them.
So for columns in the matrix with only 1 dot, the bar above it shows the count of unique (no intersection) followers that twitter has. This would be the outer area of a Venn diagram that has not intersected with anything else. Columns with 2 or more dots show the count of followers the dotted twitter handles share (intersecting sections of a Venn diagram).
I think it’s an elegant way to visualise this informaiton. What can we learn from it?
It may be interesting to note that Mara Averick (dataandme), despite having the highest follower count, does not feature in the top 3 intersections, perhaps showing that her following is more diverse and reaches out behind the core #rstats niche.
Similarly, Bob Rudis (hrbrmstr) looks to be the least ‘intersected’ tweeter of this group, with highest proportion of unique followers.
Conversely, Thomas Lin Pedersen’s (thomasp85) following is more likely to intersect with the other tweeters; we can see that his number of unique followers is only a fraction higher than the number that also follow all 6 of the others.
What do you think? Love it? Hate it? Prefer a 7-set Venn? Let us know.
Thanks to the team behind UpSet, you can learn more about the project here.
And as always, get in touch if you have any data projects you think we could help with!