Automated Data Reports with R

By Paul Campbell | October 22, 2018

A lot of data analysts will find themselves doing repetitive manual tasks on a data set every day/week/month in Excel, then copying and pasting their updated pivot tables and charts into Word or PowerPoint reports for their stakeholders. If this sounds like your job description, you may want to consider switching to a programming language like R.

Writing scripts will allow you to automate the majority of these processes; from importing your data all the way through to emailing your boss the final report. They’ll never know you were actually in the pub the whole time.


Automation

Automation may sound like a scary word to any human being with a job that would like to keep it, but learning to automate some of your most common data tasks can be seriously beneficial to both your organisation and your own job security! Some of the benefits of an automated reporting workflow over a manual one include:

  1. It saves you time. Most people will feel like they don’t have the time they need to fulfill all that is asked of them at work. So if you can cut out the time taken on manual data processing and focus more analysis and insight, that can only be a good thing.
  2. It reduces errors. When your reporting relies on manual data entry and formula with hard-coded cell references, one typo or out-of-place number can lead to results that are way off the mark. Automating the process with a script will remove the possibility of human error completely.
  3. It expands your data visualisation options. Using an open-source software like R will allow to draw on a vast array of tools and charting libraries not available in proprietary software. For example, HTML reports with the rmarkdown package can include interactive charts, maps and tables that utilise the latest web-technologies - more on this below.

Worked Example

In this post we’re going to run through what might be a typical workflow of building a reproducible data report in R - that is, one that you can simply hit ‘Run’ on whenever a new wave of data comes in, and all your charts and tables update accordingly giving you the latest insights in your data at the touch of a button.

I’m going to use the classic gapminder dataset that contains population, life expectancy, and GDP per capita metrics for many countries over time. There is a gapminder R package that would conveniently give us the data in a format best suited for data analysis - tidy data - but we want to make this run-through as realistic as possible.

As such, I’ve taken this tidy data set, removed one of the variables that we’ll have to add ourselves later, then split it up by year across multiple excel sheets in a single excel file. Sacrilige, I know, but from experience this is a very common way data with multiple time periods is stored and distributed, so it will be useful to know how to bring it back together again with R code. If you’re interested in the code required to commit such a heinous crime of ‘untidying’ a data set into excel sheets you can check out the make gapminder messy again R script over on github.

messy excel data

To see the full file in all its glory click here.

As you can see from the Excel file, each tab contains data for the year denoted in the tab name. Here we have 5 year intervals between data but in a more realistic scenario these would be daily, weekly or monthly updates for which this process would work equally well. Similarly, if the data was seperated by a different type of variable such as survey results for different demographic target groups, the process below would allow you to combine the data and visualise the differences between the demographics.

So let’s begin with the code! If you’re brand new to R don’t worry if you don’t know the ins and outs of the syntax. Hopefully you will see that we can get a lot done with not a lot of code!


Load the Libraries

First thing first is to load the libraries we’ll be using for our report and set a default ggplot2 chart theme.

library(tidyverse)
library(readxl)
library(hrbrthemes)
library(scales)
library(countrycode)
library(glue)
library(plotly)
library(crosstalk)
library(DT)
library(gganimate)
library(widgetframe)

theme_set(theme_ft_rc())

Data Importing & Tidying

This is the part that can be most satisfying to automate because the manual equivalent is often very tedious, time-consuming work. We’re going to program a process that will:

  • iterate over each sheet in the excel file
  • pull out only the data table section (starting on row 5)
  • add a year column populated by the name of the sheet
  • combine all the seperate tables into one single data frame
  • re-order the columns to our desired specifications

The reproducibility of this code comes from the fact that it is agnostic to the number of sheets in our excel file. That means whenever we get an updated file with a new tab of data, we just point the same code to this file and we’ll get a new dataset that includes this new data. Any future computations we draw from the data such as latest period-on-period changes will automatically be derived from the latest available data that we have just received.

path <- "../../data/blog_data/gapminder_messy.xlsx"

combined_data <- 
  excel_sheets(path) %>% 
  map_df(~ {
    read_excel(path, sheet = .x, skip = 4, trim_ws = TRUE) %>% 
      mutate(year = as.integer(.x))
  }) %>% 
  select(country, year, everything())

Let’s quickly inspect our data to check that it got it the job done…

# top 6 rows
head(combined_data)
## # A tibble: 6 x 5
##   country      year gdpPercap lifeExp      pop
##   <chr>       <int>     <dbl>   <dbl>    <dbl>
## 1 Afghanistan  2007       975    43.8 31889923
## 2 Albania      2007      5937    76.4  3600523
## 3 Algeria      2007      6223    72.3 33333216
## 4 Angola       2007      4797    42.7 12420476
## 5 Argentina    2007     12779    75.3 40301927
## 6 Australia    2007     34435    81.2 20434176
# bottom 6 rows
tail(combined_data)
## # A tibble: 6 x 5
##   country             year gdpPercap lifeExp      pop
##   <chr>              <int>     <dbl>   <dbl>    <dbl>
## 1 Venezuela           1952      7690    55.1  5439568
## 2 Vietnam             1952       605    40.4 26246839
## 3 West Bank and Gaza  1952      1516    43.2  1030585
## 4 Yemen, Rep.         1952       782    32.5  4963829
## 5 Zambia              1952      1147    42.0  2672000
## 6 Zimbabwe            1952       407    48.5  3080907

Looking good. However part of our analysis is going to involve computing continental summary statistics, but we currently don’t have a continent column in our dataset. Adding one in Excel would require us making our own lookup table with a corresponding continent for each unique country in the dataset then using a VLOOKUP function.

In R we can easily make this part of the automated workflow using the countrycode package to translate from one geographical coding scheme to another - in our case getting a continent name based on a country name, like so:

combined_data <- combined_data %>% 
  mutate(continent = countrycode(sourcevar = country, origin = "country.name", destination = "continent")) %>% 
  select(continent, everything())

head(combined_data)
## # A tibble: 6 x 6
##   continent country      year gdpPercap lifeExp      pop
##   <chr>     <chr>       <int>     <dbl>   <dbl>    <dbl>
## 1 Asia      Afghanistan  2007       975    43.8 31889923
## 2 Europe    Albania      2007      5937    76.4  3600523
## 3 Africa    Algeria      2007      6223    72.3 33333216
## 4 Africa    Angola       2007      4797    42.7 12420476
## 5 Americas  Argentina    2007     12779    75.3 40301927
## 6 Oceania   Australia    2007     34435    81.2 20434176

Paramaterising Your Report

We now have a consolidated dataset with an added continent variable coded to each country. A common case with data reporting would be to produce distinct reports from multiple subsets of a single data set. This is made easy using rmarkdown with the ability to supply a parameter at the top of your report script which can then be used as a variable in the code within the report to alter the outputs.

Let’s say in our example we want to batch produce a report for each continent. If we supply the name of the continent we want to build the report for in the params section of what is known as the YAML at the top of the report script like so:

---
title: My Report
output: html_document
params:
   continent: Europe
---

We can then use that parameter to filter our consolidated data set to only countries matching the continent parameter and then use that data for all subsequent analysis.

filtered_data <- combined_data %>% 
  filter(continent == params$continent)

Once we have our full report script ready to be executed, all we have to do to get a report from one continent to the next is change the value of the continent param at the top of the script.

But before we do that we need to code out the basis of the analysis we want contained in our reports. So here are a few examples of the type of data visualisations that would help a stakeholder understand the latest metrics and insights from the data.

For the sake of brevity I’m not including all the code for the charts and tables. But like the data importing and tidying code, once it is setup and working no amendments are required for each new wave of data we get to report on. They’ll automatically include the latest metrics for the data we have just processed.

If you’re interested in seeing the full code for this report you can check it out over on github.


Linked Interactive Graphics

In an HTML report, we can use linked interactive graphics to encourage users to engage with the data and explore the insights. Let’s build some charts to show the latest period percentage changes of all 3 metrics in the data set.

First we do the data wrangling to compute the changes…

p_change <- filtered_data %>% 
  group_by(continent, country) %>% 
  arrange(year) %>% 
  mutate(gdpPercap_change = (gdpPercap - lag(gdpPercap)) / lag(gdpPercap),
         lifeExp_change = (lifeExp - lag(lifeExp)) / lag(lifeExp),
         pop_change = (pop - lag(pop)) / lag(pop))
         
latest_figs <- p_change %>% 
  filter(year == max(year)) %>%
  select(-year) %>% 
  arrange(country) %>% 
  ungroup()

Then we can visualise the results interactively with the linked functionality.

Use the search bar to highlight a particular country across all 3 charts or click directly on one of the bars. Double clicking will deselect the highlighted country.



Searchable Data Tables

We can also view the same set of data in an interactive data tables with search functionality and column sorting to allow users to quickly find the numbers they are looking for. Clicking on column headers will sort the data ascending or descending.


Static Charts with ggplot2

The sad truth is that most managers just really love Microsoft Office or PDF reports. Fear not though, because with ggplot2 we have the world’s best static charting library at our disposal and getting these charts in an automated Word, PowerPoint, or PDF report with rmarkdown is a walk in the park.

The chart below combines all three metrics in our data into one chart - a la Hans Rosling (RIP) - for our latest wave of data. The position of the country annotations are automated with the ggrepel package and computed to repel overlapping text labels away from each other.


Animated Charts

Although we have been using a data set filtered to our chosen contininet parameter of Europe, we still have the full data set available to use. We can look at the continental summary statistics over time with the following data…

continent_summary <- 
  combined_data %>% 
  group_by(continent, year) %>% 
  summarise(gdpPercap = weighted.mean(gdpPercap, pop),
            lifeExp = weighted.mean(lifeExp, pop),
            pop = sum(pop)) %>% 
  ungroup() %>% 
  gather(metric, value, 3:5)

Then visually animate the progress over time…

All credit to Thomas Lin Pedersen’s outstanding new gganimate package for making charts like the above so intuitively simple to make.


Fin

To sum up, we now have an rmarkdown script that is capable of:

  1. combining data from excel tabs into a single data frame
  2. filtering it by a paramerter we supply
  3. computing new variables + summaries of the data
  4. visualing the results in interactive, static and animated charts

To be super efficient, we can batch produce all 5 reports (one for each continent) with a single function! The code below sets the list of parameters we want to iterate over, creates an rmarkdown::render function to generate the html report from the script, then renders the report for each continent with purrr::walk (applies the function to each item in the continents list).

continents <- c("Asia", "Europe", "Africa", "Americas", "Oceania")

renderMyReport <- function(continent) {
  rmarkdown::render("report.Rmd", 
                    output_file = paste0(continent,"_report_", Sys.Date(), ".html"),
                    params = list(continent = continent), 
                    output_options = list(self_contained = FALSE, lib_dir = "libs"))
}

purrr::walk(continents, renderMyReport)

You can see all the outputs of the above code using a cutdown version of this report script over here.

In a real-world scenario you’d most likely want to add some distinct narrative in each report, but letting R take care of all the data processing and visualisation for you is going to give you much more time to dig out the real insights and find a solution to all global population, wealth and health probelms! 😅


Learning

Making the switch to doing your data work with R can be daunting, but with the combination of RStudio as your ‘integrated development environment’, the tidyverse packages for data analysis, and rmarkdown for producing reports in just about any format you’d like (HTML, PDF, Word, PowerPoint, Dashboard, even entire websites like this one!), there has never been a better time to make the leap and start your journey to better data reporting workflows. There’s also a really great and welcoming community of R users online who are always happy to help new users feel at home.

If you’d like some help getting started, we’re currently running a 1-day ‘Master Data with Code’ workshop where you’ll learn how to import, manipulate, join and transform data using the tidyverse in the RStudio IDE.

We also offer bespoke in-house training for teams where we focus on your own specific data tasks and how you can use R for greater accuracy and efficiency in data processing, and engagement in your communicated insights.

For more information get in touch via our Contact Page.

Thanks for reading!

comments powered by Disqus