Volume 34 Number 10
Exploring R and the Tidyverse Suite
By Frank La Vigne | October 2019
In my last article (msdn.com/magazine/mt833459), I explored the fundamentals of the R programming language, which is widely used in the data science space. At the end of the article, I pointed out that all the code written for the article was in “base R.” While base R is capable of loading, exploring and visualizing data, it’s not the only way to perform data analysis in R. At the end of the article, I briefly mentioned the tidyverse (tidyverse.org), a collection of packages for R that align to common design principles and are designed to work together seamlessly. Package developers that would like to add to the tidyverse must adhere to the tidyverse style guide (style.tidyverse.org).
This enables a consistent experience for developers and ease of interoperability between packages.
The tidyverse libraries are open source and available on GitHub (github.com/tidyverse). The core tidyverse modules include packages needed for everyday data analyses and exploration. As of tidyverse 1.2.0, the following packages are included in the core distribution: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats. Dozens of other useful packages are also included in the tidyverse, but aren’t loaded automatically with library(tidyverse). See tidyverse.org/packages for details. This article will explore the basics of how to load, filter and visualize data the “tidyverse way.”
Recall that last month’s column used post count data from my blog as a sample dataset. This dataset is simple with enough variation to demonstrate the power and ease of tidyverse packages. It also helps to use the same sample dataset to facilitate comparisons between base R and tidyverse R.
Loading Data with readr
The readr package provides a fast and easy way to read rectangular data files, such as .csv files. It can flexibly parse many types of data files, while handling errors robustly. To get started, create a new R language Jupyter Notebook. For details on Jupyter Notebooks, refer to my February 2018 article on the topic at msdn.com/magazine/mt829269. In the first blank cell, enter the following code to load the .csv file data and display it:
library(readr) fwposts <- read_csv("franksworldposts.csv") fwposts
Note that above the tabular output with the contents of the .CSV file is text that highlights how each record was parsed and that the output is a tibble with 183 rows and four columns. Base R uses data frames to store tabular data. In the tidyverse, a tibble is the equivalent structure. In fact, tibbles are data frames, but they modify some default data frame behaviors to meet the needs of modern data analytics. More information about tibbles can be found at tibble.tidyverse.org and in-depth documentation resides at r4ds.had.co.nz/tibbles.html.
Look at the following message:
Parsed with column specification: cols( Month = col_character(), Posts = col_integer(), `Days in Month` = col_integer(), PPD = col_double() )
While the read_csv function did properly load and parse the data, it didn’t automatically detect that the Month column was a date field and labeled it a character field instead. However, I would like to preserve this data type in the schema. To do this, I need to pass along a col_types parameter to the read_csv function that explicitly defines the column schema. As readr correctly guessed all the column data types except one, I can use the existing schema as a guide.
To see the current schema or specification of the tibble, enter the following code into a blank cell and execute it:
Enter the following code, which takes the original inferred schema and adjusts how the Month column is parsed. The “%b-%y” format string matches the format of the column with the three-letter month abbreviation and a two-digit year separated by a dash, like so:
fwposts <- read_csv("franksworldposts.csv", col_types = cols( Month = col_date(format = "%b-%y"), Posts = col_integer(), `Days in Month` = col_integer(), PPD = col_double() ) ) fwposts
Note that the Month column is now properly marked as a date field in the schema.
Filter and Manipulate Data with dplyr
The dplyr package provides a consistent set of functions for nearly all data manipulation and querying tasks. To use the dplyr package, enter the following code into a new cell and execute it:
With the dplyr package loaded, I will use it to view only the months with 100 or more posts by using the pipe operator %>% to pass the tibble to the filter method. Enter the following code into a new cell and execute it:
fwposts %>% filter(Posts >= 100)
The output shows only the months with 100 or more posts.
Pipes are a fundamental concept to the tidyverse. They’re used to emphasize a series of actions where the item to the left of the pipe operator becomes the input to the right of the pipe operator. Software developers familiar with fluent style of coding will immediately recognize this pattern. To view only those rows with 100 posts or more in ascending order based on the Posts column, I would write the following code:
fwposts %>% filter(Posts >= 100) %>% arrange(desc(Posts))
To see them in descending order, I would add a call to the desc method, like this:
fwposts %>% filter(Posts >= 100) %>% arrange(desc(Posts))
I could further analyze the data by adding an additional pipe to a summarize function. Summarize functions create one value summarizing the values in a table. Enter the following code to view the number of rows, the mean post count and the mean PPD values, like so:
fwposts %>% arrange(desc(Posts)) %>% summarize(n(), mean(Posts), mean(PPD))
The values returned will be 183, 36.94536 and 1.215148.
Working with Groups
Note that the summarize function returned one value for the entire dataset. If I wanted to track how the values changed over time, I could group the values by year. To do this, I’ll import a new library (lubridate) to extract the year from the Month column. The lubridate library makes working date values easier. Using dplyr’s mutate method, I will add a new column named Year to store the extracted value. The following code does just that, assigns it to the fwposts variable and displays it. Take note of the new column:
library(lubridate) fwposts <- fwposts %>% mutate(Year = lubridate::year(Month)) fwposts
Now that the Year column has been added, I can use group by it and display a summary based on the group. Enter the following code into a new cell and execute it:
posts_by_year_summary = fwposts %>% group_by(Year) %>% summarize(PostCount=n(), AvgPosts = mean(Posts), AvgPPD = mean(PPD)) posts_by_year_summary
Note that there are now summary rows for each year and that the columns have names. However, the PostCount column contains the number of rows in a given year, not the sum of the posts. To change this, I’ll need to use the sum function to add up the values in the Posts column. Enter the following code into a new cell and execute it:
posts_by_year_summary = fwposts %>% group_by(Year) %>% summarize(Records= n(), PostCount=sum(Posts), AvgPosts = mean(Posts), AvgPPD = mean(PPD)) posts_by_year_summary
Now I have the total number of posts for the year, in addition to the number of rows for a given year, stored in posts_by_year_summary. If I wanted to remove all columns except for the Year and PostCount, I’d use the select function to choose only the fields I wanted to keep. Here’s the code:
year_postcount_only <- select(posts_by_year_summary, Year, PostCount) year_postcount_only
Alternatively, I could use the select function to remove columns. Execute the following code (the contents of the year_postcount_only tibble should be identical):
year_postcount_only <- select(posts_by_year_summary, -c(Records, AvgPosts, AvgPPD)) year_postcount_only
Just as before, I can use the arrange and desc methods to sort the values to find the year with the highest posts. Enter the following code into a blank cell and execute it, like this:
year_postcount_only %>% arrange(desc(PostCount))
The results show that 2017 was a busy year on the blog. The next step would be to plot these values onto a graph and explore the data visually.
Visualization with ggplot2
Fortunately, the ggplot2 package makes creating graphs from data straightforward, as it allows for creating graphics declaratively. Simply provide the data and instructions on mapping data columns to graphic elements, as well as which graph type to employ, and ggplot2 handles the rendering. For instance, to create a scatter plot of PostCount by Year, enter the following code to generate the graph as seen in Figure 1.
library(ggplot2) ggplot(year_postcount_only, aes(Year, PostCount) ) + geom_point()
Figure 1 Scatter Plot of PostCount by Year as Rendered by ggplot2
To connect the points on the graph with a line, enter the following code into a new cell and execute it, like this:
ggplot(year_postcount_only, aes(Year, PostCount) ) + geom_line() + geom_point()
ggplot2 also provides rich formatting options. Enter the following code to create a more colorful version of the line:
ggplot(year_postcount_only, aes(Year, PostCount) ) + geom_line(linetype="dashed", color="blue", size=1) + geom_point(color="red", size=2)
To further explore the data, I can generate a histogram to explore the distribution of the data. For example, I want to get an idea of the distribution of how many posts there have been across all 16 years. Enter the following code to use data from the fwposts tibble to build out a histogram:
ggplot(fwposts, aes(Posts) ) + geom_histogram()
As the graph shows, most months have 50 posts or less, with one very noticeable outlier. In statistical terms, the number of posts is skewed right. To get some finer granularity, I will set the binwidth to 10. Enter the following code and run it to create the graph as shown in Figure 2:
ggplot(fwposts, aes(Posts) ) + geom_histogram(binwidth=10)
Figure 2 Histogram of Posts per Month with a Binwidth of 10
The histogram in Figure 2 shows that the most common number of posts lies between 30 and 40. Adjusting the binwidth to lower values increases the granularity.
Another useful visualization for understanding distribution of numeric values is the box plot. A box plot is a standardized way of displaying the distribution of data based on a five-number summary: the minimum value, first quartile, median, third quartile and maximum value. Fortunately, generating a box plot is simple in ggplot2. Enter the following code and execute it to see the box plot for Posts:
ggplot(fwposts, aes(x=Posts, y=Posts)) + geom_boxplot()
The generated plot shows that the first and third quartile are between around 13 and 50, with a number of outliers at or above 100. For more information about box plots, read this excellent
article on the topic: bit.ly/2IbqkmX.
While base R is perfectly acceptable for most data science-related tasks, many R developers prefer to use the tidyverse suite of libraries for increased productivity. In this article, I walked through the most common steps in a typical data science pipeline: loading, exploring, manipulating and visualizing data.
These open source package libraries provide a developer experience optimized for data science. The fluent style of programming provides better code readability, streamlined workflow, and a consistent experience across multiple libraries. In fact, there’s even a style guide for package developers to follow so that new libraries fit nicely into the tidyverse.
Frank La Vigne works at Microsoft as an AI Technology Solutions Professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following Microsoft technical expert for reviewing this article: David Smith