Effective Data Visualization

About

This worksheet was adapted from the fourth lecture in DSCI 100, an introductory data science course for undergraduate students offered at The University of British Columbia.

Learning Objectives

The goal of this worksheet is to expand your data visualization knowledge and tool set. We will learn effective ways to visualize data, as well as some general rules of thumb to follow when creating visualizations. All visualization tasks in this tutorial will be applied to two datasets of widespread interest.

After completing this worksheet you will be able to:

  • Describe when to use the following kinds of visualizations to answer specific questions using a dataset:
    • scatter plots
    • line plots
    • bar plots
    • histograms
  • Given a dataset and a question, select from the above plot types and use R to create a visualization that best answers the question
  • Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question
  • Interpret the visualization in the context of the research question
  • Referring to the visualization, communicate the conclusions in nontechnical terms
  • Use the ggplot2 library in R to create and refine the above visualizations

Prerequisite Knowledge

This worksheet accompanies the material in Chapter 4 of Introduction to Data Science, an online textbook which covers introductory data science topics. You should read that chapter (as well as the preceding ones) before attempting the worksheet, or at least check that you know the material in Chapters 1 through 4.

Specifically, you should be familiar with the dplyr package in R and, particularly, with the following topics covered in Chapter 3 of Introduction to Data Science:

  • The pipe %>%
  • The functions mutate, filter, group_by, summarise, top_n, and arrange
  • How to use different _join functions

You should also have some knowledge of the ggplot2 package. Specifically:

  • What aesthetics, geometries, and scales are in ggplot2
  • How to add new layers to a ggplot object

Finally, you should also have a working knowledge of the plots listed in the first learning objective and to what data type they can be applied. (E.g. you should know what a scatter plot is and also that two quantitative variables are required to produce one.)

Quiz

2. Fast Food Chains in the USA

With their cheap meals and convenient drive-thrus, fast food restaurants are a growing demand in many countries. Despite their questionable ingredients and nutritional value, most Americans count on fast food in their daily lives (they are often delicious and so hard to resist…).

Source: https://media.giphy.com/media/NS6SKs3Lt8cPHhe0es/giphy.gif

According to Wikipedia,

Fast food was originally created as a commercial strategy to accommodate the larger numbers of busy commuters, travelers and wage workers who often didn’t have the time to sit down at a public house or diner and wait the normal way for their food to be cooked. By making speed of service the priority, this ensured that customers with strictly limited time (a commuter stopping to procure dinner to bring home to their family, for example, or an hourly laborer on a short lunch break) were not inconvenienced by waiting for their food to be cooked on-the-spot (as is expected from a traditional “sit down” restaurant). For those with no time to spare, fast food became a multi-billion dollar industry.

Currently, fast food is very popular and lots of businesses are investing in advertisement as well as new ideas to make their chain stand out in the sea of restaurants. In fact, a business wishing to buy a franchise location in an existing chain is hiring you as a consultant. They want to know the current market situation to set up a franchise in a state on the west coast of the United States (California, Oregon, or Washington) with few fast food restaurants. Particularly, they would like to set up a franchise that is consistently popular on the west coast, but in a state which has few fast food restaurants.

In this assignment, you will pretend to assist in the opening of this new restaurant. Your goal is to figure out which chain to recommend and which state would be the least competitive (as measured by having fewer fast food restaurants). In order to do this, you will have to answer three questions:

  1. Which is the most prevalent franchise on the west coast of the US?

  2. Is the most prevalent franchise consistent across the west coast?

  3. Which state on the west coast has the smallest number of fast food restaurants?

Question 2.1

Quiz

Question 2.2

The dataset is stored in an R object called fast_food. Print the first ten rows of this dataset.

Quiz

Answering question 1: Which is the most prevalent franchise on the west coast of the US?

Question 2.3

Next, find the top 9 restaurants (in terms of number of locations) on the west coast (in the states “CA”, “WA” or “OR”) and name them top_restaurants

Fill in the ... in the cell below.

Assign your answer to an object called top_restaurants.

top_restaurants <- fast_food %>%
    filter(state %in% c("CA", "WA", "OR")) %>% 
    group_by(...) %>%
    ...(n = n()) %>%
    ...(...) %>% 
    ...(...)
top_restaurants

Question 2.4

Now we can answer the first question we are interested in just by looking at the table!

Quiz

Question 2.5

Even though we can use the table to answer the question, remember you are going to present your results to the businesspeople who hired you. A table is not always the clearest way of showing information (although sometimes it might be, as we will see later in the activity). In our case a bar plot could be more helpful, so let’s create one!

Plot the counts for the top 9 fast food restaurants on the west coast as a bar chart using geom_bar. The number of restaurants should be on the y-axis and the restaurant names should be on the x-axis. Because we are not counting up the number of rows in our data frame, but instead are plotting the actual values in the n column, we need to use the stat = "identity" argument inside geom_bar.

To do this fill in the ... in the cell below. Make sure to label your axes. Assign your answer to an object called count_bar_chart.

count_bar_chart <- ... %>% 
  ggplot(aes(x = ..., y = ...)) + 
  geom_bar(...) +
  xlab(...) +
  ylab(...)
count_bar_chart

Question 2.6

The x-axis labels don’t look great unless you make the bars on the bar plot above quite wide, wider than is actually useful or effective. What can we do? There are two good solutions to this problem.

Part A: We can add a theme layer and rotate the labels. Choose an angle that you think is appropriate. Choose something between 20 degrees and 90 degrees for the angle argument. Use the hjust = 1 argument to ensure your labels don’t sit on top of the bars as you rotate them (try removing that argument and see what happens!).

Name the resulting plot count_bar_chart_A.

Part B: We can also simply use horizontal bars. Use the coord_flip function to achieve this effect.

Name the resulting plot count_bar_chart_B.

#PART A
count_bar_chart_A <- ... + 
    theme(axis.text.x = element_text(angle = ..., hjust = 1))
count_bar_chart_A
#PART B
count_bar_chart_B <- ... + 
    coord_flip()
count_bar_chart_B

Part C: Plot B seems to work better, but it would be great to have the bars ordered by size (that is, by restaurant count). This can be easily done in ggplot2 by reordering the name variable by n, like so:

ggplot(aes(x = reorder(name, n), y = n))

Now copy your code from question 2.5 and reorder the name variable as above. Don’t forget to add coord_flip at the end as well.

Name the resulting plot count_bar_chart_C.

count_bar_chart_C <- top_restaurants %>% 
  ggplot(aes(x = reorder(..., ...), y = ...)) +
  geom_bar(...) +
  xlab(...) +
  ylab(...) +
  coord_flip()
count_bar_chart_C

Answering question 2: Is the most dominant franchise consistent across the west coast?

Question 2.7

To answer the second question we need a data frame that has three columns: name (restaurant), state, and n (restaurant count). You will need to use semi_join to get the intersection of two data frames. In this case, you will use semi_join to use the names in top_restaurants to get the counts of each restaurant in each of the 3 states from the fast_food data frame. You will also need to group_by both name and state. Name this new data frame top_n_state.

If you are interested in learning more about joining data frames in R, see this cheatsheet.

... <- fast_food %>%
    semi_join(top_restaurants) %>% # semi_join gives the intersection of two data frames
    filter(state %in% c("CA", "WA", "OR")) %>% 
    ...(..., ...) %>%
    ...(...) 
top_n_state

As you can see, the resulting data frame has only 27 rows. We could try to obtain the answer to the question just by looking at the table (just as in question 2.4). However, even though the number of rows is not large, studying the table turns out to be a painstaking task. In fact, that’s what the previous consultant told the business owners to do… before being fired for it! Let’s make a display that summarises all the information in a single plot.

Question 2.8

Plot the counts (y-axis) for the top 9 fast food restaurants (x-axis) on the west coast, per US State (group), as a bar chart using geom_bar. Use fill = name inside aes to colour the restaurants by name. Use position = "dodge" inside geom_bar to group the bars by state. To rename the legend, use a labs layer. This time within labs use the fill argument instead of colour (this is because you need to modify the aesthetic that the legend was made from, here it was fill, not colour as earlier in the worksheet).

To do this fill in the ... in the cell below. Make sure to label your axes.

Assign your answer to an object called top_n_state_plot.

top_n_state_plot <- ... %>%
  ggplot(aes(x = state, y = n, fill = ...)) + 
  ...(stat = ..., position = "...") +
  xlab(...) +
  ylab(...) +
  labs(fill = "Restaurant") 
top_n_state_plot

How easy is that for comparing the restaurants and states to answer our question: Is the most dominant/top franchise consistent across the west coast? If we carefully look at this plot we can answer this question, but it takes us a while to process this. If we instead visualize this as a stacked bar chart using proportions instead of counts we might be able to do this more easily (making it a more effective visualization).

Question 2.9

Copy your code from Question 2.9.2 and modify position = "dodge" to position = "fill" to change from doing a grouped bar chart to a stacked bar chart with the data represented as proportions instead of counts.

top_n_state_plot <- ...





top_n_state_plot

Question 2.10

With this, we are ready to answer the second question.

Quiz

Even though we were able to answer the question, the stacked bar chart still seems to be noisy.

Quiz

Answering question 3: Which state on the west coast has the smallest number of fast food restaurants?

Question 2.11

Finally, let’s find which state on the west coast has the greatest number of fast food restaurants. We will need to use the semi_join strategy, as we did above, to use the names in top_restaurants to get the counts of each restaurant in each of the 3 states from the fast_food data frame. Name this data frame state_counts. Fill in the ... in the cell below.

... <- fast_food %>%
    semi_join(top_restaurants) %>% # semi_join gives the intersection of two data frames
    filter(state %in% c("CA", "WA", "OR")) %>% 
    ...(...) %>%
    ...(n = n()) 
state_counts

Question 2.12

Now, create a bar plot that has restaurant count on the y-axis and US state on the x-axis. Name the plot state_counts_plot.

state_counts_plot <- ...




state_counts_plot

Question 2.13.1

Great! Now we can answer the last question we are interested in.

Quiz

Observe that we could again have answered the question only by viewing the table, which only has three rows. Do you believe that in this case the plot we generated helps to summarise the information in the table? Is the table too large, as in question 2.7, or is it small enough that it can be presented to the final audience—the businesspeople—to answer the question?

Question 2.13.2

You can also approach the third question from a different perspective. Consider the populations of California (39.512 million), Oregon (4.217 million) and Washington (7.615 million) (source: 2019 United States Census Bureau, visited on April 2020). Is the raw restaurant count for each state the best measure of competition?

Calculate the restaurant per capita for the states on the west coast by filling in the ... in the code below.

state_counts$population <- c(39.512, 4.217, 7.615)
state_counts %>%
  mutate(n.per.capita = ... / ...)

You are encouraged to also make a plot like the one in question 2.12 to visualize this information.

Question 2.13.3

Quiz

Final recommendation to client

Now that we answered the research questions, it is good practice to think about what our final recommendation to the businesspeople will be. Recall that they wish to set up a restaurant from a consistently popular franchise in the least competitive west coast state (measured by number of restaurants).

Question 2.14

Quiz

Finally, reflect on which of the visualizations we produced would better support your recommendation. How many plots would you include? Which ones? Why? Are some tables better than the corresponding plots at presenting the information?

We are just scratching the surface of how to create effective visualizations in R. For example, we haven’t covered how to change from the default colours palette ggplot2 provides. To learn more, visit the links in the worksheet and practice, practice, practice! Go forth and make beautiful and effective plots!