Reshape Wide to Long in R

How to Collapse Multiple Columns into One Using R

By Daniel D. Bonneau in R Basics

July 4, 2022


Our Data

In this video, we’ll be using the Video Game Sales Data from Kaggle. This data set is pretty straightforward, and shows video game sales across titles, platforms, and regions for a number of years starting in the 1980’s.

Here’s a quick look at the data set as it comes:

raw <- read.csv("vgsales.csv")

head(raw)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

The primary columns of interest for this project will be NA_Sales, EU_Sales, JP_Sales, and Other_Sales. For this, we will just leave all other columns in the data set as they are. Each of the columns mentioned above represent the total value of all sales (in Millions) within their region: North America, Europe, Japan, and Other, respectively.

What is Wide Data?

In order to show how we can convert wide data to long data in R, we first need to have some understanding of what we mean by ‘Wide’ data. The way I always like to think about this is: If we have a category, and we have a separate column for each category then it’s wide. If, instead, there is only one column called something like category, that holds multiple ‘factor’ levels, then our data set is long.

However, another way to look at this is from the repetition of the ID column. In a wide data set, none of our ID values will repeat. For example, within the data set above, we see that we have one title across one platform in a given year. In a long data set, we will have the same ID column repeat each time we have a new variable (in this case region).

This is a little hard to explain without a formal example, so let’s dive right in. If you’re looking for a deeper treatment on the difference between wide and long data structures, take a look at this article.

Wide to Long in R

In order for us to transform our data from wide to long format, we will be using the tidyr package held within the tidyverse metapackage. Loading the tidyverse package will load a handful of handy functions, like so:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Here, we can see the packages it reads in, their version numbers, and also conflicts with packages that are already loaded into our R workspace. In this case, it’s just a few conflicts where dplyr will be preferred to the base functions that share the same name as the dplyr package.

Using pivot_longer() to Reshape Our Data

Now, all we are trying to do here is combine the columns that end with _sales all into one new column that we title as region. To do this, we have a few arguments we need to specify within the pivot_longer() function.

Note: I’ll also be renaming the columns to lowercase using rename_with(tolower).

raw %>%
  rename_with(tolower) %>%
  pivot_longer(cols = na_sales:other_sales, names_to = "region") %>%
  head()
## # A tibble: 6 × 9
##    rank name            platform year  genre publisher global_sales region value
##   <int> <chr>           <chr>    <chr> <chr> <chr>            <dbl> <chr>  <dbl>
## 1     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 na_sa… 41.5 
## 2     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 eu_sa… 29.0 
## 3     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 jp_sa…  3.77
## 4     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 other…  8.46
## 5     2 Super Mario Br… NES      1985  Plat… Nintendo          40.2 na_sa… 29.1 
## 6     2 Super Mario Br… NES      1985  Plat… Nintendo          40.2 eu_sa…  3.58

Compared to what we saw above when we looked at the raw data frame, we can see a couple of differences. Namely, we see that NA_Sales, EU_Sales, JP_Sales, and Other_Sales are no longer present within the data set. Instead, we have one column titled region. This name of region comes from the names_to = "region" argument of the code above.

Also, the cols = na_sales:other_sales argument tells pivot_longer() what columns we want to collapse into one. The colon (:) between the two column names essentially says: ‘select all columns from the column before the colon to the column after it’. We could have also specified each column individually, but because these columns were all right after one another, using the colon syntax made our lives a little easier.

Notice also that we now see ‘Wii Sports’ repeated for a total of 4 times, rather than the one time we saw before. As mentioned earlier, this is an additional way to verify that our data is now in a long format. The reason this happens is because we now need a new row for every column we collapsed into the region column. The information from all of the other columns is repeated across these rows, but the region and value columns will hold the information that was previously in the column for the sales of the respective region.

Renaming the Value Column

value isn’t the best name for what it is we’re pivoting. However, I wanted to break this up into two sections so we can focus on the main components. To rename this, we have another argument we can specify inside of pivot_longer(), namely, values_to =. Since the values for each of the columns are sales, I’m going to rename the value column to sales.

raw %>%
  rename_with(tolower) %>%
  pivot_longer(cols = na_sales:other_sales, names_to = "region", values_to = "sales") %>%
  head()
## # A tibble: 6 × 9
##    rank name            platform year  genre publisher global_sales region sales
##   <int> <chr>           <chr>    <chr> <chr> <chr>            <dbl> <chr>  <dbl>
## 1     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 na_sa… 41.5 
## 2     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 eu_sa… 29.0 
## 3     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 jp_sa…  3.77
## 4     1 Wii Sports      Wii      2006  Spor… Nintendo          82.7 other…  8.46
## 5     2 Super Mario Br… NES      1985  Plat… Nintendo          40.2 na_sa… 29.1 
## 6     2 Super Mario Br… NES      1985  Plat… Nintendo          40.2 eu_sa…  3.58

Closing Thoughts

And that’s it! We’ve now taken our data set with four columns specifying the sales for various regions, and ‘pivoted’ them into one column for the region name, and another column for the value - in this case, sales.

Pivoting your data from wide to long is something that can be very handy when working in R. For example, if we wanted to group the data set to find a ranking of the regional sales (alone, or by year, title, or anything else within the data) we now have one easy column we can use to group our data set. Like this:

raw %>%
  rename_with(tolower) %>%
  pivot_longer(cols = na_sales:other_sales, names_to = "region", values_to = "sales") %>%
  group_by(region) %>%
  summarize(total_sales = sum(sales)) %>%
  arrange(desc(total_sales))
## # A tibble: 4 × 2
##   region      total_sales
##   <chr>             <dbl>
## 1 na_sales          4393.
## 2 eu_sales          2434.
## 3 jp_sales          1291.
## 4 other_sales        798.

Additionally, if we wanted to plot something such as total sales by region over time, we would likely be using ggplot2 to do this. ggplot prefers data be in a long format to accurately plot it. A common issue facing many new R programmers is that when they go to plot their data, their plots look weird. Many times, this is because they haven’t yet done this transformation.

I hope this was helpful and keep your eye out for more posts about working in R!