Reshape Wide to Long in R
How to Collapse Multiple Columns into One Using R
By Daniel D. Bonneau in R Basics
July 4, 2022
In this video, we’ll be using the Video Game Sales Data from Kaggle. This data set is pretty straightforward, and shows video game sales across titles, platforms, and regions for a number of years starting in the 1980’s.
Here’s a quick look at the data set as it comes:
raw <- read.csv("vgsales.csv") head(raw)
## Rank Name Platform Year Genre Publisher NA_Sales ## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49 ## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08 ## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85 ## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75 ## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27 ## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20 ## EU_Sales JP_Sales Other_Sales Global_Sales ## 1 29.02 3.77 8.46 82.74 ## 2 3.58 6.81 0.77 40.24 ## 3 12.88 3.79 3.31 35.82 ## 4 11.01 3.28 2.96 33.00 ## 5 8.89 10.22 1.00 31.37 ## 6 2.26 4.22 0.58 30.26
The primary columns of interest for this project will be
Other_Sales. For this, we will just leave all other columns in the data set as they are. Each of the columns mentioned above represent the total value of all sales (in Millions) within their region: North America, Europe, Japan, and Other, respectively.
What is Wide Data?
In order to show how we can convert wide data to long data in R, we first need to have some understanding of what we mean by ‘Wide’ data. The way I always like to think about this is: If we have a category, and we have a separate column for each category then it’s wide. If, instead, there is only one column called something like
category, that holds multiple ‘factor’ levels, then our data set is long.
However, another way to look at this is from the repetition of the ID column. In a wide data set, none of our ID values will repeat. For example, within the data set above, we see that we have one title across one platform in a given year. In a long data set, we will have the same ID column repeat each time we have a new variable (in this case
This is a little hard to explain without a formal example, so let’s dive right in. If you’re looking for a deeper treatment on the difference between wide and long data structures, take a look at this article.
Wide to Long in R
In order for us to transform our data from wide to long format, we will be using the
tidyr package held within the
tidyverse metapackage. Loading the
tidyverse package will load a handful of handy functions, like so:
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ## ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ## ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ## ✔ readr 2.1.2 ✔ forcats 0.5.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()
Here, we can see the packages it reads in, their version numbers, and also conflicts with packages that are already loaded into our R workspace. In this case, it’s just a few conflicts where
dplyr will be preferred to the base functions that share the same name as the
pivot_longer() to Reshape Our Data
Now, all we are trying to do here is combine the columns that end with
_sales all into one new column that we title as
region. To do this, we have a few arguments we need to specify within the
Note: I’ll also be renaming the columns to lowercase using
raw %>% rename_with(tolower) %>% pivot_longer(cols = na_sales:other_sales, names_to = "region") %>% head()
## # A tibble: 6 × 9 ## rank name platform year genre publisher global_sales region value ## <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> ## 1 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 na_sa… 41.5 ## 2 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 eu_sa… 29.0 ## 3 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 jp_sa… 3.77 ## 4 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 other… 8.46 ## 5 2 Super Mario Br… NES 1985 Plat… Nintendo 40.2 na_sa… 29.1 ## 6 2 Super Mario Br… NES 1985 Plat… Nintendo 40.2 eu_sa… 3.58
Compared to what we saw above when we looked at the
raw data frame, we can see a couple of differences. Namely, we see that
Other_Sales are no longer present within the data set. Instead, we have one column titled
region. This name of region comes from the
names_to = "region" argument of the code above.
cols = na_sales:other_sales argument tells
pivot_longer() what columns we want to collapse into one. The colon (
:) between the two column names essentially says: ‘select all columns from the column before the colon to the column after it’. We could have also specified each column individually, but because these columns were all right after one another, using the colon syntax made our lives a little easier.
Notice also that we now see ‘Wii Sports’ repeated for a total of 4 times, rather than the one time we saw before. As mentioned earlier, this is an additional way to verify that our data is now in a long format. The reason this happens is because we now need a new row for every column we collapsed into the
region column. The information from all of the other columns is repeated across these rows, but the
value columns will hold the information that was previously in the column for the sales of the respective region.
Renaming the Value Column
value isn’t the best name for what it is we’re pivoting. However, I wanted to break this up into two sections so we can focus on the main components. To rename this, we have another argument we can specify inside of
values_to =. Since the values for each of the columns are sales, I’m going to rename the value column to
raw %>% rename_with(tolower) %>% pivot_longer(cols = na_sales:other_sales, names_to = "region", values_to = "sales") %>% head()
## # A tibble: 6 × 9 ## rank name platform year genre publisher global_sales region sales ## <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> ## 1 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 na_sa… 41.5 ## 2 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 eu_sa… 29.0 ## 3 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 jp_sa… 3.77 ## 4 1 Wii Sports Wii 2006 Spor… Nintendo 82.7 other… 8.46 ## 5 2 Super Mario Br… NES 1985 Plat… Nintendo 40.2 na_sa… 29.1 ## 6 2 Super Mario Br… NES 1985 Plat… Nintendo 40.2 eu_sa… 3.58
And that’s it! We’ve now taken our data set with four columns specifying the sales for various regions, and ‘pivoted’ them into one column for the region name, and another column for the value - in this case, sales.
Pivoting your data from wide to long is something that can be very handy when working in R. For example, if we wanted to group the data set to find a ranking of the regional sales (alone, or by year, title, or anything else within the data) we now have one easy column we can use to group our data set. Like this:
raw %>% rename_with(tolower) %>% pivot_longer(cols = na_sales:other_sales, names_to = "region", values_to = "sales") %>% group_by(region) %>% summarize(total_sales = sum(sales)) %>% arrange(desc(total_sales))
## # A tibble: 4 × 2 ## region total_sales ## <chr> <dbl> ## 1 na_sales 4393. ## 2 eu_sales 2434. ## 3 jp_sales 1291. ## 4 other_sales 798.
Additionally, if we wanted to plot something such as total sales by region over time, we would likely be using
ggplot2 to do this.
ggplot prefers data be in a long format to accurately plot it. A common issue facing many new R programmers is that when they go to plot their data, their plots look weird. Many times, this is because they haven’t yet done this transformation.
I hope this was helpful and keep your eye out for more posts about working in R!