|||

From wide to long formatted data

Suppose that you have the following dataset. We have the columns City and Product and then a bunch of columns of years with the revenues for the specific product code and city. The numbers themselves are not important, they are just there for illustration purposes.

mydata <- read_csv('./_city_year_data.csv') %>% as.data.frame(.)
mydata
##        City Product code    2018    2019    2020    2021    2022
## 1    London           A1 $102.46 $128.78  $69.09  $83.88  $90.82
## 2  New York           A1 $134.38  $67.85  $66.31  $99.81 $128.13
## 3     Dubai           A1 $157.32  $61.38  $71.76 $138.77 $126.03
## 4    Berlin           A1 $121.71  $90.56  $64.06  $95.74  $98.08
## 5    London           A2  $50.38 $112.65  $34.98  $70.16  $95.51
## 6  New York           A2  $84.42  $70.35  $76.27 $122.95  $43.43
## 7     Dubai           A2  $85.92  $77.10  $89.65  $54.50  $56.55
## 8    Berlin           A2  $85.63  $69.45  $54.09  $71.10 $121.21
## 9    London           A3  $62.19  $74.85  $82.60  $66.69  $43.99
## 10 New York           A3  $29.82 $117.36 $118.03  $39.20  $82.36
## 11    Dubai           A3  $52.99  $24.97  $35.59 $107.48  $34.46
## 12   Berlin           A3  $51.01  $46.51  $94.73  $59.84  $75.34

This is a fairly common format if you want to present these numbers, but not a very useful one if you want to do further calculations according to those numbers.

The task

A very common task would be to transform the dataset to a new one with just four columns of City, Product, Year, Revenue. The point of course is to do this automatically (programmatically) and not by copy and pasting. Imaging that this is one step of a multistage process on your excel routine that needs to be done let’s say once per day. This dataset is a toy example of 12 rows and 7 columns and transforming it by hand would be very simple. In a real case scenario this can get very big very quickly and it might be unfeasible to do this by hand. I know people that would break the Excel sheet and assign different people to each segment to do the transformation by hand. We are not aiming to be part of this tribe!

I challenge you to find a way to do this programmatically in Excel before continuing reading this article to understand the complexity. Pretend that you have an arbitrary number of rows and columns in your original dataset.

How would you ensure that you will do the conversion 100% of the time?

I really don’t know the answer in Excel. I tried to do it, but I could find a good enough way. I will now give you the answer in 3 lines of R code.

The transformation

In the following script I will:

Replace any NA values with 0. Change the format of the input to the desired output. Save the output file as an excel file.

Then we will go line by line to explain it.

mydata <- mydata %>% 
  replace(is.na(.), 0) %>%
  pivot_longer(!c('City', 'Product code'), names_to="Year", values_to="Revenue") %>% 
  write_csv(paste0("./_formated.", format(Sys.time(), "%d-%b-%Y %H.$M"), ".csv"), append=FALSE)

How to read it:

mydata %>% replace(is.na(.), 0)

I have mydata and I am replaceing is.na values with the number 0

… %>% pivot_longer(!c(‘City’, Product code’), names_to=“Year”, values_to=“Revenue”)

From the previous step, I want to pivot to a longer format. How?

I want to pivot everything except (the ! symbol) the columns City and Product type  !c(‘City’, Product code’). Then I want to give the name Year to the created column that gathers the years names_to=“Year, and also name Revenue the column that gathers the data from the cells values_to=“Revenue.

write_csv(paste0(”./_formated.”, format(Sys.time(), %d-%b-%Y %H.$M”), .csv”), append=FALSE)

Lastly, I am saving the result to a cvs file that has the name _formated + hourdate + .csv

The output

mydata
## # A tibble: 60 × 4
##    City     `Product code` Year  Revenue
##    <chr>    <chr>          <chr> <chr>  
##  1 London   A1             2018  $102.46
##  2 London   A1             2019  $128.78
##  3 London   A1             2020  $69.09 
##  4 London   A1             2021  $83.88 
##  5 London   A1             2022  $90.82 
##  6 New York A1             2018  $134.38
##  7 New York A1             2019  $67.85 
##  8 New York A1             2020  $66.31 
##  9 New York A1             2021  $99.81 
## 10 New York A1             2022  $128.13
## # … with 50 more rows

As you can see, the output looks exactly like we intended it to be. Just to double-check, the original file had 12 rows and 5 columns of data (excluding the first two columns of City and Product code), or 60 revenues, and the final report has 60 rows on the Revenue column so we haven’t lost any numbers in the way!

Up next Coding for non coding departments. Basic setup and libraries In this post I will explain some basics about R and the libraries that I choose to use mainly in this series. This is going to be a small post as
Latest posts Basic setup and libraries From wide to long formatted data Coding for non coding departments. My attempt at demystifying causality How to explore data meaningfully Giovanni Pierluigi da Palestrina Excel and python