From wide to long formatted data
Suppose that you have the following dataset. We have the columns City and Product and then a bunch of columns of years with the revenues for the specific product code and city. The numbers themselves are not important, they are just there for illustration purposes.
mydata <- read_csv('./_city_year_data.csv') %>% as.data.frame(.)
mydata
## City Product code 2018 2019 2020 2021 2022
## 1 London A1 $102.46 $128.78 $69.09 $83.88 $90.82
## 2 New York A1 $134.38 $67.85 $66.31 $99.81 $128.13
## 3 Dubai A1 $157.32 $61.38 $71.76 $138.77 $126.03
## 4 Berlin A1 $121.71 $90.56 $64.06 $95.74 $98.08
## 5 London A2 $50.38 $112.65 $34.98 $70.16 $95.51
## 6 New York A2 $84.42 $70.35 $76.27 $122.95 $43.43
## 7 Dubai A2 $85.92 $77.10 $89.65 $54.50 $56.55
## 8 Berlin A2 $85.63 $69.45 $54.09 $71.10 $121.21
## 9 London A3 $62.19 $74.85 $82.60 $66.69 $43.99
## 10 New York A3 $29.82 $117.36 $118.03 $39.20 $82.36
## 11 Dubai A3 $52.99 $24.97 $35.59 $107.48 $34.46
## 12 Berlin A3 $51.01 $46.51 $94.73 $59.84 $75.34
This is a fairly common format if you want to present these numbers, but not a very useful one if you want to do further calculations according to those numbers.
A very common task would be to transform the dataset to a new one with just four columns of City, Product, Year, Revenue. The point of course is to do this automatically (programmatically) and not by copy and pasting. Imaging that this is one step of a multistage process on your excel routine that needs to be done let’s say once per day. This dataset is a toy example of 12 rows and 7 columns and transforming it by hand would be very simple. In a real case scenario this can get very big very quickly and it might be unfeasible to do this by hand. I know people that would break the Excel sheet and assign different people to each segment to do the transformation by hand. We are not aiming to be part of this tribe!
I challenge you to find a way to do this programmatically in Excel before continuing reading this article to understand the complexity. Pretend that you have an arbitrary number of rows and columns in your original dataset.
How would you ensure that you will do the conversion 100% of the time?
I really don’t know the answer in Excel. I tried to do it, but I could find a good enough way. I will now give you the answer in 3 lines of R code.
In the following script I will:
Replace any NA values with 0. Change the format of the input to the desired output. Save the output file as an excel file.Then we will go line by line to explain it.
mydata <- mydata %>%
replace(is.na(.), 0) %>%
pivot_longer(!c('City', 'Product code'), names_to="Year", values_to="Revenue") %>%
write_csv(paste0("./_formated.", format(Sys.time(), "%d-%b-%Y %H.$M"), ".csv"), append=FALSE)
mydata %>% replace(is.na(.), 0)
I have mydata and I am replaceing is.na values with the number 0
… %>% pivot_longer(!c(‘City’, ‘Product code’), names_to=“Year”, values_to=“Revenue”)
From the previous step, I want to pivot to a longer format. How?
I want to pivot everything except (the ! symbol) the columns City and Product type → !c(‘City’, ‘Product code’). Then I want to give the name Year to the created column that gathers the years → names_to=“Year, and also name Revenue the column that gathers the data from the cells → values_to=“Revenue.
write_csv(paste0(”./_formated.”, format(Sys.time(), “%d-%b-%Y %H.$M”), ”.csv”), append=FALSE)
Lastly, I am saving the result to a cvs file that has the name _formated + hourdate + .csv
mydata
## # A tibble: 60 × 4
## City `Product code` Year Revenue
## <chr> <chr> <chr> <chr>
## 1 London A1 2018 $102.46
## 2 London A1 2019 $128.78
## 3 London A1 2020 $69.09
## 4 London A1 2021 $83.88
## 5 London A1 2022 $90.82
## 6 New York A1 2018 $134.38
## 7 New York A1 2019 $67.85
## 8 New York A1 2020 $66.31
## 9 New York A1 2021 $99.81
## 10 New York A1 2022 $128.13
## # … with 50 more rows
As you can see, the output looks exactly like we intended it to be. Just to double-check, the original file had 12 rows and 5 columns of data (excluding the first two columns of City and Product code), or 60 revenues, and the final report has 60 rows on the Revenue column so we haven’t lost any numbers in the way!