In this post I will explain some basics about R
and the libraries that I choose to use mainly in this series. This is going to be a small post as going in depth is not in the scope of this series.
Something that doesn’t exist on basic Excel files are variables. In contrast this is the most basic concepts in coding. We define variables for all sort of things, such as numbers, characters, words, sentences, dataframes, files, statistical models, graphs, and even more complicated things.
In R
we define a variable by just writing its name and while there some restrictions as to what words we can use, you can basically name your variables anything that doesn’t start with a number. The name 10th_date
is not a viable variable name while date_10
is fine.
We can set the name of a variable using the <-
or the =
operators and you will understand the difference as time goes one. For now stick to the <-
operator for setting variable values.
number_variable <- 3
print(number_variable)
## [1] 3
string_variable <- "hello"
string_variable
## [1] "hello"
In programming languages, more often than not we include libraries
or packages
in our programs that allow us to extend the functionality of our programs beyond the basic things that we can do with the language itself.
In this series we are mainly going to use two libraries called tidyr
and dplyr
. This is almost always going to be the first thing that we right down in order to be able to use their functionality in our programs.
library(tidyr)
library(dplyr)
Lastly, I need to introduce the pipe operator or %>%
. This operator comes from the dplyr
package and it’s why we will always load this library in our programs. It doesn’t add anything new, it just makes everything much more readable and straight forward! While it exist in Python
as well, it doesn’t work in exactly the same way and it makes the final product more complicated than it should in my opinion.
Think of it as: Everything on the left hand side is sent to the right hand side. Check this quick demo. There are various preloaded datasets that come with R
. One of them is the mtcars
dataset that you can see by just typing mtcars
.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
I am going to write a quick script that will return us the average miles per galon (column mpg
) per cylinder (column cyl
) if we have more than 1 entry (column carb
), with and without the %>%
operator.
mtcars %>%
filter(carb > 1) %>%
group_by(cyl) %>%
summarize(Avg_mpg = mean(mpg))
## # A tibble: 3 × 2
## cyl Avg_mpg
## <dbl> <dbl>
## 1 4 25.9
## 2 6 19.7
## 3 8 15.1
mtcars
filter the column carb
to be bigger than 1
then group the filtered dataset by cylinder
and from that grouped dataset create a column called Avg_mpg
that calculates the mean
of the column mpg
summarize(
group_by(
filter(mtcars, mtcars$carb >1),
cyl
),
Avg_mpg = mean(mpg)
)
## # A tibble: 3 × 2
## cyl Avg_mpg
## <dbl> <dbl>
## 1 4 25.9
## 2 6 19.7
## 3 8 15.1
Avg_mpg
that calculates the mean of mpg
on the grouped datasets
that is filtered for carb
to be bigger than 1
by the cylinders cyl
%>%
syntax is better.Both pieces of code return the same output, so they are completely equivalent in that regard. However I would argue that the first on is much more simple to look at.
This is because the %>%
syntax is read from top to bottom, while the syntax without the pipe is read from the inside out!
These are the tools that we need for now. The concept of variables, two external libraries and one extra operator. Now it’s just a matter of learning by example, practicing the examples, and applying them on real tasks. Good luck!