Introduction

Michael J. Crawley’s, Introduction to Statistics Using R, is an excellent resource for learning both Statistics and R programming. There is a tendency for many university courses to overload the learner with too much complexity all at once, but Crawley expertly weaves his way through the bare basics of stats and programming into intermediate statistical modeling. Combine this resource with some of the individual courses at (DataCamp)[https://www.datacamp.com/home] (a low-cost subscription to over 150 courses in data science), and you’ll have a commanding knowledge to make an impact in any discipline.

Crawley has an ecology background, so the data is understandably ecological. However, even if you are more interested in data in social sciences, business, engineering or anything else, the same topics and techniques are immediately transferrable. If you understand Analysis of Variance in the context of glycogen found in rat livers, you can understand Analysis of Variance with application to a marketing campaign.

Getting Started

R’s strength lies in the unique, customizable packages and libraries that developers and advanced users build to help other R users work more efficiently. However, several thousand packages reside in the repository, all of which are constantly evolving, expanding and morphing, so it can be difficult to navigate the entire landscape. Understanding what all of them do would be impossible, so you will have to gradually figure out which libraries make the most sense for your work. Taking courses and going through tutorials is the best way to find useful packages, as experts can point you in the right direction. Crawley does the majority of his work with base ‘R’, which means the functionality that comes with a fresh download of R. The benefit of taking this approach is it simplifies the process for the learner, but it also misses out on some important functionality that comes from packages that are essential for an aspiring R programmer. My goal is to blend Crawley’s great work with some updated capabilities within R.

I prefer to keep all the libraries that I am using (libraries are the packages once they are downloaded and on your machine) in one place, at the top, so it’s easy to know what is loaded and what is missing. Using them is a two step process: first you have to install the package, which is a one-time process. Then, you have to load this package as a library, and this has to be done each session. If you are simply working by yourself and already have everything installed, you could just type library(“tidyverse”) and it will load, but when interacting with other people it is essential that everyone is working with the same setup. The function ‘package.check’ below looks at the users’ system and installs packages that are not yet installed, then loads them as libraries for the session. Building functions like this is an important part of R programming, but we will discuss this later.

For this session, we are using ‘Tidyverse’, which includes several libraries including ‘dplyr’ (data manipulation), ‘ggplot2’ (data visualization) and ‘readr’ (importing data into R). We will discuss the other libraries as we use them.

packages <- c("tidyverse",
              "gridExtra")
package.check <- lapply(packages, FUN = function(x) {
  if (!require(x, character.only = TRUE)) {
    install.packages(x, dependencies = TRUE)
    library(x, character.only = TRUE)
  }
})

Data

You can download the data at Crawley’s university website at the Imperial College of London. It downloads as a zip file, which contains all of the .csv files that you will need to follow along. One piece of advice is to spend some time up-front building a rigid file and folder hierarchy on your system that won’t change, or upload the files to a server and reference them. It took me quite some time to get comfortable with references, relative paths, absolute paths, and making sure my script is integrated with the data and the output. I would suggest creating a hierarchy that is used in the same way for every project that you create. On most comprehensive projects, you will have to learn how to reference files and information from many different places, so be advised that it will normally not be as straight-forward of a process each time.

On most projects, I create a project folder, and then three folders within it. I keep everything lower-case and avoid spaces to avoid issues in my references. For this project specifically, unzip the data (set of .csv files) into the ‘data’ folder, and save a script in the ‘script’ folder. When you open a script from that location, your working directory should be at that location. Test this out by typing ‘getwd()’. Now, to access the data, your relative path will start with ‘../data/xxx’, which signifies going from the ‘scripts’ folder up into the ‘projects’ folder, then finds the ‘.csv’ from within the ‘data’ folder.

Crawley loads and attaches each dataset one at a time (so he references ‘data’, but this refers to different things depending on which dataset is specifically attached), so he is working with one dataset at a time. This saves some time, but I prefer to load all the datasets at the same time and specifically reference them. While there is a little more work specifying which dataset I’m working each time I make write a line of code, it’s easier to follow the work and understand which dataset I’m working with.
Here are the different datasets we will be working with in this chapter. I loaded them all up-front for simplicity sake.

worms   <- read_csv("../data/worms.csv")
das     <- read_csv("../data/das.csv")
yields  <- read_csv("../data/fertyield.csv")
scatter <- read_csv("../data/scatter.csv")
weather <- read_csv("../data/weather.data.csv")
coplot  <- read_csv("../data/coplot.csv")
np      <- read_csv("../data/np.csv")

Basics of Analysis

Now that we’ve got the environment set up, let’s start with the ‘worms’ file. We will split it up, practice some basic manipulations and learn how to summarize the data. I encourage you to follow along in your own R environment, and get a copy of Crawley’s (book)[https://www.amazon.com/Statistics-Introduction-Michael-J-Crawley/dp/1118941098/ref=sr_1_1/141-6702608-0790736?s=books&ie=UTF8&qid=1536538024&sr=1-1&refinements=p_27%3AMichael+J.+Crawley].

Let’s use the ‘head’ command to look at the first 6 lines of the dataset, which gives us a simple overview of how the data looks. We get some good information here, including the names of the variables, and the type. One of the most important things we want to do here is figure out some of the relationships between variables.

head(worms)
## # A tibble: 6 x 7
##   Field.Name       Area Slope Vegetation Soil.pH Damp  Worm.density
##   <chr>           <dbl> <dbl> <chr>        <dbl> <lgl>        <dbl>
## 1 Nashs.Field       3.6    11 Grassland      4.1 FALSE            4
## 2 Silwood.Bottom    5.1     2 Arable         5.2 FALSE            7
## 3 Nursery.Field     2.8     3 Grassland      4.3 FALSE            2
## 4 Rush.Meadow       2.4     5 Meadow         4.9 TRUE             5
## 5 Gunness.Thicket   3.8     0 Scrub          4.2 FALSE            6
## 6 Oak.Mead          3.1     2 Grassland      3.9 FALSE            2

To practice some manipulations, let’s just look at columns 1-3, and select all rows. One of the keys to data analysis is shedding off data that isn’t useful. You will make more sophisticated analyses quicker with a clean, concise dataset than you would otherwise, and the process will be much more enjoyable.

If we know we would only use the first three columns, you can use this technique to remove the unnecessary data. We use the bracket ‘[ ]’ to subset the worms dataset, and leave the first space blank, which represents rows. There isn’t a constraint here, so leaving it blank returns all the rows. Specifying the columns comes after the comma, which in our example is ‘1:3’ (you can think of the colon as ‘through’).

worms[, 1:3]
## # A tibble: 20 x 3
##    Field.Name         Area Slope
##    <chr>             <dbl> <dbl>
##  1 Nashs.Field         3.6    11
##  2 Silwood.Bottom      5.1     2
##  3 Nursery.Field       2.8     3
##  4 Rush.Meadow         2.4     5
##  5 Gunness.Thicket     3.8     0
##  6 Oak.Mead            3.1     2
##  7 Church.Field        3.5     3
##  8 Ashurst             2.1     0
##  9 The.Orchard         1.9     0
## 10 Rookery.Slope       1.5     4
## 11 Garden.Wood         2.9    10
## 12 North.Gravel        3.3     1
## 13 South.Gravel        3.7     2
## 14 Observatory.Ridge   1.8     6
## 15 Pond.Field          4.1     0
## 16 Water.Meadow        3.9     0
## 17 Cheapside           2.2     8
## 18 Pound.Hill          4.4     2
## 19 Gravel.Pit          2.9     1
## 20 Farm.Wood           0.8    10

This command is similar, but notice the constraint is before the comma, which selects rows 5 through 15, and all columns.

worms[5:15,]
## # A tibble: 11 x 7
##    Field.Name         Area Slope Vegetation Soil.pH Damp  Worm.density
##    <chr>             <dbl> <dbl> <chr>        <dbl> <lgl>        <dbl>
##  1 Gunness.Thicket     3.8     0 Scrub          4.2 FALSE            6
##  2 Oak.Mead            3.1     2 Grassland      3.9 FALSE            2
##  3 Church.Field        3.5     3 Grassland      4.2 FALSE            3
##  4 Ashurst             2.1     0 Arable         4.8 FALSE            4
##  5 The.Orchard         1.9     0 Orchard        5.7 FALSE            9
##  6 Rookery.Slope       1.5     4 Grassland      5   TRUE             7
##  7 Garden.Wood         2.9    10 Scrub          5.2 FALSE            8
##  8 North.Gravel        3.3     1 Grassland      4.1 FALSE            1
##  9 South.Gravel        3.7     2 Grassland      4   FALSE            2
## 10 Observatory.Ridge   1.8     6 Grassland      3.8 FALSE            0
## 11 Pond.Field          4.1     0 Meadow         5   TRUE             6

Crawley primarily uses the base functions of ‘R’, which is common in academia and those that have used ‘R’ for quite some time. However, the power of ‘R’ comes from the open-source network of developers that create custom functions and processes to bind them together as ‘packages’, which are free for anyone to download. Many newer users of ‘R’ greatly prefer the simplicity of using packages like ‘dplyr’, and for the rest of this segment, I will try to point you towards the most intuitive methods.

‘dplyr’, as part of the ‘tidyverse’ package utilizes the pipe (%>%) method to create linked data and manipulations. In the code below, notice it starts with the worms dataset, then pipes it to the next step of filtering based on ‘Area’. This is a one-step process, but you can add unlimited pipes to continually refine the data. This gives you significant insight into morphing your data to the exact way you like it, because you can take it one step at a time. With base ‘R’, you are typically stuck either writing convoluted code, or creating separate variables each time.

In this case, we will filter out certain rows, where ‘area’ is greater than 3 and ‘slope’ is less than 3. Notice that all the columns are still here, but each observation meets the criteria.

worms %>%
  filter(Area > 3 & Slope < 3)
## # A tibble: 8 x 7
##   Field.Name       Area Slope Vegetation Soil.pH Damp  Worm.density
##   <chr>           <dbl> <dbl> <chr>        <dbl> <lgl>        <dbl>
## 1 Silwood.Bottom    5.1     2 Arable         5.2 FALSE            7
## 2 Gunness.Thicket   3.8     0 Scrub          4.2 FALSE            6
## 3 Oak.Mead          3.1     2 Grassland      3.9 FALSE            2
## 4 North.Gravel      3.3     1 Grassland      4.1 FALSE            1
## 5 South.Gravel      3.7     2 Grassland      4   FALSE            2
## 6 Pond.Field        4.1     0 Meadow         5   TRUE             6
## 7 Water.Meadow      3.9     0 Meadow         4.9 TRUE             8
## 8 Pound.Hill        4.4     2 Arable         4.5 FALSE            5

Keeping the same conditions, you’ll often want to sort the data by the variable you’re most interested in. You can pipe the results of the filter to the ‘arrange’ function, and sort the results by ‘Area’.

worms %>%
  filter(Area > 3 & Slope < 3) %>%
  arrange(Area)
## # A tibble: 8 x 7
##   Field.Name       Area Slope Vegetation Soil.pH Damp  Worm.density
##   <chr>           <dbl> <dbl> <chr>        <dbl> <lgl>        <dbl>
## 1 Oak.Mead          3.1     2 Grassland      3.9 FALSE            2
## 2 North.Gravel      3.3     1 Grassland      4.1 FALSE            1
## 3 South.Gravel      3.7     2 Grassland      4   FALSE            2
## 4 Gunness.Thicket   3.8     0 Scrub          4.2 FALSE            6
## 5 Water.Meadow      3.9     0 Meadow         4.9 TRUE             8
## 6 Pond.Field        4.1     0 Meadow         5   TRUE             6
## 7 Pound.Hill        4.4     2 Arable         4.5 FALSE            5
## 8 Silwood.Bottom    5.1     2 Arable         5.2 FALSE            7

By default, ‘arrange’ is in ascending order, but you’ll often want to see the highest values first. You can do this in the same way, but use the ‘desc’ argument. In this case, we will sort by ‘Slope’, and then ‘Area’, which will be the secondary sort method. The secondary sort method is only relevant when there is a duplicate value within ‘Slope’.

worms %>%
  filter(Area > 3 & Slope < 3) %>%
  arrange(desc(Slope), Area)
## # A tibble: 8 x 7
##   Field.Name       Area Slope Vegetation Soil.pH Damp  Worm.density
##   <chr>           <dbl> <dbl> <chr>        <dbl> <lgl>        <dbl>
## 1 Oak.Mead          3.1     2 Grassland      3.9 FALSE            2
## 2 South.Gravel      3.7     2 Grassland      4   FALSE            2
## 3 Pound.Hill        4.4     2 Arable         4.5 FALSE            5
## 4 Silwood.Bottom    5.1     2 Arable         5.2 FALSE            7
## 5 North.Gravel      3.3     1 Grassland      4.1 FALSE            1
## 6 Gunness.Thicket   3.8     0 Scrub          4.2 FALSE            6
## 7 Water.Meadow      3.9     0 Meadow         4.9 TRUE             8
## 8 Pond.Field        4.1     0 Meadow         5   TRUE             6

Summary Information

Getting summary information of the data is essential to data analysis. We’ll start with the ‘summary’ function, a base ‘R’ function. While it provides some useful information, it’s somewhat unwieldy to look at. It shows the quartile range for the numerical values, and some basic information about logicals and characters, but it is personally too chaotic to be of use. Data visualization does a better job on the numericals, and the information for characters and logicals are not of use using summary.

summary(worms)
##   Field.Name             Area           Slope        Vegetation
##  Length:20          Min.   :0.800   Min.   : 0.00   Length:20
##  Class :character   1st Qu.:2.175   1st Qu.: 0.75   Class :character
##  Mode  :character   Median :3.000   Median : 2.00   Mode  :character
##                     Mean   :2.990   Mean   : 3.50
##                     3rd Qu.:3.725   3rd Qu.: 5.25
##                     Max.   :5.100   Max.   :11.00
##     Soil.pH         Damp          Worm.density
##  Min.   :3.500   Mode :logical   Min.   :0.00
##  1st Qu.:4.100   FALSE:14        1st Qu.:2.00
##  Median :4.600   TRUE :6         Median :4.00
##  Mean   :4.555                   Mean   :4.35
##  3rd Qu.:5.000                   3rd Qu.:6.25
##  Max.   :5.700                   Max.   :9.00

Structure, or ‘str’, is another base R function that provides summary information. The output provides a list of the different variables, their class, and some extra information. However, like summary, there is excess information that is irrelevant to data analysis. This takes up half a page with just seven variables; with 45 variables, this “summary” would be several pages long.

str(worms)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 20 obs. of  7 variables:
##  $ Field.Name  : chr  "Nashs.Field" "Silwood.Bottom" "Nursery.Field" "Rush.Meadow" ...
##  $ Area        : num  3.6 5.1 2.8 2.4 3.8 3.1 3.5 2.1 1.9 1.5 ...
##  $ Slope       : num  11 2 3 5 0 2 3 0 0 4 ...
##  $ Vegetation  : chr  "Grassland" "Arable" "Grassland" "Meadow" ...
##  $ Soil.pH     : num  4.1 5.2 4.3 4.9 4.2 3.9 4.2 4.8 5.7 5 ...
##  $ Damp        : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ Worm.density: num  4 7 2 5 6 2 3 4 9 7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Field.Name = col_character(),
##   ..   Area = col_double(),
##   ..   Slope = col_double(),
##   ..   Vegetation = col_character(),
##   ..   Soil.pH = col_double(),
##   ..   Damp = col_logical(),
##   ..   Worm.density = col_double()
##   .. )

As part of the ‘dplyr’ package, ‘glimpse’ is similar to ‘structure’, but has a much cleaner output. It is specifically designed to fit on the viewer’s page and provide just the necessary information. I typically start with the ‘glimpse’ function and check to make sure the class of each variable is correct, and the first few lines of code make sense. A key thing to look for is to figure out which variables should be defined as a factor. We will discuss this later.

glimpse(worms)
## Observations: 20
## Variables: 7
## $ Field.Name   <chr> "Nashs.Field", "Silwood.Bottom", "Nursery.Field",...
## $ Area         <dbl> 3.6, 5.1, 2.8, 2.4, 3.8, 3.1, 3.5, 2.1, 1.9, 1.5,...
## $ Slope        <dbl> 11, 2, 3, 5, 0, 2, 3, 0, 0, 4, 10, 1, 2, 6, 0, 0,...
## $ Vegetation   <chr> "Grassland", "Arable", "Grassland", "Meadow", "Sc...
## $ Soil.pH      <dbl> 4.1, 5.2, 4.3, 4.9, 4.2, 3.9, 4.2, 4.8, 5.7, 5.0,...
## $ Damp         <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F...
## $ Worm.density <dbl> 4, 7, 2, 5, 6, 2, 3, 4, 9, 7, 8, 1, 2, 0, 6, 8, 4...

Aggregation

Understanding how to aggregate data is the backbone of inferential statistics. In data analysis, a list of a million lines long of worm densities in soil is meaningless to the end user, but aggregating the densities of worms by land type, and learning that ‘Meadows’ has nearly twice the density of ‘Orchards’ IS meaningful. Below is the code in base ‘R’, but I will create the same output with the ‘dplyr’ method afterwards to show how they both operate. After this chapter, most of my aggregation will take place in ‘dplyr’, but it is useful to see the two side-by-side.

This code aggregates each factor of the ‘Vegetation’ field and takes the mean of the worm density using the ‘with’ function. ‘tapply’ is a function that iterates over each aggregate measure (‘Vegetation’), and applies a function (‘mean’). If the ‘with’ and ‘tapply’ don’t make sense right now, that’s fine, but going through some introductory ‘R’ programming courses that cover functions could be useful.

with(worms, tapply(Worm.density, Vegetation, mean))
##    Arable Grassland    Meadow   Orchard     Scrub
##  5.333333  2.444444  6.333333  9.000000  5.250000

Writing the same code with base ‘R’ on the worms dataset, we pull out the 2nd, 3rd, 5th and 7th column (remember how we used brackets to subset the data earlier?), and create a list of the ‘Vegetation’ types (which makes up the left column, by default named ‘Group.1’), then find the aggregated mean within each column. If that seems somewhat complicated, start by looking at the output, then read the function from inside to out. However, doing this in base ‘R’ is not intuitive in my opinion.

aggregate(worms[,c(2,3,5,7)],list(worms$Vegetation), mean)
##     Group.1     Area    Slope  Soil.pH Worm.density
## 1    Arable 3.866667 1.333333 4.833333     5.333333
## 2 Grassland 2.911111 3.666667 4.100000     2.444444
## 3    Meadow 3.466667 1.666667 4.933333     6.333333
## 4   Orchard 1.900000 0.000000 5.700000     9.000000
## 5     Scrub 2.425000 7.000000 4.800000     5.250000

To make the output more readable, the code remains the exact same, but we will rename the list of vegetation types as ‘Community’.

aggregate(worms[,c(2,3,5,7)],list(Community = worms$Vegetation), mean)
##   Community     Area    Slope  Soil.pH Worm.density
## 1    Arable 3.866667 1.333333 4.833333     5.333333
## 2 Grassland 2.911111 3.666667 4.100000     2.444444
## 3    Meadow 3.466667 1.666667 4.933333     6.333333
## 4   Orchard 1.900000 0.000000 5.700000     9.000000
## 5     Scrub 2.425000 7.000000 4.800000     5.250000

We can add in a little bit more complexity by adding in the ‘Damp’ variable and rename it ‘Moisture’. Notice that ‘Grassland’ and ‘Scrub’ had observations where ‘Damp’ (renamed ‘Moisture’) was both ‘TRUE’ and ‘FALSE’, so the output aggregates to account for each categorical variable added to the code. As the categories grow, the table expands and fills in with the respective aggregation in each column.

aggregate(worms[,c(2,3,5,7)],list(Moisture = worms$Damp, Community = worms$Vegetation), mean)
##   Moisture Community     Area    Slope  Soil.pH Worm.density
## 1    FALSE    Arable 3.866667 1.333333 4.833333     5.333333
## 2    FALSE Grassland 3.087500 3.625000 3.987500     1.875000
## 3     TRUE Grassland 1.500000 4.000000 5.000000     7.000000
## 4     TRUE    Meadow 3.466667 1.666667 4.933333     6.333333
## 5    FALSE   Orchard 1.900000 0.000000 5.700000     9.000000
## 6    FALSE     Scrub 3.350000 5.000000 4.700000     7.000000
## 7     TRUE     Scrub 1.500000 9.000000 4.900000     3.500000

The code below creates two lists, ‘Damp’ and ‘Vegetation’, both of which are categorical. ‘Damp’ becomes the aggregation column, ‘Vegetation’ becomes the headers, and the values within the table are the mean ‘Slope’. We accomplish this in the same way using ‘with’ and ‘tapply’, but notice the pattern is different than earlier. Because ‘Vegitation’ is categorical, and not a specific variable (such as ‘Area’), it spreads the categorical factors of ‘Vegetation’ (‘Arable’, ‘Grassland’, ‘Meadow’, ‘Orchard’, ‘Scrub’) out to to make the headers. Note that in arable lands that are damp, the value is NA, which means that there is no data available for that combination. By contrast, orchard lands that are not damp have a value of ‘0’, which means that there was at least one observation that measured zero slope. This is an important disctintion.

with(worms, tapply(Slope, list(Damp, Vegetation), mean))
##         Arable Grassland   Meadow Orchard Scrub
## FALSE 1.333333     3.625       NA       0     5
## TRUE        NA     4.000 1.666667      NA     9

We’ll do the same calls now, but using ‘dplyr’ methods. Notice how it is more intuitive, and easier to morph the data than with the base ‘R’ functions. Each time you start with the data and pipe it to the functions, and you can build this piece by piece. Note that when using the ‘summarise’ function, you have to specify what you are grouping on by using the ‘group_by’ function, which in this case is ‘vegetation’. Then use summarise to create the individual columns with the relevant functions. Within ‘summarise’, you can specify any sort of function or formula that results in a single ouput per aggregation. You can also name the column it to simplify the output.

worms %>%
  group_by(Vegetation) %>%
  summarise(meanArea = mean(Area),
            mean(Slope),
            mean(Soil.pH),
            mean(Worm.density))
## # A tibble: 5 x 5
##   Vegetation meanArea `mean(Slope)` `mean(Soil.pH)` `mean(Worm.density)`
##   <chr>         <dbl>         <dbl>           <dbl>                <dbl>
## 1 Arable         3.87          1.33            4.83                 5.33
## 2 Grassland      2.91          3.67            4.1                  2.44
## 3 Meadow         3.47          1.67            4.93                 6.33
## 4 Orchard        1.9           0               5.7                  9
## 5 Scrub          2.42          7               4.8                  5.25

Data Visualization

We’ll switch to the ‘das’ dataset to start looking at data visualization. There is power in doing fast, high-quality data exploration to glean some insight into what the data is telling us. For instance, does one observation seem out of range from the others? Are there any patterns that show a clear relationship between certain variables? There’s plenty of different uses for data visualization, but in general, it’s broken up into two stages.

In the first stage of data visualization, iterate quickly through lots of graphs and charts to see if anything jumps out. Don’t worry too much about how it looks, as long as the fundamentals are in place and it is not misleading. These visualizations are for you to generate ideas and different paths that could be of use in the analysis process. They lead to more in-depth investigation into the data, where inferential statistics and modeling provide actionable insights. The second part of data visualization is to be combined with your conclusions for the final user (anyone that can take meaningful action on the insight provided). You’ll want to make these visualizations very high quality, most importantly to ensure your audience fully understands the output, but also because even the slightest error will be noticed by keen observers, and one small mistake can often derail an entire project.

We’ll use ‘glimpse’ to get some information on the data. It’s has one variable and 100 observations, so our options to plot are limited. Let’s just use the base ‘R’ ‘plot’ function. As you can see, most of the data is somewhere between 1 and 3, but there is one observation that is about 22.

glimpse(das)
## Observations: 100
## Variables: 1
## $ y <dbl> 2.514542, 2.559668, 2.460061, 2.702720, 2.571997, 2.412833, ...
plot(das$y)

We’ve identified the outlier by plotting it, let’s look at the specifics. We can use the ‘which’ function to query the index of the data to tell us which observation number the outlier is, and then subset the data with that index number. I also included the query where the ‘which’ function is integrated into the bracket subset, to do both things in one simple step. For this outlier, Crawley notes that the error was due to entry error, and the decimal was left out (decimal misplacement is a common entry error to look for). If it is your data, you’ll want to fix this as close to the source as possible, otherwise you’ll have to address this issue every time you use the dataset. However, sometimes you may not be able to modify the data at the root sources (an example would be not having authorization to modify the database the data was stored on). Make note of the issue, create a reproducible process to eliminate it, and notify the holder of the data about the issue.

which(das$y > 10)
## [1] 50
das$y[50]
## [1] 21.79386
das$y[which(das$y > 10)]
## [1] 21.79386

Let’s look at the ‘yields’ dataset to see if we can spot any issues. We first use the ‘head’ function, which shows that there are two variables, ‘treatment’, and ‘yield’. When we use the ‘table’ function on our categorical variable ‘treatment’, it aggregates the factors and displays a count of each factor. Interestingly, 3 of the factors have a count of 10, while there is also a count of 9 and a count of 1. Looking at the names, one is mispelled ‘nitogen’, and the other 9 are nitrogen. Small errors like these are quite common and can skew the data, and must be identified before doing more in-depth analysis. As in the case above, it’s preferred that you fix this at the data source level, but to make quick changes (or if you are certain you will not be using the dataset again), you can use the ‘which’ as a subset to find the mispelled factor and set it to ‘nitrogen’. Check the table again, and the dataset appears to be in order.

head(yields)
## # A tibble: 6 x 2
##   treatment yield
##   <chr>     <dbl>
## 1 control   0.827
## 2 control   3.61
## 3 control   2.62
## 4 control   1.74
## 5 control   0.659
## 6 control   0.489
table(yields$treatment)
##
## both N and P      control      nitogen     nitrogen   phosphorus
##           10           10            1            9           10
which(yields$treatment == "nitogen")
## [1] 11
yields$treatment[which(yields$treatment == "nitogen")] <- "nitrogen"
table(yields$treatment)
##
## both N and P      control     nitrogen   phosphorus
##           10           10           10           10

Let’s look at using the scatterplot with the ‘scatter’ dataset. A scatterplot is useful with two continuous variables, and can yield valuable insight into relationships between variables. Again, I’ll look at base ‘R’ plots and then show the same visualizations using ‘ggplot2’. As we can see, there are two variables that make up the x and y columns. We have a basic plot with the variables along their respective axes.

head(scatter)
## # A tibble: 6 x 2
##       x     y
##   <dbl> <dbl>
## 1  0      0
## 2  5.11  61.0
## 3  1.32  11.1
## 4 35.2  141.
## 5  1.63  26.2
## 6  2.30  10.0
plot(scatter$x, scatter$y)

Doing the same thing in ‘ggplot2’, the output is much cleaner and you can quickly customize the dataset. However, at first glance there are some interesting patterns to note. First, the relationship rises steeply to 18 on the x-axis, and then levels out, indicating a non-linear fit. Also, at about 45 on the x-axis, you can see a significant widening of the data points. This is known as heteroscedasticity, and a very common trait that occurs along the margins of data. For instance, putting no fertilizer on a crop will result in a low yield, while increasing the amount at reasonable levels would gradually increase the yield, up until a certain point. Once you apply more than the maximum recommended amount, there might be some observations with high yields, while others would die and have very low yields. One important theme in analysis is finding the point of diminishing returns, or where increasing the input of a system begins to have marginal, or even detrimental effects. In data like this, one hypothesis would be that increasing fertilizer use through 18 units has a sizable impact on output, but then the output begins to level out. After 45 input units, the variance becomes unstable, and there is no benefit to use more than this amount. We could expand our experimentation here, solidify our results, and provide actionable insight to farmers.

scatter %>%
  ggplot(aes(x, y)) +
  geom_point()

Let’s look at the ‘weather’ data. We can see that this is a large dataset of 6940 observations, over 5 variables. Three of the variables (upper, lower, rain) are numerical, while month and year are listed as integers. While this is theoretically correct, we want to treat the individual months as separate categories, as the month number itself is not a measured value. This situation is perfect for box and whisker plots, and in this case we will look at the ‘upper’ values broken up by ‘month’.

head(weather)
## # A tibble: 6 x 5
##   upper lower  rain month    yr
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  10.8   6.5  12.2     1  1987
## 2  10.5   4.5   1.3     1  1987
## 3   7.5  -1     0.1     1  1987
## 4   6.5  -3.3   1.1     1  1987
## 5  10     5     3.5     1  1987
## 6   8     3     0.1     1  1987
plot(factor(weather$month), weather$upper)