10 HOMEWORK 1 [DUE 9/18]

10.1 Tidy Data

DUE DATE: 9/18

The goal of the first assignment is to take datasets that are either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the dplyr, tidyr, and readr packages.

Before attempting this assignment, you should first install the tidyverse package if you haven’t already. The tidyverse package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the install.packages() function.

install.packages("tidyverse")

Running this function will install a host of other packages so it make take a minute or two depending on how fast your computer is.

10.2 Homework Submission

Please write up your homework using R Markdown and knitr. Compile your document as an HTML file and submit your HTML file to the dropbox on Courseplus. Please show all your code for each of the answers to the three parts.

To get started, watch this video on setting up your R Markdown document.

10.3 Part 1

For this Part, you may want to review the section on spreading and gathering data.

Load the WorldPhones dataset in the datasets package with

library(datasets)
data(WorldPhones)
WorldPhones
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076

This dataset gives the number of telephones in various regions of the world (in thousands). The regions are: North America, Europe, Asia, South America, Oceania, Africa, Central America and data are available for the years 1951, 1956, 1957, 1958, 1959, 1960, 1961.

Use the functions in dplyr and tidyr to produce a data frame that looks like this.

   year  region number
1  1951   N.Amer  45939
2  1956   N.Amer  60423
3  1957   N.Amer  64721
4  1958   N.Amer  68484
5  1959   N.Amer  71799
6  1960   N.Amer  76036
7  1961   N.Amer  79831
8  1951   Europe  21574
9  1956   Europe  29990
10 1957   Europe  32510
11 1958   Europe  35218
12 1959   Europe  37598
13 1960   Europe  40341
14 1961   Europe  43173
15 1951     Asia   2876
16 1956     Asia   4708
17 1957     Asia   5230
18 1958     Asia   6662
19 1959     Asia   6856
20 1960     Asia   8220
21 1961     Asia   9053
22 1951   S.Amer   1815
23 1956   S.Amer   2568
24 1957   S.Amer   2695
25 1958   S.Amer   2845
26 1959   S.Amer   3000
27 1960   S.Amer   3145
28 1961   S.Amer   3338
29 1951  Oceania   1646
30 1956  Oceania   2366
31 1957  Oceania   2526
32 1958  Oceania   2691
33 1959  Oceania   2868
34 1960  Oceania   3054
35 1961  Oceania   3224
36 1951   Africa     89
37 1956   Africa   1411
38 1957   Africa   1546
39 1958   Africa   1663
40 1959   Africa   1769
41 1960   Africa   1905
42 1961   Africa   2005
43 1951 Mid.Amer    555
44 1956 Mid.Amer    733
45 1957 Mid.Amer    773
46 1958 Mid.Amer    836
47 1959 Mid.Amer    911
48 1960 Mid.Amer   1008
49 1961 Mid.Amer   1076

10.3.1 Notes

  • You may need to use functions outside these packages to obtain this result.

  • Note that the functions in the dplyr and tidyr package expect table-like objects (data frames or tibbles) as their input. You can convert data to these objects using the as_tibble() function in the tibble package.

  • Don’t worry about the ordering of the rows or columns. Depending on whether you use gather() or pivot_longer(), the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.

10.4 Part 2

Use the readr package to read the SPEC_2014.csv.gz data file in to R. This file contains daily levels of fine particulate matter (PM2.5) chemical constituents across the United States. The data are measured at a network of federal, state, and local monitors and assembled by the EPA.

In this dataset, the Sample.Value column provides the level of the indicated chemical constituent and the Parameter.Name column provides the name of the chemical constituent. The combination of a State.Code, a County.Code, and a Site.Num, uniquely identifies a monitoring site (the location of which is provided by the Latitude and Longitude columns).

For all of the questions below, you can ignore the missing values in the dataset, so when taking averages, just remove the missing values before taking the average.

Use the functions in the dplyr package to answer the following questions:

  1. What is average value of “Bromine PM2.5 LC” in the state of Wisconsin in this dataset?

  2. Calculate the average of each chemical constituent across all states/monitors and all time points. Which constituent has the highest average level?

  3. Which monitoring site has the highest individual-day levels of “Sulfate PM2.5 LC”? Indicate the state code, county code, and site number.

  4. What is the difference in the average levels of “EC PM2.5 LC TOR” between California and Arizona

  5. What are the median levels of “OC PM2.5 LC TOR” and “EC PM2.5 LC TOR” in the western and eastern U.S.? Define western as any monitoring location that has a Longitude less than -100.

10.4.1 Notes

You may find the case_when() function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate() call).

library(tidyverse)

## Generate some random numbers
dat <- tibble(x = rnorm(100))
slice(dat, 1:3)
# A tibble: 3 x 1
       x
   <dbl>
1 -0.338
2 -0.909
3 -0.900

## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
        mutate(is_positive = case_when(
                x >= 0 ~ "Yes",
                x < 0 ~ "No"
        ))
# A tibble: 100 x 2
        x is_positive
    <dbl> <chr>      
 1 -0.338 No         
 2 -0.909 No         
 3 -0.900 No         
 4  1.98  Yes        
 5  0.877 Yes        
 6  0.134 Yes        
 7  1.52  Yes        
 8 -1.60  No         
 9  0.734 Yes        
10 -0.155 No         
# … with 90 more rows

10.5 Part 3

Use the readxl package to read the file aqs_sites.xlsx into R (you will need to install the readxl package first). You can use the following code to read in the data.

library(readxl)
sites <- read_excel("data/aqs_sites.xlsx", .name_repair = make.names)

You may get some warnings when reading in the data but you can ignore these for now.

This file contains metadata about each of the monitoring sites in the EPA’s monitoring system. In particular, the Land.Use and Location.Setting variables contain information about what kinds of areas the monitors are located in (i.e. “residential” vs. “forest”).

Use the functions in the dplyr package to answer the following questions.

  1. How many monitoring sites are labelled as both “RESIDENTIAL” for Land.Use and “SUBURBAN” for Location.Setting?

  2. What are the median levels of “OC PM2.5 LC TOR” and “EC PM2.5 LC TOR” amongst monitoring sites that are labelled as both “RESIDENTIAL” and “SUBURBAN” in the eastern U.S., where eastern is defined as Longitude greater than or equal to -100?

10.5.1 Notes

  • If you need to convert a column from one type to another you can use as.character() to convert to character or as.numeric() to convert to numeric (there are other as.* functions too). For example,
x <- c("1", "2", "3")
x
[1] "1" "2" "3"
y <- as.numeric(x)
y
[1] 1 2 3
  • You will need to combine the data in the aqs_sites.xlsx file with the data from Part 2 in order to answer these questions.

  • The *_join() functions within the dplyr package may be of use here. Take a look at the help file for left_join() for example to get an explanation of what these functions do.

  • Each monitoring site is uniquely identified by a state code, a county code, and a site number. These variables are in each of the datasets but note that tey may not have the exact same names in each of the datasets. Therefore, you may need to do some manipulation before merging the two datasets together.