10 HOMEWORK 1 [DUE 9/18]
10.1 Tidy Data
DUE DATE: 9/18
The goal of the first assignment is to take datasets that are either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the dplyr
, tidyr
, and readr
packages.
Before attempting this assignment, you should first install the tidyverse
package if you haven’t already. The tidyverse
package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the install.packages()
function.
Running this function will install a host of other packages so it make take a minute or two depending on how fast your computer is.
10.2 Homework Submission
Please write up your homework using R Markdown and knitr
. Compile your document as an HTML file and submit your HTML file to the dropbox on Courseplus. Please show all your code for each of the answers to the three parts.
To get started, watch this video on setting up your R Markdown document.
10.3 Part 1
For this Part, you may want to review the section on spreading and gathering data.
Load the WorldPhones
dataset in the datasets
package with
library(datasets)
data(WorldPhones)
WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
This dataset gives the number of telephones in various regions of the world (in thousands). The regions are: North America, Europe, Asia, South America, Oceania, Africa, Central America and data are available for the years 1951, 1956, 1957, 1958, 1959, 1960, 1961.
Use the functions in dplyr
and tidyr
to produce a data frame that looks like this.
year region number
1 1951 N.Amer 45939
2 1956 N.Amer 60423
3 1957 N.Amer 64721
4 1958 N.Amer 68484
5 1959 N.Amer 71799
6 1960 N.Amer 76036
7 1961 N.Amer 79831
8 1951 Europe 21574
9 1956 Europe 29990
10 1957 Europe 32510
11 1958 Europe 35218
12 1959 Europe 37598
13 1960 Europe 40341
14 1961 Europe 43173
15 1951 Asia 2876
16 1956 Asia 4708
17 1957 Asia 5230
18 1958 Asia 6662
19 1959 Asia 6856
20 1960 Asia 8220
21 1961 Asia 9053
22 1951 S.Amer 1815
23 1956 S.Amer 2568
24 1957 S.Amer 2695
25 1958 S.Amer 2845
26 1959 S.Amer 3000
27 1960 S.Amer 3145
28 1961 S.Amer 3338
29 1951 Oceania 1646
30 1956 Oceania 2366
31 1957 Oceania 2526
32 1958 Oceania 2691
33 1959 Oceania 2868
34 1960 Oceania 3054
35 1961 Oceania 3224
36 1951 Africa 89
37 1956 Africa 1411
38 1957 Africa 1546
39 1958 Africa 1663
40 1959 Africa 1769
41 1960 Africa 1905
42 1961 Africa 2005
43 1951 Mid.Amer 555
44 1956 Mid.Amer 733
45 1957 Mid.Amer 773
46 1958 Mid.Amer 836
47 1959 Mid.Amer 911
48 1960 Mid.Amer 1008
49 1961 Mid.Amer 1076
10.3.1 Notes
You may need to use functions outside these packages to obtain this result.
Note that the functions in the
dplyr
andtidyr
package expect table-like objects (data frames or tibbles) as their input. You can convert data to these objects using theas_tibble()
function in thetibble
package.Don’t worry about the ordering of the rows or columns. Depending on whether you use
gather()
orpivot_longer()
, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.
10.4 Part 2
Use the readr
package to read the SPEC_2014.csv.gz data file in to R. This file contains daily levels of fine particulate matter (PM2.5) chemical constituents across the United States. The data are measured at a network of federal, state, and local monitors and assembled by the EPA.
In this dataset, the Sample.Value
column provides the level of the indicated chemical constituent and the Parameter.Name
column provides the name of the chemical constituent. The combination of a State.Code
, a County.Code
, and a Site.Num
, uniquely identifies a monitoring site (the location of which is provided by the Latitude
and Longitude
columns).
For all of the questions below, you can ignore the missing values in the dataset, so when taking averages, just remove the missing values before taking the average.
Use the functions in the dplyr
package to answer the following questions:
What is average value of “Bromine PM2.5 LC” in the state of Wisconsin in this dataset?
Calculate the average of each chemical constituent across all states/monitors and all time points. Which constituent has the highest average level?
Which monitoring site has the highest individual-day levels of “Sulfate PM2.5 LC”? Indicate the state code, county code, and site number.
What is the difference in the average levels of “EC PM2.5 LC TOR” between California and Arizona
What are the median levels of “OC PM2.5 LC TOR” and “EC PM2.5 LC TOR” in the western and eastern U.S.? Define western as any monitoring location that has a
Longitude
less than -100.
10.4.1 Notes
You may find the case_when()
function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate()
call).
library(tidyverse)
## Generate some random numbers
dat <- tibble(x = rnorm(100))
slice(dat, 1:3)
# A tibble: 3 x 1
x
<dbl>
1 -0.338
2 -0.909
3 -0.900
## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
mutate(is_positive = case_when(
x >= 0 ~ "Yes",
x < 0 ~ "No"
))
# A tibble: 100 x 2
x is_positive
<dbl> <chr>
1 -0.338 No
2 -0.909 No
3 -0.900 No
4 1.98 Yes
5 0.877 Yes
6 0.134 Yes
7 1.52 Yes
8 -1.60 No
9 0.734 Yes
10 -0.155 No
# … with 90 more rows
10.5 Part 3
Use the readxl
package to read the file aqs_sites.xlsx into R (you will need to install the readxl
package first). You can use the following code to read in the data.
You may get some warnings when reading in the data but you can ignore these for now.
This file contains metadata about each of the monitoring sites in the EPA’s monitoring system. In particular, the Land.Use
and Location.Setting
variables contain information about what kinds of areas the monitors are located in (i.e. “residential” vs. “forest”).
Use the functions in the dplyr
package to answer the following questions.
How many monitoring sites are labelled as both “RESIDENTIAL” for
Land.Use
and “SUBURBAN” forLocation.Setting
?What are the median levels of “OC PM2.5 LC TOR” and “EC PM2.5 LC TOR” amongst monitoring sites that are labelled as both “RESIDENTIAL” and “SUBURBAN” in the eastern U.S., where eastern is defined as
Longitude
greater than or equal to -100?
10.5.1 Notes
- If you need to convert a column from one type to another you can use
as.character()
to convert to character oras.numeric()
to convert to numeric (there are otheras.*
functions too). For example,
You will need to combine the data in the aqs_sites.xlsx file with the data from Part 2 in order to answer these questions.
The
*_join()
functions within thedplyr
package may be of use here. Take a look at the help file forleft_join()
for example to get an explanation of what these functions do.Each monitoring site is uniquely identified by a state code, a county code, and a site number. These variables are in each of the datasets but note that tey may not have the exact same names in each of the datasets. Therefore, you may need to do some manipulation before merging the two datasets together.