17 HOMEWORK 2 [DUE 10/9]

DUE: 10/9

The goal of this assignment is to get used to designing and writing functions. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.

17.1 Homework Submission

Please write up your homework as an R markdown document processed with knitr. Compile your document as an HTML file submit your HTML file to the dropbox on CoursePlus. Please show all your code (i.e. make sure to set echo = TRUE) for each of the answers to the three parts.

17.2 Part 1

The exponential of a number can be written as an infinite series expansion of the form \[ \exp(x) = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots \] Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.

Write a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:

x: the number to be exponentiated
k: the number of terms to be used in the series expansion beyond the constant 1. The value of k is always \(\geq 1\).

Exp <- function(x, k) {
        ## Your code here
}

For example, if \(k = 1\), then the Exp function should return the number \(1 + x\). If \(k = 2\), then you should return the number \(1 + x + x^2/2!\).

You can assume that the input value x will always be a single number.
You can assume that the value k will always be an integer \(\geq 1\).
Do not use the exp() function in R.
The factorial() function can be used to compute factorials.

Your function should produce output like the following:

Exp(2, 1)
## [1] 3
Exp(2, 6)
## [1] 7.355556
Exp(-2.5, 4)
## [1] 0.6484375
Exp(-2.5, 8)
## [1] 0.09046234
Exp(-2.5, 10)
## [1] 0.08257812

17.3 Part 2

The data for this part come from a study of indoor air pollution and respiratory disease conducted here at Johns Hopkins. A high-resolution air pollution monitor was placed in each home to collect continuous levels of particulate matter over the period of a few days (each measurement represents a 5-minute average). In addition, measurements were taken in different rooms of the house as well as on multiple visits.

MIE data

Initially, we’d like to explore the data for each subject (indicated by the id column) and so we can make some exploratory time series plots of the data.

The following code plots the data from one subject (id == 2) on the baseline visit (visit == 0) and in the bedroom (room == "bedroom").

library(tidyverse)
mie <- read_csv("data/MIE.csv.gz", col_types = "iicdi")
mie %>%
        filter(id == 20 & visit == 0 & room == "bedroom") %>%
        ggplot(aes(timepoint, value)) + 
        geom_line() + 
        theme_minimal()

While this code is useful, it only provides us information on one subject, on one visit, in one room. We could cut and paste this code to look at other subjects/visits/rooms, but that can be error prone and just plain messy.

The aim here is to design and implement a function that can be re-used to visualize all of the data in this dataset.

There are 3 aspects that may vary in the dataset: The id, the visit, and the room. Note that not all combinations of id, visit, and room have measurements.
Your function should take as input one of the 3 possible factors. You need to decide which one of those it will be. You do NOT need to write separate functions for all three variables. For example, if you choose to use id as an input, you do not need separate functions for visit or room.
Given the input from the user, your function should present all of the data for that input. For example, if the input is id = 2, then you should plot the data for all rooms and all visits. If you the input is visit = 0, then you should plot data for all ids and all rooms, etc. It should be possible visualize the entire dataset with your function (through repeated calls to your function).
Your function should produce a single plot that contains all the data. Furthermore, the data should be readable on that plot so that it is in fact useful. You will likely want to use multiple panels/facets, colors, symbols, line types, etc. to enhance the readability of your plot.
If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the stop() function).

For your homework submission

Write a short description of how you chose to design your function and why.
Present the code for your function in the R markdown document.
Include at least one example of output from your function.

17.4 Part 3

For this part you will write a few functions that simulate the process of conducting a study of particulate matter in children’s homes. The vector pm contains simulated particulate matter measurements in the homes of 5000 children in an school district. (NOTE: The code for simulating the pm vector is contained in Part B.)

Suppose as an investigator, you want to survey the homes of the students in an elementary school to measure the indoor particulate matter. You only have funding to survey 20 homes, but you want to estimate the average pm concentration in the homes of children from this school with a confidence interval.

Part A

First, write a function called calculate_CI95 that:

Takes as input a vector of data, and samples 20 values from that vector. Try looking at the documentation for the function sample(). Note that you will sample without replacement here.
Calculates an average particulate matter concentration for the 20 students sampled.
Calculates a 95% confidence interval for the average particulate matter in the population. If you have never calculated a confidence interval, the confidence interval consists of a lower bound mean(x) - 1.96 * sd(x)/sqrt(20) and upper bound mean(x) + 1.96 * sd(x)/sqrt(20).
Returns a named vector of length 3, where the first value is the lower bound, the second value is the mean, and the third value is the upper bound.

Part B

Second, write a function called plot_simulationCI that can simulate the process of conducting this study many times. Your function should have the arguments n, the number of studies to be replicated, and data, the vector of data to simulate draws from. In particular your function will:

Use the function replicate() along with calculate_CI95() to make means and confidence intervals for n random draws from the population. You may also implement this as a for loop if you wish.
Plots all n confidence intervals on the same plot. Think about using ggplot() with both geom_point() for the means and geom_errorbar() for the lower and upper bounds of the confidence interval.

Finally, run both functions on the vector ‘pm’ generated with the code below to print a sample confidence interval (part A) and sample plot with 10 confidence intervals (part B).

The following code should be used to simulate the data for this problem.

set.seed(1000)
pm <- c(exp(rnorm(2000, mean = 3, sd = 1)), 
        exp(rnorm(2000, mean = 1, sd = 1.5)),
        exp(rnorm(1000, mean = .5, sd = 1))) 
ggplot() +
        geom_histogram(aes(x = pm), bins = 10)

Here is a sample of what the output should look like.

calculate_CI95(pm)
##    lower     mean    upper 
## 3.107701 6.479236 9.850771

And here is a sample of what the plot should look like.

plot_simulationCI(pm, 10)