In order to do this homework you must
Fork the Biostat778_HW1 repository on GitHub
Once you have cloned the repository on GitHub, you can clone it to your local computer to actually do the work.
As you are working, make sure to commit changes at logical points
via git add and git commit.
Once you have completed the assignment, you can push your changes
back to your GitHub repository via git push.
Once you have pushed your changes, make a pull request on GitHub so that I can see that you're ready to submit your homework.
If you want me to time your functions on the test code, you can send me a pull request and I will run it on my computer. Timings that are run on my computer will go on the leaderboard maintained in the Git repository for the homework.
Your finished code should be submitted in the form of an R package. The R package should be named Homework1.
Your R package should pass R CMD check without any warnings,
errors, or notes.
I will put unit tests in the tests directory of the R package
for the master branch. Please DO NOT make any changes in the tests
directory.
You can use the tests in the tests directory to check your code as the expected results will also be in the tests directory.
Write a function called fastlm() that fits a linear regression model
to outcome data y and predictor data in a matrix X.
Your function should have the form
fastlm <- function(X, y, na.rm = FALSE) {
## Your code goes here
}
The inputs should be X, a \(n\times p\) matrix, y, a vector of
length \(n\), and na.rm, which indicates whether missing values in X
or y should be removed.
The function should return a list with the following named elements
coefficients: a vector of the regression coefficients estimated
using maximum likelihood, i.e. \(\hat{\beta}=(X^\prime X)^{-1}X^\prime y\).
vcov: the \(p\times p\) covariance of matrix of the estimated
regression coefficients, i.e. \(Var(\hat{\beta})\).
Your function should be fast. In particular, it should always run
faster than the lm.fit() function in R using the same inputs.
To test your function, try running the following code.
set.seed(2)
## Generate predictor matrix
n <- 100000
p <- 500
X <- cbind(1, matrix(rnorm(n * (p - 1)), n, p - 1))
## Coefficents
b <- rnorm(p)
## Response
y <- X %*% b + rnorm(n)
source("fastlm.R")
fit <- fastlm(X, y)
str(fit)
## List of 2
## $ coefficients: num [1:500] 0.126 1.072 -0.243 0.654 -0.975 ...
## $ vcov : num [1:500, 1:500] 0.0000100219 -0.0000000324 -0.0000000149 0.0000000566 -0.0000000406 ...
Write a function called dmvnorm() that evaluates the \(k\)-dimensional
multivariate Normal density with mean \(\mu\) and covariance \(S\). The
density function is
\(f(x\mid\mu,S)=\exp\left(-\frac{k}{2}\log(2\pi)-\frac{1}{2}\log
|S|-\frac{1}{2}(x-\mu)^\prime S^{-1}(x-\mu)\right)\).
Your function should have the form
dmvnorm <- function(x, mu, S, log = TRUE) {
## Your code here
}
where \(x\) is a \(n\times k\) matrix of points to be evaluated, \(\mu\) is a length \(k\) vector of means for the \(k\)-dimensional Normal, and \(S\) is a \(k\times k\) covariance matrix. Your function should be written using R code only and you MAY NOT use any additional packages beyond the standard base packages that come with R.
Your function should return a vector of length \(n\) containing the
values of the multivariate Normal density evaluated at the \(n\)
points. If log = TRUE then you should return the log density at
those points. If log = FALSE, then you should return the density
values.
You can assume that \(S\) will be symmetric. However, your function
should check to see if \(S\) is positive definite. If it is not, then
your function should return the error message "S is not positive
definite".
Your code should be fast. In particular, it should work quickly
for large inputs (i.e. when the dimension of \(k\) is large). You can
simulate Normal data if you want by using the mvrnorm() function
from the MASS package.
You can time your function with the system.time() function. Calling
system.time(dmvnorm(x, mu, S))
will give you the time spent executing your function. In particular,
you are interested in the user time.
To test your code, try running the following.
## Create the covariance matrix
n <- 100
n2 <- n^2
xg <- seq(0, 1, length = n)
yg <- xg
g <- data.matrix(expand.grid(xg, yg))
D <- as.matrix(dist(g))
phi <- 5
S <- exp(-phi * D)
mu <- rep(0, n2)
set.seed(1)
x <- matrix(rnorm(n2), byrow = TRUE, ncol = n2)
source("dmvnorm.R")
dmvnorm(x, mu, S, log = TRUE)
## [1] -124635