Homework 1 for Biostat 778

Due: November 13, 2013

Setup

In order to do this homework you must

Fork the Biostat778_HW1 repository on GitHub
Once you have cloned the repository on GitHub, you can clone it to your local computer to actually do the work.
As you are working, make sure to commit changes at logical points via git add and git commit.
Once you have completed the assignment, you can push your changes back to your GitHub repository via git push.
Once you have pushed your changes, make a pull request on GitHub so that I can see that you're ready to submit your homework.
If you want me to time your functions on the test code, you can send me a pull request and I will run it on my computer. Timings that are run on my computer will go on the leaderboard maintained in the Git repository for the homework.
Your finished code should be submitted in the form of an R package. The R package should be named Homework1.
Your R package should pass R CMD check without any warnings, errors, or notes.
I will put unit tests in the tests directory of the R package for the master branch. Please DO NOT make any changes in the tests directory.
You can use the tests in the tests directory to check your code as the expected results will also be in the tests directory.

Fast Linear Regression

Write a function called fastlm() that fits a linear regression model to outcome data y and predictor data in a matrix X.

Your function should have the form

fastlm <- function(X, y, na.rm = FALSE) {
       ## Your code goes here
}

The inputs should be X, a \(n\times p\) matrix, y, a vector of length \(n\), and na.rm, which indicates whether missing values in X or y should be removed.

The function should return a list with the following named elements

coefficients: a vector of the regression coefficients estimated using maximum likelihood, i.e. \(\hat{\beta}=(X^\prime X)^{-1}X^\prime y\).
vcov: the \(p\times p\) covariance of matrix of the estimated regression coefficients, i.e. \(Var(\hat{\beta})\).

Your function should be fast. In particular, it should always run faster than the lm.fit() function in R using the same inputs.

To test your function, try running the following code.

set.seed(2)
## Generate predictor matrix
n <- 100000
p <- 500
X <- cbind(1, matrix(rnorm(n * (p - 1)), n, p - 1))

## Coefficents
b <- rnorm(p)

## Response
y <- X %*% b + rnorm(n)

source("fastlm.R")
fit <- fastlm(X, y)
str(fit)

## List of 2
##  $ coefficients: num [1:500] 0.126 1.072 -0.243 0.654 -0.975 ...
##  $ vcov        : num [1:500, 1:500] 0.0000100219 -0.0000000324 -0.0000000149 0.0000000566 -0.0000000406 ...

Fast Multivariate Normal Density

Write a function called dmvnorm() that evaluates the \(k\)-dimensional multivariate Normal density with mean \(\mu\) and covariance \(S\). The density function is \(f(x\mid\mu,S)=\exp\left(-\frac{k}{2}\log(2\pi)-\frac{1}{2}\log |S|-\frac{1}{2}(x-\mu)^\prime S^{-1}(x-\mu)\right)\).

Your function should have the form

dmvnorm <- function(x, mu, S, log = TRUE) {
        ## Your code here
}

where \(x\) is a \(n\times k\) matrix of points to be evaluated, \(\mu\) is a length \(k\) vector of means for the \(k\)-dimensional Normal, and \(S\) is a \(k\times k\) covariance matrix. Your function should be written using R code only and you MAY NOT use any additional packages beyond the standard base packages that come with R.

Your function should return a vector of length \(n\) containing the values of the multivariate Normal density evaluated at the \(n\) points. If log = TRUE then you should return the log density at those points. If log = FALSE, then you should return the density values.

You can assume that \(S\) will be symmetric. However, your function should check to see if \(S\) is positive definite. If it is not, then your function should return the error message "S is not positive definite".

Your code should be fast. In particular, it should work quickly for large inputs (i.e. when the dimension of \(k\) is large). You can simulate Normal data if you want by using the mvrnorm() function from the MASS package.

You can time your function with the system.time() function. Calling

system.time(dmvnorm(x, mu, S))

will give you the time spent executing your function. In particular, you are interested in the user time.

To test your code, try running the following.

## Create the covariance matrix
n <- 100
n2 <- n^2
xg <- seq(0, 1, length = n)
yg <- xg
g <- data.matrix(expand.grid(xg, yg))
D <- as.matrix(dist(g))
phi <- 5

S <- exp(-phi * D)
mu <- rep(0, n2)
set.seed(1)
x <- matrix(rnorm(n2), byrow = TRUE, ncol = n2)

source("dmvnorm.R")
dmvnorm(x, mu, S, log = TRUE)

## [1] -124635