Regression Analysis
created : 20210606T05:56:14+00:00
modified : 20210620T14:05:24+00:00
Chapter 1. Introduction
1.1 What is regression analysis?
 Regression analysis a method ofr investigating functional relationships among variables.
 Relationship is expressed in the form of an equiation or a model connecting the response or dependent variable and one or more explanatory or predictor variables.
 $ Y = f(X_1, \cdots, X_p) + \epsilon $:
 where $\epsilon$ is a random error and the funciton $f$ describes the relationship between $Y$ and $X_1, \cdots, X_p$
 $Y$ is called a response (output, or dependent) variable.
 $X_1, \cdots, X_p$ are called predictors (explanatory, independent, input) variables, regressors, covariates, factors or carriers.
 When $f(X_1, \cdots, X_p) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$, it is called a linear regression model:
 $ Y = \beta_0 + \beta_1 X_1 + \cdots \beta_p X_p + \epsilon$
 where $\beta_0, \beta_1, \cdots, \beta_p$ are called the regression coefficients (parameters) to be unknown and estimated.
1.2 Hisotry
 Least squares estimation (LSE) to fit a traight line to 2dimensional data:
 Legendre(1805), Gauss(1809)
 > Laplace(1810) : CLT and connecting LS method with normality
 > Gauss (1822): first part of GaussMarkov theorem
 The term of “regression” was coined by F. Galton:
 Francis Galton (1886), “Regression towards mediocrity in hereditary stature”. The journal of the Anthropological Institute of Great Britain and Ireland 15
 Galton conclude in the paper “it is that the heightdeviate of the offspring is, on the average, twothirds the heightdeviate of tis midperentage.”
 Yule(1892) and K.Pearson (1903) theoretically formulated the regression analysis for a bivariate normal ditribution.
 RA Fisher (1922, 1925) released the assumption of the bivariate normality to the conditional normality of response given predictor.
 George E.P.Box (1979) in his Response Surface Methodology book:
 “Essentially, all models are wrong, but some are useful.”
1.4 Procedures of regression analysis
 Statement of the problem
 Selection of potentially relevant variables
 Experimental design and data colelciton
 Model specification
 Choice of fittin model
 Model fitting
 Model validation and criticism
 Using the model for the intended purpose
Various Models
 Univariate : Only one quantative response variable
 Multivariate : Two or more quantitative response variables
 Simple : Only one predictor variable
 Multiple : two or mor predictor variables
 Linear : All parameters enter the equation linearly, possibly after transformation of the data
 Nonlinear : The relationship between the resposne and some of the predictors is nonlinear or some of the parameters appear nonlinearly, but no transformation is possible to make the parameters appear linearly
 Analysis of variance : All predictors are qualitative variable
 Analysis of covariance : Some predictors are quantitative variables and others are qualitative variables
 Logistic : The response variable is qualitative
Chapter 2. Corrleation analysis and simple linear regression
2.1 Covariance and correlation
 Let $X$ and $Y$ be RVs defined on a population of interest. The vocariance of $X$ and $Y$ is defined by:
 $Cov(X,Y)=E{(XEX)(YEY)} = E(XY)  (EX)(EY)$
 The sample covariance is the unbaised estimator of $Cov(X,Y)$ and defined as:
 $\hat {Cov}(X,Y) = \frac{1}{n1}\sum_{i=1}^n (y_i  \bar y) (x_i  \bar x)$
 The (population) correlation coefficient is defined by:
 $\rho = \rho_{xy} = Cor(X,Y) = Cov(\frac{XEX}{\sigma_X}, \frac{Y  EY}{\sigma_Y})$
 The sample correlation coefficient is defined by:
 $\gamma = \gamma_{xy} = \hat {Cor}(X, Y) = \frac{\sum_{i=1}^n (x_i  \bar x)(y_i  \bar y)}{\sqrt{(\sum_{i=1}^n(x_i  \bar x)^2 (\sum_{i=1}^n (y_i  \bar y)^2)}} \\ = \frac{1}{n1}\sum_{i=1}^n \frac{x_i  \bar x}{s_x} \frac{y_i  \bar y)}{s_y}$
Anscombe’s quartet
 What is the lesson from Anscombe’s quartet? The quartet illustration the importance of looking at a set of data graphically before staring to analyze according to a particular type of relationship, and the inadequancy of basic statistic properties for describing relistic datasets.
 A single extern observation (outlier) may influence the correlation analysis or linear model too much and distort the result.
2.2 Simple linear regression
 $Y = \beta_0 + \beta_1 X + \epsilon$
 We call $\beta_0$ and $\beta_1$ the regression coefficients (parameters).
2.3 Least Squares Estimation (LSE)
 Sum of squares of the vertical distance:
 $S(\beta_0, \beta_1) = \sum_{i=1}^n \epsilon_i^2 \\ = \sum_{i=1}^n (y_i  \beta_0  \beta_1 x_i)^2$
 By differentiating in $\beta_0$ and $\beta_1$, the LSEs are obtained as follows:
 $\hat \beta_1 = \frac{\sum_{i=1}^n (x_i  \bar x)(y_i  \bar y)}{\sum_{i=1}^n(x_i  \bar x)^2} = \frac{S_{xy}}{S_{xx}}$
 $\hat \beta_0 = \bar y  \hat \beta_1 \bar x$
 Relation between the correlation coefficient and the regression coefficient:
 $\beta_1 = \gamma_{xy} \frac{s_y}{s_x}$
2.4 Properties of the LSE
 Assume $E \epsilon_i = 0, Var(\epsilon_i) = \sigma^2$, and they are independent.:
 $E \hat \beta_0 = \beta_0$
 $E \hat \beta_1 = \beta_1$
 $Var(\hat \beta_0) = \sigma^2 [\frac{1}{n} + \frac{\bar x ^2}{\sum_{i=1}^n (x_i  \bar x)^2}] = \sigma^2 [\frac{1}{n} + \frac{\bar x^2}{S_{xx}}]$
 $Var(\hat \beta_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i  \bar x)^2 = \frac{\sigma^2}{S_{xx}}}$
 $Cov(\hat \beta_0, \hat \beta_1) =  \frac{\bar x}{S_xx} \sigma^2$
 Assume $\epsilon_i \sim^{i.i.d} N(0, \sigma^2)$:
 $\hat \beta_0 \sim N(\beta_0, \sigma^2 [\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i  \bar x)^2}])$
 $\hat \beta_1 \sim N(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n (x_i  \bar x)^2})$
 How to estimate (or get) $\sigma$?:
 If we have a priori knowledge about $\sigma$ from previous studies, then we may use the value of $\sigma$.
 Otherwise, estimate it:
 $\hat \sigma^2 = \frac{sum_{i=1}^n e_i^2}{n2} = \frac{\sum_{i=1}^n (y_i  \hat y_i)^2}{n2} = \frac{SSE}{n2} = MSE$
 The estimator $\hat \sigma^2$ is unbiased for $\sigma^2$, that is, $E \hat \sigma^2 = \sigma^2$
 Without proof, $\frac{SSE}{\sigma^2} \sim \chi_{n2}^2$ if $\epsilon_i \sim^{i.i.d} N(0, \sigma^2)$
 Standard error: the estimate of the standard deviation is called the standard error:
 $s.e.(\hat \beta_0) = \hat \sigma \sqrt{\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i  \bar x)^2}} \\ = \hat \sigma \sqrt{\frac{1}{n} + \frac{\bar x^2}{S_{xx}}} = \hat \sigma \sqrt{\frac{\sum x_i ^2}{n S_{xx}}}$
 $s.e.(\hat \beta_1) = \frac{\hat \sigma}{\sqrt{\sum_{i=1}^n (x_i  \bar x)^2}} = \hat \sigma \sqrt{\frac{1}{S_{xx}}}$
 Under normality assumption, e.e., $\epsilon_i \sim^{i.i.d} N(0, \sigma^2)$:
 $\frac{\hat \beta_0  \beta_0}{\hat \sigma \sqrt{\frac{1}{n} + \frac{\bar x^2}{\sum (x_i  \hat x)^2}}} \sim t_{n2}$
 $\frac{\hat \beta_1  \beta_1}{\hat \sigma \frac{1}{\sqrt{\sum (x_i  \bar x)^2}}} \sim t_{n2}$
 Hypothesis testing:
 $H_0 : \beta_0 = \beta_0^*$
 Test statistic: $T = \frac{\hat \beta_0  \beta_0^*}{\hat \sigma \sqrt{\frac{1}{n} + \frac{\bar x^2}{\sum(x_i  \bar x)^2}}} \sim t_{n2} \text{ under } H_0$
 Let $t_0$ be the value of the test statistic with the data

Calculate the twosided pvalue by $2P(T >  t_0 )$ and reject H_0 if and only if the pvalue <= the level of significance $\alpha$
 $H_0 : \beta_1 = \beta_1^*$
 Test statistic: $T = \frac{\hat \beta_1  \beta_1^*}{\hat \sigma \sqrt{\frac{1}{\sum (x_i  \bar x)^2}}} \sim t_{n2} \text{ under } H_0$
 Let $t_1$ be the observed value of the test statistic from the data.
 Calculate the pvalue by $2P(T >  t_1 )$ and reject $H_0$ if and only if the pvalue <= the level of significance $\alpha$
 Confidence intervals:
 $\hat \beta_0 \pm t_{n2, \alpha / 2} \times s.e.(\hat \beta_0)$
 $\hat \beta_1 \pm t_{n2, \alpha / 2} \times s.e.(\hat \beta_1)$
 Prediction and predction intervals:
 Predicted value at $x = x_0$:
 $\hat y_0 = \hat \beta_0 + \hat \beta_1 x_0$
 Prediction interval:
 $\hat y_0 \pm t_{n2, \alpha / 2} \times s.e.(\hat y_0)$
 where $s.e.(\hat y_0) = \hat \sigma \sqrt{1 + \frac{1}{n} + \frac{(x_0  \bar x)^2}{\sum(x_i  \bar x)^2}}$
 Predicted value at $x = x_0$:
 Mean response estimate and confidence limits:
 Point estimate of the mean response at $x = x)0$:
 $\hat \mu_0 = \hat \beta_0 + \hat \beta_1 x_0$
 Confidence interval for $\mu_0$:
 $\hat \mu \pm t_{n2, \alpha/2} \times s.e.(\hat \mu_0)$
 where $s.e.(\hat \mu_0) = \hat \sigma \sqrt{\frac{1}{n} + \frac{(x_0  \bar x)^2}{\sum(x_i  \bar x)^2}}$
 Point estimate of the mean response at $x = x)0$:
2.5 Quality of fit
 (objective) Thre greater t test statistic of $H_0 : \beta_1 = 0$ (or the smaller the pvalue) is, the stronger the strength of the linear relationship between X and Y is.
 (subjective) The scatter plot may be used to discover the strength of the linear relationship.
 Examine the scatter plot of Y versus $\hat Y$. The closer the set of points to a straight line, the stronger the linear relationship between Y and X. One can measure the strength of the linear relationship in this graph by computing the correlation coefficient between Y and $\hat Y$,:
 $Cor(Y, \hat Y) =  Cor(Y, X) $
 Furtuermore, in both simple and multiple regressions, $Cor(Y, \hat Y)$ is related to another useful measure of the quality (goodness) of fit of the linear model to the observed data, that is called the coefficient of determination $R^2$.
 Decomposition of the sum of square erros and the coefficient of determination $R^2$:
 $\sum(y_i  \bar y)^2 \text{ (SST) } = \sum (\hat y_i  \bar y)^2 \text{ (SSR) } + \sum (y_i  \hat y_i)^2 \text{ (SSE) }$
 $R^2 = \frac{SSR}{SST} = 1  \frac{SSE}{SST} = [cor(X, Y)]^2 = [cor(\hat Y, Y)]^2$
 $R^2 = \frac{\text{Explained variance}}{\text{Total variance}}$
 $R^2 = \frac{\sum (\hat y_i  \bar y)^2}{\sum {y_i  \bar y)^2}}$
 The coefficient of determination $R^2$ is a statistical measure of how well the regression line approximates the real data points.
 $R^2$ does not provide the validty of the regression assumptions
 We reemphasize that the regression assumptions should be checked before drawing statistical conclusions from the analysis (e.g., conducting tests of hypothesis or constructing confidence or prediction intervals) because the validity of these statistical procedures hinges on the validity of the assumptions.
2.6 Simple linear regression with no intercept
 Model:
 $Y = \beta_1 X + \epsilon$
 The LS of $\beta_1$:
 $\hat \beta_1 = \frac{y_i x_i}{\sum x_i^2}$
 Fitted values and residuals:
 $\hat y_i = \hat \beta_1 x_i$
 $e_i = y_i  \hat y_i$
 Note that $\sum e_i \not = 0$
 The standard error of the slope:
 $s.e.(\hat \beta_1) = \frac{\hat \sigma}{\sqrt{\sum x_i^2}} = \sqrt{\frac{SSE}{(n1) \sum x_i^2}}$
 The coefficient of determination:
 $\sum y_i^2 = \sum \hat y_i ^2 + \sum e_i ^2$
 $R^2 = \frac{\sum \hat y_i^2}{\sum y_i^2}$
2.7 Trivial regression and one sample t test
 Consider a linear model with zero slope:
 $Y = \beta_0 + \epsilon$
 To test $H_0 : \beta_0 = \mu_0$:
 $t = \frac{\bar y  \mu_0}{s.e.(\bar y)} = \frac{\bar y  \mu_0}{s_y / \sqrt{n}}$
2.8 Hypothesis tests about a population correlation coefficient
 Consider a bivariate randome variable (X, Y). Let $\rho$ be the population correlation of X and Y. We may perform a hypothesis test $H_0: \rho = 0$ using linear regression analysis when (X, Y) has a bivariate normal distribution.:
 $t = r \sqrt{\frac{n2}{1  r ^2}} \sim t_{n2} \text { under } H_0$
 $r = \frac{t}{\sqrt{n2 + t^2}}$
 A permutation test or bootstrap method can be used without assuming normality
 Fisher’s ztransform can be used under assumption of large sample and normality.:
 $z(r) = \frac{1}{2} log (\frac{1 + r}{1  r})$
 $\sqrt{n  3} (z(r)z(\rho_0)) \sim ^{asymp} N(0, 1) \text{ under } H_0 : \rho = \rho_0$
 Under normality assumption, the exact distribution of the sample correlation coefficient is known
Appendix
 (Derivation of the LSEs of the regression coefficients) Consider a simple linear regression model:
 $Y = \beta_0 + \beta x + \epsilon$
 Suppose we have an iid randome sample $(x_1, Y_1), \cdots, (x_n, Y_n)$:
 $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
 The unobservable errors are:
 $\epsilon_i = Y_i  \beta_0  \beta_1 x_i$
 The LSEs are obtained by minimizing the sum of squared erros:
 $SSE(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i  \beta_0  \beta_1 x_i)^2$
 To find the LSEs, we differentiate $SSE(\beta_0, \beta_1)$ in $\beta_0$ and $\beta_1$ and the equations below are called the normal equations:
 $\frac{\partial SSE(\beta_0, \beta_1)}{\partial \beta_0} = 2 \sum_{i=1}^n (Y_i  \beta_0  \beta_1 x_i) = 0$
 $\frac{\partial SSE(\beta_0, \beta_1)}{\partial \beta_1} = 2 \sum_{i=1}^n x_i (Y_i  \beta_0  \beta_1 x_i) = 0$
 The normal equations can be written as:
 $\sum_{i=1}^n Y_i  n \beta_0  \sum_{i=1}^n x_i \beta_1 = 0$:
 This equation is equivalent to $\bar Y = \beta_0 + \beta_1 \bar x$
 $\sum_{i=1}^n x_i Y_i  \sum_{i=1}^n x_i \beta_0  \sum_{i=1}^n x_i^2 \beta_1 = 0$
 $\sum_{i=1}^n Y_i  n \beta_0  \sum_{i=1}^n x_i \beta_1 = 0$:
 Substitute $\beta_0 = \bar Y  \beta_1 \bar x$ in the equation:
 $\sum_{i=1}^n x_i Y_i  \sum_{i=1}^n x_i (\bar Y  \beta_1 \bar x)  \sum_{i=1}^n x_i^2 \beta_1 = 0$
 $\sum_{i=1}^n x_i (Y_i  \bar Y)  \beta_1 \sum_{i=1}^n x_i (x_i  \bar x) = 0$
 $\beta_1 = \frac{\sum_{i=1}^n x_i (Y_i  \bar Y)}{\sum_{i=1}^n x_i (x_i  \bar x)} = \frac{\sum_{i=1}^n (x_i  \bar x) (Y_i  \bar Y)}{\sum_{i=1}^n{(x_i  \bar x)(x_i  \bar x)}} = \frac{S_{xy}}{S_{xx}}$
 Hence, the LSEs are:
 $\hat \beta_ 1 = \frac{S_{xy}}{S_{xx}}$
 $\hat \beta_0 = \bar Y  \hat \beta_1 \bar x$
 (The Expected Values of $\hat \beta_0$ and $\hat \beta_1$) Assume $E \epsilon_i = 0$:
 $E(\hat \beta_1) = E(\frac{\sum_{i=1}^n (x_i  \bar x)(Y_i  \bar Y)}{S_{xx}}) \\ E(\frac{\sum_{i=1}^n(x_i  \bar x)Y_i  \bar Y \sum_{i=1}^n (x_i  \bar x)}{S_{xx}}) \\ = \frac{1}{S_{xx}} \sum_{i=1}^n (x_i  \bar x) EY_i \\ = \frac{1}{S_{xx}} \sum_{i=1}^n (x_i  \bar x)(\beta_0 + \beta_1 x_i + E(\epsilon_i)) \\ = \frac{1}{S_{xx}} \beta_1 \sum_{i=1}^n (x_i  \bar x)^2 \\ = \beta_1$
 $E(\hat \beta_0) = E(\bar Y  \hat \beta_1 \bar x) \\ = \frac{1}{n} \sum_{i=1}^n EY_i  E(\hat \beta_1) \bar x \\ = \frac{1}{n} \sum_{i=1}^n(\beta_0 + \beta_1 x_i + E(\epsilon_i))  \beta_1 \bar x \\ = \beta_0 + \beta_1 \bar x  \beta_1 \bar x \\ = \beta_0$

Both $\hat \beta_0$ and $\hat \beta_1$ are unbiased estimators for $\beta_0$ and $\beta_1$, respectively.
 (The Variances of $\hat \beta_0$ and $\hat \beta_1$ and the convariance of $\hat \beta_0$ and $\hat \beta_1$) Assume $Var(\epsilon_i) = \sigma^2$ and $Cov(\epsilon_i, \epsilon_j) = 0 \text{ for } i \not = j$:
 $Var(\hat \beta_1) = Var(\frac{\sum_{i=1}^n (x_i  \bar x) Y_i}{S_{xx}}) \\ = \frac{1}{S_{xx}^2} \sum_{i=1}^n (x_i  \bar x)^2 Var(Y_i) \\ = \frac{1}{S_{xx}^2} \sum_{i=1}^n (x_i  \bar x)^2 Var(\epsilon_i) \\ = \frac{\sigma^2}{S_{xx}}$
 $Var(\hat \beta_0) = Var(\bar Y  \hat \beta_1 \bar x) \\ = Var(\bar Y)  2 \bar x Cov(\bar Y, \hat \beta_1) + \bar x^2 Var(\hat \beta_1) \\ = \frac{1}{n} \sigma^2  2 \bar x Cov(\bar Y, \hat \beta_1) + \frac{\bar x^2}{S_{xx}} \sigma^2 \\ = \sigma^2 (\frac{1}{n} + \frac{\bar x^2}{S_{xx}})$
 $Cov(\bar Y, \hat \beta_1) = \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \frac{(x_j  \bar x)}{S_{xx}} Cov(Y_i, Y_j) \\ = \frac{1}{n} \sum_{i=1}^n \sum{j=1}^n \frac{(x_j  \bar x)}{S_{xx}} Cov(\epsilon_i, \epsilon_j) \\ = \frac{1}{n} \sum_{i=1}^n \sum_{i=1}^n \frac{(x_i  \bar x)}{S_{xx}} \sigma^2 \\ = \frac{\sigma^2}{n S_{xx}} \sum_{i=1}^n (x_i  \bar x) = 0$
 By using the above,:
 $Cov(\hat \beta_0, \hat \beta_1) = Cov(\bar Y  \hat \beta_1 \bar x, \hat \beta_1) \\ = Cov(\bar Y, \hat \beta_1)  \bar x Cov(\hat \beta_1, \hat \beta_1) \\ = 0  \bar x \frac{\sigma^2}{S_{xx}} \\ =  \frac{\bar x}{S_{xx}} \sigma^2$
 (The expected value of $\hat \sigma^2 = MSE$):
 $E(MSE) = E(\frac{1}{n2} \sum_{i=1}^n e_i^2) \\ = \frac{1}{n2} E { \sum_{i=1}^n (Y_i  \hat \beta_0  \hat \beta_1 x_i)^2} \\ = \frac{1}{n2} E{ \sum_{i=1}^n ((Y_i  \bar Y)  \hat \beta_1 (x_i  \bar x))^2} \\ = \frac{1}{n2} E { \sum_{i=1}^n (Y_i  \bar Y)^2  s \sum_{i=1}^n (x_i  \bar x)(Y_i \bar Y)\hat \beta_1 + \sum_{i=1}^n (x_i  \bar x)^2 \hat \beta_1^2} \\ = \frac{1}{n2} E { \sum_{i=1}^n (\beta_1(x_i  \bar x) + (\epsilon_i  \bar \epsilon))^2  S_{xx} \hat \beta_1 ^2} \\ = frac{1}{n2} { \beta_1 ^2 S_{xx} + 2 \beta_1 \sum_{i=1}^n (x_i  \bar x) E(\epsilon_i) + E \sum_{i=1}^n (\epsilon_i  \bar \epsilon)^2   S_{xx}E(\hat \beta_1^2) } \\ = \frac{1}{n2} { \beta_1^2 S_{xx} + (n1) \sigma^2  S_{xx} (Var \hat \beta_1) +  E(\hat \beta_1) ^2) } \\ = \frac{1}{n2} { \beta_1 ^2 S_{xx} + (n1)\sigma^2  \sigma^2 \beta_1 ^2 S_{xx} } \\ = \sigma^2$
 (The coefficient of determination $R^2 = SSR/SST = 1 SSE/SST$):
 Proof of SST = SSE + SSR where:
 $SST = \sum(y_i  \bar y)^2$
 $SSE = \sum (y_i  \hat y_i)^2$
 $SSR = \sum (\hat y_i  \bar y)^2$
 $SST = \sum(y_i  \bar y)^2 \\ = \sum (y_i  \hat y_i + \hat y_i  \bar y)^2 \\ = \sum [(y_i  \bar y_i)^2 + (\hat y_i  \bar y)^2 + 2(y_i  \hat y_i)(\hat y_i  \bar y)] \\ = \sum (y_i  \bar y_i)^2 + \sum (\hat y_i  \bar y)^2 + 2 \sum (y_i  \hat y_i)(\hat y_i  \bar y) \\ = SSE + SSR + 2 \sum (y_i  \hat y_i)(\hat y_i  \bar y)$
 Therefore, we need to show the corssproduct term vanishes, that is, $\sum(y_i  \hat y_i)(\hat y_i  \bar y) = 0$
 $\sum (y_i  \hat y_i)(\hat y_i  \bar y) = \sum (y_i  \hat \beta_0  \hat \beta_1 x_i)(\hat \beta_0 + \hat \beta_1 x_i  \hat \beta_0  \hat \beta_1 \bar x) \\ = \sum (y_i  \hat \beta_0  \hat \beta_1 x_i) \hat \beta_1 (x_i  \bar x) \\ = \hat \beta_1 \sum y_i (x_i  \bar x)  \hat \beta_0 \hat \beta_1 \sum(x_i  \bar x)  \hat \beta_1^2 \sum x_i(x_i  \bar x) \\ = \hat \beta_1 \sum(y_i  \bar y) (x_i  \bar x)  \beta_1^2 \sum(x_i  \bar x)(x_i  \bar x) \\ = \hat \beta_1 (S_{xy}  \beta_1 S_{xx}) = 0$
 Proof of SST = SSE + SSR where:
 Proof of $R^2 =  Cor(X, Y) ^2 =  Cor(\hat Y, Y) ^2$. In class, we have done $R^2 =  Cor(X, Y) ^2$, henc it’s left to show that $Cor(X, Y) ^2 =  Cor(\hat Y, Y) ^2$. From the definition of the coreelation coefficient,:
 $ Cor(\hat Y, Y) ^2 = \frac{(\sum(\hat y_i  \bar \hat y) (y_i  \bar y))^2}{\sum(\hat y_i  \bar \hat y)^2 \sum(y_i  \bar y)^2} $
 Note that $\bar \hat y = \hat \beta_0 \hat \beta_1 \bar x$:
 $ Cor (\hat Y, Y) ^2 = \frac{(\sum )\hat y_i  \bar \hat y)(y_i  \bar y))^2}{\sum (\hat y_i  \bar \hat y)^2 \sum(y_i  \bar y)^2} \\ = \frac{(\sum (\hat \beta_0 + \hat \beta_1 x_i  \hat \beta_0  \hat \beta_1 \bar x)(y_i  \bar y))^2}{\sum(\hat \beta_0 + \hat \beta_1 x_i  \hat \beta_0  \hat \beta_1 \bar x)^2 \sum(y_i  \bar y)^2} \\ = \frac{(\hat \beta_1 \sum(x_i  \bar x)(y_i  \bar y))^2}{\hat \beta_1 ^2 \sum(x_i  \bar x)^2 \sum(y_i  \bar y)^2} \\ = \frac{(\sum(x_i  \bar x)(y_i  \bar y))^2}{\sum (x_i  \bar x)^2 \sum(y_i  bar y)^2} \\ =  Cor(X, Y) ^2$
Chatper 3. Multiple linear regression
 Multiple Linear Regression Model: $Y = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i, i= 1, \cdots, n$
 Matrix notation: $y = X \beta + \epsilon$
Chatper 3.1 Parameter Estimation: Least Squares Estimation (LSE)
 $\epsilon = y  X \beta$
 $S(\beta) \epsilon’ \epsilon = (y  X \beta)’(y  X \beta)$
 Differentiate $S(\beta)$ in $\beta$ to find the minimum point.:
 $\frac{\partial}{\partial \beta} S(\beta) = 2(X’X \beta  X’y) = 0$
 \hat \beta_{LSE} = (X’X)^{1} X’y$
 The fitted values and residuals:
 $\hat y_i = \hat \beta_0 + \hat \beta_1 x_{i1} + \cdots + \hat \beta_p x_{ip}$
 $e_i = y_i  \hat y_i$
 In matrix notation,:
 Fitted Vector $\hat y = X \hat \beta = X(X’X)^{1} X’ y = P_X y$
 Residual Vector $e = y  \hat y = (I  P_X)y$
 Remark 1. The matrix $P_X = X(X’X)^{1} X’$ is called the projection matrix onto the oclumn space of X. It is also known as the hat matrix $H = P_X$

Remark 2. Check $P_X^2 = P_X$, that is, it is idempotent. In addition, we can show $(I  P_X)^2 = I  P_X$ as well.
 The unbiased estimate of the rror variance:
 $\hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{np1} = \frac{e’e}{np1} = \frac{SSE}{np1}$
 In matrix notation,
 $\hat \sigma^2 = \frac{1}{np1} e’e = \frac{1}{np1} y’(IP_X)^2 y = \frac{1}{np1}y’(IP_X)y$
Chapter 3.2 Interpretation of the regression coefficients
 Let $Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_j X_j + \cdots + \beta_p X_p + \epsilon$ be a linear regression model to explain the data.

The (partial) regression coefficient $\beta_j$ is the increment in the mean response $EY$ when we increase $X_j$ by one unit while all other predictors $X_i (i \not = j)$ are held fixed.
 The partial regression coefficient $\beta_j$ represents the contribution of $X_j$ to $Y$ after adjuesting for the other predictors
 Let $e_{Y X_1}$ be the residual vector obtained from the linear regression model $Y = \alpha_0 + \alpha_1 X_1 + \epsilon$ and $e_{X_2 X_1}$ be the residual vector obtained from the regression model $X_2 = \gamma_0 + \gamma_1 X_1 + \epsilon$. Then, the partial regression coefficient $\beta_2$ is equal to the regression coefficient $\delta$ obtained from $e_{Y X_1} = 0 + \delta e_{X_2 X_1} + \epsilon$
Chapter 3.3. Centering and scaling
 The magnitudes of the regression coefficients in a regression equation depend on the unit of measurements of the variables
 To make the regression coefficients unitless, one may first center and/or scale teh variables before performing the regression computations
 UnitLength scaling:
 $Z_y = \frac{Y  \bar y}{L_y}$
 $Z_j = \frac{X_j  \bar x_j}{L_j}$
 where $L_j = \sqrt{\sum_{i=1}^n (y_i  \bar y)^2}$ and $L_j = \sqrt{\sum_{i=1}^n (x_{ij}  \bar x_j)^2}$
 Standardizing:
 $\hat Y = \frac{Y  \bar y}{s_y}$
 $\hat X_j = \frac{X_j  \bar x_j}{s_j}$
Chapter 3.4 Properties of LSEs
 (GaussMarkov theorem) LSE = BLUE
 Assume $\epsilon_i \sim^{iid} N(0, \sigma^2)$, that is, $\epsilon \sim N(0, \sigma^2 I)$, and let $C = (X’X)^{1}$:
 A. $\hat \beta_j \sim N(\beta_j, \sigma^2 C_{jj})$
 B. $\hat \beta \sim N(\beta, \sigma^2 C)$
 C. $SSE/\sigma^2 \sim \chi_{np1}^2$ and $\beta_j$ and $\hat \sigma^2$ are independent.
Chapter 3.5 Multiple correlation coefficient
 $R^2 = (cor(Y, \hat Y))^2 = \frac{(\sum(y_i  \bar y)(\hat y_i  \bar y))^2}{\sum (y_i  \bar y)^2 \sum (\hat y_i  \bar y)^2} \\ = \frac{SSR}{SST} = 1  \frac{SSE}{SST} = 1  \frac{\sum(y_i  \hat y_i)^2}{\sum (y_i  \bar y)^2}$
 The adjuested $R^2$ is defined to be:
 $R_{adj}^2 = 1  \frac{SSE/(np1)}{SST/(n1)}$
Chapter 3.6 Inference for individual regression coefficients
 Test $H_0 : \beta_j = \beta_{j, 0} \text{ versus } H_j : \beta_j \not = \beta_{j, 0}$
 Test statistic:
 $t_j = frac{\hat \beta_j  \beta_{j,0}}{s.e.(\hat \beta_j)} \sim t_{n  p  1} \text{ under H_0 }$
 Calculate the pvalue by Pvalue = $p(t_j) = 2 Pr(t_{np1} >  t_j )$ and reject $H_0$ iff the Pvalue <= $\alpha$
Chatper 3.7 Tests of hypotheses in a linear model
 Full model ($M_F$) : $y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_{i}$ (In matrix form $Y = X \beta + \epsilon$)

Reduced model ($M_R$) : a model such as the above with k regression coefficients
 $\hat y_i$ : the fitted value in the full model
 $\hat y_i^*$ : the fitted value in the reduced model
 $SSE(M_F) = \sum(y_i  \hat y_i)^2$ : the sum of squared redisudals in the full model

$SSE(M_R) = \sum(y_i  \hat y_i^*)^2$ : the sum of squared residudals in the reduced model
 Note that $SSE(M_R) \ge SSE(M_F)$:
 $\frac{SSE(M_R)}{\sigma^2} \sim \chi^2_{nk}$
 $\frac{SSE(M_F)}{\sigma^2} \sim \chi^2_{np1}$
 $\frac{SSE(M_R)  SSE(M_F)}{\sigma^2} \sim \chi^2_{p+1+k}$

In addition, $SSE(M_F)$ and $SSE(M_R)  SSE(M_F)$ are independent.
 Condier a ypothesis test $H_0$ : reduced model is adequate versus $H_1$ : Full model is adequate.:
 $F= \frac{\frac{SSE(M_R)  SSE(M_F)}{\sigma^2} / (p+ 1  k)}{\frac{SSE(M_F)}{\sigma^2} / (np1)} \\ = \frac{[SSE(M_R)  SSE(M_F)] / (p+1k)}{SSE(M_F) / (np1)} \sim F_{p+1  k, np1} \text{ under } H_0$
 Suppose the reduced model has q= k 1 predictors and the coefficient of determination $R_q^2$ whereas teh full model has p predictors and the coefficient of determination $R_p^2$. Then,:
 $R_p^2 = 1  \frac{SSE(M_F)}{SST}$
 $SSE(M_F) = SST[1R_p^2]$
 Similarly, $SSE(M_R) = SST(1  R_q^2)$
 Therefore, the F statistic is given by:
 $F = \frac{(R_p^2  R_q^2) / (pq)}{(1  R_p^2) / (n p  1)} \sim F_{pq, np1}$
 More generally, let $M_F$ be the model including m independent parameters and $M_R$ be a submodel in the full model $M_F$ consisting of k independent parameters:
 $F = \frac{\frac{SSE(M_R)  SSE(M_F)}{\sigma^2} / (mk)}{\frac{SSE(M_F)}{\sigma^2}/ (n  m)} \\ = frac{[SSE(M_R)  SSE(M_F)] / (mk)}{SSE(M_F) / (nm)} \sim F_{mk, nm} \text{ under } H_0$
Source  SS  df  MS  F  Pvalue 

Regression  SSR  p  MSR = SSR/p  F = MSR/MSE  $P(F_{p+1 k, np1} \ge F)$ 
Residudals  SSE  np1  MSE = SSE(np1) 
Chatper 4. Diagnostics
 As seen in the Anscombe data, particularly dataset 3 and 4, one or few observations may too much influence or completely determine the regression line. We need to check whether the fit is overly determined by few observations.
 The confidence or prediction intervals and ttest or anova approach in hypothesis tests in Chapter 3 require normality assumption on the random errors.
 When some of the regression assumptions are violated, the analysis based on the fit may be distorted and they are not reliable.
 Need to check the assumptions made to apply the linear models to fit the datset.
4.1 Standard Regression Assumptions
 Assumption about the form of the model (linearity of Y and X_1, …, X_p): $Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon$
 Assumptions about the random erros: $\epsilon_1, \cdots, \epsilon_n \sim^{iid} N(0, \sigma^2)$:
 Normality assumption
 Mean zero
 Constant variance (homogeneity or homoscedasticity) assumption: when violated, it is called the heterogeity (or heteroscedasticity) problem
 Independent assumption: when violated, it is called the autocorrelation problem
 Assumptions about the predictors:
 Nonrandom
 No measurement erros
 Linearly independent: when violated, it is called the multicollinearity problem
 Assumption about the observations (equally reliable)
4.2 Residuals
 A simple and effective method ofr detecting model deficiencies in regression analyssis is the examination of residuals
 Projection or hat matrix : $H = P_X = X(X’X)^{1} X’$
 Note that $\hat y = P_X y$ and $e = (I  P_X) y$
 Let $p_{ij}$ be the (i, j) element of the projection matrix. For a simple linear regression,:
 $p_{ij} = \frac{1}{n} + \frac{(x_i  \bar x)(x_j  \bar x)}{\sum (x_i  \bar x)^2}$
 The diagonal term $h_i = p_{ii}$ is called the leverage value of the i th observation. $p_{ii}$ is denoted by $h_i$ in other literatures. For a simple linear regression,:
 $h_i = p_{ii} = \frac{1}{n} + \frac{(x_i  \bar x)^2}{\sum (x_i  \bar x)^2}$
 Since $\hat y_i = p_{i1}y_1 + \cdots + p_{ii}y_i + \cdots + p_{in} y_n$, the leverage $p_{ii}$ is the weight (leverage) given to $y_i$ in determining the i th fitted value $\hat y_i$
 $Var(e_i) = \sigma^2 (1  p_{ii})$
 $\hat \sigma^2 = \frac{1}{n  p  1} \sum_{i=1}^n e_i^2 = \frac{1}{np1} SSE$

$\hat \sigma^2{(i)} = \frac{1}{n  p  2} SSE{(i)}$ where $SSE_{(i)}$ is the sum of squares of residuals when n  1 observations are used to fit by deleting the i th observation.
 Various types of residuals:
 Standardized residual:
 $z_i = \frac{e_i}{\sigma\sqrt{1  p_{ii}}}$
 $E(z_i) = 0$ and $V(z_i) = 1$, but $\sum z_i \not = 0$
 Internally studentized residual:
 $r_i = \frac{e_i}{\hat \sigma \sqrt{1  p_{ii}}}$
 Externally studentized reisudal:
 $r_i^* = \frac{e_i}{\hat \sigma_{(i)} \sqrt{1  p_{ii}}} \sim t_{np2}$
 For simplicity, we refer to the studentized residuals as the standardized residuals. Note:
 $r_i^* = r_i \sqrt{\frac{np2}{np1  r_i^2}}$
 so that $r_i$ and $r_i^*$ are close to each other when the sample size n is large
 Standardized residual:
 Remark 3. For the purpose of residual plots, it makes little difference as to which of the two forms of the standardized residuals is used. From here on, we shall use the internally standardized residuals in the graphs. We need not make any distinction between the internally and externally standardized residuals in our residual plots. Several graphs of the residuals are used for checking the regression assumptions.
4.3 Graphical Methods
 Detect errors in the data (e.g., an outlying point may be a result of a typographical error)
 Recognize patterns in the data (e.g., clusters, outliers, gaps, etc.)
 Explore relationships among variables
 Discover new phenomena
 Confirm or negate assumptions
 Assess the adequacy of a fitted model
 Suggest remedial actions (e.g., transform the data, redesign the experiment, collect more data, etc.)
 Enhace numerical analyses in general
 Graphs before fitting a model. These are useful, for exmpale, in correcting erros in data and in selecting a model
 Graphs after fitting a model. These are particularly useful for checking the assumptions and fore assessing the goodness of the fit.
 QQ plot or PP plot or normal probability plot: checking normality
 Scatter plots of the standardized residual vs X_i : checking linearity and homogeneity
 Scatter plot of standardized residual vs fitted values: checking linearity and homogeneity
 Index plot of the standardized residuals: checking independent errors
4.4 Leverages, Outliers Influence
 We do not want our fit to be determined by one or few observations
4.4.1 Outliers in response
 Observation with large standardized residuals are outliers in the resposne variable
 See Anscombe’s quartet (c). But the outlier in (d) can not be detected by the residuals
 In R, use
outlierTest
4.4.2 Outliers in predictors
 $h_i = H_{ii} = P_{ii}$ are called levwerages and depend only on the predictors
 From $Var(e_i) = \sigma^2(1  h_i)$, a large leverage $h_i$ will make $Var(e_i)$ small (<=> The fit will be forced to be close to $y_i$)
 In addition, a large leverage point may not be detected in the residual plot since $e_i$ is small
 High leverage points : $h_i \ge 2(p+1)/n$
 See Anscombe’s quartet (d). The leverage of 8 th observation is 1 and its residual is 0
 In R,
X=model.matrix(fit); hat(X)
4.4.3 Masking and Swamping Problems
 Masking occurs when the data contain outliers but we fail to detect them. THis can happen because some of the outliers may be hidden by other outliers in the data.
 Swamping occurs when we wrongly declare some of the nonoutlying points as outliers. This can occur because outliers tend to pull the regression equation toward them, thence make other points lie far from the fitted equation
 Masking is a false negative decision whereas swamping is a false positive.
4.4.4 Incluential points
 If deletion of an observation make substantial change in the fit (estimated coefficients, fitted values, tTests, etc), then it is called an influential point
 See Anscombe’s quartet (c) and (d)
 Meausres of influence:
 Cook’s distance: $C_i = \frac{(\hat \beta  \hat \beta_{(i)})’ (X’X) (\hat \beta  \hat \beta_{(i)})}{\hat sigma^2 (p+1)} = \frac{\sum_{j=1}^n (\hat y_j  \hat y_{j(i)})^2}{\hat \sigma^2 (p+1)} = \frac{r_i^2}{p+1} \frac{p_{ii}}{1  p_{ii}}$:
 Rule of thumbs : $C_i \ge 1$
 Welsch and Kuh’s measure: $DFFITS_i = \frac{\hat y_i  \hat y_i(i)}{\hat \sigma_{(i)} \sqrt{p_{ii}}} = r_i^* \sqrt{p_{ii}}{1  p_{ii}}$:
 Rule of thumbs : $DFFITS_i \ge 2 \sqrt{(p+1) / (np1)}$
 Hadi’s influence meausre: $H_i = \frac{p_{ii}}{1p_{ii}} + \frac{p+1}{1p_{ii}} \frac{d_i^2}{1  d_i^2} \text{ where } d_i = e_i / \sqrt{SSE}$
 Cook’s distance: $C_i = \frac{(\hat \beta  \hat \beta_{(i)})’ (X’X) (\hat \beta  \hat \beta_{(i)})}{\hat sigma^2 (p+1)} = \frac{\sum_{j=1}^n (\hat y_j  \hat y_{j(i)})^2}{\hat \sigma^2 (p+1)} = \frac{r_i^2}{p+1} \frac{p_{ii}}{1  p_{ii}}$:
 What do we do with the outliers and influential points? Do not automatically remove the observations!:
 Check for a data entry error first
 Examine the physical context  why did it happen?
 Exclude the point from the analysis but try reincluding it later if the model is changed
 Redesign the experiment or sample survey, collect more data
4.5 Addedvariable (AV) plot and residual plus component (RPC) plot
 What are the effects of deleting or adding one of variables from or to the model? The ttest is valid only if the underlying assumptions hold
 There are two popular graphical ways to complement the ttest

Scatter plot matrix shows marginal relationship of each predictor and the response variable, that is, simple linear regression models
 Addedvariable plot (partial regression plot, adjuested variable plots, and individual coefficient plots):
 Let $e(Y  X_{(i)})$ be the residuals of the regression of $Y$ on $X_1, …, X_{j1}, X_{j+1}, …, X_p$, that is , the residuals of:
 $Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_{j1} X_{j1} + \beta_{j+1} X_{j+1} + \cdots + \beta_p X_p + \epsilon$
 Let $e(X_j  X_{(j)})$ be the residuals of the regression of $X_j$ on $X_1, …, X_{j1}, X_{j+1}, …, X_p$, that is, the residuals of:
 $X_j = \alpha_0 + \alpha_1 X_1 + \cdots + \alpha_{j1} X_{j1} + \alpha_{j+1} X_{j+1} + \cdots + \alpha_p X_p + \epsilon$
 The scatter plot $(e(X_j  X_{(j)}), e(Y  X_{(j)}))$ is called the addedvariable plot.:
 $e(Y  X_{(j)}) = (I  H_{(j)}) y$
 $e(X_j  X_{(j)}) = (I  H_{(j)}) x_j$
 $(I  H_{(j)}) y = (I  H_{(j)}) x_j \beta_j + \epsilon ‘$’
 Hence, the slope of the addedvariable plot is the same as the jth regression coefficient in the full model.
 The AV plot can detect nonlinearity between the response and the predictors
 The AV plot is also useful to detect outliers or influential points
 Let $e(Y  X_{(i)})$ be the residuals of the regression of $Y$ on $X_1, …, X_{j1}, X_{j+1}, …, X_p$, that is , the residuals of:
 Residual plus component plot (parital residual plot):
 A multiple linear regression model can be written as:
 $Y  \beta_0  \beta_1 X_1  \cdots  \beta_{j1}X_{j1}  \beta_{j+1} X{j + 1}  \cdots  \beta_p X_p = \beta_j X_j + \epsilon$
 The response adjusting the effects of other predictors is equal to $\beta_j X_j + \epsilon$ and the estimated quantity is equal to $\hat \beta_j X_j + e$
 The residual plus component plot is $(X_j, e + \hat \beta_jX_j)$, where the residuals $e_i$’ and the regression coefficients $\hat \beta_j$’s are obtained from the full model.
 The residual plus component plot tends to detect nonlinearity between $X_j$ and $Y$ better than the addedvariable plot
 A multiple linear regression model can be written as:
4.5.1 Effects of adding a variable
 There are four possible cases when adding a predictor in a model.:
 The new variable has an insignificant regression coefficient and the remaining regression coefficients do not change substantially from their previous values. Under these conditions the new variable should not be included in the regression equation, unless omse other external conditions (e.g., theory or subject matter considerations) dictate its inclusion.
 The new variable has a significant regression coefficient, and the regression coefficients for the previously introduced variables are chagned in a substantial way. In this case the new variable should be retained, but an examination of collinearity should be carried out. If there is no evidence of collinearity, the variable should be included in the equation and other additional variables should be examined for possible inclusion. On the other hand, if the variables show collinearity, corrective actions should be taken.
 The new variable has a significant regression coefficient, and the coefficients of the previously introduced variables do not change in any substantial way. This is the ideal situation and aries when the new variable in uncorrelated with the previously introduced variables. Under these conditions the new variable should be retained in the equation.
 The new variable has an insignificant regression coefficient, but the regression coefficients of the previously introduced variables are substantially changed as a result of the introduction of the new variable. This is clear evidence of collinearity, and corrective actions have to be taken before the equation of the inclusion or exclusion of the new variable in the regression equation can be resolved.
4.5.2 Robust regression
 Another approach, useful for the identification of outliers and influential observations, is robust regression, a method of fitting that gives less weight to points with high leverage.
Chapter 5. Regression analysis with qualitative explanatory variables
5.1 Introduction
 Predictor(s) may be qualitative: Use dummy (indicator) variables
 Dummy variable takes only on 0 or 1.
 For a qualitative variable with k possible categories $(C_1, \cdots, C_k)$, we need k1 dummy variables: $I(C_1), \cdots, I(C_{k1})$ since $I(C_1) + \cdots + I(C_k) = 1$
5.2 Interactions
 Interaction effects example
5.3 Equal slopes and unequal intercepts
 Let D be a dummy variable.:
 $EY = \beta_0 + \beta_1 X + \beta_2 D$
 is a model equivalent to:
 $EY = \beta_0 + \beta_1 X \text{ for } D = 0$
 $EY = (\beta_0 + \beta_2) + \beta_1 \text{ for } D = 1$
5.4 Unequal slopes and unequal intercepts
 There may exist an interaction effect between a qualitative predictor and a quantitative predictor to the response variable
 Let D be a dummy variable. The interaction effect of D (qualitative predictor) and X (quantitative predictor) to the response variable Y produces the model with unequal slopes of X (quantitative predictor) and uneuqla intercepts:
 $EY = \beta_0 + \beta_1 X + \beta_2 D + \beta_3 (X: D)$
 is a model equivalent to:
 $EY = \beta_0 + \beta_1 X \text{ for } D = 0$
 $EY = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X \text{ for } D = 1$
5.5 Seasonality
Chapter 6. Transformations
 Use transformations to achieve linearity or/and homogeneity
6.1 Transformations for lienarity
 Nonlinearity between predictor(s) and response may be detected by a scatter plot (simple linear regression) or an addedvariable plot or residual plus component plot (multiple linear regression)
Function  Transformation  Linear Form 

$Y=\alpha X^{\beta}$  $Y’ = logY, X’ = logX$  $Y’=\alpha ‘ + \beta X’$ 
$Y = \alpha e^{\beta X}$  $Y’ = logY$  $Y’ = log \alpha + \beta X$ 
$Y = \alpha + \beta log X$  $X’ = logX$  $Y = \alpha + \beta X’$ 
$Y = \frac{X}{\alpha X  \beta}$  $Y’ = \frac{1}{Y}, X’ = \frac{1}{X}$  $Y’ = \alpha  \beta X’$ 
$Y = \frac{e^{\alpha + \beta X}}{1 + e^{\alpha + \beta X}}$  $Y’ = log \frac{Y}{1Y}$  $ Y’ = \alpha + \beta X$ 
6.2 Detection of heterogeneity
 Heterogeneity may be detected by the residual plots (residuals vs each predictors or residuals vs fitted values). A formal test can be performed by ncvTest (car) or bptest (lmtest)
6.3 Variance stabilizing transformations
 Note that the variance stabilizing transform makes the rror distribution closer to a normal distribution as well
$\sigma$  Transformation 

$\sigma = \mu^k$  $Y^{1  k}$ 
$\sigma = \mu$  $log Y$ 
$\sigma = \sqrt{\mu}$  $\sqrt{Y}$ 
$\sigma = \sqrt{\mu (1  \mu) / n}$  $arcsin(\sqrt{Y})$ 
6.4 Weighted Least Squares (WLS)
6.5 BoxCox power transformations
 $f(Y;\lambda) = \begin{cases} \frac{Y^{\lambda}  1}{\lambda} & \text{ for } \lambda \not = 0 \\ log Y & \text{ for } \lambda = 0 \end{cases}$
 is called the BoxCox power transformation where $\lambda$ may be estimated from the data. Historically, the main purpose of the BoxCox transform was to make a random variable closer to a normal distribution
Chatper 7. Weighted Least Squares
 Model: $y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i$. In a matrix form, $Y = X \beta + \epsilon$
 Suppose that $Var(\epsilon) = \sigma_i^2$
 Then the transformed new model $\frac{y_i}{\sigma_i} = \beta_0 \frac{1}{\beta_i} + \beta_1 \frac{x_{i1}}{\sigma_i} + \cdots + \beta_p \frac{x_{ip}}{\sigma_i} + \frac{\epsilon_i}{\sigma_i}$ would have a constant variance erros.
 Let $w_i = \frac{1}{\sigma_i^2}$ be the weight on Observation i. Let $W = diag(w_1, \cdots, w_n)$
 Minimize:
 $ SSE^*(\beta) = \sum w_i(y_i  \beta_0  \beta_1 x_{i1}  \cdots  \beta_p x_{ip})^2 \\ = \sum (\frac{y_i}{\sigma_i}  \beta_0 \frac{1}{\sigma_i}  \beta_1 \frac{x_{i1}}{\sigma_i}  \cdots  \beta_p \frac{x_{ip}}{\sigma_i})^2 \\ = \sum (y_i ^ *  \beta_0 x_{i0} ^ *  \beta_1 x_{i1} ^ *  \cdots  \beta_p x_{ip} ^ *)^2 $
 by defining new variables $y_i^* = y_i / \sigma_i$ and $x_{ij}^* = x_{ij} / \sigma_i$
 In matrix form,:
 $\epsilon = Y  X \beta$
 $SSE^*(\beta) = \epsilon ‘ W \epsilon = (Y  X\beta)’ W (Y  X \beta)$
 We define $Y^* = W^{1/2} Y$ and $X^* = W^{1/2} X$, where $W^{1/2} = diag(1/\sigma_1, \cdots, 1/\sigma_n)$
 The LSEs with the transformed data are given by:
 $\hat \beta^* = ((X ^ )’ X ^ *)^{1} (X^)’ Y ^ * \\ = (X’WX)^{1} X’WY$
 that are teh WLS estimates of the regression coefficient vector $\beta$ with the untransformed data
 Statistics in terms of the transformed data and the untransformed data:
 Let $y_i ^ *$ be the ith transformed response value: $y_i ^ * = \sqrt{w_i} y_i$, that is, $Y ^ * = W^{1/2} Y$
 Let $x_{i0} ^ , \cdots, x_{ip} ^ *$ be the ith transformed predictor values : $x_{i0} ^ * = \sqrt {w_i} x_{i0}, \cdots, x_{ip} ^ * = \sqrt{w_i} x_{ip} ^ *$, that is, $X^ = W^{1/2} X$
 The WLS method minimizes $\sum_{i=1}^n w_i (y_i  \beta_0, x_{i0}  \cdots  \beta_p x_{ip})^2 = \sum_{i = 1}^n(y_i ^ *  \beta_0 x_0 ^ *  \cdots  \beta_p x_{ip} ^ *)^2$:
 Let $\hat \beta_i ^ *$ be teh WLS estimate of $\beta_i$ and the residual sum of squares of the transformed data can be written as:
 $SSE ^ * = \sum_{i=1}^n w_i (y_i  \hat \beta_0 ^ * x_{i0}  \cdots  \hat \beta_p ^ * x_{ip})^2 = \sum_{i=1}^n (y_i ^ *  \hat \beta_0 ^ * x_{i0} ^ *  \cdots  \hat \beta_p ^ * x_{ip} ^ *)^2 \\ = (Y  X \hat \beta ^ *)’ W (Y  X \hat \beta ^ *) = (Y ^ *  X ^ * \hat \beta ^ *) ‘ (Y ^ *  X ^ * \hat \beta ^ *)$
 $SST ^ * = \sum_{i=1}^n w_i (y_i  \bar y)^2 = \sum_{i=1}^n (y_i ^ *  \bar y ^ *)^2 \\ = (Y  \bar Y)’ W (Y  \bar Y0) = (Y ^ *  \bar Y ^ *)’ (Y ^ *  \bar Y ^ *)$
 $\hat \sigma ^ * = \sqrt{ \frac{ SSE ^ * }{n  p  1}}$ => R shows this value
 $R^2 = 1  \frac{SSE ^ * }{SST ^ *}$ => R shows this value
 Using the WLS estimates $\hat \beta_i ^ *$ and the untransformed data $y_i, x_{i0}, \cdots, x_{ip}$:
 $SSE(\hat \beta ^ *) = \sum_{i=1}^n(y_i  \hat \beta_0 ^ * x_{i0}  \cdots  \hat \beta_p ^ * x_{ip}) ^2 \\ = (Y  X \hat \beta ^ *) ‘ (Y  X \hat \beta ^ *)$
 $SST = \sum_{i=1}^n (y_i  \bar y)^2 \\ = (Y  \bar Y) ‘ (Y  \bar Y)$
 $\hat \sigma = \sqrt{\frac{SSE(\hat \beta ^ *)}{np1}}$
 $PseudoR^2 = 1  \frac{SSE(\hat \beta ^ *)}{SST}$
 Let $\hat \beta_i ^ *$ be teh WLS estimate of $\beta_i$ and the residual sum of squares of the transformed data can be written as:
 Since the LSEs $\hat \beta_{LS}$ minimize SSE and the WLS estimates $\hat \beta ^ *$ do not minimize SSE, it is always true that $SSE(\hat \beta ^ *) \ge SSE(\hat \beta_{LS})$. The pseudo $R^2$ is always smaller than $R^2$ calculated usign the LSE $\hat \beta_{LS}$
 How to know or find the weight $w_i$?:
 If known a priori, then use it.
 If unknown, then first fit the data using the OLS method and then figure out the weights $w_i$. At the second stage, fit the data using WLS with the esitmated weightts (Twostage procedure)
 Adding the qualitative variable Region to the model may cure the heterogeneity problem
 We may try a variance stabilizing transform as well through BoxCox transform.
Chapter 8. Correlated errors
 One of the standard regression assumptions: $\epsilon_i$ and $\epsilon_j$ for $i \not = j$ are independent.
 WHen the observations have a natural sequential order as in time series data or spatial data, the coreelation is referred to as autocorrelation (serial correlation, lagged correlation), that is the correlation between the current value and the past value for an example in time series data
 Why does the auto correlation occur?:
 Successive observations that are positively correlated: adjacent random errors and residuals tend to be similar in both temporal and spatial dimensions due to similar external conditions
 Omitted an important predictor of which adjacent values are correlated.
 Effects or consequences of ignoring the autocorrelation: What happens if we ignore the autocorrelation when it exists?:
 LSEs are still unbiased but not minimum variance, ie.e., lose efficiency
 $\hat \sigma^2$ may be an underestimate of $\sigma^2$, that is, $E(\hat \sigma^2) < \sigma^2$
 CIs and hypothesis tests may be invalid
8.1 Runs Test
 Definition 8.8.1 (Runs test):
 Suppose that we have $n_1$ “+”s and $n_2$ ““s. Under assumption of randomness of “+” and ““, every sequence would have the same probabilty with $\frac{1}{\binom{n_1 + n_2}{n_1}}$. A run is defined as the number of groups in the sequence.
 For instance, ++—++ has three runs and ++++ has 6 runs. To test randomness of a sequence, we can calculate the pvalue from all the permutations of the sequence of which the probability is equal.
 When $n_1$ = #+’s and $n_2$ = #‘s are large (>= 10), the expected value and the variance of the runs under the null hypothesis of randomness are:
 $\mu = \frac{2 n_1 n_2}{n_1 + n_2} + 1$
 $\sigma^2 = \frac{2 n_1 n_22 (2 n_1 n_2  n_1  n_2)}{(n_1 + n_2)^2 (n_1 + n_2  1)}$
 The large sample test statistic of randomness is:
 $ZZ = \frac{runs  \mu}{\sigma} \sim N(0, 1) \text{ under } H_0$
8.2 DurbinWatson test
 First order autocorrelation model:
 $\epsilon_t = p \epsilon_{t  1} + w_t \text{ where }  \rho  < 1, w_t \sim^{iid} N(0, \sigma^2)$
 where $\rho$ is the correlation coefficient between $\epsilon_t$ and $\epsilon_{t1}$
 Test $H_0: \rho = 0 \text{ against } H_1 : \rho > 0$
 Test statistic is:
 $d = \frac{\sum_{t=2}^n(e_t  e_{t1})^2}{\sum_{t=1}^n e_t^2}$
 The estimate of $\rho$ is:
 $\hat \rho = \frac{\sum_{t=2}^n e_t e_{t1}}{\sum_{t=1}^n e_t^2}$
 The DurbinWatson statistic is approximately:
 $d = 2(1  \hat \rho)$
8.3 Transformation to remove autocorrelation (Cochrane and Orcutt, 1949)
 Cochrane and Orcutt’s iterative procedure:
 Estimate $\beta_0$ and $\beta_1$ of $y_t = \beta_0 + \beta_1 x_t + \epsilon_t$ using OLS method.
 Obtain the residuals $e_t$ and estimate $\hat \rho = \frac{\sum_{2}^n e_t e_{t1}}{\sum_{1}^n e_t^2}$
 Calculate $y_t ^ * = y_t  \rho y_{t1}$ and $x_t ^ * = x_t  \rho x_{t1}$ and estimate $\beta_0 ^ *$ and $\beta_1 ^ *$. Obtain $\hat \beta_0 = \frac{\hat \beta_0 ^ *}{1  \hat \rho}$ and $\hat \beta_1 = \hat \beta_1 ^ *$
 Examine the autocorrelation by using the residuals obtained from 3: $\e_t = y_t  \hat \beta_0  \hat \beta_1 x_t$
 Repeat 24 until we remove the autocorrelation

Note that CochraneOrcutt’s procedure does not guarantee convergence
 Another approach, the textbook called it iterative method, is to minimize:
 $S(\rho, \beta_0, \beta_1) = \sum_{i = 2}^n(y_t  \rho y_{t1}  \beta_0 (1 \rho)  \beta_1 (x_t  \rho x_{t1}))^2$
 The results by this methoid usually do not significantly deviate from those by CochraneOrcutt procedure
8.4 Autocorrelation and missing predictors
 Autocorrelation may appear when a model is missing a significant predictor
8.5 Seasonality and dummy variables
 DurbinWatson test and runs test may not detect a certian type of correlations. One of which is seasonality
Chatper 9 Multicollinearity
 One of the standard regression assumptions is that the predictors are linearly independent
 We call orthogonal predictors if there is completely no linear relationship between the predictors, that is, the predictors are uncorrelated.
 Interpretation does not work any longer if there is multicollinearily ($\beta_j$: The increment of the response by increasing one unit of $X_j$ when all other variables are held fixed??)
 Multicollinearity produces unstable estimates of the regression parameters, that is, the standard error of the regression parameter betcomes large.
9.1
9.1.1 Multicollinearity may affect inferences in a regression model
9.1.2 Multicollinearity may affect forecasting
 Prediction may work with careful consideration of the predictors, that is , the values of the predictors are chosen by satisfying the correlation structures of the predictors, but the prediction under change of a single predictor whil holding others fixed may not be reasonable.
9.2 Detection of multicollinearity
 Definition 9.2.1 (Variance INflaction Factor (VIF)):
 Let $R_j^2$ be the coefficient of determination whe n$X_j$ is regressed on all other predictors. The variance inflation factor (VIF) for $X_j$ is defined by:
 $VIF_j = \frac{1}{1  R_j^2}$
 Let $R_j^2$ be the coefficient of determination whe n$X_j$ is regressed on all other predictors. The variance inflation factor (VIF) for $X_j$ is defined by:
 Note that $ 1 \le VIF_j < \infty$. When all predictors are orthogonal, $VIF_j = 1$. If $VIF_j > 10$, then the estimate of $\beta_j$ may be unstable. The variance of the LS estimate of $\beta_j$ is proportional to $VIF_j$ when $X_j$ is centered and scaled.
 Note $R_j^2 = 1  \frac{1}{VIF_j}$. Therefore, $VIF_j > 10$ is equivalent to $R_j^2 > 0.9$
 The average $VIF_j$ values, denoted by $\bar VIF$, is the ratio of the squared error of the LSEs to the squared errors the estimates when the predictors are orthogonal
 $Cov(\hat \beta) = \sigma^2 (X’ X)^{1}$. The kth diagonal term of $(X’X)^{1}$ is:
 $((X’X)^{1}){kk} = [x’_k x_k  x’_k X{(k)} (X’{(k)} X{(k)}) ^{1} X’{(k)} x_k]^{1} \\ = [x’_k ( I  P{X_{(k)}})]^{1} \\ = [S_{kk}(1  R_k^2)]^{1} \\ = \frac{VIF_k}{S_{kk}} \\ = \frac{VIF_k}{(n1)Var(x_k)}$
 Definition 9.2.2 (The condition indices and the condition number).:
 $Let \lambda_1 \ge \cdots \ge \lambda_p$ be the ordered eigenvalues of the correlation matrix of the predictors. THe jth condition index is defined by:
 $\kappa_j = \sqrt{\frac{\lambda_1}{\lambda_j}}$
 When it is large, there exists multicollinearity. The largest value $\kappa_p$ is called the condition number of the correlation matrix and there may be a strong multicollinearity if $\kappa_p \ge 15$
 $Let \lambda_1 \ge \cdots \ge \lambda_p$ be the ordered eigenvalues of the correlation matrix of the predictors. THe jth condition index is defined by:
Summary of Chapter 9
 Consequences of multicollinearity:
 Interpretation of the fitted model may not be reaonsable
 Predcition may not work without careful handling of collinearity
 The variance(s) of the LSE(s) may be large.
 How to detect multicollinearity?:
 Look at the scatter plot matrix
 Calculate VIFs and see if any VIF > 10.
 Calculate the condition indicies $\kappa_j = \sqrt{\lambda_1 / \lambda_j}$ and see if any condition index > 15
Chatper 10. Methods for data with multicollinearity
 We may delete some variables causing multicollinearity. => It may not be clear which variable(s) should be removed.
 Principal components regression followed by deleting some components with small variances
 Ridge regression and LASSO
10.1 Principal components
 Have X centered and scaled so that we assume $Var(X_j) = 1$ and $E(X_j) = 0$
 Transform $X_1, \cdots, X_p$ to p orthogonal predictors, $C_1, \cdots, C_p$:
 $C_j = v_{1j} X_1 + v_{2j} X_2 + \cdots + v_{pj} X_p$
 Let $V = (v_1; \cdots; v2) = \begin{pmatrix} v_{11}&v_{12}& \cdots& v_{1p} \\ v_{21}& v_{22} &\cdots& v_{2p} \\ \vdots &\vdots& \ddots& \vdots \\ v_{p1}& v_{p2}& \cdots& v_{pp} \end{pmatrix}$
 be the matrix whos column vectors are orthonormal eigenvectors of $\frac{1}{n1} X’ X$, that is , $(\frac{1}{n1} X’ X) v_i = \lambda_i v_i$ and $v_j ‘ v_i = \delta_{ij}$
 New data set may be written as:
 $C_{n \times p} = X_{n \times p} V_{p times p}$
 $\frac{1}{n1} C’C=V’(\frac{1}{n1} X’X) V \\ = \Lambda = diag(\lambda_1, \cdots, \lambda_p)$
 that is, $C_j$ has the sample variance $\lambda_j$ and $C_i$ and $C_j$ for $i \not = j$ have zreo sample correlation coefficient.

If some eigenvalues are quite smaller than others or near zero, then multicollinearity exists. The principal components (PCs) with samll eigenvalues may give use the relationship between the predictors
 Summary of principal components:
 Diagonalize the covariance or correlation matrix
 Eigenvector => Coefficients of the principal component
 Eigenvalue => Variance of the principal component
10.2 Recovering the regression coefficients of the original variables
10.2.1 Recovering LSEs from the fit using the centered or/and scaled data
 Let $Y= \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon ‘$, that is $Y = \beta_0 1 + X \beta + \epsilon’$.
 Assume that $W$ and $Z = (Z_1, \cdots, Z_p)$ are centered and scaled variables of $Y$ and $X = (X_1, \cdots, X_p)$, respectively.
 Note that the LSE of the intercept for the centered variables iz zero since:
 $W = \theta_0 1 + Z \theta + \epsilon$
 $\begin{pmatrix} \hat \theta_0 \\ \hat \theta \end{pmatrix} = [\begin{pmatrix} 1’ \\ Z’ \end{pmatrix} \begin{pmatrix} 1 : Z \end{pmatrix}]^{1} \begin{pmatrix} 1’ \\ Z’ \end{pmatrix} W \\ = \begin{pmatrix} n &0 \\ 0 &Z’Z\end{pmatrix} ^{1} \begin{pmatrix} 0 \\ Z’W \end{pmatrix} \\ = \begin{pmatrix} \frac{1}{n} &0 \\ 0& [Z’Z]^{1} \end{pmatrix} \begin{pmatrix} 0 \\ Z’W \end{pmatrix} \\ = \begin{pmatrix} 0 \\ (Z’Z)^{1}Z’W \end{pmatrix}$
 Hence we do not need $\theta_0$ for the centered data:
 $W=Z\theta + \epsilon$
 $\frac{Y  \bar y 1}{s_y} = \theta_1 \frac{X_1  \bar x_1 1}{s_{x_1}} + \cdots + \theta_p \frac{X_1  \bar x_p 1} {s_{x_p}} + \epsilon$
 $Y = (\bar y  \theta_1 \frac{s_y}{s_{x_1}} \bar x_1  \cdots  \theta_p \frac{s_y}{s_{x_p}} \bar x_p) 1 + \theta_1 \frac{s_y}{s_{x_1}} X_1 + \cdots + \theta_p \frac{s_y}{s_{x_p}} X_p + s_y \epsilon$
 Compare the above to:
 $Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon’$
 We conclude that:
 $ \beta_i = \theta_i \frac{s_y}{s_{x_i}} \text{ for } i = 1, \cdots, p$
 $ \beta_0 = \bar y  \beta_1 \bar x_1,  \cdots  \beta_p \bar x_p$
10.2.2 Recovering the LSEs from the fit using the principal components
 Let $(Z’Z)V = (n1)V \Lambda$, that is, $V’(Z’Z)V = (n1)\Lambda = (n1)diag(\lambda_1, \cdots, \lambda_p)$ where $\lambda_1 \ge \cdots \ge \lambda_p$ and $VV’ = V’V =I$
 Define $C=ZV$ The jth vector $C_j = Z_1 v_{1j} + \cdots + Z_p v_{pj}$ that is called the jth principal component.
 $W = Z\theta \epsilon$
 $W = (ZV)(V’ \theta) + \epsilon$
 $W = C \alpha + \epsilon$
 The LSE of $\alpha$ is:
 $\hat \alpha = (C’C)^{1} C’ W \\ = \frac{1}{n1} \Lambda^{1} C’ W$
 $Cov(\hat \alpha) = \frac{1}{(n1)^2} \Lambda ^{1} C’ Cov(W) C \Lambda^{1} \\ = \frac{1}{(n1)^2} \Lambda^{1} C’ \frac{\sigma^2}{s_y^2} IC\Lambda ^{1} \\ = \frac{\sigma^2}{(n1)s_y^2} \Lambda ^{1}$
 that means the variance of $\hat \alpha_i = \frac{\sigma^2}{(n1)s_y^2} \lambda_i$ and $\hat \alpha_i$ and $\hat \alpha_j$ are note correlated for $i \not = j$. This implies that is $\lambda_i$ is small, then the LSE of $\alpha_i$ has high uncertainty. Since $V’ \theta = \alpha$,:
 $\hat \theta = V \hat \alpha$
 $Cov(\hat \theta) = \frac{\sigma^2}{(n1) s_y^2} V \Lambda^{1} V’$
Summary of recovering regression coefficients
 Let $\hat \alpha$ be the regression coefficients vector from the principal components.
 Let $\hat \theta$ be the regression coefficients vector from centered and/or scaled data
 Let $\hat \beta$ be the regression coefficients vector from the original data
 $\hat \theta = V \hat \alpha, \text{ that is }, \hat \theta_i = \sum_{j=1}^p v_{ij} \hat \alpha_j$
 $\hat \beta_i = \hat \theta_i \frac{s_y}{s_{x_i}} \text{ for } i = 1, \cdots, p$

$\hat \beta_0 = \bar  \hat \beta_1 \bar x_1  \cdots  \hat \beta_p \bar x_p$
 Standard errors of the estimates: The variance of the regression coefficients can be written as:
 $Cov(\hat \alpha) = \frac{\sigma^2}{(n1) s_y^2} \Lambda^{1} = \frac{\sigma^2}{(n1) s_y^2} diag(\frac{1}{\lambda_1}, \cdots, \frac{1}{\lambda_p})$
 $Cov(\hat \theta) = V Cov(\hat \alpha) V’ = \frac{\sigma^2}{(n1) s_y^2} V \Lambda^{1} V’$
 $Var(\hat \theta_i) = \frac{\sigma^2}{(n1) s_y^2} \sum_{j=1}^p \frac{v_{ij}^2}{\lambda_j}$
 $\hat {Var} (\hat \beta_i) = \frac{\hat \sigma^2}{(n1) s_{x_i}^2} \sum_{j=1}^p \frac{v_{ij}^2}{\lambda_j}$
 $s.e.(\hat \beta_i) = \frac{\hat \sigma}{\sqrt{n1} s_{x_i}} \sqrt{\sum_{j=1}^p \frac{v_{ij}^2}{\lambda_j}}$
10.3 Principal component regression (Dimension reduction)
 Center and/or scale the data
 Calculate the principal components of the sample variancecovariance matrix or the sample correlation matrix
 Select the number of principal components
 Fit the data using the selected principal components
 Recover the estimates of the regression coefficients
 Note that the number of principal components may be selected by the crossvalidation (CV) method
10.4 Ridge regression
 Suppose that the data set is centered and scaled.
 The LSE of the regression coefficients in matrix form can be written as:
 $\hat \theta = (X’ X)^{1} X’ y$
 Multicollinearity makes $X’X$ (almost) singular => LSEs are unstable
 Hoerl and Kennard (1970) proposed the ridge regression:
 $\hat \theta ^ *(\lambda) = (X’X + \lambda I) ^{1} X’y$
 The ridge estimator $\hat \theta^*(\lambda)$ is to minimize:
 $(y  X \theta)’(y  X \theta) + \lambda \sum_{j=1}^p \theta_j ^2$
 Equaivalently, minimize:
 $(y  X \theta)’ (y  X \theta) \text{ given } \sum_{j=1}^p \theta_j^2 \le C$
 $\lambda$ = tuning parameter to be selected.
 $\lambda = 0$ => OLS
 $\lambda = \infty$ => All estimates = 0
 How to select the tuning parameter $\lambda$?:
 HKB method : Hoerl, Kennard, and Baldwin (1975):
 $\lambda = \frac{p \hat \sigma^2 (0)}{\sum_{j=1}^p \hat \theta_j (0)}$
 Iterative method: Let $\lambda_0$ be the estimate of HKB:
 $\lambda_i = \frac{p \hat \sigma^2(0)}{\sum_{j=1}^p \hat \theta_j (\lambda_{i1})}$
 Ridge trace: Use the plot of the ridge estiamtes over $\lambda$. Select $\lambda$ such that the stimates are stable with small $SSE(\lambda)$ and $VIF(\lambda)$s:
 $SSE(\lambda) = (Y  X \hat \theta ^ * (\lambda))’ (Y  X \hat \theta ^ * (\lambda))$
 $VIF_j(\lambda) = ((X’X + \lambda I)^{1} X’ X (X’ X + \lambda I) ^{1})_{jj}$
 Cross validation (CV) method: Find $\lambda$ minimizing the cross validation prediction error
 HKB method : Hoerl, Kennard, and Baldwin (1975):
 Properties of the ridge estimator:
 $E(\hat \theta ^ * (\lambda)) = (X’ X + \lambda I)^{1} X’ X \theta$
 $Cov(\hat \theta ^ * (\lambda)) = (X’ X + \lambda I) ^{1} X’ X( X’ X + \lambda I) ^{1} \sigma^2$
 $MSE(\lambda) = E[(\hat \theta ^ * (\lambda)  \theta) ‘ (\hat \theta ^ * (\lambda)  \theta)] \\ = \sigma^2 \sum_{j=1}^p \frac{\lambda_j}{(\lambda_j + \lambda)^2} + \lambda^2 \theta ‘ (X’ X + \lambda I)^{2} \theta \\ = Variance + (Bias)^2$
10.5 Least Absolute Shrinkage and Selection Operator (LASSO)
 Tibshirani, R. (1996) Regression shrinkabge and selection via the lasso. Journal of the Royal Statistical Society Series B
 $L^1$ regularization:
 $(y  X \theta)’ (y  X \theta) + \lambda \sum_{j=1}^p  \theta_j $
 Equivalently, minimize:
 $(y  X \theta)’(y  X \theta) \text{ given } \sum_{j=1}^p  \theta_j  \le C$
 As $\lambda$ gets large (equivalently, C gets small), some of estimates shrink to zero. => It can be used for variable or model selection
 Easier interpretation than the ridge regression.
 The tuning parameter $\lambda$ or C can be estimated by CV method.
Chapter 11 Variable selections
11.1 Why do we need variable selections?
 So far we assume that the predictors are predetermined to be included in our model.
 In practice, we do not know which predictors should be included in the model.
 We do not know the functional relationship as well
 Selecting variables and the functional relationship must be simultaneously considered.
11.2 Effects of variable selections
 We have a resposne Y and q predictors $X_1, \cdots, X_q$
 The linear regression equation including all available predictors can be written as:
 $y_i = \beta_0 + \sum_{j=1}^q \beta_j x_{ij} + \epsilon_i$
 Let $\hat \beta_j ^ *$ be the LSE of the model including all available predictors.
 We need to statistically decide which predictors should be retained in the model and which predictors should be removed from the model.
 If we remove $X_{p+1}, \cdots , X_q$ from our model, then we have a smaller model:
 $y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \epsilon_i$
 Let $\hat \beta_j$ be the LSE of the model including p predictors by setting $\beta_{p+1} = \cdots = \beta_q = 0$
 What happens if the full model is correct and we removed some predictors?:
 $Var(\hat \beta_j ^ *) \ge Var(\hat \beta_j)$
 $0 = Bias^2(\hat \beta_j ^*) \le Bias ^2(\hat \beta_j)$
 $MSE(\hat \beta_j ^ *) = Var (\hat \beta_j ^ *) + bias ^2 (\hat \beta_j ^ *)$
 Variance dec, Bias inc
 This is called “BiasVariance tradeoff”
 Conclusions:
 Deleting variables may give us the smaller MSE esitmates than the unbiased LS estimates under the true model including all predictors.
 Including extraneous variables results in loss of precision of the estimates.
Appendix: Effects of Incorrect Model Specifications
 $\hat \beta_p$ is a biased estimate of $\beta_p$ unless $\beta_r = 0$ or $X_p’X_r = 0$
 $Cov(\hat \beta_p ^ *)  Cov(\hat \beta_p) \ge 0$, that is , variances of the least squares estimates of regression coefficients obtained from the full model are larger than the corresponding variances of the estimates obtained from the subset model. In other words, the deletion of variables always results in smaller variances for the estimates of the regression coefficients of the remaining variables.
 If $Cov(\hat \beta_r ^ )  \beta_r \beta_r ‘ \ge 0$, then $MSE (\hat \beta_p ^)  MSE(\hat \beta_p) \ge 0$. This means that the least squares estimates of regression coefficients obtained from the subset model have smaller mean square error than estimates obtained fro mthe full model when the variables deleted have regression coefficients that are smaller than the standard deviation of the esitmates of the coefficients.
 $\hat \sigma_p ^2$ is generally biased upward as an estimate of $\sigma^2$
11.3 Practical issues in variable selections
 When a dataset with k predictors is collected, the number of all possible linear regression models equals $2^k$ => Exhaustive search may not be feasible in practice.
 When we compare two or more models with the same number of predictors, we may use $R^2$ to select better model. What if the models do not have the same number of predictors ? => We need the statistical criteria for variable selections.
11.4 Forward, backward, stepwise selection
 Classical approach based on the pvalues in variable selections or model selections: sequentially include or remove a variable based on the pvalue of the F test discussed in earlier chapters.:
 What relationship between the probability that we select the true model and the level of significance? : No clear theoretical undertanding.
 However, this approach is still widely used in many areas and most of statistical software provide this.
 Suppose that we have k possible predictors: $X_1, \cdots, X_k$. Assume there is no serious multicollinearity between predictors.:
FS (Forward Selection)
 Preselect level of significance $\alpha_{in}$
 Start with the samllest model $y_i = \beta_0 + \epsilon_i$, denoted by $M_0$
 Find the model having the smallest pvalue among k models: $y_i = \beta_0 + \beta_j x_{ij} + \epsilon$. If the pvalue of the model is less than or equal to $\alpha_{in}$, then include the variable and the model is denoted by $M_1$. Otherwise, stop the procedure.
 Continue this procedure until there is no variable has the smaller pvalue than $\alpha_in$ or it reaches the model including all variables.
BE (Backward Elimination)
 Predelect level of signifiance $\alpha_{out}$
 Start with the biggest model: $y_i = \beta_0 + \sum_{j=1}^k \beta_j x_{ij} + \epsilon_i$
 If the largest pvalue of $X_j$ is greater than or equal to $\alpha_{out}$, then eliminate it.
 Refit the data with $k1$ variables
 Continue the procedures until no predictor has larger pvalue than $\alpha_{out}$ or it reaches $y_i = \beta_0 + \epsilon_i$
Stepwise selection
 Preselect $\alpha_{in}$ and $\alpha_{out}$
 At each step of the FS method, check whether a variable has the pvalue greater than or equal to $\alpha_{out}$, if so, eliminate the variable.
 Continue these procedure until no variable has smaller pvalue less than or equal to $\alpha_{in}$ and greater than or equal to $\alpha_{out}$ or it reaches the full model.
11.5 Best subset selection (regression)
 For each number of predictors, select the best model based on the coefficient of determination $R^2$
 We obtain k best models
 If necessary, we select the best model among the k models based on one or some criteria.
 It is an exhaustive search looking at all possible regression models
 It cannot be used if the number of predictors k is too large.
11.6 Criteria
 MSE(=RMS) : Mean Square Errors or Residual Mean Squares of a model with p parameters is defined by:
 $MSE_p = \frac{SSE_p}{np}$
 We select a model with smallest $MSE_p$
 $R_{adj}^2$ = Adjusted $R^2$ is defined by:
 $R_{adj, p}^2 = 1  \frac{SSE_p / (np)}{SST/(n1)} = 1  \frac{MSE_p}{SST/(n1)}$
 Hence, maximizing $R_{adj,p}^2$ is equivalent to minimizing $MSE_p$
 Mallows $C_p$ : Try to minimize the standardized total MSE of the prediction, that is,:
 $\frac{\sum_{i=1}^n MSE_p(\hat y_i)}{\sigma^2} = \frac{\sum_{i=1}^n (\hat y  E y_i)^2}{\sigma^2}$
 Mallows (1973) estimated it as:
 $C_p = \frac{SSE_p}{\hat \sigma^2} + 2p  n$
 where $\hat \sigma^2$ is the MSE of the model including all possible predictors. If the model with p predictors is correct, then $E(C_p) = p$.
 In parctice,:
 Select a model of which $C_p$ is close to p.
 Usually, there are many models of which $C_p \approx p$, select the model with smallest $C_p$ as well.
 Disadvantage: The estimate $\hat \sigma ^2$ affects $C_p$ and the variable selection based on $C_p$ may not be reliable.
 PRESS: It is the predction sum of squares:
 $PRESS_p = \sum_{i=1}^n (y_i  \hat y_{(i)})^2 = \sum_{i=1}^n (\frac{e_i}{1  h_{ii}})^2$
 where $\hat y_{(i)}$ is the predicted value of the ith observation based on the regression model with p variables excluding ith observation.
 It is called a leavonout crossvalidation (LOOCV)
 For linear regression model, this statistic has a closed form, that is, it is very simple to calculate
 AIC: Akaike information criterion (AIC) derivated this to correct the bias term of the loglikelihood function in very general situation.:
 $AIC_p = 2 log (Likelihood) + 2p \\ = n log(\frac{SSE_p}{n}) + 2p$
 AIC tends to overfit <=> AIC commonly selects a bigger model than the true model.
 AIC or modified versions of AIC are widely used for the purpose of prediction <= Overfit is less problematic in prediction.
 AIC is not consistent, that is, $lim_{n>\infty} P(\text{select a true model}) \not = 1$
 BIC: Schwarz(1978) derived Bayesian information criterion (BIC) under Bayesian framework of the problem:
 $BIC_p = 2 log (Likelihood) + plog n \\ = n log (\frac{SSE_p}{n}) + p log n$
 BIC tends to select a smaller model than AIC <= BIC penalizes more than AIC for $log n \ge 2$
 BIC is consistent, that is , $lim_{n>\infty} P(\text{select the true model}) = 1$ if the model space includes the true model. => It provides very nice theoretical justification ubt in practice we do not know our modelspace includes the true model.
 BIC is commonly used when our model selection is the purpose of explanation and interpretation.
 Crossvalidation(CV): Recently, as the computer technology has been rapidly developed, this method is widely applied once the sample size is not so small.:
 PRESS is a kind of the crossvalidation methods.
 Randonly partition the dataset into (usually) 5 or 10 almost equal sizes subsets.
 For instance, suppose that we partition the dataset into 10 subsets. Use 9 subsets as the traning set to fit a model and estimate the prediction erros using the subset that is not used to fit the model.
 We will have 10 fitted models and 10 estimates of the prediction erros. => Take average of the prediction erros.
 For each model, we estimate the average of the prediction erros.
 Choose the model having the smallest average prediction errors.
11.8 Variable selections with multicollinear data
 We may eliminate some predictors to remove th multicollinearity
 We may apply Ridge Regression:
 Eliminate variables whose coefficients are stable but small. Since ridge regression is applied to standardized data, the magnitude of the various coefficients are directly comparable.
 Eliminate variables with unstable coefficients that do not hold their predicting power, that is, untable coefficients that tend to zero.
 Eliminate one or more variables with unstable coefficients. The variables remaining from the original set, say p in number, are used to form the regression equation.
 We may apply LASSO with crossvalidation (CV) method to determine the tuning parameter.