代写Assessment 1代写数据结构程序

2024.08.05 - 首页 >> Database作业

Assessment 1 (Due by 23:59pm 9th August 2024)

Question 1[50 marks]

We observe data {(yi, xi ) , i = 1, 2, . . . , n} from the linear regression model

yi = β0 + β1 x1i + β2 x2i + · · · + βpxpi + εi , i = 1, 2, . . . , n, (1)

where xi = (x1i, x2i, . . . , xpi )T involves the p covariates.

(a). [10 marks] Generate a set of sample observations (yi, xi ), i = 1, 2, . . . , n in the statistical software R by following the data generating process (DGP) below.

1. Parameters. set p = 20, n = 26 and β0 = 1, β1 = β2 = · · · = β10 = 0.8, β11 = β12 = · · · = β20 = 1.3.

2. Covariates. All the p covariates (i.e. predictors) follow normal distribution with mean 0.4 and variance 1.1.

3. Error Component. The error component εi follows standard normal dis- tribution.

(b). [15 marks] With the generated data in (a), estimate the regression coe伍cients β0 , βk with k = 1, 2, . . . , p with the ordinary least squares (OLS) estimation approach in R. Design an experiment (i.e. simulation) to evaluate the prediction accuracy of this OLS estimator for the response variable y on test data. Please write the procedure of the designed experiment and present the results in R.

(c). [25 marks] With the generated data in (a), propose another estimation approach for the linear regression model, which has more accurate prediction accuracy than the OLS. Please implement the proposed estimation approach in R and present the estimation of the linear coe伍cients. Further, please illustrate why the proposed method is better than the OLS in the sense of prediction accuracy.

Question 2[50 marks]

Consider two sets of sample observations {x1, x2, . . . , xn } and {y1, y2, . . . , ym } from normal distributions with population mean vectors being μ and ν, respectively. The population covariance matrices are both identity matrices. The dimensions of μ and ν are both equal to p. Statisticians are interested in the hypothesis test

H0 : μ = ν vs Ha : μ ≠ ν . (2)

A popular test statistic for this hypothesis testing problem is the Hotelling T square statistic

where Sx and Sy are sample covariance matrices constructed by {x1, x2, . . . , xn } and {y1, y2, . . . , ym }, respectively; x and y are sample mean vectors for μ and ν, respec-

tively. The Hotelling T square statistic T2 has the following asymptotic distribution

T2 → χp(2), as n, m → ∞ , (4)

where χp(2) is Chi square distribution with k degrees of freedom.

(a). [25 marks] Please generate the two sets of sample observations in R by setting p = 60, n = 80, m = 90, μ = ν = (1, 1, . . . , 1)T , and then calculate the value of Hotelling T2 statistic T2 . Repeat this experiment for N = 200 times and then plot the histogram of the statistic T2 .

(b). [25 marks] Please apply the bootstrap method to estimate the variance of the Hotelling T2 statistic in R when p = 60, n = 80, m = 90, μ = ν = (1, 1, . . . , 1)T . Write down the details of the bootstrap procedure and present the bootstrap estimation. In addition, please comment on the accuracy of the bootstrap estimation and provide the reasons.

Note: This homework is to be submitted through Wattle in digital form only as per ANU policy. The R codes for any computational question must be supplied.