program编程讲解、辅导Java,C++程序、讲解Python语言编程 讲解留学生Processing|讲解Python程序
- 首页 >> Matlab编程 1. [15 Marks] Repeat the advertisement exercise with the following changes.
(a) The data is generated via the following data generation mechanism. Xi ∼ U(0, 1) for
i ∈ {1, 2, 3}; here U(0, 1) stands for the continuous uniform distribution over the [0, 1] set.
However, we require that X1 + X2 + X3 = 1, that is, the explanatory variables stand for
a percentage of the budget.
(b) In addition, the model for y is as follow:
Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)
where W ∼ U(0, 1).
Similar to the original example, generate train and test sets of size N = 1000. Fit the linear regression
and the random forest models to the data. For the linear regression, make an inference
about the coefficients, specifically, comment about the contributions of different advertisement
types to sales. Use the linear model and the RF (with 500 trees), to make a prediction (using
the test set), and report the corresponding mean squared errors.
When constructing datasets, please use “1” and “2” seeds for the train and the test sets,
respectively.
2. [10 Marks] Consider the following variant of the cross-validation procedure.
(i) Using the available data, find a subset of “good” predictors that show correlation with
the response variable.
(ii) Using these predictors, construct a model (for regression or classification).
(iii) Use cross-validation to estimate the model prediction error.
1
Is this a good method? Do you expect to obtain the true prediction error? Explain your
answer.
3. [5 Marks] Suppose that we observe X1, . . . , Xn ∼ F. We model F as a normal distribution
with mean µ and standard deviation of σ. For this problem, determine the hypothesis class
H = {f(x, θ); θ ∈ Θ}.
and state explicitly what is θ and Θ.
4. [15 Marks] Let H be a class of binary classifiers over a set Z. Let D be an unknown distribution
over X , and let g be a target hypothesis in H. F Show that the expected value of LossT (g)
over the choice of T equals LossD(g), namely,
ET LossD(g) = LossD(g).
5. [15 Marks (see details below)] Consider the following dataset.
Now, suppose that we would like to consider two models.
Model1 : y = β1x1 + ε,
and
Model2 : y = β0 + β1x1 + ε,
where ε ∼ N(0, 1). That is, we consider two linear models with and without the intercept.
(a) [5 Marks)] Fit these models tot the data and write the corresponding coefficients. Namely,
fill the following table:
Model β0 β1
Model1 0
Model2
(b) [5 Marks)] Consider the squared error loss, the absolute error loss, and the L1.5 loss. Find
the average loss for each model. Namely, fill the following table:
Model squared error loss absolute error loss L1.5 loss
Model1
Model2
(c) [5 Marks)] Draw a conclusion from the obtained results.
6. [30 Marks (see details below)] Consider the Hitters data-set (given in Hitters.csv). Our
objective is to predict a hitter’s salary via linear models.
(a) [5 Marks)] Load the data-set and replace all categorical values with numbers. (You can
use the LabelEncoder object in Python).
2
(b) [5 Marks)] Generally, it is better to use OneHotEncoder when dealing with categorical
variables. Justify the usage of LabelEncoder in (a).
(c) [20 Marks)] Fit linear regression and report 10-Fold Cross-Validation mean squared error. (2)
Suppose that a = 1, b = 2, and c = 3, and write a Crude Monte Carlo algorithm for the
estimation of ` using N = 10000 sample size. Deliver the 95% confidence interval. Compare
the obtained estimation with the true value ` as given in (2).
3
(a) The data is generated via the following data generation mechanism. Xi ∼ U(0, 1) for
i ∈ {1, 2, 3}; here U(0, 1) stands for the continuous uniform distribution over the [0, 1] set.
However, we require that X1 + X2 + X3 = 1, that is, the explanatory variables stand for
a percentage of the budget.
(b) In addition, the model for y is as follow:
Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)
where W ∼ U(0, 1).
Similar to the original example, generate train and test sets of size N = 1000. Fit the linear regression
and the random forest models to the data. For the linear regression, make an inference
about the coefficients, specifically, comment about the contributions of different advertisement
types to sales. Use the linear model and the RF (with 500 trees), to make a prediction (using
the test set), and report the corresponding mean squared errors.
When constructing datasets, please use “1” and “2” seeds for the train and the test sets,
respectively.
2. [10 Marks] Consider the following variant of the cross-validation procedure.
(i) Using the available data, find a subset of “good” predictors that show correlation with
the response variable.
(ii) Using these predictors, construct a model (for regression or classification).
(iii) Use cross-validation to estimate the model prediction error.
1
Is this a good method? Do you expect to obtain the true prediction error? Explain your
answer.
3. [5 Marks] Suppose that we observe X1, . . . , Xn ∼ F. We model F as a normal distribution
with mean µ and standard deviation of σ. For this problem, determine the hypothesis class
H = {f(x, θ); θ ∈ Θ}.
and state explicitly what is θ and Θ.
4. [15 Marks] Let H be a class of binary classifiers over a set Z. Let D be an unknown distribution
over X , and let g be a target hypothesis in H. F Show that the expected value of LossT (g)
over the choice of T equals LossD(g), namely,
ET LossD(g) = LossD(g).
5. [15 Marks (see details below)] Consider the following dataset.
Now, suppose that we would like to consider two models.
Model1 : y = β1x1 + ε,
and
Model2 : y = β0 + β1x1 + ε,
where ε ∼ N(0, 1). That is, we consider two linear models with and without the intercept.
(a) [5 Marks)] Fit these models tot the data and write the corresponding coefficients. Namely,
fill the following table:
Model β0 β1
Model1 0
Model2
(b) [5 Marks)] Consider the squared error loss, the absolute error loss, and the L1.5 loss. Find
the average loss for each model. Namely, fill the following table:
Model squared error loss absolute error loss L1.5 loss
Model1
Model2
(c) [5 Marks)] Draw a conclusion from the obtained results.
6. [30 Marks (see details below)] Consider the Hitters data-set (given in Hitters.csv). Our
objective is to predict a hitter’s salary via linear models.
(a) [5 Marks)] Load the data-set and replace all categorical values with numbers. (You can
use the LabelEncoder object in Python).
2
(b) [5 Marks)] Generally, it is better to use OneHotEncoder when dealing with categorical
variables. Justify the usage of LabelEncoder in (a).
(c) [20 Marks)] Fit linear regression and report 10-Fold Cross-Validation mean squared error. (2)
Suppose that a = 1, b = 2, and c = 3, and write a Crude Monte Carlo algorithm for the
estimation of ` using N = 10000 sample size. Deliver the 95% confidence interval. Compare
the obtained estimation with the true value ` as given in (2).
3