首页 > 其他分享 >MAST90083: Computational Statistics and Data Science

MAST90083: Computational Statistics and Data Science

时间:2024-09-09 17:37:28浏览次数:10  
标签:Statistics 17 Computational values time using model Data regression

School of Mathematics and Statistics

MAST90083: Computational Statistics and Data Science

Assignment 1

Due date: No later than 08:00am on Monday 9th September 2024

Weight: 20%

Question 1 Linear Regression

In this question, we will apply linear regression, ridge regression and Lasso to the R built-in dataset Hitters and compare their performance. Type ?Hitters to see the description of this data. Our goal is to predict the response variable salary using these three regression models. To complete this task, you’ll need to install two packages: ISLR and glmnet. (Note: When using random functions in this exercise, please set the random number generator with set.seed(10) to ensure reproducibility.)

1. Data preparation: Load the Hitters dataset from library(ISLR). Remove all those rows from Hitters dataset that have entry NA in the salary column. For a design matrix construction, include all variables in Hitters dataset except the salary and store them in variable X (matrix of covariates). Also, read the salary variable and store it in variable y (response).

2. Linear regression model: First, we consider the full model that includes all the po-tential variables and do the following tasks: (1) compute the least square estimates ˆβLS of β and the corresponding p-values manually using the sampling distribution property of the least squares estimator and check with the R built-in function lm(). From these p values, which variable (variables) can you consider significant (or meaningful) ? (2) Now we consider a smaller model using those variables that have p values smaller than 0.05 in (1). Compute the ˆβLS1 and the corresponding p-values for this smaller model. (3) Compare these two models using F-statistics to test whether the smaller model provides a better characterization of the linear relationship between X and y. First compute manually the F statistics and associated p-value and then check with the F statistics obtained from R built-in function anova(). What conclusions can you make from this comparison? Can you reject the hypothesis that the smaller model is correct?

3. Ridge regression: Generate a sequence of 100 values of λ between 105 and 10−2 and use the glmnet function. For ridge regression, we set α = 0 and do the following tasks: (1) compute estimates of the ridge regression coefficients for the 100 different λ values. Then, observe the set of coefficients for two extreme values of λ i.e. 105 and also for 10−2 . For which value of λ among these two, the ridge coefficient values are closer to zero? (2) Draw a plot of l2-norm of the ridge coefficient values (excluding the intercept’s coefficient value) against the logarithm of the λ values. From this plot, can you decide the optimal λ value between 105 and 10−2 , or is it better to use the mean square error (MSE) plot against the λ values? justify your response. (3) Use the cv.glmnet function to perform. ten fold cross validation (CV). First, we randomly pick half samples from x for all variables and also the corresponding samples from y to construct a training dataset (use set.seed(10) for random number generator). The rest of the samples are saved for the testing dataset. Using this training dataset, plot the cross validation results, and find the best λ (the one that results in smallest CV error) value and its corresponding MSERR value obtained using the testing dataset. You can use predict() function here. Now refit the ridge regression model on the full data set using the λ chosen by CV. Examine the ridge regression coefficients, are they all present, similar to the linear regression case?

4. Lasso: This time we set α = 1 and again (1) plot the cross validation results, and find the best λ value using the same training set used in ridge regression above and its corresponding MSELasso value 代 写MAST90083: Computational Statistics and Data Science using the same testing set used in ridge regression model above. (2) Now obtain the coefficients again using the best λ that we just selected using CV. Are all the coefficients selected again?

5. Here we consider a regularization technique that combines both the ridge regression penalty and lasso penalty, which can be written as 2/1−α||β||2 2 + α|β|1, 0 ≤ α ≤ 1 (you can see that when α = 1, it corresponds to the Lasso penalty and when α = 0, it corresponds to the ridge regression penalty).

Now, use this penalty to combine the benefits of ridge regression and Lasso by finding a best α. This still can be achieved using glmnet(). Type ?glmnet() to see the details of each argument in this function. Let α ∈ [0, 1] be a set of 11 different values obtained using seq(0,1,by=0.1). Then, (1) for each of this αi , i=1,...,11. apply the ten fold cross-validation to select the best λ values and compute the associated MSEi using the same testing set used in ridge regression and Lasso above. (2) Select the best α as the one that results in the smallest MSE and estimate the regression coefficients again using the best α and the corresponding best λ that where selected. (3) Are all the coefficients selected again? Compare this results with the coefficients obtained from ridge regression in 3 and Lasso in 4, what do you observe? (Hint: You can check whether your code is correct or not by checking MSE1 = MSERR and MSE11 = MSELasso)

Question 2

In this question, we will analyze the weather stations.csv dataset, which contains daily average pressure readings (measured in inches) from 17 weather stations located in the central part of Oklahoma State, USA. The data covers the period from May 2, 2017, to September 30, 2017, covering a total of 152 days. In other words, we have 17 time series, each representing daily average pressure over 152 days in 17 different weather stations. In addition to the pressure data, we also have geographic information for these 17 weather stations stored in the file geoinfo stations.csv. We are interested in three spatial variables: NLAT (North Latitude),ELON (East Longitude) and ELEV (Elevation in meters) for each weather station in this file.

1. At weather station j, j = 1, · · · , 17, the daily average pressure data at time t, t = 1, · · · , 152 is yj,t. For each time series, we consider the following model of order p

 

Provide a linear matrix form. for model (1) where the parameters are presented in a p×1 vector ϕ = (ϕ1, · · · , ϕp) T and a 3 × 1 vector β = (β1, β2, β3) T , derive the expressions of least square estimators of ϕj and βj . (Note a linear matrix form. for model includes the expression for observed time series yj , design matrices and the error term in vector form.)

2. First, only consider the centered daily pressure data and don’t include the geographic variables in the model. In this case the time series can be expressed using the following model

 

This also known as autoregression (AR) model with order p. Now we aim to use AIC and BIC to find the best order p in model (2) for each time series j, j = 1, · · · , 17. This means, for time series j, use the R built-in function AIC() and BIC() to compute the AIC and BIC values for different order p, say p=seq(1,15,by=1) then select the best order p using AIC and BIC. Repeat these steps for all the 17 time series. Draw two plots (one for AIC and the other for BIC) and each of them should have 17 curves. Each curve represent the AIC (or BIC) values against the order p for the time series j and provide your best order p for the 17 time series. (Hint: You can use arima() function in R to fit AR(p) model and get the log-likelihood for computing AIC and BIC)

3. Now we use the same best order p ⋆ obtained from 2 for each of the 17 time series in order to perform. the following analysis. From 2, we can get 17 pairs of coefficients (ϕj , ηˆj ), for the 17 time series, or 17 sets of parameters in total so that each time series is modeled as

 

It is aimed to have a single model for all stations time series. For this purpose, it is assumed that ˆηj depends on the spatial covariates, NLATj , ELONj and ELEVj . This assumption can be checked if it is reasonable by fitting the following linear regression model

ηˆ = γ0 + γ1NLAT + γ2ELON + γ3ELEV.

where ηˆ, NLAT, ELON, ELEV are a 17 × 1 vector and ηˆ obtained from 2. Perform. a t-test to determine whether all three spatial covariates should be included. Then, remove any insignificant covariate and use the F-test to assess whether the smaller model is better. Compare the results of the t-test and F-test. What conclusions can you make ?

 

标签:Statistics,17,Computational,values,time,using,model,Data,regression
From: https://www.cnblogs.com/qq--99515681/p/18404968

相关文章

  • SciTech-Mathmatics-Probability+Statistics-Population-Sampling-Types of Sampling
    TypesofSamplingMethods(WithExamples)BYZACHBOBBITTPOSTEDONSEPTEMBER24,2018Researchersareofteninterestedinansweringquestionsaboutpopulationslike:Whatistheaverageheightofacertainspeciesofplant?Whatistheaverageweightofa......
  • 初识:DynamicDatasource动态数据源
    目录主要作用具体例子其他Maven模块如何利用动态数据源模块就根据用户服务、订单服务、商品服务、DynamicDataSource服务,DynamicDataSource服务的application.yaml的配置信息,实现实现主从、读写分离、负载均衡、高可用性和故障切换主要作用支持多数据源切换:在微服务架......
  • SciTech-Mathmatics-Probability+Statistics-Population-Sampling of Region of Popul
    SciTech-Mathmatics-Probability+Statistics-Population-Sample:SampleProportionvs.SampleMean:TheDifferenceSampleProportionvs.SampleMean:TheDifferenceBYZACHBOBBITTPOSTEDONMAY5,2021TwotermsthatareoftenusedinstatisticsareSamplePr......
  • SciTech-Mathmatics-Probability+Statistics: Differences: Population VS Sampling(D
    SciTech-Mathmatics-Probability+Statistics:Differences:PopulationVSSampling(DistributionofParameters)Likelyhood(Assumption)VSProbability(ModelofSolidParameters)Likelihoodvs.Probability:What'stheDifference?BYZACHBOBBITTPOSTEDONAU......
  • json.dumps(data.__dict__) 会把中文字符转义为 Unicode 码点
    在Python中,json.dumps()函数会将Python对象序列化为JSON字符串,并默认使用UTF-8编码来处理非ASCII字符。虽然在Python3中,字符串是默认以Unicode形式存储的,但是当json.dumps()将Python字符串转换为JSON格式时,非ASCII字符会自动被转义为Unicode码点(例如\u......
  • useMemo中不能执行setData
    在React中,useMemo 应该用于缓存计算结果,而不是执行副作用操作。当你在 useMemo 内部更新state时,可能会导致循环依赖或不稳定的渲染行为。当你在 useMemo 内部更新state时,这意味着每次组件重新渲染时,useMemo 返回的值都会发生变化,从而触发重新渲染。这可能导致组件进......
  • c# 读取上传excel转换为DataTable
    varfiles=Request.Files;if(files==null||files.Count==0){returnJson(new{code=1,msg="未获取到文件"});}varfile=files[0];vardt=ReadExcelToDataTable(file.InputStream);returnJson(new{code=1,msg="已读取Excel......
  • 使用微信小程序-云开发时报错: Error: errCode: -401003 api parameter type error |
    错误Uncaught(inpromise)thirdScriptErrorerrCode:-401003apiparametertypeerror|errMsg:parameter.datashouldbeobjectinsteadofundefined;Error:errCode:-401003apiparametertypeerror|errMsg:parameter.datashouldbeobjectinsteadofundef......
  • redis基础——SpringDataRedis入门
    redis基础——SpringDataRedis入门最近在学习redis,学到了redis的java客户端,其中最常用的是Jedis和lettuce,而SpringDataRedis是spring整合了Jedis和lettuce的产物,它提供了RedisTemplate工具类,封装各种对redis的操作,将不同数据类型的API封装到了不同类型中,避免了代码臃肿。S......
  • data-analysis-llm-agent
    data-analysis-llm-agenthttps://github.com/fanqingsong/data-analysis-llm-agentConversationalAIwithFunctionCallingforDataAnalysisOverviewThechatbotisdesignedtoprovidedataanalysisinsightsfromdatabasebasedonapredefinedschema.It......