首页 > 其他分享 >FIT5201 Complexity and Model Selection

FIT5201 Complexity and Model Selection

时间:2023-04-15 10:34:35浏览次数:40  
标签:KNN model training Selection your error test Model FIT5201


Assignment 1, FIT5201, S1 2023
1 Model Complexity and Model Selection

In this section, you study the effect of model complexity on the training and testing error. You also demonstrate your programming skills by developing a regression algorithm and a cross-validation technique that will be used to select the models with the most effective complexity.

Background. A KNN regressor is similar to a KNN classifier (covered in Activity 1.1) in that it finds the K nearest neighbors and estimates the value of the given test point based on the values of its neighbours. The main difference between KNN regression and KNN classification is that a KNN classifier returns the label that has the majority vote in the neighborhood, whilst KNN regressor returns the average of the neighbors’ values. In Activity 1 of Module 1, we use the number of misclassifications as the measurement of training and testing errors in a KNN classifier. For KNN regressor, you need to choose another error function (e.g., the sum of the squares of the errors) as the measurement of training errors and testing errors.

Question 1 [KNN Regressor, 5+5=10 Marks]

I Implement a KNN regressor using the scikit-learn conventations, i.e., in a class with the following skeleton.

class KnnRegressor:

def __init__(self): # ADD PARAMETERS AS REQUIRED

# YOUR CODE HERE

def fit(self, x, y):

# YOUR CODE HERE

return self

def predict(self, x):

# YOUR CODE HERE

Hint: You can closely follow the implementation from Activity 1.1 of the KNN classifier. You cannot use sklearn.neighbors.KNeighborsRegressor to solve this task.

II To test your implementation, load the datasets diabetes and california housing through the functions load diabetes and fetch california housing, both of which are available in the module sklearn.datasets. For both datasets, perform a training/test split (using a fraction of 0.6 of the data as training data), fit your KNN regressor to the training portion (using some guess for a good value of k), and report the training and test errors.

Question 2 [L-fold Cross Validation, 5+5+5=15 Marks]

I Implement a L-Fold Cross Validation (CV) scheme using the scikit-learn convention for data splitters, i.e., using the following skeleton.

class LFold:

def __init__(self): # ADD PARAMETERS AS REQUIRED

# YOUR CODE HERE

def get_n_splits(self, x=None, y=None, groups=None):

# YOUR CODE HERE

def split(self, x, y=None, groups=None):

# YOUR CODE HERE

Test your implementation for correctness by running a simple example like the following.

for idx_train, idx_test in LFold(5).split(list(range(20))):

print(idx_train, idx_test)

You cannot use sklearn.model selection.KFold to solve this task.

II For both datasets from Question 1, use your L-fold CV implementation to systematically test the effect of the KNN parameter K by testing all options from 1 to 50 and, for each K, instead of only performing a single training/test split run your L-Fold CV. For each K compute the mean and standard deviation of the mean squared error (training and test) across the L folds and report the K for which you found the best test performance (for both datasets).

III For both datasets, plot the mean training and test errors against the choice of K with error bars (using the standard error of the mean). You can compute the standard error of the means as

ste = 1.96s/√ L

where s is the sample standard deviation of the error across the L folds. Based on this plot,comment on

– The effect of the parameter K. For both datasets, identify regions of overfitting and underfitting for the KNN model.

– The effect of the parameter L of the CV procedure. HINT: You might want to repeat the above process with different values for L to get an intuition of its effect.

Question 3 [Automatic Model Selection, 5 + 5 = 10 Marks]

I Implement a version of the KNN regressor that automatically chooses an appropriate value of K from a list of options by performing an internal cross-validation on the training set at fitting time. As usually, use the scikit-learn paradigm, i.e., use the following template.

class KnnRegressorCV:

def __init__(self, ks=list(range(1, 21)), cv=LFold(5)):

# YOUR CODE HERE

def fit(self, x, y):

# YOUR CODE HERE

return self

def predict(self, x):

# YOUR CODE HERE

II For both datasets from the previous questions, test your KNN regressor with internal CV by using either a outer single train/test-split or, ideally, with an outer cross-validation (resulting in a so-called nested cross-validation scheme). Report on the (mean) k value that is chosen by the KNN regressor with internal cross-validation and whether it corresponds to the best k-value with respect to the outer test sets. Comment on what factors determine whether the internal cross-validation procedure is successful in approximately selecting the best model.

2 Probabilistic Machine Learning

In this section, you show your knowledge about the foundation of probabilistic machine learning (i.e. probabilistic inference and modeling) by solving a simple but basic statistical inference problem.

Solve the following problem based on the probability concepts you have learned in Module 1 with the same math conventions.

Question 4 [Bayes Rule, 5+5=10 Marks]

Recall the simple example from Appendix A of

Module 1. Suppose we have one red, one blue, and one yellow box with the following content:

In the red box we have 3 apples and 5 oranges,
in the blue box we have 4 apples and 4 oranges, and
in the yellow box we have 1 apples and 1 orange.
Now suppose we selected one of the boxes uniformly at random and then, in a second step, picked a fruit from it, again uniformly at random.

I Implement a Python function that simulates the above experiment (using a suitable method of a numpy random number generator obtained via numpy.random.get default rng).

II Answer the following question by a formal derivation: If the picked fruit is an apple, what is the probability that it was picked from the yellow box?

Hint: Formalise this problem using the notions in the “Random Variable” paragraph in Appendix A of Module 1.

3 Ridge Regression

In this section, you develop Ridge Regression by adding the L2 norm regularization to the linear regression (covered in Activity 2.1 of Module 2) and study the effect of the L2 norm regularization on the training and testing errors. This section assesses your mathematical skills (derivation), programming, and analytical skills.

Question 5 [Ridge Regression, 10+5+5=20 Marks]

I Given the gradient descent algorithms for linear regression (discussed in Chapter 2 of Module 2), derive weight update steps of stochastic gradient descent (SGD) for linear regression with L2 regularisation norm. Show your work with enough explanation in your PDF report; you should provide the steps of SGD.

Hint: Recall that for linear regression we defined the error function E. For this assignment,you only need to add an L2 regularization term to the error function (error term plus the regularization term). This question is similar to Activity 2.1 of Module 2.

II Using the analytically derived gradient from Step I, implement either a direct or a (stochastic) gradient descent algorithm for Ridge Regression (use again the usual template with init , fit, and predict methods. You cannot use any import from sklearn.linear model for this task.

III Study the effect of the L2-regularization on the training and testing errors, using the synthetic data generator from Activity 2.3. i.e., where data is generated according to
a For each λ in {0, 0.4, 0.8, . . . , 10}, create a pipeline of your implemented ridge regressor with a polynomial feature transformer with degree 5.

b Fit the model ten times (resampling a training dataset of size 20 each time) for all choices of λ.

c Create a plot of mean squared errors (use different colors for the training and testing errors), where the x-axis is log lambda and y-axis is the error. Discuss λ, model complexity, and error rates, corresponding to underfitting and overfitting, by observing your plot.

4 Multiclass Perceptron

In this section, you are asked to demonstrate your understanding of linear models for classification.

You expand the binary-class perceptron algorithm that is covered in Activity 3.1 of Module 3 into a multiclass classifier. Then, you study the effect of the learning rate on the error rate. This section assesses your programming, and analytical skills.

Background. Assume we have N training examples {(x1, t1), …,(xN , tN )} where tn is one of K discrete values {1, . . . , K}, i.e. a K-class classification problem. For a prediction function of a model with parameters w, we use, as usual, yn(xn, w) to represent the predicted label of data point xn. In particular, for the K-class classification problem with p-dimensional inputs, we will consider a k × p weight matrix w, or alternatively, a collection of K weight vectors wk, each of which corresponding to one of the classes. At prediction time, a data point x will then be classified as

y = arg max wk · x .

k∈{1,…K}

We can fit those weights with the multiclass perceptron algorithm as follows:

Initialise the weight vectors w1, . . . , wK randomly to small weights
FOR n = 1 to N:
– y = arg maxk∈{1,…,K} wk · xn

– IF y! = tn THEN for all k ∈ {1, . . . , K}
IF weights have changed THEN go to Step 2 ELSE terminate
In what follows, we look into the convergence properties of the training algorithm for multiclass perceptron (similar to Activity 3.1 of Module 3).

Question 6 [Multiclass Perceptron, 5+5+10=20 Marks]

I Implement the multiclass perceptron as explained above using the usual template. You cannot use sklearn.linear model.Perceptron to solve this task.

II Evaluate your algorithm using the digits dataset provided through the function load digits in sklearn.datasets. This is a classification problem with 10 classes corresponding to the digites 0 to 9 (see the scikit-learn online documentation for more information). Perform an 80/20 train/test split and report your train and test error rates (using η = 0.01).

III Modify your classifier implementation to store the history of the weight vectors (similar to the gradient descent algorithms implemented in Activity 2.1). Then run the model fitting for two different learning rates (η = 0.1 and η = 0.9) and draw a plot of the training and test error as the number of iterations of the inner loop increases (it is enough to only evaluate the errors every 5 iterations). Explain how the testing errors of two models behave differently, as the training data increases, by observing your plot.

5 Logistic Regression versus Bayes Classifier

This task assesses your analytical skills. You need to study the performance of two well-known generative and probabilistic models, i.e. Bayesian classifier and logistic regression, as the size of the training set increases. Then, you show your understanding of the behavior of learning curves of typical generative and probabilistic models.

Question 7 [Discriminative vs Generative Models, 5+5+5+5=20 Marks]

I Load the breast cancer dataset via load breast cancer in sklearn.datasets and copy the code from Activities 3.2 and 3.3. for the Bayes classifier (BC) and logistic regression (LR).

Note: for logistic regression you can instead also simply import LogisticRegression from sklearn.linear model and, when using, set the parameter penalty to ’none’. Perform a training/test split (with train size equal to 0.8) and report which model performs better in terms of train and test performance.

II Implement an experiment where you test the performance for increasing training sizes of N = 5, 10, . . . , 500. For each N sample 10 training sets of the corresponding size, fit both models, and record training and test errors.

Hint: you can use training test split from sklearn.model selection with an integer parameter for train size.

III Create suitable plots that compare the mean train and test performances of both models as a function of training size. Make sure to also include error bars in the plot computed similar to those in Question 5.

IV Formulate answers to the following questions:

a What happens for each classifier when the number of training data points is increased?

b Which classifier is best suited when the training set is small, and which is best suited when the training set is big?

c Justify your observations in previous questions (III.a & III.b) by providing some speculations and possible reasons.

Hint: Think about model complexity and the fundamental concepts of machine learning covered in Module 1.

Submission and Interview

Submission Please submit one zip-file that contains two versions of a single Jupyter Notebook file that contains

your name and student ID in a leading markdown cell followed by
a structure that clearly separates between sections and questions (with markdown headlines and sub-headlines) and then for each question
all required code,
all required figures,
and your answers to all questions that require a free-text answer (in markdown cells).
One version is the actual notebook file (with extension “.ipynp”). The other one is a pdf export (“.pdf”). Note that depending on your system it might be necessary to first generate an html export and then save that with your web browser as pdf-file.

The three files should be named STUD ID FIRSTNAME LASTNAME assessment 1.SUFFIX where SUFFIX is “zip”, “pdf”, and “ipynb”, respectively. The submission must be received via Moodle by the due date mentioned on the top of this document.

Interview

In addition to the submission, you will be asked to meet (online or on-campus) with your tutor for an interview when your assessment is marked. Not submitting the file or not attending the interview will both result in 0 marks for the assignment.

Notes Please note that,

One second delay will be penalized as one day delay. So please submit your assignment in advance (considering the possible Internet delay) and do not wait until the last minute.
We will not accept any resubmitted version. So please double check your assignment before the submission.
Your final grade does not only depend on your submission but also on your ability to explain your solution in an assignment interview to be held after submission. Failure to attend this interview will result in 0 marks.

WX:codehelp mailto: [email protected]

标签:KNN,model,training,Selection,your,error,test,Model,FIT5201
From: https://www.cnblogs.com/julytwo/p/17320616.html

相关文章

  • Visualizing MuZero Models
    发表时间:2021文章要点:这篇文章主要想看看muzero里面的model具体学到了什么表征。通过PCA降维的方式,发现最开始编码状态的h函数学到的embedding和动态转移函数g学到的embedding并不统一,存在很大差异。因为muzero里面没有相关的loss来控制他俩一样。然后作者就提出两种loss来约......
  • sklearn.linear_model.LogisticRegression-逻辑回归分类器
    语法格式class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto'......
  • odoo中用javascript调用model中定义好的方法
    odoo中如果前端界面要调用后台model中写好的方法,很简单。使用do_action即可,比如要调用改res.users的默认语言后执行的方法 odoo.define('switch_language.SwitchLanguageMenu',function(require){"usestrict";varModel=require('web.Model');varse......
  • Django 同一Model注册多个modelAdmin
    需求:同一个djangomodel模型,根据不同需求展示注册到admin.py中classUserMsg(models.Model):username=models.CharField(max_length=20,null=False,verbose_name='发送方')hername=models.CharField(max_length=20,null=True,verbose_name='对方姓名')ti......
  • Django中models模块增删改查
    1django操作数据库1.1创建表和字段fromdjango.dbimportmodelsclassBlog_Model(models.Model):name=models.CharField(db_column='name',max_length=200,blank=True,verbose_name='name')num=models.IntegerField(db_column='num�......
  • CommunityToolkit.Mvvm8.1 viewmodel使用-旧式写法(2)
     本系列文章导航https://www.cnblogs.com/aierong/p/17300066.htmlhttps://github.com/aierong/WpfDemo(自我Demo地址)  0.说明CommunityToolkit.Mvvm8.1有一个重大更新的功能:源生成器功能,它极大简化我们的mvvm代码但是本篇先总结一下原写法,下篇再总结源生成器......
  • Django基础 - 05Model模型之CRUD
     一、模型类的objects1.1objects字段默认情况下,由创建模型类的元类在模型类中创建一个 django.db.models.Manager类的对象,赋给objects。 Manager类实际是QuerySet类的子类。classCategoryEntity(models.Model):objects=models.Manager()#objects必须为......
  • models.ForeignKey()的一些相关参数说明
    models.ForeignKey()是DjangoORM中的一个字段类型,用于定义关联关系。在使用models.ForeignKey()时,可以传入一些参数来控制关联行为。以下是一些常用的参数说明:to:指定关联的目标模型类。on_delete:指定当关联对象被删除时的行为。CASCADE或者SET_NULLrelated_name:指定反向......
  • django model ForeignKey ,解决外键字段增加_id 的问题
    例如:主表-模型: Author:id,name副表-模型:Book:id,author(ForeignKey:Author),name 1.数据库字段增加_id的问题,网上有很多教程,可以使用 db_column='yourname'2.当Book查询中,使用filter,all等查询,都会得到字段author_id,而不是author,可以使用values(),方法指定字段名称,例......
  • Modelcode
    个人用的模板不喜勿喷1.这个是自己非用STL写的板子优化极差还是建议STLvoidsortfast(intl,intr){inti,j,mid,p;i=l;j=r;mid=a[(l+r)/2];do{while(a[i]<mid)i++;while(a[j]>mid)j--;if(i<=j){p=......