首页 > 其他分享 >AIML427 Big Data:

AIML427 Big Data:

时间:2023-04-26 12:00:09浏览次数:41  
标签:clustering methods AIML427 data will Big model Data your

2023 AIML427 Big Data: Assignment 2
This assignment has 100 marks and is at on 11:59pm, Monday, 8th May 2023. Please submit
your answers as a single .pdf file. Make sure you read the Assessment section before writing
the report. This assignment contributes 25% to your overall course grade.
Any questions about Parts 1 or 2 should be directed to Bach; any questions about Parts 3
should go to Qi.
1 Manifold Learning [40 marks]
In class, we discussed a variety of different manifold learning methods, which we broadly
categorised as “classic” statistical methods or “modern” ML-based methods. In this question,
you are expected to further explore the differences between these classes of methods, and com-
pared to PCA (as a linear dimensionality reduction method). You should use no more than 3
pages to answer the following questions.
1. Find a reasonably high-dimensional (at least 100 dimensions) dataset that is interesting
to you. It should also have at least 100 instances, but preferably more. Describe the
dataset (name, related task/what it is used for, number of features and instances, refer-
ence) and justify your choice.
2. Using your choice of library, apply PCA to the dataset, and present your results. You
should show visualisation(s) of the PCs found and also comment on the explained vari-
ance.
3. Pick one of the “classic” manifold learning methods and apply it to the same data. Show
visualisation(s) and compare the results to that of PCA (e.g. for an embedding with two
dimensions). Highlight any differences between the two methods, and hypothesise why
they may have occurred.
4. Pick one of the “modern” manifold learning methods and apply it to the same data.
Show visualisation(s), compare and contrast the results to the two previous methods,
with analysis of any differences seen.
5. Finally, pick one of the two manifold learning methods for further analyse. Your method
will have “tunable parameters” — parameters that you can change to get different re-
sults. Pick one such parameter, and explore how sensitive the embedding is to changes
in this parameter. You should explain the role of this parameter in the manifold learning
algorithm, how you tested its effect, and show the results found.
1
2 Clustering [30 marks]
The NCI60 dataset (from the Stanford NC160 Cancer Microarray Project) consists of p = 6,830
gene expression measurements for each of n = 64 cancer cell lines. (Sourced from An Introduc-
tion to Statistical Learning).
In this question, you will be clustering the genes, rather than individual cancer cell lines.
This can be seen as a form of feature clustering — i.e. what genes are most related?
For each clustering method, you will need to visualise the clustering results (partition) for
that method. Given that there are 64 dimensions for each gene, for visualising the clustering
results, you should use PCA to reduce the dimensionality to 2D so that you can plot the found
clusters.
It is recommended you use either R or Python for this question, as they both have libraries
to interact with this data, as shown below. You should use no more than 3 pages to answer the
following questions.
2.1 R:
library(ISLR)
nci.data = NCI60$data
X = scale(t(nci.data))
P = X %*% prcomp(X)$rotation
X will be the numpy array of interest — note that we have transposed our data so our rows
are the genes. P is the principal components of the data. You will use the the first 2 PCs, i.e.
the first 2 columns of P, to visualise the clusters.
2.2 Python:
For Python (or any other language), you will need to first download nci60 data.csv from
the course homepage.
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
nci_data = pd.read_csv("nci60_data.csv", index_col=0)
X = scale(nci_data.T)
P = PCA().fit_transform(X)
X will be the data matrix of interest — note that we have transposed our data so our rows
are the genes. P is the principal components of the data. You will use the the first 2 PCs, i.e.
the first 2 columns of P, to visualise the clusters.
2.3 Tasks:
1. Carry out hierarchical clustering with Euclidean distance and complete linkage.
(a) Describe the resulting clustering for 3 to 6 clusters.
(b) Plot the first 2 principal components against each other with the colour argument
set equal to the cluster labels. What can you deduce/observe about the clustering?
2
2. Repeat the cluster analysis using correlation-based distance and complete linkage. NB:
you will need to precompute the correlations and pass them into your clustering method.
Compare the clusters with those found above.
3. Finally, carry out K-means clustering for 3 to 6 clusters. Compare the clusters of K-means
with that of the above two approaches. Which of the hierarchical clustering results is
more similar to that of K-means? Why?
3 Regression [30 marks]
In the lecture, we considered the case in which the features/predictors appeared only linearly
in the regression model. The simplest type of nonlinearity we could add to the model is pair-
wise interactions of the features. If xj and xk are distinct features, this means we also consider
xjxk as a feature. Pairwise interactions are rather straightforward to implement in R:
X = model.matrix(balance ~ . ? ., Credit)[,?1] (1)
becomes the new design matrix. The construction . ? . means consider all pairwise multiplica-
tions of distinct features.
Repeat the analysis for the Credit dataset (we have done it in the lecture on Week 7) with
pairwise interactions of the features. You will find it convenient to set grid = 10∧seq(3,?1, 100)
and thresh = 1e? 10.
Answer the following questions:
1. How many predictors are there, i.e. what is p?
2. How did you generate your training and test sets?
3. Select the tuning parameter for the ridge regression model using cross-validation, and
show the process.
4. Select the tuning parameter for the lasso regression model using cross-validation, and
show the process. How many features have been selected by the lasso?
5. Compare and discuss the final form of the model from the linear regression, ridge re-
gression, and lasso regression.
6. Compare the test errors for the linear model, ridge regression model, and lasso model.
7. Plot a comparison of the test predictions for the three approaches.
NB Please show how you generated your test and training sets – in particular the RNG
seed you used – so we can replicate your results. Remember to report this in the following
questions when applicable.
Assessment
Format: You can use any font to write the report, with a minimum of single spacing and 11
point size (hand writing is not permitted unless with approval from the lecturers). Reports
exceeding the maximum page limit will be penalised. Any additional material such as code or
figures/tables can be included in an appendix, which will not count towards the page limit.
3
Communication: a key skill required of a scientist is the ability to communicate effectively.
No matter the scientific merit of a report, if it is illegible, grammatically incorrect, mispunctu-
ated, ambiguous, or contains misspellings, it is less effective and marks will be deducted.
Marking Criteria: The final report will be submitted to Turnitin for a plagiarism check. Late
submissions without a pre-arranged extension will be penalised as per the course outline.
The usual mark checking procedures in place for all assessment apply to this report. The
assessment of the reports will account of the understanding of big data, clarity and accuracy
of answer, presentation, organisation, layout and referencing.
Submission: You are required to submit a single .pdf report through the web submission
system from the AIML427 course website by the due time.

标签:clustering,methods,AIML427,data,will,Big,model,Data,your
From: https://www.cnblogs.com/somtimes/p/17355225.html

相关文章

  • This dataset does not have valid histogram required for classification method, r
     此数据集没有分类方法所需的有效直方图,请运行“计算统计信息”工具生成直方图。参考1:https://blog.csdn.net/soderayer/article/details/125409022参考2:https://blog.csdn.net/aGang_Gg/article/details/86690749 计算栅格统计信息......
  • python 连接oracle 报错 cx_Oracle.DatabaseErro
    1,python连接oracle的时候报错如下cx_Oracle.DatabaseError:ORA-24315:非法的属性类型,2,导致这个错误的原因是服务器oracle版本和客户端cx_oracle客户端版本不一致引起的,所以通过下面命令询oracle版本。select*fromv$version3,然后到http://sourceforge.net/projects/cx-ora......
  • Databend 开源周报第 90 期
    Databend是一款现代云数仓。专为弹性和高效设计,为您的大规模分析需求保驾护航。自由且开源。即刻体验云服务:https://app.databend.cn。What'sOnInDatabend探索Databend本周新进展,遇到更贴近你心意的Databend。元数据优化最近,Databend的元数据文件版本更新至v3,序......
  • 火山引擎 DataLeap:在数据研发中,如何提升效率?
    更多技术交流、求职机会,欢迎关注字节跳动数据平台微信公众号,回复【1】进入官方交流群在数仓及中台研发过程中,研发人员经常需要在不同任务中维护相同或类似代码,不仅费时费力,并且代码迭代后也面临不同业务单元逻辑性不一致的问题,对运维管理形成挑战。 一般来说,研发人员往往通......
  • MEMORY REPLAY WITH DATA COMPRESSION FOR CONTINUAL LEARNING--阅读笔记
    MEMORYREPLAYWITHDATACOMPRESSIONFORCONTINUALLEARNING--阅读笔记摘要:在这项工作中,我们提出了使用数据压缩(MRDC)的内存重放,以降低旧的训练样本的存储成本,从而增加它们可以存储在内存缓冲区中的数量。观察到压缩数据的质量和数量之间的权衡对于内存重放的有效性是非常重要......
  • pandas.DataFrame—构建二维、尺寸可变的表格数据结构
    语法格式pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)常用的几个参数解释:data:一系列数据,包括多种类型;index:索引值,行标签,默认值为RangeIndex(0,1,2,…,n);columns:列标签,默认值为RangeIndex(0,1,2,…,n);dtype:设置数据......
  • pandas.DataFrame.groupby—使用映射器或通过一系列列对数据框进行分组
    语法格式DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, squeeze=_NoDefault.no_default, observed=False, dropna=True)常用的几个参数解释:by:可接受映射、函数、标签或标签列表。用于确定分组。ax......
  • jquery ajax dataType有哪些
    预期服务器返回的数据类型。如果不指定,jQuery将自动根据HTTP包MIME信息来智能判断,比如XMLMIME类型就被识别为XML。在1.4中,JSON就会生成一个JavaScript对象,而script则会执行这个脚本。随后服务器端返回的数据会根据这个值解析后,传递给回调函数。可用值:•"xml":......
  • [Jquery DataTable] 生成模板文件
    以前生成模板文件,都是在后端放一个文件,前端提供一个链接地址。碰巧看到用DataTable来生成模板文件的方式,特此记录下。原理:创建一个空数据的DataTable,提供导出按钮功能,并隐藏DataTable。页面上就只会显示一个按钮,不显示DataTable.   <!DOCTYPEhtml><htmllang="en">......
  • 20230425001 - DataGridView绑定了数据之后, 再添加CheckBox列的解决方案
                 DataGridViewCheckBoxColumncheckBoxColumn=newDataGridViewCheckBoxColumn();           checkBoxColumn.Name="select";           checkBoxColumn.HeaderText="选择";           dgv_M.Columns.Inse......