首页 > 其他分享 >COMP3425数据挖掘

COMP3425数据挖掘

时间:2023-05-08 09:01:52浏览次数:35  
标签:COMP3425 Data 99 98 数据挖掘 Australian data 2.5

COMP3425辅导、辅导c/c++,Python编程
COMP3425 and COMP8410 Data Mining S1 2023
Assignment 2: Description of
Data
Data and Metadata

The data supplied for the assignment arises from The Australian Data Archive’s ANU Poll
Dataverse [1]. As a student of the course, you are assumed to accept the Terms and Conditions
of Use reproduced below. Please read them carefully. The custodian of the data has requested
you delete your data at the end of the course.

In particular the data captures the results of a survey poll conducted in 2019 on the topic of
attitudes and behaviours towards Universities, amongst other things. You can find a complete
description of the purpose of the poll and coding of the data (metadata) and also a descriptive
summary of the poll results here:
https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.26193/GOVGBB
The data is provided to you for the assignment in two forms. The first is the original dataset
as downloaded from the ADA called 2.ANUPoll2019RoleOfGovernment_CSV_01445.csv, in
comma-separated-values format. This data is described by the metadata in 1.
ADA.CODEBOOK.01445.xslx and the corresponding question text in 1.
ADA.QUESTIONNAIRE.01445.pdf

The second is a form derived from the original, pre-processed for the COMP3425 data mining
assignment, in comma-separated-values format called 3425_data.csv. Below you will find a
description of the pre-processing undertaken and this, in addition to the original metadata,
will be needed to assist your understanding of the data.

If you are a COMP3425 (undergraduate) student, you must work with the pre-processed
dataset 3425_data.csv.

If you are COMP8410 (postgraduate) student you may use either the original or the pre-
processed data, or both. The original will give you more opportunity to show off your technical
skills and creativity, while the pre-processed one is more constrained but may save time,
requiring you to spend less effort understanding the data, and helping to avoid some data
errors. The same代 做 rubric will be used for marking in both cases, but the original dataset provides
an extended learning experience and better opportunity for higher marks. Even if you use the
original data, you may find it useful to observe the pre-processing that has been undertaken to
produce 3425_data.csv to seed ideas or to solve problems you encounter.

Pre-processing applied with Excel to derive 3425_data.csv

? Only a selection of the original attributes have been retained.
? The Q15_safe_gambler column has been added, based on respondent’s answers to
questions Q15a-i, which have answers that range from almost always to never.
Q15_safe_gambler is a normalized number in the range [0,1] that shows the rarity of
the various problem gambling behaviours raised in Q15a-i. Refused and Don’t know
options are replaced by the midpoint value for each question, and the field is null
when the Q15 questions were not asked.
Q15_safe_gambler = IF(NOT(Q14=" "),((IF(OR(Q15a=-98, Q15a =-99),2.5,
Q15a)+(IF(OR(Q15b=-98, Q15b =-99),2.5, Q15b))+(IF(OR(Q15c =-98, Q15c
=-99),2.5, Q15c))+(IF(OR(Q15d =-98, Q15d =-99),2.5, Q15d))+(IF(OR(Q15e
=-98, Q15e4=-99),2.5, Q15e))+(IF(OR(Q15f =-98, Q15f =-99),2.5,
Q15f))+(IF(OR(Q15g =-98, Q15g=-99),2.5 Q15g))+(IF(OR(Q15h=-98, Q15h =-
99),2.5, Q15h))+(IF(OR(Q15i=-98, Q15i =-99),2.5, Q15i)))-9)/27,"")

? The binary undecided voter column was added based on the given answer to Q4, and
is TRUE when the answer to Q4 is one of -98, -99, 95, 97 and FALSE otherwise. That
is, IF(OR(OR(OR(Q4=-99, Q4=-98),Q4=95), Q4=97),TRUE,FALSE).
? For two categorical columns, nominal Q2 and nominal StateMap, double quotation
marks were added to all non-empty cells. For the rest of the categorical columns,
you can use the same approach to help Rattle recognise categorical data in a column
if necessary. For example, for nominal StateMap, the formula CONCATENATE("""",
StateMap, """") is used. For nominal Q2, the formula CONCATENATE("""", TEXT(Q2,
"0"), """") is used.

References

[1] Biddle, Nicholas; and Reddy, Karuna, 2019, “ANU Poll 2019: Role of the University”,
doi/10.26193/GOVGBB

Terms and Conditions of Use

This data has been distributed exclusively for students of COMP3425 and COMP8410 S1
2023 only. Data must be destroyed at the end of the course but may be re-obtained by
request to the Australian Data Archive.

Furthermore, from https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.26193/GOVGBB,

I acknowledge that:

1. Use of the material is restricted to use for analytical purposes and that this means that I can only
use the material to produce information of an analytical nature.

Examples of such uses are: (a) the manipulation of data to produce means, correlations or other
descriptive summary measures; (b) the estimation of population characteristics from sample data;
(c) the use of data as input to mathematical models and for other types of analyses (e.g. factor
analysis); and (d) to provide graphical and pictorial representation of characteristics of the
population or sub-sets of the population.

2. The material is not to be used for any non-analytical purposes, or for commercial or financial gain,
without the express written permission of the Australian Data Archive.
Examples of non-analytical purposes are: (a) transmitting or allowing access to the data in part or
whole to any other person / Department / Organisation not a party to this undertaking; and (b)
attempting to match unit record data in whole or in part with any other information for the
purposes of attempting to identify individuals.

3. Outputs (such as statistics, tables and graphs) obtained from analysis of these data may be further
disseminated provided that I:
(a) acknowledge both the original depositors and the Australian Data Archive; (b) acknowledge
another archive where the data file is made available through the Australian Data Archive by
another archive; and (c) declare that those who carried out the original analysis and collection of the
data bear no responsibility for the further analysis or interpretation of it.

4. Use of the material is solely at my risk and I indemnify the Australian Data Archive and its host
institution, The Australian National University.

5. The Australian Data Archive and its host institution, The Australian National University, shall not
be held liable for any breach of this undertaking.

6. The Australian Data Archive and its host institution, The Australian National University, shall not
be held responsible for the accuracy and completeness of the material supplied.

 WX:codehelp 

标签:COMP3425,Data,99,98,数据挖掘,Australian,data,2.5
From: https://www.cnblogs.com/wolfjava/p/17380649.html

相关文章

  • 数据挖掘-电商产品评论数据情感分析
    importpandasaspdimportreimportjieba.possegaspsgimportnumpyasnp#去重,去除完全重复的数据reviews=pd.read_csv("./reviews.csv")reviews=reviews[['content','content_type']].drop_duplicates()content=reviews['co......
  • 【数据挖掘&机器学习】招聘网站的职位招聘数据的分位数图、分位数-分位数图以及散点图
    一.本次需求背景本文主题:招聘网站的职位招聘数据的分位数图、分位数-分位数图以及散点图、使用线性回归算法拟合散点图处理详解之前的文章我们已经对爬取的数据做了清洗处理,然后又对其数据做了一个薪资数据的倾斜情况以及盒图离群点的探究。我们这次的需求是:使用散点图、使用......
  • 数据挖掘算法汇总
    参考:http://wenku.baidu.com/view/c79058d480eb6294dd886c8c.html     http://www.doc88.com/p-7344376788072.html......
  • python数据挖掘与分析实战__电子商务网站用户行为分析及服务推荐
    importosimportpandasaspd#修改工作路径到指定文件夹os.chdir("D:/CourseAssignment/AI/CollectWebDate/")#第一种连接方式fromsqlalchemyimportcreate_engineengine=create_engine('mysql+pymysql://root:123456@localhost:3306/test?charset=utf8')sql......
  • Python数据挖掘之关联规则学习
    一、关联算法应用介绍关联规则分析是数据挖掘中最活跃的研究方法之一,目的是在一个数据集中找出各项之间的关联关系,而这种关系并没有在数据中直接表示出来。常见于与购物篮分析。常用关联算法表如下,简单理解的话,就是测算某几项东西一起出现的概率。比如:如果测算得出,大量订单中出......
  • 大数据分析师、数据挖掘工程师和科学研究,需要看哪些书籍
    数据分析师:在拥有行业数据的电商、金融、电信、咨询等行业里做业务咨询,商务智能,出分析报告,互联网公司的产品经理差不多类型了,统计学能力要求高,SPSS、SAS、R、SQL。经典图书推荐:《概率论与数理统计》、《统计学》推荐DavidFreedman版、《业务建模与数据挖掘》、《数据挖掘导论》、......
  • 大数据技术的新应用:数据挖掘与分析的实现与商业化
    互联网的普及和信息化的加速发展,数据量呈现爆炸式增长,如何从海量数据中挖掘出有价值的信息成为了一个重要的问题。大数据技术的出现,为数据挖掘和分析提供了更加高效、精准的解决方案。一、大数据技术在数据挖掘中的应用1.数据采集数据挖掘的第一步是数据采集,大数据技术可以帮助......
  • Weka数据挖掘Apriori关联规则算法分析用户网购数据
    全文链接:http://tecdat.cn/?p=32150原文出处:拓端数据部落公众号随着大数据时代的来临,如何从海量的存储数据中发现有价值的信息或知识帮助用户更好决策是一项非常艰巨的任务。数据挖掘正是为了满足此种需求而迅速发展起来的,它是从大量的、不完全的、有噪声的、模糊的、随机的数据......
  • 数据挖掘中聚类和分类有什么区别
         分类(classification)是这样的过程:它找出描述并区分数据类或概念的模型(或函数),以便能够使用模型预测类标记未知的对象类。分类分析在数据挖掘中是一项比较重要的任务,目前在商业上应用最多。分类的目的是学会一个分类函数或分类模型(也常常称作分类器),该模型能把......
  • 数据挖掘(3.1)--频繁项集挖掘方法
    目录1.Apriori算法Apriori性质伪代码apriori算法apriori-gen(Lk-1)【候选集产生】has_infrequent_subset(c,Lx-1)【判断候选集元素】例题求频繁项集:对于频繁项集L={B,C,E},可以得到哪些关联规则:2.FP-growth算法FP-tree构造算法【自顶向下建树】insert_tree([plP],T)利用FP-tree挖掘......