首页 > 其他分享 > CSCI316 大数据挖掘

CSCI316 大数据挖掘

时间:2023-08-19 19:34:43浏览次数:32  
标签:task Assignment continuous file 数据挖掘 CSCI316 your must


CSCI316 (SIM) 2023 Session 3 Individual Assignment 2
CSCI316 – Big Data Mining Techniques and Implementation
Individual Assignment 2
2023 Session 3 (SIM)
15 Marks
Deadline: Refer to the submission link of this assignment on Moodle
Three (3) tasks are included in this assignment. The specification of each task starts in a separate page.
You must implement and run all your Python code in Jupyter Notebook. The deliverables include one
Jupyter Notebook source file (with .ipybn extension) and one PDF document for each task.
Note: To generate a PDF file for a notebook source file, you can either (i) use the Web browser’s PDF
printing function, or (ii) click “File” on top of the notebook, choose “Download as” and then “PDF via
LaTex”.
All results of your implementation must be reproducible from your submitted Jupyter notebook source
files. In addition, the submission must include all execution outputs as well as clear explanation of your
implementation algorithms (e.g., in the Markdown format or as comments in your Python codes).
Submission must be done online by using the submission link associated with assignment 1 for this
subject on MOODLE. The size limit for all submitted materials is 20MB. DO NOT submit a zip file.
This is an individual assignment. Plagiarism of any part of the assignment will result in having 0 mark for
the assignment and for all students involved.
CSCI316 (SIM) 2023 Session 3 Individual Assignment 2
Task 1
(7.5 marks)
Data set: Customer Churn Dataset
https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset
Objective
The objective of this task is to implement a Random Forest classifier based on a Decision Tree model which
you implemented in Task 2 of Individual Assignment 1. (Note. If you have implemented multiple DT models,
you can choose any one of them as the base model. The DT model which you use must be from Task 2 of
Individual Assignment 1.)
Task requirements
(1) Clearly state which method you use for this Random Forest classifier.
(2) Compare the performance of this Random Forest classifier and the performance of DT models which
you have implemented.
Deliverables
• A Jupiter Notebook source file named <your_name>_task_x.ipybn which contains your
implementation source code in Python
A PDF document named <your_name>_task_x.pdf which is generated from your Jupiter Notebook
source file, and presents clear and accurate explanation of your implementation and results.
CSCI316 (SIM) 2023 Session 3 Individual Assignment 2
Task 2
(7.5 marks)
Data set: MAGIC Gamma Telescope Dataset
(Source: https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope)
The data are Monte-Carlo generated to simulate registration of high energy gamma particles in a groundbased atmospheric Cherenkov gamma telescope using the imaging technique. The dataset contains 19,020
records. Attribute information:
1. fLength: continuous # major axis of ellipse [mm]
2. fWidth: continuous # minor axis of ellipse [mm]
3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]
7. fM3Long: continuous # 3rd root of third moment along major axis [mm]
8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
9. fAlpha: continuous # angle of major axis with vector to origin [deg]
10. fDist: continuous # distance from origin to center of ellipse [mm]
11. class: g,h # gamma (signal), hadron (background)
g = gamma (signal): 12332
h = hadron (background): 6688
Objective
Develop an Artificial Neural Network (ANN) in TensorFlow/Keras to predict the class.
Requirements
(1) You can (but not must) use Scikit-Learn or other Python libraries to pre-process and visualise the data
set. However, the ANN must be implemented with the Keras API in TensorFlow.
(2) You can use any ANN architecture (incl. feedforward, CNN, etc.) which has at least two hidden layers.
(3) The training process includes a hyperparameter fine-tunning step. Define a grid including at least three
hyperparameters: (a) the number of hidden layers, (b) the number neurons in each layer, and (c) the
regularization parameter for L1 and L2. Each hyperparameter has at least two candidate values. All
other hyperparameters (e.g., activation functions and learning rates) are up to you.
(4) Use 2/3 data for training and 1/3 for test. Report the loss values for training and test.
(5) Present clear and accurate explanation of your ANN architecture and results.
Deliverables
• A Jupiter Notebook source file named <your_name>_task_x.ipybn which contains your
implementation source code in Python
• A PDF document named <your_name>_task_x.pdf which is generated from your Jupiter

标签:task,Assignment,continuous,file,数据挖掘,CSCI316,your,must
From: https://www.cnblogs.com/longtimeagos/p/17642938.html

相关文章

  • 数据挖掘资源汇总
    wikipedia.org,历史,领域概述,资源链接:Datamining:介绍了数据挖掘的概念、过程、学术会议、软件等,右侧有细分条目;Category:Datamining:更多和数据挖掘有关的条目;DMOZ关于DM:资源链接;谷歌上不了推荐镜像站,搜索和下载电子书籍推荐LibraryGenesis(更多在线图书馆)。大学课程、在线教程:Stanf......
  • MATLAB用改进K-Means(K-均值)聚类算法数据挖掘高校学生的期末考试成绩|附代码数据
    全文链接:http://tecdat.cn/?p=30832最近我们被客户要求撰写关于K-Means(K-均值)聚类算法的研究报告,包括一些图形和统计输出。本文首先阐明了聚类算法的基本概念,介绍了几种比较典型的聚类算法,然后重点阐述了K-均值算法的基本思想,对K-均值算法的优缺点做了分析,回顾了对K-均值改进......
  • 数据挖掘(七) -----在python程序中使用hail
    我们在之前的文章中已经尝试安装了hail和简单的使用数据挖掘(五)-----基于Spark的可伸缩基因数据分析平台开源存储运算架构hail全面了解和安装但是我们发现这种hail的运行方式是需要进入到conda的hail的虚拟环境中才能运行的。我们业务一般来说都是在外层执行,还有其他的业务逻......
  • 数据挖掘(五) -----基于Spark的可伸缩基因数据分析平台开源存储运算架构hail全面了解
    hail简介hail是一个开源的、通用的、面向python数据类型的处理基因数据专用的分析库和方法解决方案。hail的存在是为了支持多维度的复杂的数据结构,比如全基因组关联数据研究(GWAS).GWASTutorialhail的底层是通过python,scala,java和apachespark来实现的。hail官网gitlab官方文......
  • 【数据挖掘 | 可视化】 WordCloud 词云(附详细代码案例)
    ......
  • Python爬虫在电商数据挖掘中的应用
    作为一名长期扎根在爬虫行业的专业的技术员,我今天要和大家分享一些有关Python爬虫在电商数据挖掘中的应用与案例分析。在如今数字化的时代,电商数据蕴含着丰富的信息,通过使用爬虫技术,我们可以轻松获取电商网站上的产品信息、用户评论等数据,为商家和消费者提供更好的决策依据。在本文......
  • 数据挖掘具体步骤
    数据挖掘具体步骤1、理解业务与数据2、准备数据数据清洗:缺失值处理:异常值:数据标准化:特征选择:数据采样处理:3、数据建模分类问题:聚类问题:回归问题关联分析集成学习imageBagging(例如随机森林算法)BoostingStacking4、模型评估......
  • 数据挖掘笔记(二)
    数据挖掘常用的方法利用数据挖掘进行数据分析常用的方法主要有分类、回归分析、聚类、关联规则、特征、变化和偏差分析、Web页挖掘等,它们分别从不同的角度对数据进行挖掘。①分类。分类是找出数据库中一组数据对象的共同特点并按照分类模式将其划分为不同的类,其目的是通过分类......
  • 鱼佬:百行代码入手数据挖掘赛!
     Datawhale干货 作者:鱼佬,武汉大学,Datawhale成员本实践以科大讯飞xDatawhale联合举办的数据挖掘赛为例,给出了百行代码Baseline,帮助学习者更好地结合赛事实践。同时,在如何提分上进行了详细解读,以便于大家进阶学习。数据及背景给定实际业务中借款人信息,邀请大家建立风险识别模型,预测......
  • 第四届工业大数据赛事:时序序列预测 + 结构化数据挖掘2种类型赛题!
     Datawhale推荐 主办单位:中国信息通信研究院,国家电网,富士康等自2017年以来,由中国信通院主办的工业大数据创新竞赛已经成功举办三届。这是首个由政府主管部门指导的工业大数据领域的全国性权威赛事。除了权威单位的出力,许多业界知名互联网企业也贡献了宝贵的经验和数据,为参赛者......