首页 > 其他分享 >CSC246/446机器学习

CSC246/446机器学习

时间:2023-04-30 12:33:12浏览次数:42  
标签:EM 机器 446 dataset CSC246 number plots clusters should


CSC246/446 Machine Learning Homework 5: Clustering
Objective

You are given two kinds of dataset. Datasets A, B, C, and Z were created by the instructor by sampling from four different Gaussian mixtures, with different numbers of mixture components. Your first task is to apply EM to each of these datasets and to attempt to uncover the ”true” number of mixture components. Exactly how to do this is up to you.

The second dataset is the famous Sonar benchmark dataset, which consists of 208 samples of 60 real valued features. The features correspond to various sonar (link) measurements of mines (bombs) and rocks in the ocean. Although the task is a supervised one, we can also consider applying the EM algorithm to build a two-class GMM out of the dataset (ignoring the sonar vs mine label).

Your overall objective is to apply the EM algorithm to model each dataset using Gaussian mixtures. For the synthetic datasets, you will have to determine an appropriate number of clusters. (This can be a manual process — you do NOT have to automate that for this assignment.). For the Sonar dataset, the number of clusters should be two. The synthetic datasets are included with the assignment posting on blackboard; however, you should obtain the Sonar dataset directly from UCI.

Evaluating Clusterings

Clusterings can be evaluated in two ways — supervised and unsupervised. For supervised clusterings, we assume that we have access to some ”true” category labels, and use those to evaluate the quality of the assigned clusters. Two well-known metrics are the Rand index (link) and Mutual Information (link). For unsupervised clustering, we can only try to obtain ”coherent” clusters, aiming for properties such as having lower within-class variance than between-class variance. The Silhouette coefficient is a popular unsupervised clustering metric (link), but there are others.

Of course, being probabilistic models, Gaussian mixture fits can also be evaluated in terms of model likelihood. (Unfortunately this kind of evaluation cannot be applied to simpler algorithms such as k-means, because k-means does not involve any probability.)

Modeling and Experimentation

As you know, the EM algorithm is guaranteed to converge, but it may converge to a local rather than global optimum. Therefore you should experiment with various initialization methods. The sklearn methods support random initialization and k-means based initialization (as well as two variants on those.)

Similarly, as a maximum likelihood estimator, if one of your cluster centers happens to exactly line up with a sample point, your model will diverge toward a singularity. To reduce this, you can apply regularization. Luckily, the sklearn implementation has a default regularization value, but if you find it is not sufficient, you may need to adjust it.

You should obtain the following results: For each of the datasets (A,B, C, Z, and Sonar) produce plots of llkelihood and Silhouette score vs number of iterations of EM. You can produce combined plots with both functions per plot, or you can produce separate plots.

For the 2D datasets (A, B, and C) you should produce a color-coded visualization of the clusters. (There are examples of how to do this in the scikit learn documentation.)

For the sonar dataset, you should also produce a plot of mutual information and Silhouette score between the inferred clustering and the true labels vs the number of iterations of EM.

Resources

The primary resources for this assignment is the sklearn documentation:

https://scikit-learn.org/stable/modules/clustering.html
https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
https://archive.ics.uci.edu/ml/datasets/connectionist%20bench%20(sonar,%20mines%20vs.%20rocks)
Grading and Submission

Submit a zipfile of your code and a PDF writeup with plots/explanations by 1159pm on the last day of class (April 25th) Your submission should contain the following components:

25% – At least four plots (or one really cool one) with likelihood and Silhouette coefficient per iteration for the synethetic data (A, B, C, and Z).
15% – Three separate color-coded plots of your best clusterings for low-dimensional synthetic data (A, B, and C).
25% – Plots of likelihood, mutual information, and Silhouette score per iteration of EM for Sonar data.
35% – Writeup — you should also submit a brief writeup which should describe your experiments and wor. In particular, you should describe how you selected the number of clusters for ABCZ (10%), discuss what you found to be the best schemas for initialization and evaluation (10%), and discuss how well the method worked in recovering the true clusters for the Sonar data. (10%). (Leaving 5% for issues of formatting and editing – e.g., prose style and quality.)

 

标签:EM,机器,446,dataset,CSC246,number,plots,clusters,should
From: https://www.cnblogs.com/mondayw/p/17365127.html

相关文章

  • GPT护理机器人 - 让护士的工作变简单
    引子书接上文《GPT接入企微应用-让工作快乐起来》,我把GPT接入了企微应用,不少同事都开始尝试起来了。有的浅尝辄止,有的刨根问底,五花八门,无所不有。这里摘抄几份:“帮我写一份表白信,我们是大学同学,暗恋十年””顺产后多久可以用收腹带?生完宝宝用收腹带好还是......
  • Midjourney 创建私人画图机器人(保姆级教程)
    本教程收集于:AIGC从入门到精通教程汇总之前给大家介绍过了Midjourney的注册教程:AI绘画:Midjourney注册(保姆级教程)也有StableDiffusion(开源)的本地搭建教程:AI数字绘画:stable-diffusion本地部署教程你是不是遇到以下问题:1.Midjourney会员怎么自建绘图服务器,不受其他人的打扰?......
  • 【数据挖掘&机器学习】招聘网站的职位招聘数据的分位数图、分位数-分位数图以及散点图
    一.本次需求背景本文主题:招聘网站的职位招聘数据的分位数图、分位数-分位数图以及散点图、使用线性回归算法拟合散点图处理详解之前的文章我们已经对爬取的数据做了清洗处理,然后又对其数据做了一个薪资数据的倾斜情况以及盒图离群点的探究。我们这次的需求是:使用散点图、使用......
  • 【数据预处理&机器学习】对于薪资数据的倾斜情况以及盒图离群点的探究
    一.需求背景课题中心:招聘网站的职位招聘数据预处理之前的文章,我们已经对职位薪资数据进行了爬取(9000条)数据,然后进行了数据的清洗,最终得到了4000条有效数据。具体需求:按不同的类别划分职位中的薪酬数据,画盒图/箱线图,检查孤立点/离群点;使用分位数图、分位数-分位数图方法处理数......
  • AI客服问答机器人-基于ChatGPT实现一个垂直领域的AI问答机器人
    我们大家都知道,ChatGPT的强大之处。但是呢,如何让ChatGPT基于我们自己的数据进行回复呢,如何将垂直领域的最新数据“喂”给ChatGPT,使其成为一名领域专家呢。下面是我自己实现的客服系统,整合好问答知识后的ChatGPT功能,具体的演示如下 登录到后台以后,可以开启向量知识库AI功能,集合......
  • Linux中将memcached注册成服务并可以随机器启动时启动服务
    网上看了很多资料大多比较繁琐,而且很多不能再最新的centos6上执行成功,最后还是自己写了一份,以供日后备用:  1.首先是写service脚本service脚本需要进入到目录/etc/init.d中,然后touchmemcached,最后vimmemcached后进行脚本编写,脚本如下:#chkconfig:3456060#!/bin/bash......
  • 【机器学习基础】数据集的划分比例
    前言 参考1. 机器学习:训练集、验证集、测试集分配比例_ChrisKang的博客-CSDN博客;2. 数据集的划分,验证集参与训练了吗?_无枒的博客-CSDN博客;完......
  • 基于机器学习的纠错系统技术 - 智能文本纠错 API
    引言在过去的几十年里,文本纠错技术已经取得了巨大的进展,从最初的基于规则的纠错系统到现在的基于机器学习的纠错系统,技术的发展已经帮助人们解决了大量的文本纠错问题,随着机器学习技术的发展,文本纠错技术也发生了重大变化。本文将介绍一款新的基于机器学习的纠错技术,并详细列出......
  • 一步步制作下棋机器人之 完善XY坐标控制
    匆匆忙忙,又是一周。马上五一,凑了十天假期,想想就开心。但是假期中是生日,又老了一岁了。很多目标都没有实现,就马上要到30了,可怕。30啊!!唉,时光匀速又决绝的前行不息,推动了没有返程票的人生旅程。总想着不断提升自己,不断丰富生命的意义,不断拓宽人生的界限,让人世这一遭不至于太单调......
  • 【牛客编程题】Python机器学习(入门例题5题)
    【牛客编程题】Python机器学习(入门例题5题)做题链接:https://www.nowcoder.com/exam/oj?page=1&tab=Python篇&topicId=329文章目录AI1鸢尾花分类_1AI2鸢尾花分类_2AI3决策树的生成与训练-信息熵的计算AI4决策树的生成与训练-信息增益AI5使用梯度下降对逻辑回归进行训练AI1鸢尾......