首页 > 其他分享 >Kmeans聚类流程

Kmeans聚类流程

时间:2024-05-25 18:23:44浏览次数:14  
标签:probability West 流程 Kmeans ecologist 聚类 Loggerhead East Beach

1. turtles

Introduction

In this report, we will analyze a problem related to turtle populations on a small island with two beaches: West Beach and East Beach. The goal is to determine the probability of being on East Beach given that a Loggerhead Turtle is found. We will use Bayes' theorem and R programming to solve this problem.

Problem Statement

An ecologist studying turtles on a small island with two beaches knows the following information about the turtle population:

  • On West Beach, 90% of turtles are Green Sea Turtles, and the remaining 10% are Loggerhead Sea Turtles.
  • On East Beach, 60% of turtles are Green Sea Turtles, while 40% are Loggerhead Turtles.

On a foggy day, the ecologist gets lost on the island. After hours of walking, they reach a beach but cannot determine which one it is due to the dense fog. The ecologist finds a turtle and examines it, discovering that it is a Loggerhead Turtle.

The question is: What is the probability that the ecologist is on East Beach? Additionally, we need to state the assumptions made to arrive at this probability.

Assumptions

To solve this problem, we make the following assumptions:

  1. The ecologist is either on West Beach or East Beach; there are no other possibilities.
  2. In the foggy weather, the probability of reaching West Beach or East Beach is equal, i.e., 50% each.

Solution

We will use Bayes' theorem to calculate the probability of being on East Beach given that a Loggerhead Turtle is found.

# Define known conditions
p_west <- 0.5  # Probability of reaching West Beach
p_east <- 0.5  # Probability of reaching East Beach
p_loggerhead_given_west <- 0.1  # Probability of finding a Loggerhead Turtle on West Beach
p_loggerhead_given_east <- 0.4  # Probability of finding a Loggerhead Turtle on East Beach

# Apply Bayes' theorem to calculate the probability of being on East Beach given a Loggerhead Turtle is found
p_east_given_loggerhead <- (p_loggerhead_given_east * p_east) / 
  (p_loggerhead_given_west * p_west + p_loggerhead_given_east * p_east)

# Print the result
cat("The probability of being on East Beach given that a Loggerhead Turtle is found is:", p_east_given_loggerhead, "\n")

Conclusion

Based on the given information and assumptions, the probability of being on East Beach given that a Loggerhead Turtle is found is r p_east_given_loggerhead, or 80%.

This result relies on the assumptions that the ecologist is either on West Beach or East Beach and that the probability of reaching each beach in the foggy weather is equal. If these assumptions do not hold, the calculated probability may differ. For example, if the probability of reaching West Beach in the foggy weather is higher, the ecologist might still be more likely to be on West Beach even after finding a Loggerhead Turtle.

2. Classifying neuron types from electrophysiological recordings

library(tidyverse)
vmndata <- read.csv("/Users/chen_yiru/Desktop/Desk/Projects/incourse/大二下/ADS_files/vmndata.csv")
head(vmndata)
colSums(is.na(vmndata))

No NA value here.

vmndata_duplicate <- duplicated(vmndata)

sum(vmndata_duplicate)

No duplicates.

ggplot(data = vmndata, aes(x = hap1, y = hap2, color= type)) + geom_point() + ggtitle("Original classification")

寻找最佳聚类数

虽然这里因为已经知道一共有五类,但是如果是不知道的情况下还是要这一步

dots <- vmndata[,c(2,3)]
library(factoextra)
set.seed(123)
fviz_nbclust(dots, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)

model <- kmeans(dots, 5)
model
vmndata$cluster <- as.factor(model$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= cluster)) + geom_point() + ggtitle("Cluter classification")

Test the clustering using different subsets of the fit parameters

sub_hap1 <- vmndata[,2]
hap1_result <- kmeans(sub_hap1,5)
vmndata$hap1_cluster <- as.factor(hap1_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap1_cluster)) + geom_point() + ggtitle("Hap1 classification")
sub_hap2 <- vmndata[,3]
hap2_result <- kmeans(sub_hap2,5)
vmndata$hap2_cluster <- as.factor(hap2_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap2_cluster)) + geom_point() + ggtitle("Hap2 classification")

Comparison between the origional classification and cluster classification,评估结果

Adjusted Rand Index (ARI) 是一种用于评估两个数据分配(例如,真实标签和由聚类算法得到的标签)一致性的统计量。它的取值范围是 [-1, 1],其中:

1 表示两个分配完全一致。
0 表示随机一致性,即两个分配的一致性与随机标签的一致性相同。
-1 表示完全不一致,即两个分配的一致性比随机标签的一致性还要差。

虽然 ARI 没有绝对的“好”或“坏”的阈值,但通常认为:

接近 1:非常好
0.7 到 0.9:好
0.4 到 0.69:一般
0.2 到 0.39:较差
接近 0:随机
负值:比随机还差
# Example of calculating ARI
library(CommKern)
ari_score <- adj_RI(vmndata$type, vmndata$cluster)
print(ari_score)

标签:probability,West,流程,Kmeans,ecologist,聚类,Loggerhead,East,Beach
From: https://www.cnblogs.com/chen-heybro/p/18212750

相关文章

  • 数据清洗到站点聚类,全面解析伦敦共享单车使用规律!
    1.项目背景随着共享单车在全球范围内的普及,城市交通出行模式发生了巨大变化。伦敦作为国际化大都市,交通拥堵问题日益严重,共享单车作为一种绿色、环保、便捷的出行方式,逐渐成为解决交通问题的重要组成部分,然而,要实现共享单车系统的高效运营,必须深入了解用户的使用习惯和需求......
  • Settings里面切换不同Launcher的代码流程
    1.Android\packages\modules\Permission\PermissionController中的DefaultAppActivity中接收,根据packagename进行追踪路径如下:DefaultAppActivity.java--->HandheldDefaultAppFragment.java--->DefaultAppChildFragment.java:setDefaultApp()--->ManageRoleHolderState......
  • 刚需:数据商品合规登记人员要求、流程步骤、三权材料说明
    之前有对数据商品进行过简单的概述,对数据商品合规登记的合法性和合规性审核过程,以及在市场中合法合规流通的重要环节做过简单易懂的介绍。首先,对参与或负责数据商品合规登记的工作人员需要具备以下知识和能力:1. 熟悉数据合规相关的法律法规和标准,稍微有个大概的数据要素流......
  • C语言开发流程与编译四部曲
    1、编写代码(1)文件格式要求源代码:.c头文件.h(2)编写过程要求使用英文字符(3)中英文切换需要注意全半角问题(4)字符编码问题(Linux:UTF-8)error:stray'\342'inprogram以上错误为中文及圆角问题2、生成程序(1)编译型语言:c/c++(2)解释型语言:py(3)若没有编译器(gcc)sudoaptinstall......
  • JS核心语法【流程控制语句、函数】;DOM【查找元素、操作元素、事件】--学习JavaEE的day
    day48JS核心技术JS核心语法继day47注意:用到控制台输出、弹窗流程控制语句Ifelse、For、For-in(遍历数组时,跟Java是否一样【java没有】)、While、Dowhile、break、continue案例:1.求1-100之间的偶数之和<!DOCTYPEhtml><html> <head> <metacharset="UTF......
  • 红队攻防渗透技术实战流程:云安全之云原生安全:K8s安全etcd Dashboard Configfile漏洞
    红队云攻防实战1.云原生安全-K8s安全-Kubelet漏洞利用1.1K8s安全-Master节点漏洞利用-2379端口etcd未授权访问1.1.1K8s安全-Master节点漏洞利用-etcd未授权的几种利用方式1.1.2K8s安全-Master节点漏洞利用-etcd未授权-V2版本利用1.1.3K8s安全-Master节点漏......
  • 红队攻防渗透技术实战流程:云安全之云原生安全:K8s实战
    红队云攻防实战1.云原生安全-K8s安全-Kubelet漏洞利用1.1K8s安全-横向移动-污点Taint-概念1.2K8s安全-横向移动-污点Taint实战1.2.2K8s安全-横向移动-探针APIServer未授权1.2.2K8s安全-横向移动-利用污点Taint横向移动master节点1.2.3K8s安全-Master节......
  • R语言上市公司经营绩效实证研究 ——因子分析、聚类分析、正态性检验、信度检验|附代
    全文链接:http://tecdat.cn/?p=32747原文出处:拓端数据部落公众号随着我国经济的快速发展,上市公司的经营绩效成为了一个备受关注的话题。本文旨在探讨上市公司经营绩效的相关因素,并运用数据处理、图示、检验和分析等方法进行深入研究,帮助客户对我国45家上市公司的16项财务指标进行......
  • 爆火AI美女跳舞制作全流程-SD插件Ebsynth_Utility(附带所有工具包)
    1.基础介绍AIGC|ChatGPT行业介绍1.1SD简介StableDiffusionXL能够生成几乎任何艺术风格的高质量图像,是用来生成写实图像的最佳开放模型。StableDiffusion是一个可以和MJ相媲美的AI出图工具,简称SD它是一个开源的、免费的项目,没有公司在经营,如果你想用,是需要安装到自......
  • 【Muduo】网络库各模块和交互流程简介
    Muduo是由陈硕大佬个人开发的C++网络库,最近在剖析其源码,在此做一些归纳整理。Channel模块内含向Poller中注册的文件描述符fd,封装了感兴趣的事件events、Poller返回的发生的事件revents,和一组能够根据fd发生的事件revents进行回调的回调函数callbacks共有两种Channel,一种是......