标签：probability West 流程 Kmeans ecologist 聚类 Loggerhead East Beach

1. turtles

Introduction

In this report, we will analyze a problem related to turtle populations on a small island with two beaches: West Beach and East Beach. The goal is to determine the probability of being on East Beach given that a Loggerhead Turtle is found. We will use Bayes' theorem and R programming to solve this problem.

Problem Statement

An ecologist studying turtles on a small island with two beaches knows the following information about the turtle population:

On West Beach, 90% of turtles are Green Sea Turtles, and the remaining 10% are Loggerhead Sea Turtles.
On East Beach, 60% of turtles are Green Sea Turtles, while 40% are Loggerhead Turtles.

On a foggy day, the ecologist gets lost on the island. After hours of walking, they reach a beach but cannot determine which one it is due to the dense fog. The ecologist finds a turtle and examines it, discovering that it is a Loggerhead Turtle.

The question is: What is the probability that the ecologist is on East Beach? Additionally, we need to state the assumptions made to arrive at this probability.

Assumptions

To solve this problem, we make the following assumptions:

The ecologist is either on West Beach or East Beach; there are no other possibilities.
In the foggy weather, the probability of reaching West Beach or East Beach is equal, i.e., 50% each.

Solution

We will use Bayes' theorem to calculate the probability of being on East Beach given that a Loggerhead Turtle is found.

# Define known conditions
p_west <- 0.5  # Probability of reaching West Beach
p_east <- 0.5  # Probability of reaching East Beach
p_loggerhead_given_west <- 0.1  # Probability of finding a Loggerhead Turtle on West Beach
p_loggerhead_given_east <- 0.4  # Probability of finding a Loggerhead Turtle on East Beach

# Apply Bayes' theorem to calculate the probability of being on East Beach given a Loggerhead Turtle is found
p_east_given_loggerhead <- (p_loggerhead_given_east * p_east) / 
  (p_loggerhead_given_west * p_west + p_loggerhead_given_east * p_east)

# Print the result
cat("The probability of being on East Beach given that a Loggerhead Turtle is found is:", p_east_given_loggerhead, "\n")

Conclusion

Based on the given information and assumptions, the probability of being on East Beach given that a Loggerhead Turtle is found is r p_east_given_loggerhead, or 80%.

This result relies on the assumptions that the ecologist is either on West Beach or East Beach and that the probability of reaching each beach in the foggy weather is equal. If these assumptions do not hold, the calculated probability may differ. For example, if the probability of reaching West Beach in the foggy weather is higher, the ecologist might still be more likely to be on West Beach even after finding a Loggerhead Turtle.

2. Classifying neuron types from electrophysiological recordings

library(tidyverse)
vmndata <- read.csv("/Users/chen_yiru/Desktop/Desk/Projects/incourse/大二下/ADS_files/vmndata.csv")
head(vmndata)

colSums(is.na(vmndata))

No NA value here.

vmndata_duplicate <- duplicated(vmndata)

sum(vmndata_duplicate)

No duplicates.

ggplot(data = vmndata, aes(x = hap1, y = hap2, color= type)) + geom_point() + ggtitle("Original classification")

寻找最佳聚类数

虽然这里因为已经知道一共有五类，但是如果是不知道的情况下还是要这一步

dots <- vmndata[,c(2,3)]
library(factoextra)
set.seed(123)
fviz_nbclust(dots, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)

model <- kmeans(dots, 5)
model
vmndata$cluster <- as.factor(model$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= cluster)) + geom_point() + ggtitle("Cluter classification")

Test the clustering using different subsets of the fit parameters

sub_hap1 <- vmndata[,2]
hap1_result <- kmeans(sub_hap1,5)
vmndata$hap1_cluster <- as.factor(hap1_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap1_cluster)) + geom_point() + ggtitle("Hap1 classification")

sub_hap2 <- vmndata[,3]
hap2_result <- kmeans(sub_hap2,5)
vmndata$hap2_cluster <- as.factor(hap2_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap2_cluster)) + geom_point() + ggtitle("Hap2 classification")

Comparison between the origional classification and cluster classification，评估结果

Adjusted Rand Index (ARI) 是一种用于评估两个数据分配（例如，真实标签和由聚类算法得到的标签）一致性的统计量。它的取值范围是 [-1, 1]，其中：

1 表示两个分配完全一致。
0 表示随机一致性，即两个分配的一致性与随机标签的一致性相同。
-1 表示完全不一致，即两个分配的一致性比随机标签的一致性还要差。

虽然 ARI 没有绝对的“好”或“坏”的阈值，但通常认为：

接近 1：非常好
0.7 到 0.9：好
0.4 到 0.69：一般
0.2 到 0.39：较差
接近 0：随机
负值：比随机还差

# Example of calculating ARI
library(CommKern)
ari_score <- adj_RI(vmndata$type, vmndata$cluster)
print(ari_score)

标签：probability,West,流程,Kmeans,ecologist,聚类,Loggerhead,East,Beach
From： https://www.cnblogs.com/chen-heybro/p/18212750

Kmeans聚类流程