Knit skills
eval=false has no output shown
hist(rnorm(10000), col = 'tomato') # eval
echo=false has no code shown
hist(rnorm(10000), col = 'tomato') # echo
include=false has no output and code shown
hist(rnorm(10000), col = 'tomato') # include
warning=false has no warning shown
library(ggplot2)
error=false has no error shown
#num <- "ADS" + 100
library:: maybe useful when knit fail
#knitr::kable(summary(data))
convert to xelatex:
output:
pdf_document:
latex_engine: xelatex
Data process
import data
data <- read.csv("path", header = TRUE, sep = ",")
# data <- read.table("path) # when process txt
head(data)
basic examination
summary(data)
unique(data$col)
str(data)
examine NA
is.na(data)
# summary(is.na(data))
# sum(is.na(data))
# anyNA(data)
# sum(!complete.cases(data)) # library DMwR
data <- na.omit(data)
# data.dropna(subset=['case_id', 'date_onset', 'age'])
examine duplication
duplicated(data)
data <- distinct(data) # dplyr
# data <- unique(data)
to factor
teeth <- teeth %>%
mutate(dose = factor(dose, levels = c(0.5, 1, 2), ordered = T),
supp = as.factor(supp)) %>%
relocate(supp, dose) str(teeth)
wide to long
library(tidyr)
data <- data.frame(
ID = 1:10,
Before = c(5, 3, 6, 7, 4, 6, 8, 7, 5, 6),
After = c(7, 4, 6, 9, 5, 8, 9, 8, 7, 7))
data_long <- data %>%
pivot_longer(cols = c(Before, After), names_to = "Time", values_to = "Value")
long to wide
data_wide <- data_long %>%
pivot_wider(names_from = Time, values_from = Value)
Visualization
三分组变量箱线图:
ggplot(data, aes(x = factor_variable,
y = numeric_variable,
fill = group_variable)) +
geom_boxplot() +
labs(title = "Boxplot of Numeric Variable by Factor and Group Variables", x = "Factor Variable", y = "Numeric Variable") +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Statistics
t-test:
t.test(variable1, variable2, var.equal = TRUE)
u-test:
- H0: 两组分布相同
- HA: 两组分布不同
wilcox.test(variable1, variable2)
ANOVA:
- Choose: 先说有几组,有几个factor,确定用什么ANOVA(或Kruskal-Wallis test)
anova_result <- aov(variable1 ~ group1 * group2, data = data)
- Justify: Assumptions for a 2-way ANOVA test are:
- Independence of observations.
- We can assume it at once.
- Normality of residuals:
- Can be checked visually (plot(anova_model, 2)) or by running a suitable hypothesis test
- Equality of variance:
- Can be checked visually (plot(anova_model, 1)) or by running a suitable hypothesis test
- Equal group size (have to use different types of SS calculation for the ANOVA table if this requirement is violated):
- The group size can be noticed in the data diagnosis step
- Finally, we can use parameter ANOVA.
- Independence of observations.
- statistical hypotheses:
- H0: means of different supp groups are the same
- H1: means of different supp groups are NOT the same
- Carry test:
summary(anova_result)
TukeyHSD(anova_result)
- Suggestions:
- 缺乏未治疗组(对照)
- 其他生物学意义
Chi-square:
Justify:
- Chi-square goodness-of-fit test
- Chi-square test for homogeneity
- Chi-square test for independence
Assumptions:
• The variables must be categorical.
– Fits.
• Observations must be independent.
– Can assume from the task. Fits.
• Cells in the contingency table are mutually exclusive.
– Fits.
• The expected value of cells should be 5 or greater in at least 80% of cells.
– See Table. Fits.
chisq.test(x = table, p = predict) # goodness-of-fit
Bootstrapping
- if data are categorical but seriously lacking in independence
first_satisfied <- 864
first_unsatisfied <- 714
second_satisfied <- 980
second_unsatisfied <- 473
first_bootstraps <- vector()
second_bootstraps <- vector()
first_results <- c(rep(1, first_satisfied), rep(0, first_unsatisfied))
second_results <- c(rep(1, second_satisfied), rep(0, second_unsatisfied)) for (a in 1:100) {
first_sample <-
mean(sample(first_results, length(first_results),replace=T))
second_sample <-
mean(sample(second_results,length(second_results),replace=T))
first_bootstraps<-c(first_bootstraps,first_sample)
second_bootstraps<-c(second_bootstraps,second_sample)
}
first_upper<-quantile(first_bootstraps,probs= c(0.975))
second_lower<-quantile(second_bootstraps,probs=c(0.025))
boxplot(
first_bootstraps,
second_bootstraps,
notch= T,
names= c('early', 'late'),
ylab='Prop.ofsatisfiedbuttonpresses' )
然后看是否一组的上interval和一组的下interval相交
first_upper < second_lower
Bayes
P(A|B) = P(B|A)*P(A)/P(B)
# compare two hypotheses:
P(H1|D)/P(H2|D) = P(D|H1)/P(D|H2) * P(H1)/P(H2)
# Bayes Factor:
P(D|H1)/P(D|H2)
K-means
visualization
ggplot(data, aes(x = ln_hap1, y = hap2, color = type)) +
# if cluster result then type -> cluster
geom_point() +
theme_minimal() +
labs(title = "xxx") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
clustering
features <- data[, c("hap1", "hap2")]
set.seed(123)
k <- 5
kmeans_result <- kmeans(features, centers = k, nstart = 1)
Finally:
Sys.time()
标签:总结,shown,H1,ADS,知识,ANOVA,factor,test,data
From: https://www.cnblogs.com/Qbio/p/18222152