首页 > 其他分享 >SciTech-Mathematics-Probability+Statistics-7 Key Statistics Concepts

SciTech-Mathematics-Probability+Statistics-7 Key Statistics Concepts

时间:2024-08-12 11:49:04浏览次数:8  
标签:Statistics samples Probability sampling Key distribution data mean

7 Key Statistics Concepts Every Data Scientist Must Master

BY BALA PRIYA CPOSTED ON AUGUST 9, 2024

Statistics is one of the must-have skills for all data scientists. But learning statistics can be quite the task.

That’s why we put together this guide to help you understand essential statistics concepts for data science. This should give you an overview of the statistics you need to know as a data scientist and explore further on specific topics.

Let’s get started.

1. Descriptive Statistics

Descriptive statistics provide a summary of the main features of a dataset for preliminary data analysis. Key metrics include measures of central tendency, dispersion, and shape.

Measures of Central Tendency

These metrics describe the center or typical value of a dataset:

  • Mean: Average value, sensitive to outliers
  • Median: Middle value, robust to outliers
  • Mode: Most frequent value, indicating common patterns

Measures of Dispersion

These metrics describe data spread or variability:

  • Range: Difference between highest and lowest values, sensitive to outliers
  • Variance: Average squared deviation from the mean, indicating overall data spread.
  • Standard deviation: Square root of variance, in the same unit as the data. Low values indicate data points close to the mean, high values indicate widespread data.

Measures of Shape

These metrics describe the data distribution shape:

  • Skewness: Asymmetry of the distribution; Positive for right-skewed, negative for left-skewed
  • Kurtosis: “Tailedness” of the distribution; High values indicate heavy tails (outliers), low values indicate light tails

Understanding these metrics is foundational for further statistical analysis and modeling, helping to characterize the distribution, spread, and central tendencies of your data.

2. Sampling Methods

You need to understand sampling for estimating population characteristics. When sampling, you should ensure that these samples accurately reflect the population. Let's go over the common sampling methods.

Random Sampling

Random sampling minimizes bias, ensuring the samples are representative enough. In this, you assign unique numbers to population members and use a random number generator to select the samples at random.

Stratified Sampling

Ensures representation of all subgroups. Stratified sampling divides population into homogeneous strata(such as age, gender) and randomly samples from each stratum proportional to its size.

Cluster Sampling

Cluster sampling is cost-effective for large, spread-out populations. In this, divide population into clusters (such as geographical areas), randomly select clusters, and sample all or randomly select members within chosen clusters.

Systematic Sampling

Systematic sampling is another technique that ensures evenly spread samples.
You assign unique numbers, determine sampling interval (k), randomly select a starting point, and select every k-th member.
Choosing the right sampling method ensures the design effectiveness of study and more representative samples. This in turn improves the reliability of conclusions.

2. Probability Distributions

Probability distributions represent the likelihood of different outcomes. When you’re starting out, you should learn about the normal, binomial, poisson, and exponential distributions—each with its properties and applications.

Normal Distribution

Many real-world distributions follow normal distribution which has the following properties:

Symmetric around the mean, with mean, median, and mode being equal. The normal distribution is characterized by mean (µ) and standard deviation (σ).
As an empirical rule, ~68% of data within one standard deviation, ~95% within two, and ~99.7% within three.
It’s also important to talk about Central Limit Theorem (CLT) when talking about normal distributions. In simple terms, the CLT states that with a large enough sample size, the sampling distribution of the sample mean approximates a normal distribution.

Binomial Distribution

Binomial distribution is used to model the expected number of successes in n independent Bernoulli trials. Each binomial trial has only two possible outcomes: success or failure. The binomial distribution is:

Defined by the probability of success (p)
Suitable for binary outcomes like yes/no or success/failure
Poisson Distribution
Poisson distribution is generally used to model the number of events occurring within a fixed interval of time. It’s especially suited for rare events and has the following properties:

Events are independent and have a fixed average rate (λ) of occurrence
Useful for counting events over continuous domains (time, area, volume)

Exponential Distribution

The exponential distribution is continuous and is used to model the time between events in a Poisson process.
The exponential distribution is:
Characterized by the rate parameter (λ) (which is the inverse of the mean)
Memoryless, meaning the probability of an event occurring in the future is independent of the past
Understanding these distributions helps in modeling various types of data.

标签:Statistics,samples,Probability,sampling,Key,distribution,data,mean
From: https://www.cnblogs.com/abaelhe/p/18354677

相关文章

  • SciTech-Mathematics-Probability+Statistics-[THREE types of Probability]{Subjecti
    THREEtypesofProbability:TheoreticalProbabilityEmpiricalProbabilitySubjectiveProbabilityBayes,EmpiricalBayesandModeratedMethodsEmpiricalandtheoreticalpriordistribution|TheBookof…https://www.khanacademy.org/math/cc-seventh-......
  • HashMap 中处理哈希冲突,红黑树对于没有实现 Comparable 接口的 Key 处理
    背景:假设有两个对象,分别是stu和teach(都没有实现Comparable接口),将它们添加进去HashMap里,假设这两个对象发生哈希冲突,那么红黑树怎么判断它们谁在左谁在右?依据是什么?​ 当两个对象stu和teach的哈希值相同,且它们没有实现Comparable接口时,Java8的HashMap会使用t......
  • SQL Server给表添加及删除主键Primary Key及默认值Default约束
    1.添加表的主键(PrimaryKey)和默认值(Default)约束在SQLServer中,给表添加主键(PrimaryKey)及默认值(Default)约束是数据库设计和维护中常见的操作。这些操作可以通过ALTERTABLE语句在表已存在的情况下执行,也可以通过CREATETABLE语句在创建表时直接指定。下面分别介绍这两种情......
  • OneKeyAdmin 后台任意文件下载
    侵权声明本文章中的所有内容(包括但不限于文字、图像和其他媒体)仅供教育和参考目的。如果在本文章中使用了任何受版权保护的材料,我们满怀敬意地承认该内容的版权归原作者所有。如果您是版权持有人,并且认为您的作品被侵犯,请通过以下方式与我们联系:[[email protected]]。我们将在确......
  • SciTech-Mathematics-Probability+Statistics-5StatisticalConcepts:
    5StatisticalConceptsThatOftenConfuseBeginners(AndHowtoUnderstandThem)BYNAHLADAVIESPOSTEDONAUGUST6,20245StatisticalConceptsThatOftenConfuseBeginners(AndHowtoUnderstandThem)Statisticsisn'tjustformathematiciansorscie......
  • react渲染列表中的key的作用
    这个key首先是只在渲染数组列表的时候会用到。比如经常遇到的 如上没有key的话,会报一个错,那么,我们可不可以使用数组的index作为下标呢?答案是不推荐。因为在数组项的顺序在插入、删除或者重新排序等操作中会发生改变,此时把索引作为key可能会产生一些微妙的bug。像下面这种......
  • openssl验证证书文件pem和key是否匹配
    环境:linux环境下1、从key、pem提取公钥opensslx509-inyour_certificate.pem-noout-pubkey>public_key.txtopensslrsa-inyour_private_key.key-pubout>private_key_pub.txt2、验证diffpublic_key.txtprivate_key_pub.txtdiff命令比较这两个公钥文件。......
  • keycloak~关于社区登录的过程说明
    keycloak将第三方登录(社区登录)进行了封装,大体主要会经历以下三个过程:打开社区认证页面,输入账号密码或者扫码,完成社区上的认证由社区进行302重定向,回到keycloak页面keycloak与社区完成一次oauth2授权码认证,通过社区返回的code来获取token,再通过token来获取社区上的用户信息,在这......
  • key命令操作
    key命令操作查询###查看所有keykeys*###匹配查看*keyssit*###单个字符匹配?keyssit?###可选匹配[]keyssit[e|y]判断KEY类型###随机返回一个KEYrandomkey###判断key是否存在(0|1)existssite#1表示存在0表示不存在###返回KEY的类型typesite......
  • ssh 远程登录报错:Unable to negotiate with IP port 22: no matching host key type f
    最近在Mac上想要远程一台Linux服务器,结果不知怎么的就不能使用以前的ssh登录了iot@ios-iMac~%[email protected]:nomatchinghostkeytypefound.Theiroffer:ssh-rsa,ssh-dss ......