首页 > 其他分享 >Bias of an estimator

Bias of an estimator

时间:2023-12-19 09:44:07浏览次数:37  
标签:displaystyle mu unbiased Bias estimator theta operatorname

Bias of an estimator   Difference between an estimator's expected value from a parameter's true value For broader coverage of this topic, see Bias (statistics).

In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population; because an estimator is difficult to compute (as in unbiased estimation of standard deviation); because a biased estimator may be unbiased with respect to different measures of central tendency; because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful.

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see § Effect of transformations); for example, the sample variance is a biased estimator for the population variance. These are all illustrated below.

Definition[edit]

Suppose we have a statistical model, parameterized by a real number θ, giving rise to a probability distribution for observed data, P θ ( x ) = P ( x ∣ θ ) {\displaystyle P_{\theta }(x)=P(x\mid \theta )} P_{\theta }(x)=P(x\mid \theta ), and a statistic θ ^ {\displaystyle {\hat {\theta }}} {\hat {\theta }} which serves as an estimator of θ based on any observed data x {\displaystyle x} x. That is, we assume that our data follows some unknown distribution P ( x ∣ θ ) {\displaystyle P(x\mid \theta )} P(x\mid \theta ) (where θ is a fixed, unknown constant that is part of this distribution), and then we construct some estimator θ ^ {\displaystyle {\hat {\theta }}} {\hat {\theta }} that maps observed data to values that we hope are close to θ. The bias of θ ^ {\displaystyle {\hat {\theta }}} {\hat {\theta }} relative to θ {\displaystyle \theta } \theta is defined as[1]

Bias ⁡ ( θ ^ , θ ) = Bias θ ⁡ [ θ ^ ] = E x ∣ θ ⁡ [ θ ^ ] − θ = E x ∣ θ ⁡ [ θ ^ − θ ] , {\displaystyle \operatorname {Bias} ({\hat {\theta }},\theta )=\operatorname {Bias} _{\theta }[\,{\hat {\theta }}\,]=\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}\,]-\theta =\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}-\theta \,],} {\displaystyle \operatorname {Bias} ({\hat {\theta }},\theta )=\operatorname {Bias} _{\theta }[\,{\hat {\theta }}\,]=\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}\,]-\theta =\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}-\theta \,],}

where E x ∣ θ {\displaystyle \operatorname {E} _{x\mid \theta }} {\displaystyle \operatorname {E} _{x\mid \theta }} denotes expected value over the distribution P ( x ∣ θ ) {\displaystyle P(x\mid \theta )} P(x\mid \theta ) (i.e., averaging over all possible observations x {\displaystyle x} x). The second equation follows since θ is measurable with respect to the conditional distribution P ( x ∣ θ ) {\displaystyle P(x\mid \theta )} P(x\mid \theta ).

An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ, or equivalently, if the expected value of the estimator matches that of the parameter.[2] Unbiasedness is not guaranteed to carry over. For example, if θ ^ {\displaystyle {\hat {\theta }}} {\hat {\theta }} is an unbiased estimator for parameter θ, it is not guaranteed that g( θ ^ {\displaystyle {\hat {\theta }}} {\hat {\theta }}) is an unbiased estimator for g(θ).[3]

In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.

Examples[edit]

Sample variance[edit]

Main article: Sample variance

The sample variance of a random variable demonstrates two aspects of estimator bias: firstly, the naive estimator is biased, which can be corrected by a scale factor; second, the unbiased estimator is not optimal in terms of mean squared error (MSE), which can be minimized by using a different scale factor, resulting in a biased estimator with lower MSE than the unbiased estimator. Concretely, the naive estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n − 1 yields an unbiased estimator. Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator. This number is always larger than n − 1, so this is known as a shrinkage estimator, as it "shrinks" the unbiased estimator towards zero; for the normal distribution the optimal value is n + 1.

Suppose X1, ..., Xn are independent and identically distributed (i.i.d.) random variables with expectation μ and variance σ2. If the sample mean and uncorrected sample variance are defined as

X ¯ = 1 n ∑ i = 1 n X i S 2 = 1 n ∑ i = 1 n ( X i − X ¯ ) 2 {\displaystyle {\overline {X}}\,={\frac {1}{n}}\sum _{i=1}^{n}X_{i}\qquad S^{2}={\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}\,{\big )}^{2}\qquad } {\displaystyle {\overline {X}}\,={\frac {1}{n}}\sum _{i=1}^{n}X_{i}\qquad S^{2}={\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}\,{\big )}^{2}\qquad }

then S2 is a biased estimator of σ2, because

E ⁡ [ S 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − X ¯ ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( ( X i − μ ) − ( X ¯ − μ ) ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( ( X i − μ ) 2 − 2 ( X ¯ − μ ) ( X i − μ ) + ( X ¯ − μ ) 2 ) ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 n ( X ¯ − μ ) ∑ i = 1 n ( X i − μ ) + 1 n ( X ¯ − μ ) 2 ∑ i = 1 n 1 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 n ( X ¯ − μ ) ∑ i = 1 n ( X i − μ ) + 1 n ( X ¯ − μ ) 2 ⋅ n ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 n ( X ¯ − μ ) ∑ i = 1 n ( X i − μ ) + ( X ¯ − μ ) 2 ] {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )-({\overline {X}}-\mu ){\bigg )}^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg )}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\sum _{i=1}^{n}1{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\cdot n{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]\end{aligned}}} {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )-({\overline {X}}-\mu ){\bigg )}^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}{\bigg (}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg )}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\sum _{i=1}^{n}1{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+{\frac {1}{n}}({\overline {X}}-\mu )^{2}\cdot n{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]\end{aligned}}}

To continue, we note that by subtracting μ {\displaystyle \mu } \mu from both sides of X ¯ = 1 n ∑ i = 1 n X i {\displaystyle {\overline {X}}={\frac {1}{n}}\sum _{i=1}^{n}X_{i}} {\displaystyle {\overline {X}}={\frac {1}{n}}\sum _{i=1}^{n}X_{i}}, we get

X ¯ − μ = 1 n ∑ i = 1 n X i − μ = 1 n ∑ i = 1 n X i − 1 n ∑ i = 1 n μ   = 1 n ∑ i = 1 n ( X i − μ ) . {\displaystyle {\begin{aligned}{\overline {X}}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-{\frac {1}{n}}\sum _{i=1}^{n}\mu \ ={\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu ).\\[8pt]\end{aligned}}} {\displaystyle {\begin{aligned}{\overline {X}}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-\mu ={\frac {1}{n}}\sum _{i=1}^{n}X_{i}-{\frac {1}{n}}\sum _{i=1}^{n}\mu \ ={\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu ).\\[8pt]\end{aligned}}}

Meaning, (by cross-multiplication) n ⋅ ( X ¯ − μ ) = ∑ i = 1 n ( X i − μ ) {\displaystyle n\cdot ({\overline {X}}-\mu )=\sum _{i=1}^{n}(X_{i}-\mu )} {\displaystyle n\cdot ({\overline {X}}-\mu )=\sum _{i=1}^{n}(X_{i}-\mu )}. Then, the previous becomes:

E ⁡ [ S 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 n ( X ¯ − μ ) ∑ i = 1 n ( X i − μ ) + ( X ¯ − μ ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 n ( X ¯ − μ ) ⋅ n ⋅ ( X ¯ − μ ) + ( X ¯ − μ ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − 2 ( X ¯ − μ ) 2 + ( X ¯ − μ ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 − ( X ¯ − μ ) 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 ] − E ⁡ [ ( X ¯ − μ ) 2 ] = σ 2 − E ⁡ [ ( X ¯ − μ ) 2 ] = ( 1 − 1 n ) σ 2 < σ 2 . {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\cdot n\cdot ({\overline {X}}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )^{2}+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\sigma ^{2}-\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}=\left(1-{\frac {1}{n}}\right)\sigma ^{2}<\sigma ^{2}.\end{aligned}}} {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\sum _{i=1}^{n}(X_{i}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-{\frac {2}{n}}({\overline {X}}-\mu )\cdot n\cdot ({\overline {X}}-\mu )+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-2({\overline {X}}-\mu )^{2}+({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}-({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}-\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}\\[8pt]&=\sigma ^{2}-\operatorname {E} {\bigg [}({\overline {X}}-\mu )^{2}{\bigg ]}=\left(1-{\frac {1}{n}}\right)\sigma ^{2}<\sigma ^{2}.\end{aligned}}}

This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above: E ⁡ [ ( X ¯ − μ ) 2 ] = 1 n σ 2 {\displaystyle \operatorname {E} {\big [}({\overline {X}}-\mu )^{2}{\big ]}={\frac {1}{n}}\sigma ^{2}} {\displaystyle \operatorname {E} {\big [}({\overline {X}}-\mu )^{2}{\big ]}={\frac {1}{n}}\sigma ^{2}}.

In other words, the expected value of the uncorrected sample variance does not equal the population variance σ2, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased[4] estimator of the population mean μ.[2]

Note that the usual definition of sample variance is S 2 = 1 n − 1 ∑ i = 1 n ( X i − X ¯ ) 2 {\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}} {\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}}, and this is an unbiased estimator of the population variance.

Algebraically speaking, E ⁡ [ S 2 ] {\displaystyle \operatorname {E} [S^{2}]} {\displaystyle \operatorname {E} [S^{2}]} is unbiased because:

E ⁡ [ S 2 ] = E ⁡ [ 1 n − 1 ∑ i = 1 n ( X i − X ¯ ) 2 ] = n n − 1 E ⁡ [ 1 n ∑ i = 1 n ( X i − X ¯ ) 2 ] = n n − 1 ( 1 − 1 n ) σ 2 = σ 2 , {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n-1}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]={\frac {n}{n-1}}\operatorname {E} \left[{\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]\\[8pt]&={\frac {n}{n-1}}\left(1-{\frac {1}{n}}\right)\sigma ^{2}=\sigma ^{2},\\[8pt]\end{aligned}}} {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} \left[{\frac {1}{n-1}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]={\frac {n}{n-1}}\operatorname {E} \left[{\frac {1}{n}}\sum _{i=1}^{n}{\big (}X_{i}-{\overline {X}}{\big )}^{2}\right]\\[8pt]&={\frac {n}{n-1}}\left(1-{\frac {1}{n}}\right)\sigma ^{2}=\sigma ^{2},\\[8pt]\end{aligned}}}

where the transition to the second line uses the result derived above for the biased estimator. Thus E ⁡ [ S 2 ] = σ 2 {\displaystyle \operatorname {E} [S^{2}]=\sigma ^{2}} {\displaystyle \operatorname {E} [S^{2}]=\sigma ^{2}}, and therefore S 2 = 1 n − 1 ∑ i = 1 n ( X i − X ¯ ) 2 {\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}} {\displaystyle S^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}} is an unbiased estimator of the population variance, σ2. The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.

The reason that an uncorrected sample variance, S2, is biased stems from the fact that the sample mean is an ordinary least squares (OLS) estimator for μ: X ¯ {\displaystyle {\overline {X}}} {\overline {X}} is the number that makes the sum ∑ i = 1 n ( X i − X ¯ ) 2 {\displaystyle \sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}} \sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2} as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice μ ≠ X ¯ {\displaystyle \mu \neq {\overline {X}}} \mu \neq {\overline {X}} gives,

1 n ∑ i = 1 n ( X i − X ¯ ) 2 < 1 n ∑ i = 1 n ( X i − μ ) 2 , {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}<{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2},} {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}<{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2},}

and then

E ⁡ [ S 2 ] = E ⁡ [ 1 n ∑ i = 1 n ( X i − X ¯ ) 2 ] < E ⁡ [ 1 n ∑ i = 1 n ( X i − μ ) 2 ] = σ 2 . {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}{\bigg ]}<\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}=\sigma ^{2}.\end{aligned}}} {\displaystyle {\begin{aligned}\operatorname {E} [S^{2}]&=\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}{\bigg ]}<\operatorname {E} {\bigg [}{\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-\mu )^{2}{\bigg ]}=\sigma ^{2}.\end{aligned}}}

The above discussion can be understood in geometric terms: the vector C → = ( X 1 − μ , … , X n − μ ) {\displaystyle {\vec {C}}=(X_{1}-\mu ,\ldots ,X_{n}-\mu )} {\displaystyle {\vec {C}}=(X_{1}-\mu ,\ldots ,X_{n}-\mu )} can be decomposed into the "mean part" and "variance part" by projecting to the direction of u → = ( 1 , … , 1 ) {\displaystyle {\vec {u}}=(1,\ldots ,1)} {\displaystyle {\vec {u}}=(1,\ldots ,1)} and to that direction's orthogonal complement hyperplane. One gets A → = ( X ¯ − μ , … , X ¯ − μ ) {\displaystyle {\vec {A}}=({\overline {X}}-\mu ,\ldots ,{\overline {X}}-\mu )} {\displaystyle {\vec {A}}=({\overline {X}}-\mu ,\ldots ,{\overline {X}}-\mu )} for the part along u → {\displaystyle {\vec {u}}} {\displaystyle {\vec {u}}} and B → = ( X 1 − X ¯ , … , X n − X ¯ ) {\displaystyle {\vec {B}}=(X_{1}-{\overline {X}},\ldots ,X_{n}-{\overline {X}})} {\displaystyle {\vec {B}}=(X_{1}-{\overline {X}},\ldots ,X_{n}-{\overline {X}})} for the complementary part. Since this is an orthogonal decomposition, Pythagorean theorem says | C → | 2 = | A → | 2 + | B → | 2 {\displaystyle |{\vec {C}}|^{2}=|{\vec {A}}|^{2}+|{\vec {B}}|^{2}} {\displaystyle |{\vec {C}}|^{2}=|{\vec {A}}|^{2}+|{\vec {B}}|^{2}}, and taking expectations we get n σ 2 = n E ⁡ [ ( X ¯ − μ ) 2 ] + n E ⁡ [ S 2 ] {\displaystyle n\sigma ^{2}=n\operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]+n\operatorname {E} [S^{2}]} {\displaystyle n\sigma ^{2}=n\operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]+n\operatorname {E} [S^{2}]}, as above (but times n {\displaystyle n} n). If the distribution of C → {\displaystyle {\vec {C}}} \vec{C} is rotationally symmetric, as in the case when X i {\displaystyle X_{i}} X_{i} are sampled from a Gaussian, then on average, the dimension along u → {\displaystyle {\vec {u}}} {\displaystyle {\vec {u}}} contributes to | C → | 2 {\displaystyle |{\vec {C}}|^{2}} {\displaystyle |{\vec {C}}|^{2}} equally as the n − 1 {\displaystyle n-1} n-1 directions perpendicular to u → {\displaystyle {\vec {u}}} {\displaystyle {\vec {u}}}, so that E ⁡ [ ( X ¯ − μ ) 2 ] = σ 2 n {\displaystyle \operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]={\frac {\sigma ^{2}}{n}}} {\displaystyle \operatorname {E} \left[({\overline {X}}-\mu )^{2}\right]={\frac {\sigma ^{2}}{n}}} and E ⁡ [ S 2 ] = ( n − 1 ) σ 2 n {\displaystyle \operatorname {E} [S^{2}]={\frac {(n-1)\sigma ^{2}}{n}}} {\displaystyle \operatorname {E} [S^{2}]={\frac {(n-1)\sigma ^{2}}{n}}}. This is in fact true in general, as explained above.

Estimating a Poisson probability[edit]

A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution.[5][6] Suppose that X has a Poisson distribution with expectation λ. Suppose it is desired to estimate

P ⁡ ( X = 0 ) 2 = e − 2 λ {\displaystyle \operatorname {P} (X=0)^{2}=e^{-2\lambda }\quad } \operatorname {P} (X=0)^{2}=e^{-2\lambda }\quad

with a sample of size 1. (For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and λ is the average number of calls per minute, then e−2λ is the probability that no calls arrive in the next two minutes.)

Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e.

E ⁡ ( δ ( X ) ) = ∑ x = 0 ∞ δ ( x ) λ x e − λ x ! = e − 2 λ , {\displaystyle \operatorname {E} (\delta (X))=\sum _{x=0}^{\infty }\delta (x){\frac {\lambda ^{x}e^{-\lambda }}{x!}}=e^{-2\lambda },} {\displaystyle \operatorname {E} (\delta (X))=\sum _{x=0}^{\infty }\delta (x){\frac {\lambda ^{x}e^{-\lambda }}{x!}}=e^{-2\lambda },}

the only function of the data constituting an unbiased estimator is

δ ( x ) = ( − 1 ) x . {\displaystyle \delta (x)=(-1)^{x}.\,} \delta (x)=(-1)^{x}.\,

To see this, note that when decomposing eλ from the above expression for expectation, the sum that is left is a Taylor series expansion of eλ as well, yielding eλeλ = e−2λ (see Characterizations of the exponential function).

If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if X is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive.

The (biased) maximum likelihood estimator

e − 2 X {\displaystyle e^{-2{X}}\quad } e^{-2{X}}\quad

is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its mean squared error

e − 4 λ − 2 e λ ( 1 / e 2 − 3 ) + e λ ( 1 / e 4 − 1 ) {\displaystyle e^{-4\lambda }-2e^{\lambda (1/e^{2}-3)}+e^{\lambda (1/e^{4}-1)}\,} e^{-4\lambda }-2e^{\lambda (1/e^{2}-3)}+e^{\lambda (1/e^{4}-1)}\,

is smaller; compare the unbiased estimator's MSE of

1 − e − 4 λ . {\displaystyle 1-e^{-4\lambda }.\,} 1-e^{-4\lambda }.\,

The MSEs are functions of the true value λ. The bias of the maximum-likelihood estimator is:

e − 2 λ − e λ ( 1 / e 2 − 1 ) . {\displaystyle e^{-2\lambda }-e^{\lambda (1/e^{2}-1)}.\,} e^{-2\lambda }-e^{\lambda (1/e^{2}-1)}.\,

Maximum of a discrete uniform distribution[edit]

Main article: Maximum of a discrete uniform distribution

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 through to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximum-likelihood estimator of n is X, even though the expectation of X given n is only (n + 1)/2; we can be certain only that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X − 1.

Median-unbiased estimators[edit]

The theory of median-unbiased estimators was revived by George W. Brown in 1947:[7]

An estimate of a one-dimensional parameter θ will be said to be median-unbiased, if, for fixed θ, the median of the distribution of the estimate is at the value θ; i.e., the estimate underestimates just as often as it overestimates. This requirement seems for most purposes to accomplish as much as the mean-unbiased requirement and has the additional property that it is invariant under one-to-one transformation.

Further properties of median-unbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl.[citation needed] In particular, median-unbiased estimators exist in cases where mean-unbiased and maximum-likelihood estimators do not exist. They are invariant under one-to-one transformations.

There are methods of construction median-unbiased estimators for probability distributions that have monotone likelihood-functions, such as one-parameter exponential families, to ensure that they are optimal (in a sense analogous to minimum-variance property considered for mean-unbiased estimators).[8][9] One such procedure is an analogue of the Rao–Blackwell procedure for mean-unbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao–Blackwell procedure for mean-unbiased estimation but for a larger class of loss-functions.[9]

Bias with respect to other loss functions[edit]

Any minimum-variance mean-unbiased estimator minimizes the risk (expected loss) with respect to the squared-error loss function (among mean-unbiased estimators), as observed by Gauss.[10] A minimum-average absolute deviation median-unbiased estimator minimizes the risk with respect to the absolute loss function (among median-unbiased estimators), as observed by Laplace.[10][11] Other loss functions are used in statistics, particularly in robust statistics.[10][12]

Effect of transformations[edit]

For univariate parameters, median-unbiased estimators remain median-unbiased under transformations that preserve order (or reverse order). Note that, when a transformation is applied to a mean-unbiased estimator, the result need not be a mean-unbiased estimator of its corresponding population statistic. By Jensen's inequality, a convex function as transformation will introduce positive bias, while a concave function will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a non-linear function f and a mean-unbiased estimator U of a parameter p, the composite estimator f(U) need not be a mean-unbiased estimator of f(p). For example, the square root of the unbiased estimator of the population variance is not a mean-unbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see unbiased estimation of standard deviation for a discussion in this case.

Bias, variance and mean squared error[edit]

Main article: Bias–variance tradeoff See also: Accuracy (trueness and precision) Sampling distributions of two alternative estimators for a parameter β0. Although β1^ is unbiased, it is clearly inferior to the biased β2^.

Ridge regression is one example of a technique where allowing a little bias may lead to a considerable reduction in variance, and more reliable estimates overall.

While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample. An estimator that minimises the bias will not necessarily minimise the mean square error. One measure which is used to try to reflect both types of difference is the mean square error,[1]

MSE ⁡ ( θ ^ ) = E ⁡ [ ( θ ^ − θ ) 2 ] . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} {\big [}({\hat {\theta }}-\theta )^{2}{\big ]}.} \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} {\big [}({\hat {\theta }}-\theta )^{2}{\big ]}.

This can be shown to be equal to the square of the bias, plus the variance:[1]

MSE ⁡ ( θ ^ ) = ( E ⁡ [ θ ^ ] − θ ) 2 + E ⁡ [ ( θ ^ − E ⁡ [ θ ^ ] ) 2 ] = ( Bias ⁡ ( θ ^ , θ ) ) 2 + Var ⁡ ( θ ^ ) {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})=&(\operatorname {E} [{\hat {\theta }}]-\theta )^{2}+\operatorname {E} [\,({\hat {\theta }}-\operatorname {E} [\,{\hat {\theta }}\,])^{2}\,]\\=&(\operatorname {Bias} ({\hat {\theta }},\theta ))^{2}+\operatorname {Var} ({\hat {\theta }})\end{aligned}}} {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})=&(\operatorname {E} [{\hat {\theta }}]-\theta )^{2}+\operatorname {E} [\,({\hat {\theta }}-\operatorname {E} [\,{\hat {\theta }}\,])^{2}\,]\\=&(\operatorname {Bias} ({\hat {\theta }},\theta ))^{2}+\operatorname {Var} ({\hat {\theta }})\end{aligned}}

When the parameter is a vector, an analogous decomposition applies:[13]

MSE ⁡ ( θ ^ ) = trace ⁡ ( Cov ⁡ ( θ ^ ) ) + ‖ Bias ⁡ ( θ ^ , θ ) ‖ 2 {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {trace} (\operatorname {Cov} ({\hat {\theta }}))+\left\Vert \operatorname {Bias} ({\hat {\theta }},\theta )\right\Vert ^{2}} {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {trace} (\operatorname {Cov} ({\hat {\theta }}))+\left\Vert \operatorname {Bias} ({\hat {\theta }},\theta )\right\Vert ^{2}}

where trace ⁡ ( Cov ⁡ ( θ ^ ) ) {\displaystyle \operatorname {trace} (\operatorname {Cov} ({\hat {\theta }}))} {\displaystyle \operatorname {trace} (\operatorname {Cov} ({\hat {\theta }}))} is the trace (diagonal sum) of the covariance matrix of the estimator and ‖ Bias ⁡ ( θ ^ , θ ) ‖ 2 {\displaystyle \left\Vert \operatorname {Bias} ({\hat {\theta }},\theta )\right\Vert ^{2}} {\displaystyle \left\Vert \operatorname {Bias} ({\hat {\theta }},\theta )\right\Vert ^{2}} is the square vector norm.

Example: Estimation of population variance[edit]

For example,[14] suppose an estimator of the form

T 2 = c ∑ i = 1 n ( X i − X ¯ ) 2 = c n S 2 {\displaystyle T^{2}=c\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}\,\right)^{2}=cnS^{2}} T^{2}=c\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}\,\right)^{2}=cnS^{2}

is sought for the population variance as above, but this time to minimise the MSE:

MSE = E ⁡ [ ( T 2 − σ 2 ) 2 ] = ( E ⁡ [ T 2 − σ 2 ] ) 2 + Var ⁡ ( T 2 ) {\displaystyle {\begin{aligned}\operatorname {MSE} =&\operatorname {E} \left[(T^{2}-\sigma ^{2})^{2}\right]\\=&\left(\operatorname {E} \left[T^{2}-\sigma ^{2}\right]\right)^{2}+\operatorname {Var} (T^{2})\end{aligned}}} {\begin{aligned}\operatorname {MSE} =&\operatorname {E} \left[(T^{2}-\sigma ^{2})^{2}\right]\\=&\left(\operatorname {E} \left[T^{2}-\sigma ^{2}\right]\right)^{2}+\operatorname {Var} (T^{2})\end{aligned}}

If the variables X1 ... Xn follow a normal distribution, then nS22 has a chi-squared distribution with n − 1 degrees of freedom, giving:

E ⁡ [ n S 2 ] = ( n − 1 ) σ 2  and  Var ⁡ ( n S 2 ) = 2 ( n − 1 ) σ 4 . {\displaystyle \operatorname {E} [nS^{2}]=(n-1)\sigma ^{2}{\text{ and }}\operatorname {Var} (nS^{2})=2(n-1)\sigma ^{4}.} \operatorname {E} [nS^{2}]=(n-1)\sigma ^{2}{\text{ and }}\operatorname {Var} (nS^{2})=2(n-1)\sigma ^{4}.

and so

MSE = ( c ( n − 1 ) − 1 ) 2 σ 4 + 2 c 2 ( n − 1 ) σ 4 {\displaystyle \operatorname {MSE} =(c(n-1)-1)^{2}\sigma ^{4}+2c^{2}(n-1)\sigma ^{4}} \operatorname {MSE} =(c(n-1)-1)^{2}\sigma ^{4}+2c^{2}(n-1)\sigma ^{4}

With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n − 1) which minimises just the square of the bias.

More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.

However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.

Bayesian view[edit]

Most bayesians are rather unconcerned about unbiasedness (at least in the formal sampling-theory sense above) of their estimates. For example, Gelman and coauthors (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading."[15]

Fundamentally, the difference between the Bayesian approach and the sampling-theory approach above is that in the sampling-theory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the data which are known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using Bayes' theorem:

p ( θ ∣ D , I ) ∝ p ( θ ∣ I ) p ( D ∣ θ , I ) {\displaystyle p(\theta \mid D,I)\propto p(\theta \mid I)p(D\mid \theta ,I)} p(\theta \mid D,I)\propto p(\theta \mid I)p(D\mid \theta ,I)

Here the second term, the likelihood of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the prior probability for θ, which takes account of everything the analyst may know or suspect about θ before the data comes in. This information plays no part in the sampling-theory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.

But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior.

For example, consider again the estimation of an unknown population variance σ2 of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function

ExpectedLoss = E ⁡ [ ( c n S 2 − σ 2 ) 2 ] = E ⁡ [ σ 4 ( c n S 2 σ 2 − 1 ) 2 ] {\displaystyle \operatorname {ExpectedLoss} =\operatorname {E} \left[\left(cnS^{2}-\sigma ^{2}\right)^{2}\right]=\operatorname {E} \left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]} \operatorname {ExpectedLoss} =\operatorname {E} \left[\left(cnS^{2}-\sigma ^{2}\right)^{2}\right]=\operatorname {E} \left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]

A standard choice of uninformative prior for this problem is the Jeffreys prior, p ( σ 2 ) ∝ 1 / σ 2 {\displaystyle \scriptstyle {p(\sigma ^{2})\;\propto \;1/\sigma ^{2}}} \scriptstyle {p(\sigma ^{2})\;\propto \;1/\sigma ^{2}}, which is equivalent to adopting a rescaling-invariant flat prior for ln(σ2).

One consequence of adopting this prior is that S22 remains a pivotal quantity, i.e. the probability distribution of S22 depends only on S22, independent of the value of S2 or σ2:

p ( S 2 σ 2 ∣ S 2 ) = p ( S 2 σ 2 ∣ σ 2 ) = g ( S 2 σ 2 ) {\displaystyle p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid S^{2}\right)=p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid \sigma ^{2}\right)=g\left({\tfrac {S^{2}}{\sigma ^{2}}}\right)} p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid S^{2}\right)=p\left({\tfrac {S^{2}}{\sigma ^{2}}}\mid \sigma ^{2}\right)=g\left({\tfrac {S^{2}}{\sigma ^{2}}}\right)

However, while

E p ( S 2 ∣ σ 2 ) ⁡ [ σ 4 ( c n S 2 σ 2 − 1 ) 2 ] = σ 4 E p ( S 2 ∣ σ 2 ) ⁡ [ ( c n S 2 σ 2 − 1 ) 2 ] {\displaystyle \operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]=\sigma ^{4}\operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]} \operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]=\sigma ^{4}\operatorname {E} _{p(S^{2}\mid \sigma ^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]

in contrast

E p ( σ 2 ∣ S 2 ) ⁡ [ σ 4 ( c n S 2 σ 2 − 1 ) 2 ] ≠ σ 4 E p ( σ 2 ∣ S 2 ) ⁡ [ ( c n S 2 σ 2 − 1 ) 2 ] {\displaystyle \operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]\neq \sigma ^{4}\operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]} \operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\sigma ^{4}\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]\neq \sigma ^{4}\operatorname {E} _{p(\sigma ^{2}\mid S^{2})}\left[\left(cn{\tfrac {S^{2}}{\sigma ^{2}}}-1\right)^{2}\right]

— when the expectation is taken over the probability distribution of σ2 given S2, as it is in the Bayesian case, rather than S2 given σ2, one can no longer take σ4 as a constant and factor it out. The consequence of this is that, compared to the sampling-theory calculation, the Bayesian calculation puts more weight on larger values of σ2, properly taking into account (as the sampling-theory calculation cannot) that under this squared-loss function the consequence of underestimating large values of σ2 is more costly in squared-loss terms than that of overestimating small values of σ2.

The worked-out Bayesian calculation gives a scaled inverse chi-squared distribution with n − 1 degrees of freedom for the posterior probability distribution of σ2. The expected loss is minimised when cnS2 = <σ2>; this occurs when c = 1/(n − 3).

Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expected-loss minimising result as the corresponding sampling-theory calculation.

See also[edit]

Notes[edit]

  1. ^ Jump up to: a b c
  1. Kozdron, Michael (March 2016). "Evaluating the Goodness of an Estimator: Bias, Mean-Square Error, Relative Efficiency (Chapter 3)" (PDF). stat.math.uregina.ca. Retrieved 2020-09-11.
  2. ^ Jump up to: a b Taylor, Courtney (January 13, 2019). "Unbiased and Biased Estimators". ThoughtCo. Retrieved 2020-09-12.
  3. ^ Dekking, Michel, ed. (2005). A modern introduction to probability and statistics: understandig why and how. Springer texts in statistics. London [Heidelberg]: Springer. ISBN 978-1-85233-896-1.
  4. ^ Richard Arnold Johnson; Dean W. Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall. ISBN 978-0-13-187715-3. Retrieved 10 August 2012.
  5. ^ J. P. Romano and A. F. Siegel (1986) Counterexamples in Probability and Statistics, Wadsworth & Brooks / Cole, Monterey, California, USA, p. 168
  6. ^ Hardy, M. (1 March 2003). "An Illuminating Counterexample". American Mathematical Monthly. 110 (3): 234–238. arXiv:math/0206006. doi:10.2307/3647938. ISSN 0002-9890. JSTOR 3647938.
  7. ^ Brown (1947), page 583
  8. ^ Pfanzagl, Johann (1979). "On optimal median unbiased estimators in the presence of nuisance parameters". The Annals of Statistics. 7 (1): 187–193. doi:10.1214/aos/1176344563.
  9. ^ Jump up to: a b Brown, L. D.; Cohen, Arthur; Strawderman, W. E. (1976). "A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications". Ann. Statist. 4 (4): 712–722. doi:10.1214/aos/1176343543.
  10. ^ Jump up to: a b c Dodge, Yadolah, ed. (1987). Statistical Data Analysis Based on the L1-Norm and Related Methods. Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: North-Holland. ISBN 0-444-70273-3.
  11. ^ Jaynes, E. T. (2007). Probability Theory : The Logic of Science. Cambridge: Cambridge Univ. Press. p. 172. ISBN 978-0-521-59271-0.
  12. ^ Klebanov, Lev B.; Rachev, Svetlozar T.; Fabozzi, Frank J. (2009). "Loss Functions and the Theory of Unbiased Estimation". Robust and Non-Robust Models in Statistics. New York: Nova Scientific. ISBN 978-1-60741-768-2.
  13. ^ Taboga, Marco (2010). "Lectures on probability theory and mathematical statistics".
  14. ^ DeGroot, Morris H. (1986). Probability and Statistics (2nd ed.). Addison-Wesley. pp. 414–5. ISBN 0-201-11366-X. But compare it with, for example, the discussion in Casella; Berger (2001). Statistical Inference (2nd ed.). Duxbury. p. 332. ISBN 0-534-24312-6.
  15. ^ Gelman, A.; et al. (1995). Bayesian Data Analysis. Chapman and Hall. p. 108. ISBN 0-412-03991-5.

References[edit]

  • Brown, George W. "On Small-Sample Estimation." The Annals of Mathematical Statistics, vol. 18, no. 4 (Dec., 1947), pp. 582–585. JSTOR 2236236.
  • Lehmann, E. L. "A General Concept of Unbiasedness" The Annals of Mathematical Statistics, vol. 22, no. 4 (Dec., 1951), pp. 587–592. JSTOR 2236928.
  • Allan Birnbaum, 1961. "A Unified Theory of Estimation, I", The Annals of Mathematical Statistics, vol. 32, no. 1 (Mar., 1961), pp. 112–135.
  • Van der Vaart, H. R., 1961. "Some Extensions of the Idea of Bias" The Annals of Mathematical Statistics, vol. 32, no. 2 (June 1961), pp. 436–447.
  • Pfanzagl, Johann. 1994. Parametric Statistical Theory. Walter de Gruyter.
  • Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (2010). Classical Inference and the Linear Model. Kendall's Advanced Theory of Statistics. Vol. 2A. Wiley. ISBN 978-0-4706-8924-0..
  • Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1993). Unbiased estimators and their applications. Vol. 1: Univariate case. Dordrect: Kluwer Academic Publishers. ISBN 0-7923-2382-3.
  • Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1996). Unbiased estimators and their applications. Vol. 2: Multivariate case. Dordrect: Kluwer Academic Publishers. ISBN 0-7923-3939-8.
  • Klebanov, Lev [B.]; Rachev, Svetlozar [T.]; Fabozzi, Frank [J.] (2009). Robust and Non-Robust Models in Statistics. New York: Nova Scientific Publishers. ISBN 978-1-60741-768-2.

External links[edit]

   

标签:displaystyle,mu,unbiased,Bias,estimator,theta,operatorname
From: https://www.cnblogs.com/WLCYSYS/p/17912916.html

相关文章

  • 神经网络优化篇:详解偏差,方差(Bias /Variance)
    偏差,方差注意到,几乎所有机器学习从业人员都期望深刻理解偏差和方差,这两个概念易学难精,即使自己认为已经理解了偏差和方差的基本概念,却总有一些意想不到的新东西出现。关于深度学习的误差问题,另一个趋势是对偏差和方差的权衡研究甚浅,可能听说过这两个概念,但深度学习的误差很少权衡......
  • Unbiased Knowledge Distillation for Recommendation
    目录概UnKD代码ChenG.,ChenJ.,FengF.,ZhouS.andHeX.Unbiasedknowledgedistillationforrecommendation.WSDM,2023.概考虑流行度偏差的知识蒸馏,应用于推荐系统.UnKDMotivation就不讲了,感觉不是很强烈.方法很简单,就是将按照流行度给items进行......
  • RatingEstimator
    [ABC292Ex]RatingEstimator题意可以转换为:单点修改。查询如果不存在某一个前缀平均值不低于\(B\),则输出整体平均值;否则,输出第一个不低于\(B\)的前缀平均值。看到题,感觉可以线段树,但是要找的是第一个不低于的前缀平均值,怎么搞呢?首先我们可以维护前缀和,这样操作1......
  • LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention B
    LEA:ImprovingSentenceSimilarityRobustnesstoTyposUsingLexicalAttentionBias论文阅读KDD2023原文地址Introduction文本噪声,如笔误(Typos),拼写错误(Misspelling)和缩写(abbreviations),会影响基于Transformer的模型.主要表现在两个方面:Transformer的架......
  • VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator-翻译
    摘要:本文介绍了一种单目视觉惯性系统(VINS),用于在各种环境中进行状态估计。单目相机和低成本惯性测量单元(IMU)构成了六自由度状态估计的最小传感器套件。我们的算法通过有界滑动窗口迭代地优化视觉和惯性测量,以实现精确的状态估计。视觉结构是通过滑动窗口中的关键帧来维护的,而惯性......
  • 推荐系统徐偏差(debias)相关的技术、论文及代码整理分享
    推荐系统作为解决信息过载的一种重要手段,已经在不同的应用场景下取得了不错的效果。近些年来关于推荐系统的研究主要集中在如何设计更好的模型来适应用户行为数据,进而提升推荐质量。然而,由于用户行为数据是观察所得(Observational)而不是实验所得(Experimental),因此会存在各种......
  • RuntimeError: Error building extension ‘fused‘&FAILED: fused_bias_act_kernel.c
    RuntimeError:Errorbuildingextension‘fused’&FAILED:fused_bias_act_kernel.cuda.o&ninja:buildstopped:subcommandfailed.问题如下:RuntimeError:Errorbuildingextension‘fused’:[1/3]/usr/local/cuda/bin/nvcc-DTORCH_EXTENSION_NAME=fused-DTORCH_......
  • 机器学习模型优化:variance bias
    bias(偏差:指同一个点的训练数据的预测值与正确值间的偏离程度)variance(方差:指同一个点的训练数据的预测值的离散程度) 一般情况下,模型需要在bias和variance之间取得一个平衡。bias小的模型,variance一般大;variance小的模型,bias一般大。更好的理解bias和variance的关系能够帮助我们......
  • 推荐系统中的position-bias
    背景position-bias简单理解就是同个商品/广告展示在不同的位置上有不同的ctr、cvr,如:1.小说榜单,越靠前的位置的点击率越高2.今日头条中可能有不同广告位,约显眼的广告位点击率约高这样就回形成“马太效应”,排在前面的物品越来越容易排在前面,排在后面的物品越来越容易排在后面......
  • Java偏向锁实现原理(Biased Locking)
    评:阅读本文的读者,需要对Java轻量级锁有一定的了解,知道lockrecord,markword之类的名词。可以参考我的一篇博文:Java轻量级锁原理详解(LightweightLocking)Java偏向锁(BiasedLocking)是Java6引入的一项多线程优化。它通过消除资源无竞争情况下的同步原语,进一步提高了程序的运行......