1  定量数据的统计描述

1.1 频数分布

  • 极差(Range) : \(R=X_{max}-X_{min}\)

  • 组数 (Number of Bins) \(k\) : 通常选择 \(8\)\(15\) 之间的值。

  • 组距 (Bin Width) : \(interval=\frac{R}{k}\)

  • 频数 (Frequency) : \(Frequency = count\)

  • 频率 (Relative Frequency): \(Relative\ Frequency = \frac{count}{n} \times 100\%\)

Code
library(ggplot2)
library(tibble)
ggplot(data = mtcars, aes(x = mpg)) +
    geom_histogram(color = "black", bins = 10)

Code

ggplot(data = mtcars, aes(x = mpg)) +
    geom_histogram(color = "black", binwidth = diff(range(mtcars$mpg)) / 9)

1.2 集中趋势(central tendency)

总体方差除以 n

样本方差除以(n-1),R 计算的是样本方差

Code
#算术均数
mean(mtcars$mpg)     
#> [1] 20.09062

# 截尾均数
x <- c(1,2:9,11)
mean(x)
#> [1] 5.6

mean(x,trim = 0.1)
#> [1] 5.5
#中位数
median(mtcars$mpg)   
#> [1] 19.2

#众数 mode 
rstatix::get_mode(mtcars$mpg)
#> [1] 10.4 15.2 19.2 21.0 21.4 22.8 30.4

注意:函数rstatix::get_mode() 可能返回多个众数,如果存在多个众数,请检查其处理方式。

1.3 离散趋势(dispersion tendency)

Code
# 值域
range(mtcars$mpg)  
#> [1] 10.4 33.9
# 极差 or 全距
diff(range(mtcars$mpg) )  
#> [1] 23.5

# 分位数
quantile(mtcars$mpg,probs = c(0,0.1,0.25,0.5,0.75,1))    
#>     0%    10%    25%    50%    75%   100% 
#> 10.400 14.340 15.425 19.200 22.800 33.900
# 四分位数间距
IQR(mtcars$mpg)     
#> [1] 7.375

# 方差 variance
var(mtcars$mpg)       
#> [1] 36.3241

# 标准差 standard deviation
sd(mtcars$mpg)       
#> [1] 6.026948


# 变异系数 Coefficient of Variation
CV <- function(x, na.rm = TRUE) {  
    if (na.rm) x <- x[!is.na(x)]
    CV = sd(x) / mean(x) * 100
    sprintf("%.6f%%", CV)
}
CV(mtcars$mpg)
#> [1] "29.998808%"


# 绝对中位差 median absolute deviation
mad(mtcars$mpg,constant = 1.4826)
#> [1] 5.41149
median(abs(mtcars$mpg-median(mtcars$mpg)))
#> [1] 3.65
median(abs(mtcars$mpg-median(mtcars$mpg)))*1.4826
#> [1] 5.41149

说明:mad() 计算时乘以比例因子 constant = 1.4826 以实现渐进正态一致性。

1.4 分布形态

1.4.1 偏度系数

1.4.1.1 总体偏度(Population Skewness)

表示随机变量概率分布的不对称性。

https://www.macroption.com/skewness-formula/

三阶中心矩。二阶中心矩即方差。

\[ Population\ Skewness (X) = \frac{E(X_i-E(X))^3}{Var(X)^{\frac{3}{2}}} =E [(\frac{X_i-\mu}{\sigma})^3]= \frac{1}{n} \sum_{i=1}^{n} (\frac{X_i-\mu}{\sigma} )^3 \]

偏度的取值范围: \((-\infty,+\infty)\)

  1. Skew>0,正偏态分布,右偏 = 尾部向右延伸。Mode < Median < Mean;

  2. Skew=0,数据相对均匀的分布在均值两侧;

  3. Skew<0,负偏态分布,左偏 = 尾部向左延伸;Mode > Median > Mean。

Code
x <- c(1,2,3,5,6,10)

skewness <- function(x,na.rm=TRUE){
    if(na.rm) x <- x[!is.na(x)]
    n=length(x)
    μ=mean(x)
    SD=sd(x)
    
    return(c(population_sknewness = mean(((x-μ)/SD)^3),
             sample_sknewness = sum(((x-μ)/SD)^3)*n/(n-1)/(n-2)))
}
skewness(x)
#> population_sknewness     sample_sknewness 
#>            0.5142767            0.9256980


e1071::skewness(x,type = 2)  # 样本偏度
#> [1] 0.925698
e1071::skewness(x,type = 3)  # 总体偏度
#> [1] 0.5142767

e1071::skewness(x,type = 1)   # 无偏偏度
#> [1] 0.6760343
moments::skewness(x)
#> [1] 0.6760343

1.4.1.2 样本偏度(Sample Skewness)

\[ Sample\ Skewness(X) = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left [\frac{X_i-\bar X}{S} \right ]^3 \]

1.4.2 峰度系数

1.4.2.1 总体峰度(Population Kurtosis)

表示随机变量概率分布的尖峭程度。四阶中心矩与方差平方的比值。

https://www.macroption.com/kurtosis-formula/

超额峰度 excess kurtosis :四阶中心矩与方差平方的比值减3。

https://www.macroption.com/excess-kurtosis/

\[ Population\ Kurtosis(X) = \frac{E(X_i-E(X))^4}{Var(X)^{2}}-3= E [(\frac{X_i-\mu}{\sigma})^4] - 3= \frac{1}{n} \sum_{i=1}^{n} (\frac{X_i-\mu}{\sigma} )^4-3 \]

超额峰度的取值范围:\([-2,+\infty)\)

  1. 超额峰度<0,数据分布与正态分布相比较为扁平;

  2. 超额峰度=0,正态分布;

  3. 超额峰度>0,数据分布与正态分布相比较为高尖。

Code

kurtosis<-function(x,na.rm=TRUE){
    if(na.rm) x<-x[!is.na(x)]
    n=length(x)
    μ=mean(x)
    SD=sd(x)
    return(c(population_kurtosis= mean(((x-μ)/SD)^4)-3,
             sample_kurtosis = sum(((x-μ)/SD)^4)*n*(n+1)/(n-1)/(n-2)/(n-3)-3*(n-1)^2/(n-2)/(n-3)))
}
kurtosis(x)
#> population_kurtosis     sample_kurtosis 
#>           -1.377770            0.563368
e1071::kurtosis(x,type = 3)# 默认
#> [1] -1.37777
e1071::kurtosis(x,type = 2)
#> [1] 0.563368

1.4.2.2 样本峰度(Sample Kurtosis)

\[ Sample \ Kurtosis(X) = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left [\frac{X_i-\bar X}{S} \right]^4-\frac{3(n-1)^2}{(n-2)(n-3)} \]

1.5 标准化变换

Code
scale(mtcars$mpg,center = T,scale = T) %>%  
    tibble(normalization = .) %>% 
    DT::datatable()

1.6 统计摘要

Code
summary(mtcars$mpg)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.40   15.43   19.20   20.09   22.80   33.90
rstatix::get_summary_stats(mtcars,mpg,type = "full")
variable n min max median q1 q3 iqr mad mean sd se ci
mpg 32 10.4 33.9 19.2 15.425 22.8 7.375 5.411 20.091 6.027 1.065 2.173
Code


psych::describeBy(mtcars$mpg,group =NULL)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 20.09062 6.026948 19.2 19.69615 5.41149 10.4 33.9 23.5 0.610655 -0.372766 1.065424