3 相关性

Modified

November 20, 2024

https://corrr.tidymodels.org/articles/using-corrr.html

https://easystats.github.io/correlation/

相关性：判断两个变量之间的相关关系强度和方向，无论是否独立。

3.1 分类变量

　　如果独立性检验的结果表明两个变量之间不独立，那么如何量化它们之间相关性的强弱?

3.1.1 列联系数、Phi 系数和 Cramer’s V 系数

vcd 包里的函数 assocstats( )可以用来计算列联表的 Phi 系数、列联系数和 Cramer’s V 系数。其中， Phi 系数只适用于四格表。　　

Show the code

library(vcd)
mytable <- table(Arthritis$Sex, Arthritis$Treatment)
assocstats(mytable)
#>                      X^2 df P(> X^2)
#> Likelihood Ratio 0.73748  1  0.39047
#> Pearson          0.73653  1  0.39078
#> 
#> Phi-Coefficient   : 0.094 
#> Contingency Coeff.: 0.093 
#> Cramer's V        : 0.094

3.1.2 Kappa 统计量

对于配对列联表，可以计算一致性指标 Kappa 统计量。 epiDisplay 包里的函数 kap( )可以用于计算一致性的比例以及 Kappa 统计量的值　　　　

Show the code

my.matrix <- matrix(c(11, 2, 12, 33), nrow = 2)
sum(my.matrix)
#> [1] 58
vcd::Kappa(my.matrix)
#>            value    ASE     z Pr(>|z|)
#> Unweighted 0.455 0.1153 3.945 7.97e-05
#> Weighted   0.455 0.1153 3.945 7.97e-05
epiDisplay::kap(my.matrix)
#> 
#>  Table for calculation of kappa
#>    A  B
#> A 11 12
#> B  2 33
#> 
#> Observed agreement = 75.86 % 
#> Expected agreement = 55.71 % 
#> Kappa = 0.455 
#> Standard error = 0.121 , Z = 3.762 , P value = < 0.001 
#>

　　共 58 个对象，每一对象用两种检测方法检测，其中11 个对象的两种检测结果都为阳性， 33 个对象的两种检测结果都是阴性，所以总一致性为 (11 + 33)/58 ≈ 75.86% 。

Show the code

chisq.test(my.matrix)$expected
#>          [,1]     [,2]
#> [1,] 5.155172 17.84483
#> [2,] 7.844828 27.15517

　为了解释期望一致性和 Kappa 值的含义，先计算各个单元格的期望频数。对角线上的这两个单元格对应的期望频数分别约为5.155172 和27.15517 ，因此期望一致性为 (5.155172+27.15517)/58≈ 55.71% 。期望一致性是假定两种方法的检测结果都是完全随机的情况下的一致性。也就是说，即使两种检测方法都毫无作用，平均也能达到 55.71% 的一致性。 Kappa 统计量是超出随机的一致性的部分占最大可能超出随机的一致性的比例。在本例中，前者为 75.86% − 55.71% ，后者为 100% − 55.71% 。因此， Kappa 值为 (75.86 - 55.71)/(100 - 55.71) ≈ 0.455

3.1.3 马赛克图

　　马赛克图中的矩形面积正比于多维列联表中单元格的频率　　

Show the code

mosaicplot(mytable,xlab ="Sex",ylab =  "Treatment",las = 1)

Show the code

mosaicplot(my.matrix)

3.2 连续变量

如果两个连续变量不相互独立时，使用协方差（covariance）来描述两个变量的关系。

协方差（或相关系数）为零，不相关，不存在线性关系，但可能存在非线性关系。

Show the code

df <- mpg[,c(3,8,9)]
cov(df)    # 协方差矩阵
#>           displ      cty       hwy
#> displ  1.669158 -4.39069 -5.893111
#> cty   -4.390690 18.11307 24.225432
#> hwy   -5.893111 24.22543 35.457779

3.2.0.1 相关系数

相关系数的取值范围： \([-1,1]\)

\[ r(X,Y)=\frac {\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n (x_i-\bar x)^2 \sum_{i=1}^n (y_i-\bar y)^2}} \]

Show the code

# Pearson's 积差相关系数 　　一般要求两个连续变量都服从正态分布
cor(df,use = "everything",method="pearson") # default
#>           displ        cty        hwy
#> displ  1.000000 -0.7985240 -0.7660200
#> cty   -0.798524  1.0000000  0.9559159
#> hwy   -0.766020  0.9559159  1.0000000

correlation::correlation(df,method = "pearson",p_adjust = "holm")
#> # Correlation Matrix (pearson-method)
#> 
#> Parameter1 | Parameter2 |     r |         95% CI | t(232) |         p
#> ---------------------------------------------------------------------
#> displ      |        cty | -0.80 | [-0.84, -0.75] | -20.21 | < .001***
#> displ      |        hwy | -0.77 | [-0.81, -0.71] | -18.15 | < .001***
#> cty        |        hwy |  0.96 | [ 0.94,  0.97] |  49.58 | < .001***
#> 
#> p-value adjustment method: Holm (1979)
#> Observations: 234

# Spearman's rank相关系数  　　非参数
cor(df,method = "spearman")
#>            displ        cty        hwy
#> displ  1.0000000 -0.8809049 -0.8266576
#> cty   -0.8809049  1.0000000  0.9542104
#> hwy   -0.8266576  0.9542104  1.0000000

correlation::correlation(df,method = "spearman",p_adjust = "holm")
#> # Correlation Matrix (spearman-method)
#> 
#> Parameter1 | Parameter2 |   rho |         95% CI |        S |         p
#> -----------------------------------------------------------------------
#> displ      |        cty | -0.88 | [-0.91, -0.85] | 4.02e+06 | < .001***
#> displ      |        hwy | -0.83 | [-0.86, -0.78] | 3.90e+06 | < .001***
#> cty        |        hwy |  0.95 | [ 0.94,  0.96] | 97781.21 | < .001***
#> 
#> p-value adjustment method: Holm (1979)
#> Observations: 234

# Kendall's tau相关系数  　　非参数
cor(df,method = "kendall")
#>            displ        cty        hwy
#> displ  1.0000000 -0.7210828 -0.6536974
#> cty   -0.7210828  1.0000000  0.8628045
#> hwy   -0.6536974  0.8628045  1.0000000
correlation::correlation(df,method = "kendall",p_adjust = "holm")
#> # Correlation Matrix (kendall-method)
#> 
#> Parameter1 | Parameter2 |   tau |         95% CI |      z |         p
#> ---------------------------------------------------------------------
#> displ      |        cty | -0.72 | [-0.76, -0.68] | -15.54 | < .001***
#> displ      |        hwy | -0.65 | [-0.70, -0.60] | -14.14 | < .001***
#> cty        |        hwy |  0.86 | [ 0.84,  0.88] |  18.40 | < .001***
#> 
#> p-value adjustment method: Holm (1979)
#> Observations: 234

3.2.0.2 相关图（correlogram）

Show the code

ggcorrplot::ggcorrplot(
    corr = cor(df,use = "everything",method="pearson") ,
    lab = T
)

3.2.0.3 显著性检验

　　零假设为变量之间不相关（即两个总体的相关系数为 0 ）。函数 cor.test( ) 可用于对相关系数进行显著性检验。

统计量

\[ t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \]

Show the code

cor.test(df$displ,df$hwy)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  df$displ and df$hwy
#> t = -18.151, df = 232, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.8142727 -0.7072539
#> sample estimates:
#>      cor 
#> -0.76602

psych包corr.test() 计算相关系数矩阵和显著性检验

Show the code

psych::corr.test(df)
#> Call:psych::corr.test(x = df)
#> Correlation matrix 
#>       displ   cty   hwy
#> displ  1.00 -0.80 -0.77
#> cty   -0.80  1.00  0.96
#> hwy   -0.77  0.96  1.00
#> Sample Size 
#> [1] 234
#> Probability values (Entries above the diagonal are adjusted for multiple tests.) 
#>       displ cty hwy
#> displ     0   0   0
#> cty       0   0   0
#> hwy       0   0   0
#> 
#>  To see confidence intervals of the correlations, print with the short=FALSE option

print(psych::corr.test(df), short = FALSE)
#> Call:psych::corr.test(x = df)
#> Correlation matrix 
#>       displ   cty   hwy
#> displ  1.00 -0.80 -0.77
#> cty   -0.80  1.00  0.96
#> hwy   -0.77  0.96  1.00
#> Sample Size 
#> [1] 234
#> Probability values (Entries above the diagonal are adjusted for multiple tests.) 
#>       displ cty hwy
#> displ     0   0   0
#> cty       0   0   0
#> hwy       0   0   0
#> 
#>  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
#>           raw.lower raw.r raw.upper raw.p lower.adj upper.adj
#> displ-cty     -0.84 -0.80     -0.75     0     -0.85     -0.74
#> displ-hwy     -0.81 -0.77     -0.71     0     -0.81     -0.71
#> cty-hwy        0.94  0.96      0.97     0      0.94      0.97