1 主成分分析

主成分分析(Principal component analysis,PCA) 是一种线性降维方法。

https://bryanhanson.github.io/LearnPCA/articles//Vig_04_Scores_Loadings.html

对于一组变量\(X_1,X_2,...,X_p\),存在它们的线性组合 Y,令\(Var(y_1)\)最大,得到\(y_1\),再找\(y_2\)\(y_2\)\(y_1\)正交,以此类推,找到一组无关主成分 Y

\[ y_i=a_{i1}x_1+a_{i2}x_2+...+a_{ip}x_p\\ Cov(y_i,y_j)= \begin{cases} &\lambda_i,\ i=j \\ & 0,\ i\ne j \end{cases} \]

1.1 自定义 PCA 步骤

1.1.1 数据标准化

R中数据框 n个观测,p个变量

\[ X= \begin{bmatrix} x_{11}& x_{12} & ... & x_{1p}\\ x_{21} & x_{22} & ... & x_{2p}\\ \vdots & \vdots & & \vdots \\ x_{n1}& x_{n2} & ... & x_{np} \end{bmatrix} =(X_1,X_2,...X_p) \]

对原始数据矩阵标准化,消除量纲和数量级的影响。

数据标准化确保变量在相同的尺度上,这对于PCA非常重要。

使用R语言内置的 USArrests 数据集:

Code
df <- as_tibble(USArrests, rownames = "state") |> column_to_rownames("state")
head(df)
#>            Murder Assault UrbanPop Rape
#> Alabama      13.2     236       58 21.2
#> Alaska       10.0     263       48 44.5
#> Arizona       8.1     294       80 31.0
#> Arkansas      8.8     190       50 19.5
#> California    9.0     276       91 40.6
#> Colorado      7.9     204       78 38.7

df_center <- scale(df,center = T,scale = T)

1.1.2 计算协方差矩阵

\(\Sigma\) 未知时,其用其估计值样本协方差矩阵Sp×p 代替

\[ S=\frac{AA^T}{n-1} \]

  • \(A_{ p×n}\) 显示来自每个变量值与其均值的偏差 \(X_i-\bar X\)
  • \((AA^T)_{ii}\) 显示偏差平方和 (样本方差 \(s_i^2\) );
  • \((AA^T)_{ij},i\ne j\) 显示样本协方差 \(s_{ij} = (A 的行 i) · (A 的行 j)\)
Code
# 手动计算协方差矩阵
A <- as.matrix(t((df_center)))
AA_T <- A %*% t(A)
S <- AA_T / (nrow(df_center) - 1)
S
#>              Murder   Assault   UrbanPop      Rape
#> Murder   1.00000000 0.8018733 0.06957262 0.5635788
#> Assault  0.80187331 1.0000000 0.25887170 0.6652412
#> UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
#> Rape     0.56357883 0.6652412 0.41134124 1.0000000
# 内置协方差矩阵函数
cov(df_center)
#>              Murder   Assault   UrbanPop      Rape
#> Murder   1.00000000 0.8018733 0.06957262 0.5635788
#> Assault  0.80187331 1.0000000 0.25887170 0.6652412
#> UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
#> Rape     0.56357883 0.6652412 0.41134124 1.0000000

1.1.3 计算相关系数矩阵

相关系数矩阵 \(R=(r_{ij})\) 的公式为:

\[ r_{ij}=\frac {S_{ij}}{\sqrt{S_{ii}×S_{jj}}} \]

Code
# 定义自定义函数计算相关系数矩阵
r <- function(df){
    df <- as.data.frame(df)
    n=length(df)
    names <- colnames(df)
    df <- scale(df)
    S <- cov(df)
    r <- matrix(data = NA,n,n,dimnames = list(names,names))
    for(i in 1:n){
        for(j in 1:n){
            r[i,j]=S[i,j]/sqrt(S[i,i]*S[j,j])
        }
    }
    return(r)
}
r(df)
#>              Murder   Assault   UrbanPop      Rape
#> Murder   1.00000000 0.8018733 0.06957262 0.5635788
#> Assault  0.80187331 1.0000000 0.25887170 0.6652412
#> UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
#> Rape     0.56357883 0.6652412 0.41134124 1.0000000

# 使用内置的相关系数矩阵函数
cor(df_center)
#>              Murder   Assault   UrbanPop      Rape
#> Murder   1.00000000 0.8018733 0.06957262 0.5635788
#> Assault  0.80187331 1.0000000 0.25887170 0.6652412
#> UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
#> Rape     0.56357883 0.6652412 0.41134124 1.0000000

1.1.4 总方差

总方差T=所有特征值的总和=样本方差的总和=协方差矩阵的迹(对角线的总和)

sum(eigen(AA_T)$values)=sum(diag(AA_T))= sum(svd$d^2)

Code
# AA^T的特征值
y <- eigen(AA_T)
y$values
#> [1] 121.531837  48.498492  17.471596   8.498074
y$vectors
#>            [,1]       [,2]       [,3]        [,4]
#> [1,] -0.5358995  0.4181809 -0.3412327  0.64922780
#> [2,] -0.5831836  0.1879856 -0.2681484 -0.74340748
#> [3,] -0.2781909 -0.8728062 -0.3780158  0.13387773
#> [4,] -0.5434321 -0.1673186  0.8177779  0.08902432

# 特征值的和
sum(y$values)
#> [1] 196

# 迹
sum(diag(AA_T))
#> [1] 196

1.1.5 奇异值分解与主成分推导

SVD公式:

\[ A_{p×n}=U\Sigma V^T \]

A 是中心化后的数据矩阵, U 是左奇异矩阵, \(\Sigma\) 是奇异值对角矩阵, V 是右奇异矩阵((也是主成分方向))

主成分推导:

\[ PC=A\cdot V=U \Sigma \]

在这个公式中:

  • \(U\) 是包含左奇异向量的矩阵,表示样本在新坐标系中的坐标。

  • \(\Sigma\) 是包含奇异值的对角矩阵,这些奇异值与特征值相关,表示每个主成分的方差大小。

Code

# 进行奇异值分解

svd <- svd(df_center)
svd$d
#> [1] 11.024148  6.964086  4.179904  2.915146
svd$u
#>               [,1]        [,2]         [,3]          [,4]
#>  [1,] -0.088502119 -0.16111249  0.105218608  0.0530665011
#>  [2,] -0.175119011 -0.15255799 -0.483145153 -0.1489378244
#>  [3,] -0.158329049  0.10603826 -0.012974042 -0.2834384049
#>  [4,]  0.012699298 -0.15917987 -0.027135114 -0.0620804497
#>  [5,] -0.226649068  0.21932910 -0.141759482 -0.1161380178
#>  [6,] -0.136005136  0.14038162 -0.259336498  0.0004974586
#>  [7,]  0.122004202  0.15479183  0.152346210 -0.0402308322
#>  [8,] -0.004284214  0.04624999  0.170197773 -0.2995093258
#>  [9,] -0.270566006 -0.00557636  0.136613685 -0.0326971796
#> [10,] -0.147204794 -0.18180252  0.081106695  0.3656676469
#> [11,]  0.081955040  0.22324195 -0.012026954  0.3065826885
#> [12,]  0.147251202 -0.02998994 -0.061530174 -0.1694899353
#> [13,] -0.123823808  0.09692418  0.160455000 -0.0414370084
#> [14,]  0.045389560  0.02154472 -0.054011475  0.1442115222
#> [15,]  0.202373535  0.01479136 -0.038974667  0.0059617844
#> [16,]  0.071558552  0.03840409 -0.006051930  0.0701237800
#> [17,]  0.067425851 -0.13624293  0.006718884  0.2277132300
#> [18,] -0.140517959 -0.12382100  0.185555941  0.1544203417
#> [19,]  0.215231159 -0.05350432  0.015555919 -0.1122203023
#> [20,] -0.158347534 -0.06079147  0.037242408 -0.1898534932
#> [21,]  0.043656895  0.20960067  0.144350623 -0.0609897144
#> [22,] -0.189334383  0.02208976 -0.091150533  0.0347643443
#> [23,]  0.151999911  0.08987636 -0.036252509  0.0228600295
#> [24,] -0.089483486 -0.34027971  0.175449706  0.0731840095
#> [25,] -0.062570302  0.03743606 -0.089392087  0.0766873549
#> [26,]  0.106451539 -0.07631705 -0.058472149  0.0420214180
#> [27,]  0.113651981  0.02757065 -0.041582129  0.0053970395
#> [28,] -0.258115678  0.11025209 -0.275529769  0.1068057898
#> [29,]  0.214071497  0.00257041 -0.008728664 -0.0112530536
#> [30,] -0.016304324  0.20604821  0.181049718  0.0826499280
#> [31,] -0.177802722 -0.02030605 -0.043504824 -0.1153016522
#> [32,] -0.151092550  0.11701618  0.152302994 -0.0045791345
#> [33,] -0.100877464 -0.31671218  0.204524432 -0.3240968906
#> [34,]  0.268696706 -0.08516514 -0.071353148 -0.0862511360
#> [35,]  0.020291306  0.10550966  0.007374849  0.1609363198
#> [36,]  0.027997563  0.04091867  0.003625902  0.0035087357
#> [37,] -0.005309061  0.07696200 -0.222585787 -0.0807475502
#> [38,]  0.079778211  0.08118230  0.094883089  0.1219329726
#> [39,]  0.077565243  0.21208574  0.324451737 -0.2083610270
#> [40,] -0.118598723 -0.27483477  0.071178009 -0.0446445538
#> [41,]  0.178498756 -0.11703880 -0.092198470 -0.0372092940
#> [42,] -0.089775081 -0.12228530 -0.044544714  0.2217051038
#> [43,] -0.121689077  0.05863443  0.116539361  0.2184216921
#> [44,]  0.049439812  0.20917537 -0.069565218 -0.0279528910
#> [45,]  0.251561949 -0.19933619 -0.199240942 -0.0492029259
#> [46,]  0.008650709 -0.02839251 -0.002773945  0.0717790646
#> [47,]  0.019477550  0.13790380 -0.147991604 -0.0749973365
#> [48,]  0.189347337 -0.20254292 -0.024814359  0.0447947015
#> [49,]  0.186754750  0.08689225  0.032888157  0.0625194853
#> [50,]  0.056521430 -0.04563221  0.056996643 -0.0565930091
svd$v
#>            [,1]       [,2]       [,3]        [,4]
#> [1,] -0.5358995 -0.4181809  0.3412327  0.64922780
#> [2,] -0.5831836 -0.1879856  0.2681484 -0.74340748
#> [3,] -0.2781909  0.8728062  0.3780158  0.13387773
#> [4,] -0.5434321  0.1673186 -0.8177779  0.08902432
# 奇异值的平方和
sum(svd$d^2)
#> [1] 196

# 奇异值的对角矩阵
D <- diag(svd$d)

#  df_center  X = U D V'
X <- svd$u %*% D %*% t(svd$v) 

#  D = U' X V
t(svd$u) %*% X %*% svd$v
#>               [,1]         [,2]          [,3]          [,4]
#> [1,]  1.102415e+01 1.110223e-15  8.881784e-16  3.219647e-15
#> [2,]  2.220446e-16 6.964086e+00  6.661338e-16  1.318390e-15
#> [3,] -2.220446e-16 4.440892e-16  4.179904e+00 -8.881784e-16
#> [4,] -4.440892e-16 1.436351e-15 -9.436896e-16  2.915146e+00

1.1.6 结果解释

1.1.6.1 主成分荷载系数

Code
# 主成分荷载系数
svd$v
#>            [,1]       [,2]       [,3]        [,4]
#> [1,] -0.5358995 -0.4181809  0.3412327  0.64922780
#> [2,] -0.5831836 -0.1879856  0.2681484 -0.74340748
#> [3,] -0.2781909  0.8728062  0.3780158  0.13387773
#> [4,] -0.5434321  0.1673186 -0.8177779  0.08902432

在PCA中,右奇异向量矩阵𝑉 的列向量代表数据在新的正交基上的方向,这些基是按数据中方差最大化的方向排列的。每个向量就是一个主成分方向。具体来说,矩阵 svd$v 中的每一列都是一个主成分, 且这些列向量可以看作是原始变量在新主成分空间中的线性组合系数。

因此,svd$v 中的元素表示的是每个原始变量在对应主成分上的贡献,即主成分荷载系数(loadings)。

例如,如果 V 的第j 个列向量为 \([v_{1j},v_{2j},...,v_{pj}]\) ,这意味着第j 个主成分可以表示为原始变量的线性组合:

\[ PC_j=v_{1j}\cdot x_1+v_{2j}\cdot x_2+...+v_{pj}\cdot x_p \]

其中, \(x_1,x_2,...,x_p\) 是原始变量, \(v_{1j},v_{2j},...,v_{pj}\) 是它们在第j 个主成分上的荷载系数。

1.1.6.2 主成分得分

主成分得分 (principal component scores) 代表了原始数据在新主成分轴上的坐标。

具体来说,主成分得分可以表示为:

\[ Scores = U\Sigma \]

Code
# 得分
svd$u %*% D
#>              [,1]        [,2]        [,3]         [,4]
#>  [1,] -0.97566045 -1.12200121  0.43980366  0.154696581
#>  [2,] -1.93053788 -1.06242692 -2.01950027 -0.434175454
#>  [3,] -1.74544285  0.73845954 -0.05423025 -0.826264240
#>  [4,]  0.13999894 -1.10854226 -0.11342217 -0.180973554
#>  [5,] -2.49861285  1.52742672 -0.59254100 -0.338559240
#>  [6,] -1.49934074  0.97762966 -1.08400162  0.001450164
#>  [7,]  1.34499236  1.07798362  0.63679250 -0.117278736
#>  [8,] -0.04722981  0.32208890  0.71141032 -0.873113315
#>  [9,] -2.98275967 -0.03883425  0.57103206 -0.095317042
#> [10,] -1.62280742 -1.26608838  0.33901818  1.065974459
#> [11,]  0.90348448  1.55467609 -0.05027151  0.893733198
#> [12,]  1.62331903 -0.20885253 -0.25719021 -0.494087852
#> [13,] -1.36505197  0.67498834  0.67068647 -0.120794916
#> [14,]  0.50038122  0.15003926 -0.22576277  0.420397595
#> [15,]  2.23099579  0.10300828 -0.16291036  0.017379470
#> [16,]  0.78887206  0.26744941 -0.02529648  0.204421034
#> [17,]  0.74331256 -0.94880748  0.02808429  0.663817237
#> [18,] -1.54909076 -0.86230011  0.77560598  0.450157791
#> [19,]  2.37274014 -0.37260865  0.06502225 -0.327138529
#> [20,] -1.74564663 -0.42335704  0.15566968 -0.553450589
#> [21,]  0.48128007  1.45967706  0.60337172 -0.177793902
#> [22,] -2.08725025  0.15383500 -0.38100046  0.101343128
#> [23,]  1.67566951  0.62590670 -0.15153200  0.066640316
#> [24,] -0.98647919 -2.36973712  0.73336290  0.213342049
#> [25,] -0.68978426  0.26070794 -0.37365033  0.223554811
#> [26,]  1.17353751 -0.53147851 -0.24440796  0.122498555
#> [27,]  1.25291625  0.19200440 -0.17380930  0.015733156
#> [28,] -2.84550542  0.76780502 -1.15168793  0.311354436
#> [29,]  2.35995585  0.01790055 -0.03648498 -0.032804291
#> [30,] -0.17974128  1.43493745  0.75677041  0.240936580
#> [31,] -1.96012351 -0.14141308 -0.18184598 -0.336121113
#> [32,] -1.66566662  0.81491072  0.63661186 -0.013348844
#> [33,] -1.11208808 -2.20561081  0.85489245 -0.944789648
#> [34,]  2.96215223 -0.59309738 -0.29824930 -0.251434626
#> [35,]  0.22369436  0.73477837  0.03082616  0.469152817
#> [36,]  0.30864928  0.28496113  0.01515592  0.010228476
#> [37,] -0.05852787  0.53596999 -0.93038718 -0.235390872
#> [38,]  0.87948680  0.56536050  0.39660218  0.355452378
#> [39,]  0.85509072  1.47698328  1.35617705 -0.607402746
#> [40,] -1.30744986 -1.91397297  0.29751723 -0.130145378
#> [41,]  1.96779669 -0.81506822 -0.38538073 -0.108470512
#> [42,] -0.98969377 -0.85160534 -0.18619262  0.646302674
#> [43,] -1.34151838  0.40833518  0.48712332  0.636731051
#> [44,]  0.54503180  1.45671524 -0.29077592 -0.081486749
#> [45,]  2.77325613 -1.38819435 -0.83280797 -0.143433697
#> [46,]  0.09536670 -0.19772785 -0.01159482  0.209246429
#> [47,]  0.21472339  0.96037394 -0.61859067 -0.218628161
#> [48,]  2.08739306 -1.41052627 -0.10372163  0.130583080
#> [49,]  2.05881199  0.60512507  0.13746933  0.182253407
#> [50,]  0.62310061 -0.31778662  0.23824049 -0.164976866

1.1.6.3 主成分标准差

\[ Standard \ Deviation \ of \ PC_i =\frac{\sigma_i}{\sqrt{n-1}} \]

其中,n 是样本量,σ 是奇异值。

Code
svd$d /sqrt(nrow(df_center)-1)
#> [1] 1.5748783 0.9948694 0.5971291 0.4164494

1.1.6.4 方差贡献百分比

在主成分分析 (PCA) 中,方差贡献百分比(variance explained ratio)是用来衡量每个主成分解释了数据总方差的比例。这个比例可以通过奇异值分解 (SVD) 的奇异值来计算。

具体来说,方差贡献百分比的计算过程如下:

  1. 计算每个主成分的方差。奇异值的平方 \(\sigma_i^2\) 表示主成分 \(𝑖\)的方差。

  2. 计算总方差,即所有奇异值的平方和。

  3. 每个主成分的方差贡献百分比可以通过将每个奇异值的平方除以总方差来计算。

Code
# 方差贡献百分比
pct <- svd$d^2/sum(svd$d^2)
pct
#> [1] 0.62006039 0.24744129 0.08914080 0.04335752

# 累计方差贡献百分比
cumsum(pct)
#> [1] 0.6200604 0.8675017 0.9566425 1.0000000

1.2 内置PCA

Code
pca <- prcomp(df_center)
pca
#> Standard deviations (1, .., p=4):
#> [1] 1.5748783 0.9948694 0.5971291 0.4164494
#> 
#> Rotation (n x k) = (4 x 4):
#>                 PC1        PC2        PC3         PC4
#> Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
#> Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
#> UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
#> Rape     -0.5434321  0.1673186 -0.8177779  0.08902432
summary(pca)
#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     1.5749 0.9949 0.59713 0.41645
#> Proportion of Variance 0.6201 0.2474 0.08914 0.04336
#> Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

1.3 可视化分析

1.3.1 判断主成分的个数

  1. Cattell碎石图 图形变化最大处,即拐角处
  2. Kaiser-Harris准则 特征值大于1,直线y=1以上
  3. 平行分析 基于真实数据的特征值大于一组随机数据矩阵相应的特征值(虚线)

1.3.2 碎石图

Code
# Create Scree Plot
screeplot(pca, type = "lines", main = "Scree Plot")

Code

library(ggplot2)
explained_variance <- pca$sdev^2 / sum(pca$sdev^2)
explained_variance_df <- data.frame(
  Principal_Component = paste0("PC", 1:length(explained_variance)),
  Explained_Variance = explained_variance
)

ggplot(explained_variance_df, aes(x = Principal_Component, y = Explained_Variance)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_line(aes(group = 1), color = "blue") +
  geom_point(color = "red") +
  labs(title = "Scree Plot", x = "Principal Component", y = "Explained Variance") +
  theme_minimal()

1.3.2.1 平行分析

Code
fa_parallel <- psych::fa.parallel(df_center, fa = "pc", n.iter = 100)

#> Parallel analysis suggests that the number of factors =  NA  and the number of components =  1
Code
svd$v
#>            [,1]       [,2]       [,3]        [,4]
#> [1,] -0.5358995 -0.4181809  0.3412327  0.64922780
#> [2,] -0.5831836 -0.1879856  0.2681484 -0.74340748
#> [3,] -0.2781909  0.8728062  0.3780158  0.13387773
#> [4,] -0.5434321  0.1673186 -0.8177779  0.08902432
tibble(x = 1:4, pc1 = svd$v[, 1]) %>%
    ggplot(aes(x, pc1)) +
    geom_point() +
    theme_classic() +
    theme(panel.border = element_rect(
        color = "black",
        fill = NA,
    ), # 添加四周框线
    )

Code

plot(svd$v[,1],ylab = "1st PC")

Code
plot(svd$v[,1],svd$v[,2],xlab="lst PC",ylab="2nd PC")

1.4 tidy 主成分分析

Code
df <- as_tibble(USArrests, rownames = "state")
df
#> # A tibble: 50 × 5
#>    state       Murder Assault UrbanPop  Rape
#>    <chr>        <dbl>   <int>    <int> <dbl>
#>  1 Alabama       13.2     236       58  21.2
#>  2 Alaska        10       263       48  44.5
#>  3 Arizona        8.1     294       80  31  
#>  4 Arkansas       8.8     190       50  19.5
#>  5 California     9       276       91  40.6
#>  6 Colorado       7.9     204       78  38.7
#>  7 Connecticut    3.3     110       77  11.1
#>  8 Delaware       5.9     238       72  15.8
#>  9 Florida       15.4     335       80  31.9
#> 10 Georgia       17.4     211       60  25.8
#> # ℹ 40 more rows

df |>
  select(-state) |>
  map_dfr(mean)  #apply(.,2,mean)
#> # A tibble: 1 × 4
#>   Murder Assault UrbanPop  Rape
#>    <dbl>   <dbl>    <dbl> <dbl>
#> 1   7.79    171.     65.5  21.2
Code
df_pca <- df |>
  select(-state) |>
  stats::prcomp(scale = TRUE)

主成分得分,表示主成分与原有观测的相关系数

Code
df_pca$x
#>               PC1         PC2         PC3          PC4
#>  [1,] -0.97566045 -1.12200121  0.43980366  0.154696581
#>  [2,] -1.93053788 -1.06242692 -2.01950027 -0.434175454
#>  [3,] -1.74544285  0.73845954 -0.05423025 -0.826264240
#>  [4,]  0.13999894 -1.10854226 -0.11342217 -0.180973554
#>  [5,] -2.49861285  1.52742672 -0.59254100 -0.338559240
#>  [6,] -1.49934074  0.97762966 -1.08400162  0.001450164
#>  [7,]  1.34499236  1.07798362  0.63679250 -0.117278736
#>  [8,] -0.04722981  0.32208890  0.71141032 -0.873113315
#>  [9,] -2.98275967 -0.03883425  0.57103206 -0.095317042
#> [10,] -1.62280742 -1.26608838  0.33901818  1.065974459
#> [11,]  0.90348448  1.55467609 -0.05027151  0.893733198
#> [12,]  1.62331903 -0.20885253 -0.25719021 -0.494087852
#> [13,] -1.36505197  0.67498834  0.67068647 -0.120794916
#> [14,]  0.50038122  0.15003926 -0.22576277  0.420397595
#> [15,]  2.23099579  0.10300828 -0.16291036  0.017379470
#> [16,]  0.78887206  0.26744941 -0.02529648  0.204421034
#> [17,]  0.74331256 -0.94880748  0.02808429  0.663817237
#> [18,] -1.54909076 -0.86230011  0.77560598  0.450157791
#> [19,]  2.37274014 -0.37260865  0.06502225 -0.327138529
#> [20,] -1.74564663 -0.42335704  0.15566968 -0.553450589
#> [21,]  0.48128007  1.45967706  0.60337172 -0.177793902
#> [22,] -2.08725025  0.15383500 -0.38100046  0.101343128
#> [23,]  1.67566951  0.62590670 -0.15153200  0.066640316
#> [24,] -0.98647919 -2.36973712  0.73336290  0.213342049
#> [25,] -0.68978426  0.26070794 -0.37365033  0.223554811
#> [26,]  1.17353751 -0.53147851 -0.24440796  0.122498555
#> [27,]  1.25291625  0.19200440 -0.17380930  0.015733156
#> [28,] -2.84550542  0.76780502 -1.15168793  0.311354436
#> [29,]  2.35995585  0.01790055 -0.03648498 -0.032804291
#> [30,] -0.17974128  1.43493745  0.75677041  0.240936580
#> [31,] -1.96012351 -0.14141308 -0.18184598 -0.336121113
#> [32,] -1.66566662  0.81491072  0.63661186 -0.013348844
#> [33,] -1.11208808 -2.20561081  0.85489245 -0.944789648
#> [34,]  2.96215223 -0.59309738 -0.29824930 -0.251434626
#> [35,]  0.22369436  0.73477837  0.03082616  0.469152817
#> [36,]  0.30864928  0.28496113  0.01515592  0.010228476
#> [37,] -0.05852787  0.53596999 -0.93038718 -0.235390872
#> [38,]  0.87948680  0.56536050  0.39660218  0.355452378
#> [39,]  0.85509072  1.47698328  1.35617705 -0.607402746
#> [40,] -1.30744986 -1.91397297  0.29751723 -0.130145378
#> [41,]  1.96779669 -0.81506822 -0.38538073 -0.108470512
#> [42,] -0.98969377 -0.85160534 -0.18619262  0.646302674
#> [43,] -1.34151838  0.40833518  0.48712332  0.636731051
#> [44,]  0.54503180  1.45671524 -0.29077592 -0.081486749
#> [45,]  2.77325613 -1.38819435 -0.83280797 -0.143433697
#> [46,]  0.09536670 -0.19772785 -0.01159482  0.209246429
#> [47,]  0.21472339  0.96037394 -0.61859067 -0.218628161
#> [48,]  2.08739306 -1.41052627 -0.10372163  0.130583080
#> [49,]  2.05881199  0.60512507  0.13746933  0.182253407
#> [50,]  0.62310061 -0.31778662  0.23824049 -0.164976866

# by default    df_pca$x
broom::tidy(df_pca, matrix = "scores") |> 
    pivot_wider(id_cols = everything(),
                names_from = PC,
                names_prefix = "PC",
               values_from = value)
#> # A tibble: 50 × 5
#>      row     PC1     PC2     PC3      PC4
#>    <int>   <dbl>   <dbl>   <dbl>    <dbl>
#>  1     1 -0.976  -1.12    0.440   0.155  
#>  2     2 -1.93   -1.06   -2.02   -0.434  
#>  3     3 -1.75    0.738  -0.0542 -0.826  
#>  4     4  0.140  -1.11   -0.113  -0.181  
#>  5     5 -2.50    1.53   -0.593  -0.339  
#>  6     6 -1.50    0.978  -1.08    0.00145
#>  7     7  1.34    1.08    0.637  -0.117  
#>  8     8 -0.0472  0.322   0.711  -0.873  
#>  9     9 -2.98   -0.0388  0.571  -0.0953 
#> 10    10 -1.62   -1.27    0.339   1.07   
#> # ℹ 40 more rows

主成分荷载(loading):表示主成分与原有变量的相关系数

Code
df_pca$rotation
#>                 PC1        PC2        PC3         PC4
#> Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
#> Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
#> UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
#> Rape     -0.5434321  0.1673186 -0.8177779  0.08902432

# df_pca$Rotation
tidy(df_pca, matrix = "loadings") |> 
    pivot_wider(
        names_from = PC,
        names_prefix = "PC",
        values_from = value,
    )
#> # A tibble: 4 × 5
#>   column      PC1    PC2    PC3     PC4
#>   <chr>     <dbl>  <dbl>  <dbl>   <dbl>
#> 1 Murder   -0.536 -0.418  0.341  0.649 
#> 2 Assault  -0.583 -0.188  0.268 -0.743 
#> 3 UrbanPop -0.278  0.873  0.378  0.134 
#> 4 Rape     -0.543  0.167 -0.818  0.0890

例如:

\[ PC_1=-0.536Murrder-0.583Assault-0.278UrbanPop-0.543Rape \]

Code

tidy(df_pca, matrix = "loadings") |>
  ggplot(aes(value, column)) +
  facet_wrap(~ PC) +
  geom_col() +
  scale_x_continuous(labels = scales::percent)

特征值 eigenvalues,高维椭球的主轴长度,相关矩阵的特征值。

方差百分比贡献。

Code
# screen plot
tidy(df_pca, matrix = "eigenvalues") |>
    ggplot(aes(PC, percent)) +
    geom_point(color = "red") +
    geom_line()+
    scale_y_continuous(labels = scales::percent)

Back to top