20  广义估计方程

Mixed Effects Models and Extensions in Ecology with R 第12章

广义估计方程(Generalized Estimating Equations,GEE)是一种用于分析重复测量或相关数据的统计方法。 GEE的核心思想是通过建立一个广义线性模型来估计一组总体均值参数,同时考虑数据的相关性结构和缺失数据,而不依赖于特定的分布假设,从而使其适用于多种数据类型,包括二项分布、正态分布、负二项分布。

广义估计方程用于处理具有组内相关性的非正态分布数据。模型形式类似于广义线性模型,但使用了估计方程来考虑组内相关性。

20.1 数据集处理

数据集 Analyzing ecological data

Code
rfb <- read_delim("data/AED/RiceFieldBirds.txt")
rfb$richness <- rowSums(rfb[, 8:56] > 0)
rfb |> mutate(
    FIELD = factor(FIELD),
    SPTREAT = factor(SPTREAT),
    log_AREA = log(rfb$AREA),
    DEPTH2 = DEPTH ^ 2,
    .after = 4
) -> rfb


ggplot(rfb, aes(Time, richness)) +
    geom_point(pch = 21) +
    geom_smooth(method = "loess", se = F) +
    facet_wrap(vars(FIELD), labeller = "label_both")

Code
owls <- read_delim("data/AED/Owls.txt")

owls |>
    mutate(NCalls = SiblingNegotiation, log_broodsize = log(BroodSize), ) ->
    owls
Code
de <- read_delim("data/AED/DeerEcervi.txt")

de |>
    mutate(
        Ecervi_binary = if_else(Ecervi > 0, 1, Ecervi),
        Sex = factor(Sex),
        Length_center = scale(Length, center = T , scale = F),
    ) -> de

20.2 GLM 连接函数

GLM的形式如下:

\[ g(\mu_{ij})=\beta_0+\beta_1x_{ij_1}+\beta_2x_{ij_2}+...+\beta_px_{ij_p} \]

其中,g() 是一个连接函数,用于将自变量的线性组合与因变量的均值联系起来。常见的连接函数包括恒等连接函数(identity)、logistic连接函数(logit)、逆正弦连接函数(inverse sine)函数等。i 表示观测对象的索引,j 表示时间点或相关性结构的索引,uij 表示因变量的均值,β 表示待估计的系数,Xij 表示自变量。

\[ \eta = \beta \mathbf{X}+ \alpha \]

\[ E(y)=g^{-1}(\eta) \]

对于计数数据:

\[ E(y)=e^{\eta}=e^{\beta \mathbf{X} +\alpha}, 此时g(\mu)=\ln(\mu)=\mathbf{X}\beta \]

Code
rfb_glm <- glm(richness ~ offset(log_AREA) +SPTREAT + DEPTH +DEPTH2,
            family = quasipoisson(link = "log"),
            data = rfb)

owls_glm <- glm(
    NCalls ~ offset(log_broodsize) + SexParent * FoodTreatment + SexParent *
        ArrivalTime,
    family = poisson(link = "log"),
    data = owls
)

对于二分类数据:

\[ E(y)=\frac{e^{\beta \mathbf{X} +\alpha}}{1+e^{\beta\mathbf{X}+\alpha}}, 此时g(\mu)=\ln(\frac{\mu}{1-\mu})=logit(\mu) \]

Code
de_glm <- glm(Ecervi_binary ~ Length_center *Sex,
              family = binomial(link = "logit"),
              data = de)
summary(de_glm)
#> 
#> Call:
#> glm(formula = Ecervi_binary ~ Length_center * Sex, family = binomial(link = "logit"), 
#>     data = de)
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)        0.652409   0.109602   5.953 2.64e-09 ***
#> Length_center      0.025112   0.005576   4.504 6.68e-06 ***
#> Sex2               0.163873   0.174235   0.941   0.3469    
#> Length_center:Sex2 0.020109   0.009722   2.068   0.0386 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1073.1  on 825  degrees of freedom
#> Residual deviance: 1003.7  on 822  degrees of freedom
#> AIC: 1011.7
#> 
#> Number of Fisher Scoring iterations: 4

20.3 方差

\[ var(Y_{is}|X_{is})=\Phi × \nu(\mu_{is}) \]

其中,Φ 是scale parameter (overdispersion),v() 是方差函数。

20.4 相关性结构

R(α)

  1. 非结构化相关:cor(Yis,Yit)=αst

  2. 自回归相关:cor(Yis,Yit)=α|s-t|

  3. exchangeable 等相关:cor(Yis,Yit)=α

  4. independence,独立,cor(Yis,Yit)=I

20.5 GEE

GLM的期望和方差:

\[ E(Y_{ij})=b'(\theta_{ij})\\ Var(Y_{ij})=b''(\theta_{ij})\ \Phi \]

假定潜在的随机成分Vi

\[ V_i=A_i^{1/2}R(\alpha)A_i^{1/2}\ \Phi \]

其中,Ai =diag(b''(θi1),...,b''(θij))

GEE的形式如下:

\[ U(\beta)=\sum_{i=1}^{n} D'_iV_i^{-1}(y_i-\mu_i)=0 \]

其中,$U(β) $是一个包含待估计参数的函数,quasi-deviance \(D'_i = \left(\frac{\partial \mu_i}{\partial \beta}\right)'\)\(V_i\) 是方差-协方差矩阵,\(R(\alpha)\) 是相关性结构矩阵,y 是观测数据,μ 是模型的均值预测值。采用迭代重复加权最小二乘法(iteratively reweighted least squares ,IWLS)),偏微分方程估计参数。

20.6 gee

https://www.statsmodels.org/stable//gee.html

Code
data(epil,package = "MASS")
Oxboys |> head(n=18)
Subject age height Occasion
1 -1.0000 140.5 1
1 -0.7479 143.4 2
1 -0.4630 144.8 3
1 -0.1643 147.1 4
1 -0.0027 147.7 5
1 0.2466 150.2 6
1 0.5562 151.7 7
1 0.7781 153.3 8
1 0.9945 155.8 9
2 -1.0000 136.9 1
2 -0.7479 139.1 2
2 -0.4630 140.1 3
2 -0.1643 142.6 4
2 -0.0027 143.2 5
2 0.2466 144.0 6
2 0.5562 145.8 7
2 0.7781 146.8 8
2 0.9945 148.3 9
Code


df_long <- Oxboys
Code
library(gee)

g1 <- gee(height~age,data = df_long ,id = Subject,corstr = "AR-M",Mv = 1)
#> (Intercept)         age 
#>  149.371801    6.521022
g1
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Identity 
#>  Variance to Mean Relation: Gaussian 
#>  Correlation Structure:     AR-M , M = 1 
#> 
#> Call:
#> gee(formula = height ~ age, id = Subject, data = df_long, corstr = "AR-M", 
#>     Mv = 1)
#> 
#> Number of observations :  234 
#> 
#> Maximum cluster size   :  9 
#> 
#> 
#> Coefficients:
#> (Intercept)         age 
#>  149.719096    6.547328 
#> 
#> Estimated Scale Parameter:  65.41743
#> Number of Iterations:  2
#> 
#> Working Correlation[1:4,1:4]
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.9892949 0.9787045 0.9682274
#> [2,] 0.9892949 1.0000000 0.9892949 0.9787045
#> [3,] 0.9787045 0.9892949 1.0000000 0.9892949
#> [4,] 0.9682274 0.9787045 0.9892949 1.0000000
#> 
#> 
#> Returned Error Value:
#> [1] 0
summary(g1)
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Identity 
#>  Variance to Mean Relation: Gaussian 
#>  Correlation Structure:     AR-M , M = 1 
#> 
#> Call:
#> gee(formula = height ~ age, id = Subject, data = df_long, corstr = "AR-M", 
#>     Mv = 1)
#> 
#> Summary of Residuals:
#>         Min          1Q      Median          3Q         Max 
#> -22.0304139  -5.4912438   0.1324571   4.3822174  18.5695861 
#> 
#> 
#> Coefficients:
#>               Estimate Naive S.E.  Naive z Robust S.E. Robust z
#> (Intercept) 149.719096  1.5531285 96.39840   1.5847569 94.47449
#> age           6.547328  0.3177873 20.60286   0.3042478 21.51972
#> 
#> Estimated Scale Parameter:  65.41743
#> Number of Iterations:  2
#> 
#> Working Correlation
#>            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
#>  [1,] 1.0000000 0.9892949 0.9787045 0.9682274 0.9578625 0.9476085 0.9374643
#>  [2,] 0.9892949 1.0000000 0.9892949 0.9787045 0.9682274 0.9578625 0.9476085
#>  [3,] 0.9787045 0.9892949 1.0000000 0.9892949 0.9787045 0.9682274 0.9578625
#>  [4,] 0.9682274 0.9787045 0.9892949 1.0000000 0.9892949 0.9787045 0.9682274
#>  [5,] 0.9578625 0.9682274 0.9787045 0.9892949 1.0000000 0.9892949 0.9787045
#>  [6,] 0.9476085 0.9578625 0.9682274 0.9787045 0.9892949 1.0000000 0.9892949
#>  [7,] 0.9374643 0.9476085 0.9578625 0.9682274 0.9787045 0.9892949 1.0000000
#>  [8,] 0.9274287 0.9374643 0.9476085 0.9578625 0.9682274 0.9787045 0.9892949
#>  [9,] 0.9175005 0.9274287 0.9374643 0.9476085 0.9578625 0.9682274 0.9787045
#>            [,8]      [,9]
#>  [1,] 0.9274287 0.9175005
#>  [2,] 0.9374643 0.9274287
#>  [3,] 0.9476085 0.9374643
#>  [4,] 0.9578625 0.9476085
#>  [5,] 0.9682274 0.9578625
#>  [6,] 0.9787045 0.9682274
#>  [7,] 0.9892949 0.9787045
#>  [8,] 1.0000000 0.9892949
#>  [9,] 0.9892949 1.0000000


gee1 <- gee(y ~ age + trt + base,id=subject,data = epil,family = poisson,corstr ="exchangeable" )
#>  (Intercept)          age trtprogabide         base 
#>   0.57304359   0.02234757  -0.15188049   0.02263524
summary(gee1)
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Logarithm 
#>  Variance to Mean Relation: Poisson 
#>  Correlation Structure:     Exchangeable 
#> 
#> Call:
#> gee(formula = y ~ age + trt + base, id = subject, data = epil, 
#>     family = poisson, corstr = "exchangeable")
#> 
#> Summary of Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -15.742906  -3.318756  -1.186874   1.295122  63.957902 
#> 
#> 
#> Coefficients:
#>                 Estimate  Naive S.E.   Naive z Robust S.E.   Robust z
#> (Intercept)   0.57304359 0.451797966  1.268362 0.360726141  1.5885835
#> age           0.02234757 0.013412798  1.666138 0.011400956  1.9601489
#> trtprogabide -0.15188049 0.159304397 -0.953398 0.171051111 -0.8879246
#> base          0.02263524 0.001696439 13.342795 0.001226748 18.4514092
#> 
#> Estimated Scale Parameter:  5.087384
#> Number of Iterations:  1
#> 
#> Working Correlation
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.3933815 0.3933815 0.3933815
#> [2,] 0.3933815 1.0000000 0.3933815 0.3933815
#> [3,] 0.3933815 0.3933815 1.0000000 0.3933815
#> [4,] 0.3933815 0.3933815 0.3933815 1.0000000

20.7 geepack

Code
library(geepack)

rfb_gee <- geeglm(richness ~ offset(log_AREA) +SPTREAT + DEPTH +DEPTH2,
            family = poisson(link = "log"),
            data = rfb,
            id=FIELD,
            corstr = "ar1")

summary(rfb_gee)
#> 
#> Call:
#> geeglm(formula = richness ~ offset(log_AREA) + SPTREAT + DEPTH + 
#>     DEPTH2, family = poisson(link = "log"), data = rfb, id = FIELD, 
#>     corstr = "ar1")
#> 
#>  Coefficients:
#>                Estimate    Std.err  Wald Pr(>|W|)   
#> (Intercept)  -0.6782034  0.2610088 6.752  0.00937 **
#> SPTREATrlfld -0.5223137  0.2287644 5.213  0.02242 * 
#> DEPTH         0.0498238  0.0193149 6.654  0.00989 **
#> DEPTH2       -0.0011411  0.0004908 5.406  0.02007 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation structure = ar1 
#> Estimated Scale Parameters:
#> 
#>             Estimate Std.err
#> (Intercept)    2.334  0.2927
#>   Link = identity 
#> 
#> Estimated Correlation Parameters:
#>       Estimate Std.err
#> alpha   0.4215 0.09034
#> Number of clusters:   11  Maximum cluster size: 10
rfb_gee2 <- geeglm(richness ~ offset(log_AREA) +SPTREAT ,
            family = poisson(link = "log"),
            data = rfb,
            id=FIELD,
            corstr = "ar1")

# Wald's Test
anova(rfb_gee,rfb_gee2)
Df X2 P(>|Chi|)
2 6.885 0.032

\[ E(Y_{is})=\mu_{is}=e^{-0.678+0.0498×DEPTH-0.001×DEPTH^2-0.522×SPTREAT_{is}}\\ var(Y_{is})= 2.33 × \mu_{is} \\ cor(Y_{is},Y_{it})=0.422^{|s-t|} \]

作业相关矩阵 corstr ,比较QIC,一般越小越好。

Code
QIC(rfb_gee)
#>       QIC      QICu Quasi Lik       CIC    params      QICC 
#>  -467.536  -471.921   239.961     6.193     4.000  -455.536
QIC(rfb_gee2)
#>       QIC      QICu Quasi Lik       CIC    params      QICC 
#>  -451.065  -457.436   230.718     5.185     2.000  -447.637

20.7.1 示例

Code
df <- read_delim("data/麻醉诱导时相.txt")
df_long <- df |> pivot_longer(
    cols = starts_with("t"),
    names_to = "time",
    values_to = "SBP"
)
Code
g3 <- geeglm(SBP~ group * time,data = df_long ,id = id,corstr = "ar1")

g3
#> 
#> Call:
#> geeglm(formula = SBP ~ group * time, data = df_long, id = id, 
#>     corstr = "ar1")
#> 
#> Coefficients:
#>   (Intercept)        groupB        groupC        timet1        timet2 
#>         121.0           0.2           5.2          -8.6          -2.6 
#>        timet3        timet4 groupB:timet1 groupC:timet1 groupB:timet2 
#>           4.8          -0.2           7.2           5.4          -0.6 
#> groupC:timet2 groupB:timet3 groupC:timet3 groupB:timet4 groupC:timet4 
#>          -5.0           2.2          11.6          14.2           4.6 
#> 
#> Degrees of Freedom: 75 Total (i.e. Null);  60 Residual
#> 
#> Scale Link:                   identity
#> Estimated Scale Parameters:  [1] 16.13
#> 
#> Correlation:  Structure = ar1    Link = identity 
#> Estimated Correlation Parameters:
#>  alpha 
#> 0.8464 
#> 
#> Number of clusters:   15   Maximum cluster size: 5
summary(g3)
#> 
#> Call:
#> geeglm(formula = SBP ~ group * time, data = df_long, id = id, 
#>     corstr = "ar1")
#> 
#>  Coefficients:
#>               Estimate Std.err    Wald Pr(>|W|)    
#> (Intercept)    121.000   1.414 7320.50  < 2e-16 ***
#> groupB           0.200   2.234    0.01  0.92867    
#> groupC           5.200   2.028    6.58  0.01034 *  
#> timet1          -8.600   0.921   87.22  < 2e-16 ***
#> timet2          -2.600   1.315    3.91  0.04794 *  
#> timet3           4.800   1.180   16.55  4.7e-05 ***
#> timet4          -0.200   1.213    0.03  0.86907    
#> groupB:timet1    7.200   1.173   37.67  8.4e-10 ***
#> groupC:timet1    5.400   2.191    6.08  0.01371 *  
#> groupB:timet2   -0.600   1.470    0.17  0.68309    
#> groupC:timet2   -5.000   1.943    6.62  0.01008 *  
#> groupB:timet3    2.200   1.397    2.48  0.11534    
#> groupC:timet3   11.600   3.098   14.02  0.00018 ***
#> groupB:timet4   14.200   1.425   99.23  < 2e-16 ***
#> groupC:timet4    4.600   2.498    3.39  0.06555 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation structure = ar1 
#> Estimated Scale Parameters:
#> 
#>             Estimate Std.err
#> (Intercept)     16.1    4.44
#>   Link = identity 
#> 
#> Estimated Correlation Parameters:
#>       Estimate Std.err
#> alpha    0.846  0.0661
#> Number of clusters:   15  Maximum cluster size: 5
QIC(g3)
#>       QIC      QICu Quasi Lik       CIC    params      QICC 
#>      1240      1240      -605        15        15       968