The vignette is organized as follows. In chapter 1, we guide the reader through the preparatory steps (loading packages and data). In the chapters that follow, we discuss regression estimation (with focus on weighted least squares, M- and GM-estimators) for 3 different modes of inference.
First, we load the robsurvey
and the survey
package (Lumley, 2010, 2021). For regression analysis, the availability of the survey
package is imperative.
> library(robsurvey, quietly = TRUE)
> library(survey)
Note. The argument quietly = TRUE
suppresses the start-up message in the call of library(robsurvey)
.
The counties dataset contains county-specific information on population, number of farms, land area, etc. for a sample of \(n = 100\) counties in the U.S. in the 1990s. The sampling design is simple random sample (without replacement) from the population of \(3\,141\) counties. The dataset is tabulated in Lohr (1999, Appendix C) which is based on 1994 data of the U.S. Bureau of the Census. The first 3 rows of the data are displayed below.
> data(counties)
> head(counties, 3)
state county landarea totpop unemp farmpop numfarm farmacre weights1 AL Escambia 948 36023 1339 531 414 90646 31.41
2 AL Marshall 567 73524 3189 1592 1582 136599 31.41
3 AK Prince of Wales 7325 6408 383 71 2 214 31.41
fpc1 3141
2 3141
3 3141
Description of variables.
state | state | county | county |
landarea | land area, 1990 (square miles) | totpop | population total, 1992 |
unemp | number of unemployed persons, 1991 | farmpop | farm population, 1990 |
numfarm | number of farms, 1987 | farmacre | acreage in farms, 1987 |
weights | sampling weight | fpc | finite population correction |
The goal is to regress the county-specific size of the farm population in 1990 (variable farmpop
) on a set of explanatory variables. The simplest population regression model is
\[\begin{equation} \xi: \quad \mathrm{farmpop}_i = b_0 + b_1 \cdot \mathrm{numfarm}_i + \sqrt{v_i} E_i, \qquad i \in U, \end{equation}\]
where \(U\) is the set of labels of all \(3\,141\) counties in the U.S. (population), \(b_0, b_1 \in \mathbb{R}\) are unknown regression coefficients, \(v_i\) are known constants (i.e., known up to a constant multiplicative factor, \(\sigma\)), and \(E_i\) are regression errors (random variables) with mean zero and unit variance such that \(E_i\) and \(E_j\) are conditionally independent given \(\mathrm{numfarm}_i\), \(i \in U\).
Another candidate model is
\[\begin{equation} \xi_{log}: \quad \log(\mathrm{farmpop}_i) = b_0 + b_1 \cdot \log(\mathrm{numfarm}_i) + \sqrt{v_i} E_i, \qquad i \in U. \end{equation}\]
The counties dataset is a simple random sample (without replacement) of size \(n=100\). In what follows, we restrict attention to the subset of the 98 counties with \(\mathrm{farmpop}_i > 0\). Hence, we define the sampling design dn
.
> dn <- svydesign(ids = ~1, fpc = ~fpc, weights = ~weights,
+ data = counties[counties$farmpop > 0, ])
The figures below show scatterplots of farmpop
plotted against numfarm
(in raw and log scale). The scatterplot in the left panel indicates that the variance of farmpop
increases with larger values of numfarm
(heteroscedasticity). The same graph but in log-log scale (see right panel) shows an outlier, which is located far from the bulk of the data.
To begin with, we consider fitting the model \(\xi\) (under the assumption of homoscedasticity, i.e., \(v_i \equiv 1\)) by weighted least squares. The function is svyreg()
, and it requires two arguments: a formula
object (a symbolic description of the model to be fitted) and surveydesign
object (class survey::survey.design
).
> m <- svyreg(farmpop ~ numfarm, dn, na.rm = TRUE)
> m
Survey regression estimator
:
Callsvyreg(formula = farmpop ~ numfarm, design = dn, na.rm = TRUE)
:
Coefficients
(Intercept) numfarm -121.310 1.921
: 402 Scale estimate
The output resembles the one of an lm
call. The estimated model can be summarized using
> summary(m)
:
Callsvyreg(formula = farmpop ~ numfarm, design = dn, na.rm = TRUE)
:
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max. -1381.71 -309.50 -13.04 0.00 181.30 2637.47
:
CoefficientsPr(>|t|)
Estimate Std. Error t value -121.3105 93.7991 -1.293 0.196
(Intercept) 1.9215 0.1934 9.936 <2e-16 ***
numfarm ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: 402 on 3076.18 degrees of freedom Residual standard error
Other useful methods and utility functions
coef()
extracts the estimated coefficientsresiduals()
extracts the residualsfitted()
extracts the fitted valuesvcov()
extracts the variance-covariance matrix of the estimated coefficientsIn the above scatterplot, we observe that the variance of farmpop
increases with larger values of numfarm
(heteroscedasticity). We conjecture that the \(v_i\)’s in model \(\xi\) are proportional to the square root of the variable numfarm
. Thus, we add variable vi
to the design object dn
with the update()
command
> dn <- update(dn, vi = sqrt(numfarm))
The argument var
of svyreg()
is now used to specify the heteroscedastic variance (it can be defined as a formula
, i.e. ~vi
, or the variable name in quotation marks, "vi"
).
> svyreg(farmpop ~ -1 + numfarm, dn, var = ~vi, na.rm = TRUE)
Survey regression estimator
:
Callsvyreg(formula = farmpop ~ -1 + numfarm, design = dn, var = ~vi,
na.rm = TRUE)
:
Coefficients
numfarm 1.775
: 87.51 Scale estimate
Note that we have dropped the regression intercept.
We continue to assume the heteroscedastic model \(\xi\), but now we compute a weighed regression M-estimator. Two types of estimators are available:
svyreg_huber()
Huber and asymmetric Huber \(\psi\)-function (the latter obtains by specifying argument asym = TRUE
)svyreg_tukey()
Tukey biweight (redescending) \(\psi\)-functionThe tuning constant of both \(\psi\)-functions is called k
. The Huber \(M\)-estimator of regression is
> m <- svyreg_huber(farmpop ~ -1 + numfarm, dn, var = ~vi, k = 1.3, na.rm = TRUE)
> m
-estimator (Huber psi, k = 1.3)
Survey regression M
:
Callsvyreg_huber(formula = farmpop ~ -1 + numfarm, design = dn, k = 1.3,
var = ~vi, na.rm = TRUE)
in 6 iterations
IRWLS converged
:
Coefficients
numfarm 1.669
: 83.35 Scale estimate
and the summary obtains by
> summary(m)
:
Callsvyreg_huber(formula = farmpop ~ -1 + numfarm, design = dn, k = 1.3,
var = ~vi, na.rm = TRUE)
:
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max. -165.951 -64.126 -10.095 6.848 47.253 458.062
:
CoefficientsPr(>|t|)
Estimate Std. Error t value 1.66859 0.05554 30.04 <2e-16 ***
numfarm ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: 83.35 on 3077.18 degrees of freedom
Residual standard error
:
Robustness weights
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2369 1.0000 1.0000 0.9294 1.0000 1.0000
The output of the summary method is almost identical with the one of an lm
call. The paragraph on robustness weights is a numerical summary of the M-estimator’s robustness weights, which can be extracted from the estimated model with robweights()
.
The subsequent figure shows the graph for plot(residuals(m), robweights(m))
. From the graph, we can see by how much the residuals have been downweighted.
We want to fit the population regression model \[ \log(\mathrm{farmpop}_i) = b_0 + b_1 \log(\mathrm{numfarm}_i) + b_2 \log(\mathrm{totpop}_i) + b_3 \log(\mathrm{farmacre}_i) + \sqrt{v_i} E_i, \qquad i \in U, \]
with sample data by the weighted generalized M-estimator (GM) of regression. GM-estimators are robust against high leverage observations (i.e., outliers in the design space of the model) while \(M\)-estimators of regression are not. Two types of GM-estimators are available:
svyreg_huberGM()
Huber and asymmetric Huber \(\psi\)-function (the latter obtains by specifying argument asym = TRUE
)svyreg_tukeyGM()
Tukey biweight (redescending) \(\psi\)-functionVariable farmacre
contains 3 missing values. For ease of analysis, we exclude the missing values from the survey.design
object dn
.
> dn_exclude <- na.exclude(dn)
The formula object of our model is
> f <- log(farmpop) ~ log(numfarm) + log(totpop) + log(farmacre)
With the help of the formula object f
, we form a matrix of the explanatory variables (excluding the regression intercept) of our model. This matrix is called xmat
.
> xmat <- model.matrix(f, dn_exclude$variables)[, -1]
Below we show the pairwise scatterplots of pairs(xmat)
. The pairwise distributions are mostly elliptically contoured and show some outliers.
The (intermediate) goal is to compute the robust Mahalanobis distances of the observations in xmat
. Observations with large distances are considered outliers and will be downweighted in the subsequent regression analysis. In order to obtain the robust Mahalanobis distances, we compute the robust multivariate location and scatter matrix of xmat
using the (weighted) BACON algorithm in package wbacon
(Schoch, 2021).
> library(wbacon)
> mv <- wBACON(xmat, weights = weights(dn_exclude))
> mv
: Robust location, covariance, and distances
Weighted BACONin 7 iterations (alpha = 0.05)
Converged : 2 (2.11%) Number of potential outliers
The estimated multivariate location and scatter or covariance matrix can be extracted from object mv
with the functions center()
and vcov()
, respectively. The robust Mahalanobis distances can be extracted with
> dis <- distance(mv)
The boxplot of dis
shows a couple of observations with large Mahalanobis distance. These observations are considered as potential outliers.
Three functions are available to downweight excessively large distances (see also figure)
tukeyWgt()
huberWgt()
simpsonWgt()
see Simpson et al. (1992)The GM-estimator of regression with Tukey biweight function and downweighting of high leverage observations using tukeyWgt(dis)
is
> m <- svyreg_tukeyGM(f, dn_exclude, k = 4.6, xwgt = tukeyWgt(dis))
> summary(m)
:
Callsvyreg_tukeyGM(formula = f, design = dn_exclude, k = 4.6, xwgt = tukeyWgt(dis))
:
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.004303 -0.352241 -0.052866 0.004878 0.360099 3.389738
:
CoefficientsPr(>|t|)
Estimate Std. Error t value 1.62725 0.56957 2.857 0.00431 **
(Intercept) log(numfarm) 1.30079 0.07044 18.466 < 2e-16 ***
log(totpop) -0.06550 0.03176 -2.062 0.03929 *
log(farmacre) -0.20161 0.04617 -4.366 1.31e-05 ***
---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: 0.5549 on 2979.95 degrees of freedom
Residual standard error
:
Robustness weights
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.6270 0.7850 0.7114 0.8818 0.9959
Model-based inferential statistics are computed using a dummy survey.design
object under the assumption of a simple random sample without replacement. We define
> dn0 <- svydesign(ids = ~1, weights = ~1,
+ data = counties[counties$farmpop > 0, ])
which differs from our original design dn
in the following aspects:
weights = ~1
(i.e., the sampling weights are set to 1.0);fpc
(finite sampling correction) is not specified.We consider fitting model \(\xi_{log}\) by the Huber regression \(M\)-estimator (based on the design object dn0
)
> m <- svyreg_huber(log(farmpop) ~ log(numfarm), dn0, k = 1.3, na.rm = TRUE)
In the call of summary()
, we specify the argument mode = "model"
for model-based inference (default is mode = "design"
) to obtain
> summary(m, mode = "model")
:
Callsvyreg_huber(formula = log(farmpop) ~ log(numfarm), design = dn0,
k = 1.3, na.rm = TRUE)
:
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.203816 -0.314133 0.007801 0.003245 0.335280 3.647648
:
CoefficientsPr(>|t|)
Estimate Std. Error t value -0.13991 0.29129 -0.48 0.632
(Intercept) log(numfarm) 1.08916 0.04663 23.36 <2e-16 ***
---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: 0.4934 on 96 degrees of freedom
Residual standard error
:
Robustness weights
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1758 1.0000 1.0000 0.9557 1.0000 1.0000
Likewise, we can call vcov(m, mode = "model")
to extract the estimated model-based covariance matrix of the regression estimator.
Suppose we wish to estimate the regression coefficients that refer to the superpopulation (i.e., a more general population than the finite population) given sample data. Furthermore, our sample data represent a large fraction of the finite population (i.e., \(n/N\) is not small). In statistical inference about the coefficients, we need to account for the sampling design and the model because the quantity of interest is the superpopulation parameter and the sampling fraction is not negligible; see motivating example.
The MU284 population of Särndal et al. (1992, Appendix B) is a dataset with observations on the 284 municipalities in Sweden in the late 1970s and early 1980s. It is available in the sampling
package; see Tillé and Matei (2021). The population is divided into two parts based on 1975 population size (P75
):
The three biggest cities take exceedingly large values (representative outliers) on almost all of the variables. To account for this (at least to some extent), a stratified sample has been drawn from the MU284 population using a take-all stratum. The sample data, MU284strat
, is of size \(n=60\) and consists of
Stratum | \(S_1\) | \(S_2\) | \(S_3\) | \(S_4\) | \(S_5\) | \(S_6\) | \(S_7\) | \(S_8\) | take all |
Population stratum size | 24 | 48 | 32 | 37 | 55 | 41 | 15 | 29 | 3 |
Sample stratum size | 5 | 10 | 6 | 8 | 11 | 8 | 3 | 6 | 3 |
The overall sampling fraction is \(60/284 \approx 21\%\) (and thus not negligible). The data frame MU284strat
includes the following variables.
LABEL | identifier variable | P85 | 1985 population size (in \(10^3\)) |
P75 | 1975 population size (in \(10^3\)) | RMT85 | revenues from the 1985 municipal taxation (in \(10^6\) kronor) |
CS82 | number of Conservative seats in municipal council | SS82 | number of Social-Democrat seats in municipal council (1982) |
S82 | total number of seats in municipal council in 1982 | ME84 | number of municipal employees in 1984 |
REV84 | real estate values in 1984 (in \(10^6\) kronor) | CL | cluster indicator |
REG | geographic region indicator | Stratum | stratum indicator |
weights | sampling weights | fpc | finite population correction |
We load the data and generate the sampling design.
> data(MU284strat)
> dn <- svydesign(ids = ~1, strata = ~ Stratum, fpc = ~fpc, weights = ~weights,
+ data = MU284strat)
The Huber M-estimator of regression is computed (as before)
> m <- svyreg_huber(RMT85 ~ REV84 + P85 + S82 + CS82, dn, k = 2)
The compound design- and model-based distribution of the estimated coefficients obtains by specifying the argument mode = compound
in the call of the summary()
method.
> summary(m, mode = "compound")
:
Callsvyreg_huber(formula = RMT85 ~ REV84 + P85 + S82 + CS82, design = dn,
k = 2)
:
Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max. -77.877 -18.469 5.044 66.250 17.635 2690.755
:
CoefficientsPr(>|t|)
Estimate Std. Error t value 130.973688 22.260303 5.884 1.15e-08 ***
(Intercept) 0.005729 0.003304 1.734 0.0840 .
REV84 9.499801 0.285577 33.265 < 2e-16 ***
P85 -3.845513 0.542752 -7.085 1.14e-11 ***
S82 -1.965082 1.010420 -1.945 0.0528 .
CS82 ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: 23.96 on 279 degrees of freedom
Residual standard error
:
Robustness weights
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01781 1.00000 1.00000 0.95076 1.00000 1.00000
We may call vcov(m, mode = "compound")
to extract the estimated covariance matrix of the regression estimator under the compound design- and model-based distribution.
Lumley, T. (2021). survey: analysis of complex survey samples. R package version 4.0, URL https://CRAN.R-project.org/package=survey.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R: A Guide to Analysis Using R. John Wiley and Sons, Hoboken, NJ.
Lohr, S.L. (1999). Sampling: Design and Analysis. Pacific Grove (CA): Duxbury Press.
Venables, W.N. and Ripley, B.D. (2002). Modern Applied Statistics with S. 4th Edition. Springer, New York.
Schoch, T. (2021). wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression. Journal of Open Source Software 6 (62), 3238. DOI: 10.21105/joss.03238
Simpson, D.G., Ruppert, D. and Carroll, R.J. (1992). On One-Step GM Estimates and Stability of Inferences in Linear Regression. Journal of the American Statistical Association 87, 439-450. DOI: 10.1080/01621459.1992.10475224
Tillé, T. and Matei, A. (2021). sampling: Survey Sampling. R package version 2.9. https://CRAN.R-project.org/package=sampling.