dc.description.abstract
In its Global Risks Report 2017, the World Economic Forum identifies rising income and
wealth disparity as one of the top five global development trends that potentially causes unemployment,
underemployment, and profound social instability.
To counteract this development, it is necessary to quantify inequality and to analyze the distribution
of income and wealth. In order to measure inequality and identify factors that significantly
impact income or wealth, governments and statistical offices collect data by conducting
surveys and censuses. However, collecting data on rather private topics such as income can
lead to high item non-response rates. Therefore, it is tempting for survey designers to collect
information on income using income bands as opposed to detailed income. This kind of data is commonly known as interval-censored data, grouped data
or banded data. It is defined as observing only the lower and upper bound of an income variable
with its exact value remaining unknown. Collecting only the interval information instead
of continuous data offers a higher degree of data privacy protection to survey respondents,
which lowers response burdens and thus leads to lower item non-response rates and higher data
quality. This kind of data is already being collected by a number of surveys and censuses.
Among them is the biggest annually survey in Europe, the German Microcensus, and the censuses of Australia,
Colombia, and New Zealand.
While data quality is increased, analyzing interval-censored data requires more advanced
statistical methods. This is due to the fact that only the interval information is observed and
the underlying data distribution within each interval remains unobserved. For instance, well-established
and widely used statistical methods, such as linear and linear mixed regression,
require a continuous response variable. Furthermore, formulas to estimate statistical indicators,
such as the mean, rely on metric data. While regression models are commonly applied to
analyze income and wage, the estimation of statistical indicators from interval-censored data
is of particular interest for the Federal Statistical Office and the Statistical Offices of the German
States in order to measure and monitor the regional distribution of poverty and inequality. This work therefore proposes new statistical methodology for
the estimation of linear and linear mixed regression models with an interval-censored response
variable and for the estimation of statistical indicators from interval-censored data, e.g., German
Microcensus data.
In Part I of the thesis, theory is developed to infer the properties of a population with linear
and linear mixed models using sample data. In particular, in Chapter 1, theory is proposed
to estimate the regression parameter and its standard errors of linear and linear mixed models
with an interval-censored response variable. For the estimation of the parameters, a novel
stochastic expectation-maximization (SEM) algorithm is proposed. In order to estimate the
standard errors of the regression parameters, two different bootstraps are introduced. A nonparametric
bootstrap for the linear regression model and a parametric bootstrap for the linear
mixed regression model. Both the introduced bootstraps account for the additional uncertainty
that is caused by the interval censoring of the dependent variable. The theory is applied to
analyze interval-censored personal income data collected by the German Microcensus with
a linear mixed regression model. By applying the newly proposed methodology, different
components that significantly affect income are discovered.
In Part II, new methodology is proposed for the direct estimation (without covariates) and
the prediction of statistical indicators, for instance, poverty and inequality indicators. For the
direct estimation of statistical indicators, an iterative kernel density algorithm is proposed
in Chapter 2. The proposed algorithm generates metric pseudo samples from the interval-censored
target variable. From these pseudo samples, any statistical indicator of interest can
be estimated. The estimation of the standard errors is facilitated by a non-parametric bootstrap
that accounts for the additional uncertainty coming from the interval censoring. The method
is applied to estimate poverty and inequality indicators at the federal state level from interval-censored
household income data collected by the German Microcensus. For valid indicator
estimates, survey and household equivalence scale weights are incorporated into the algorithm
and used in the analysis.
When samples sizes are small, e.g., in small geographic areas, direct estimators of statistical
indicators might be unreliable. Furthermore, some areas of interest might not even be
sampled. In these situations, small area estimation (SAE) methods can provide reliable estimates
for the desired indicators. One particular SAE method that
has been used in this context is the empirical best predictor (EBP) method. This method is based on the use of a linear mixed regression model estimated with
income as a response variable that is measured on a continuous scale. To enable the use of
the EBP method with an interval-censored response variable, the SEM algorithm proposed in
Chapter 1 is applied to estimate the model parameters in Chapter 3. The EBP method crucially
depends on the normality assumption of the residuals. Therefore, the SEM algorithm is further
developed to facilitate the use of the data-driven Box-Cox transformation. The estimation of the mean squared error of the EBPs is facilitated by a parametric
bootstrap that accounts for the additional variability coming from the uncertainty from estimating
the transformation parameter of the Box-Cox transformation and the uncertainty resulting
from working with limited information due to interval censoring. The newly introduced SEM
algorithm in conjunction with transformations and the modified EBP approach is then used to
estimate disaggregated poverty and inequality indicators from interval-censored income data
in Chiapas, one of the poorest states in Mexico.
In Part III, the implementation of the proposed theory in the programming language R
is presented. Implementing new methodology is valuable in order to
enable other researchers, data analysts, and practitioners to easily use the newly introduced
statistical theory. Therefore, the theory is implemented in the R package smicd available on
the Comprehensive R Archive Network. In Chapter 4, the package, its functionality, and its
usage is presented in detail.
en