In its Global Risks Report 2017, the World Economic Forum identifies rising income and wealth disparity as one of the top five global development trends that potentially causes unemployment, underemployment, and profound social instability. To counteract this development, it is necessary to quantify inequality and to analyze the distribution of income and wealth. In order to measure inequality and identify factors that significantly impact income or wealth, governments and statistical offices collect data by conducting surveys and censuses. However, collecting data on rather private topics such as income can lead to high item non-response rates. Therefore, it is tempting for survey designers to collect information on income using income bands as opposed to detailed income. This kind of data is commonly known as interval-censored data, grouped data or banded data. It is defined as observing only the lower and upper bound of an income variable with its exact value remaining unknown. Collecting only the interval information instead of continuous data offers a higher degree of data privacy protection to survey respondents, which lowers response burdens and thus leads to lower item non-response rates and higher data quality. This kind of data is already being collected by a number of surveys and censuses. Among them is the biggest annually survey in Europe, the German Microcensus, and the censuses of Australia, Colombia, and New Zealand.
While data quality is increased, analyzing interval-censored data requires more advanced statistical methods. This is due to the fact that only the interval information is observed and the underlying data distribution within each interval remains unobserved. For instance, well-established and widely used statistical methods, such as linear and linear mixed regression, require a continuous response variable. Furthermore, formulas to estimate statistical indicators, such as the mean, rely on metric data. While regression models are commonly applied to analyze income and wage, the estimation of statistical indicators from interval-censored data is of particular interest for the Federal Statistical Office and the Statistical Offices of the German States in order to measure and monitor the regional distribution of poverty and inequality. This work therefore proposes new statistical methodology for the estimation of linear and linear mixed regression models with an interval-censored response variable and for the estimation of statistical indicators from interval-censored data, e.g., German Microcensus data.
In Part I of the thesis, theory is developed to infer the properties of a population with linear and linear mixed models using sample data. In particular, in Chapter 1, theory is proposed to estimate the regression parameter and its standard errors of linear and linear mixed models with an interval-censored response variable. For the estimation of the parameters, a novel stochastic expectation-maximization (SEM) algorithm is proposed. In order to estimate the standard errors of the regression parameters, two different bootstraps are introduced. A nonparametric bootstrap for the linear regression model and a parametric bootstrap for the linear mixed regression model. Both the introduced bootstraps account for the additional uncertainty that is caused by the interval censoring of the dependent variable. The theory is applied to analyze interval-censored personal income data collected by the German Microcensus with a linear mixed regression model. By applying the newly proposed methodology, different components that significantly affect income are discovered.
In Part II, new methodology is proposed for the direct estimation (without covariates) and the prediction of statistical indicators, for instance, poverty and inequality indicators. For the direct estimation of statistical indicators, an iterative kernel density algorithm is proposed in Chapter 2. The proposed algorithm generates metric pseudo samples from the interval-censored target variable. From these pseudo samples, any statistical indicator of interest can be estimated. The estimation of the standard errors is facilitated by a non-parametric bootstrap that accounts for the additional uncertainty coming from the interval censoring. The method is applied to estimate poverty and inequality indicators at the federal state level from interval-censored household income data collected by the German Microcensus. For valid indicator estimates, survey and household equivalence scale weights are incorporated into the algorithm and used in the analysis.
When samples sizes are small, e.g., in small geographic areas, direct estimators of statistical indicators might be unreliable. Furthermore, some areas of interest might not even be sampled. In these situations, small area estimation (SAE) methods can provide reliable estimates for the desired indicators. One particular SAE method that has been used in this context is the empirical best predictor (EBP) method. This method is based on the use of a linear mixed regression model estimated with income as a response variable that is measured on a continuous scale. To enable the use of the EBP method with an interval-censored response variable, the SEM algorithm proposed in Chapter 1 is applied to estimate the model parameters in Chapter 3. The EBP method crucially depends on the normality assumption of the residuals. Therefore, the SEM algorithm is further developed to facilitate the use of the data-driven Box-Cox transformation. The estimation of the mean squared error of the EBPs is facilitated by a parametric bootstrap that accounts for the additional variability coming from the uncertainty from estimating the transformation parameter of the Box-Cox transformation and the uncertainty resulting from working with limited information due to interval censoring. The newly introduced SEM algorithm in conjunction with transformations and the modified EBP approach is then used to estimate disaggregated poverty and inequality indicators from interval-censored income data in Chiapas, one of the poorest states in Mexico.
In Part III, the implementation of the proposed theory in the programming language R is presented. Implementing new methodology is valuable in order to enable other researchers, data analysts, and practitioners to easily use the newly introduced statistical theory. Therefore, the theory is implemented in the R package smicd available on the Comprehensive R Archive Network. In Chapter 4, the package, its functionality, and its usage is presented in detail.