International institutions and national statistical institutes are increasingly expected to report disaggregated indicators, i.e., means, ratios or Gini coefficients for different regional levels, socio-demographic groups or other subpopulations. These subpopulations are called areas or domains in this thesis. The data sources that are used to estimate these disaggregated indicators are mostly national surveys which may have small sample sizes for the domains of interest. Therefore, direct estimates that are based only on the survey data might be unreliable. To overcome this problem, small area estimation (SAE) methods help to increase the precision of survey-based estimates without demanding larger and more costly surveys. In SAE, the collected survey data is combined with other data sources, e.g., administrative and register data or data that is a by-product of digital activities.
The data requirements for various SAE methods depend to a large extent on whether the indicator of interest is a linear or non-linear function of a quantitative variable. For the estimation of linear indicators, e.g., the mean, aggregated data is sufficient, that is, direct estimates and auxiliary information from other data sources only need to be available for each domain. One popular area-level approach in this context is the Fay-Herriot model that is studied in Part 1 of this work. In Chapter 1, the Fay-Herriot model is used to estimate the regional distribution of the mean household net wealth in Germany. The analysis is based on the Household Finance and Consumption Survey (HFCS) that was launched by the European Central bank and several statistical institutes in 2010. The main challenge of applying the Fay-Herriot approach in this context is to handle the issues arising from the data: a) the skewness of the wealth distribution, b) informative weights due to, among others, unit non-response, and c) multiple imputation to deal with item non-response. For the latter, a modified Fay-Herriot model that accounts for the additional uncertainty due to multiple imputation is proposed in this thesis. It is combined with known solutions for the other two issues and applied to estimate mean net wealth at low regional levels. The Deutsche Bundesbank that is responsible for reporting the wealth distribution in Germany, as well as many economic institutes, predominantly work with the statistical software Stata. In order to provide the Fay-Herriot model and its extensions used in Chapter 1, a new Stata command called fayherriot is programmed in the context of this thesis to make the approach available for practitioners. Chapter 2 describes the functionality of the command with an application to income data from the Socio-Economic Panel, one of the largest panel surveys in Germany. The example application demonstrates how the Fay-Herriot approach helps to increase the reliability of estimates for mean household income compared to direct estimates at three different regional levels.
In an extension to estimating linear indicators, Part 2 deals with the estimation of non-linear income and wealth indicators. Since the mean is sensitive to outliers, the median and other quantiles are also of interest when estimating the income or wealth distribution. As a first approach, this thesis focuses on the direct estimation of quantiles, which is not as straightforward as for the mean. In Chapter 3, common quantile definitions implemented in standard statistical software are empirically evaluated based on income and wealth distributions with regards to their bias. The analysis shows that, especially for wealth data that is mostly heavily skewed, sample sizes need to be large in order to obtain unbiased direct estimates with the common quantile definitions. Since a design-unbiased direct estimator is one assumption of the aforementioned Fay-Herriot model, further research would be necessary in order to use the Fay-Herriot approach for the estimation of quantiles when the underlying data is heavily skewed. More common methods for producing reliable estimates for non-linear indicators -- including quantiles, poverty indicators, and inequality indicators such as the Gini coefficient -- in small domains are unit-level SAE methods. However, for these methods, the data requirements are more restrictive. Both the survey data and the auxiliary data need to be available for each unit in each domain. Among others, the empirical best prediction (EBP), the World-Bank method, and the M-Quantile approach are well-known methods for the estimation of non-linear indicators in small domains. However, these methods are either not available in statistical software or the user-friendliness is limited. Therefore, in this work the R package emdi is developed that focuses on an user-friendly application of the EBP. Chapter 4 describes how the package emdi supports the user beyond the estimation by tools for assessing and presenting the results.
Both, area- and unit-level SAE models, are based on linear mixed regression models that rely on a set of assumptions, particularly the linearity and normality of the error terms. If these assumptions are not fulfilled, transforming the response variable is one possible solution. Therefore, Part 3 provides a guideline for the usage of transformations. Chapter 5 gives an extensive overview of different transformations applicable in linear and linear mixed regression models and discusses practical challenges. The implementation of various transformations and estimation methods for transformation parameters are provided by the R package trafo that is described in Chapter 6.
Altogether, this work contributes to the literature by
a) combining SAE and multiple imputation proposing a modified Fay-Herriot approach, b) showing limitations of existing quantile definitions with regards to the bias when data is skewed and the sample size is small, c) closing the gap between academic research and practical applications by providing user-friendly software for the estimation of linear and non-linear indicators, and d) giving a framework for the usage of transformations in linear and linear mixed regression models.