For evidence-based policy-making, reliable information on socio-economic indicators are essential. Sample surveys have a long tradition of providing cost-efficient information on these indicators. Mostly, there is a demand for the quantity of interest not only at the level of the total population, but especially at the level of sub-populations (geographic areas or sociodemographic groups) called areas or domains. To gain insights into these sub-populations, disaggregated direct estimators can be used, which are calculated solely on area-specific survey data. An area is regarded as ’large’ if the sample size is large enough to enable reliable direct estimates. If the precision of the direct estimates is not sufficient or the sample size is even zero, the area is considered as ’small’. This is particularly common at high spatial or socio-demographic resolutions. Small area estimation (SAE) is promising to overcome this problem without the need for larger and thus more costly surveys. The essence of SAE techniques is that they ’borrow strength’ from other areas to improve their predictions. For this purpose, a model is built on survey data that links additional auxiliary data and exploits area-specific structures. Suitable auxiliary data sources are administrative and register data, such as the census. In many countries, such data are strictly protected by confidentiality agreements and access to population micro-data is a challenge even for gatekeeper organisations. Thus, users have an increased interest in SAE estimators that do not require population micro-data to serve as auxiliary data. In this thesis, new methods in the absence of population micro-data are presented and applications on socio-economic highly relevant indicators are demonstrated.
Since different SAE models impose different data requirements, Part I bundles research combining unit-level survey data and limited auxiliary data, e.g., aggregated data such as means, which is a common data situation for users. To account for the unit-level survey information the use of the well-known nested error regression (NER) model is targeted. This model is a special case of a linear mixed model based on several assumptions. But how can users proceed if the model assumptions are not fulfilled? In Part I, this thesis provides two new approaches to deal with this issue. One promising approach is to transform the response. Since several socio-economically relevant variables, such as income, have a skewed distribution, the log-transformation of the response is an established way to meet the assumptions. However, the data-driven log-shift transformation is even more promising because it extends the log by an additional parameter and achieves more flexibility. Chapter 1 introduces both transformations in the absence of population micro-data. A particular challenge is the transformation of the small area means back to the original scale. Hence, the proposed approach introduces aggregate statistics (means and covariances) and kernel density estimation to resolve the issue of lacking population micro-data. Uncertainty estimation is developed, and all methods are evaluated in design- and model-based settings. The proposed method is applied to estimate regional income in Germany using the Socio-Economic Panel and census data. It achieves a clear improvement in reliability, and thus demonstrates the importance of the method. To conveniently enable further applications, this new methodology is implementedin the R package saeTrafo. Chapter 2 describes the various functionalities of the package using publicly available income data. To increase user-friendliness, established unit-level models under transformations and their uncertainty estimations are implemented and the most suitable method is automatically selected. For some applications, however, it is challenging to find a suitable transformation or, more generally, to specify a model, particularly in the presence of complex interactions. For this case, machine learning methods are valuable as a transformation is not necessarily required nor a model needs to be explicitly specified. The semi-parametric framework of mixed effects random forest (MERF) combines the advantages of random forests (robustness against outliers and implicit model-selection) with the ability to model hierarchical dependencies as present in SAE approaches. Chapter 3 introduces MERFs in the absence of population micro-data. As existing random forest algorithm require unit-level auxiliary population data, an alternative strategy is introduced. It adaptively incorporates aggregated auxiliary information through calibration-weights to circumvent unit-level auxiliary data. Applying the proposed method on opportunity costs of care work for Germany using the Socio-Economic Panel and census data demonstrates the gain in accuracy in comparison to both direct estimates and the classical NER model.
In contrast to methods using a unit-level sample survey, Part II focuses on the well-known class of area-level SAE models requiring direct estimates from a survey while using (once again) only aggregated population auxiliary data. This thesis presents two particularly relevant applications of this model class. Chapter 4 examines regional consumer price indices (CPIs) in the United Kingdom (UK), contributing to the great interest in monitoring inflation at the spatial level. The SAE challenge is to construct model-based expenditure weights to generate the regional basket of goods and services for the twelve regions of the UK. They are estimated and constructed from the living cost and food survey. Furthermore, available price data are linked to the SAE estimated baskets to produce regional CPIs. The resulting CPI series are closely examined, and smoothing techniques are applied. As a result, the reliability improves, but the CPI series are still too volatile for policy use. However, our research serves as a valuable framework for the creation of a regional CPI in the future. The second application also explores the reliability of the disaggregated estimation of a politically and economically highly relevant indicator, in this case the unemployment rate. The regional target level are the functional urban areas in the German federal state North Rhine-Westphalia. In Chapter 5, two types of unemployment rates - the traditional one and an alternative definition taking commuting into account - are estimated and compared. Direct estimates from the labour force survey are linked with SAE methods to passively collected mobile network data. This alternative data source is real-time available, offers spatial flexible resolutions, and is dynamic. In compliance with data protection rules, we obtain aggregated auxiliary mobile network information from the data provider. The SAE methods improve the reliability, and the resulting predictions show that alternative unemployment rates in German city cores are lower than traditional estimated official unemployment rates indicate.