Over the last few years, data has often been described as the oil of the 21st century, e.g., by Bhageshpur (2019). Just as access to oil dominated power and development in the last century, this claim implies that personal data is not only assumed to be similarly valuable, but also equally as influential in politics and society as oil once was. However, oil sources mainly diverge in their accessibility, quality, quantity, and cost of exploitation but, once extracted and refined theses sources may lead to roughly similar products. Data also differs in these four categories but, additionally, data sources typically lead to very specific insights. One single data source is often neither sufficient to answer important and complex scientific (or economic) questions nor to make any predictions with fine granularity and high precision. In such cases, combining data from different sources that provide additional aspects to the problem at hand is one promising approach to achieve these aims.
In this dissertation, combining data sources is conducted for two purposes. Part I of this work focuses on combining data to achieve additional understanding. In the paper presented in Part I, the authors analyze reasons why students drop out of undergraduate courses in economics and business administration. From a university perspective, administrative data is readily available, e.g., which modules are completed in which semester, how many educational credit points are achieved by each student in each semester. Socioeconomic data at individual level, however, is usually unavailable to university administrations. In order to overcome this hurdle, the authors proposed and executed a novel prospective study design. A survey was conducted on students starting the second semester and the data was combined with administrative longitudinal data. Hence, the authors were able to analyze individual studying behavior conditioned on a large pool of socio-demographic variables. Among other results, the authors were able to show that college admission grades have a negligibly small impact on the achievements made on Bachelor degree courses. This finding stands in strong contrast to college admission policies in Germany that strongly focus on high-school grades for college admission.
The second purpose of data-combination, as discussed in this work, is the combination of data-sources to improve the precision of predictions. Part II consists of three papers from the field of small area estimation (SAE). In SAE problems, there is some survey data typically available that contains the (VOI). However, an indicator of interest needs to be estimated on some subgroup level by a function of the VOI. Such levels usually consist of a geographic region but are not limited to this. In the context of SAE, these subgroups are called areas. With an increasing number of areas, the quantity of observations available per area decreases, often leading to areas with very few or even without any observations (out-of-sample areas). In such situations, a prediction with reasonable reliability becomes impossible when only relying on the available survey data. One possibility to overcome this burden is to couple the survey data with additional data, such as administrative or census data. Frequently, these data sources do not contain the VOI, thus rendering a direct estimation of the indicator impossible. However, if similar covariates are available in the survey and census, a feasible approach is to assume a model for the VOI that holds, both in the survey and census, to estimate the model on the survey data and then to combine the model estimates with the more numerous census data that enables prediction. This general method is often referred to as “borrowing strength.” Such model-based approaches are roughly divided into two classes, depending on the data availability and resulting requirements on the models. First, if the data is available for each individual of interest, e.g., a citizen or household, unit-levels are used. For area-level models, on the other hand, data is only available in aggregated form at area level.
Many common methods in SAE assume normally distributed residuals, not only for the model estimation, but more crucially when using the census data for prediction and the estimation of precision. Therefore, deviations from normality have severe consequences and lead to less precise point estimates and misleading shrunk error estimates. The paper presented in Chapter 2 proposes the use of data-driven transformations, resulting in smaller deviations from normality and thus improved point and precision estimates. Chapter 3 presents the R-package emdi that implements not only the latter methodology, but also allows for the usage of various area-level Fay-Herriot models. emdi focuses on user-friendliness and provides many useful functionalities to support the user through every step, from model estimation through the analysis of model assumptions to visualize the results end enable their exportation. However, when deviations from normality are too severe, different model types are more suitable. Chapter 4 presents the R-package ammlogit, which allows the user to work with a multinomial VOI. The corresponding paper introduces a new methodology for prediction and a revised bootstrap mean squared error estimation. Like emdi, ammlogit is designed to be user-friendly and narrow the gap between research and practitioners. Still, the method used in ammlogit also uses distributional assumptions, as the counts are assumed to be conditionally multinomially distributed and the area-level error terms are assumed to be identically normally distributed.
A model class without distributional assumptions are quantile-type regression models. M-Quantile models have been used in SAE since Chambers and Tzavidis (2006). However, the related mixed quantile regression models have not yet been consequently applied to small area problems, even though they are a naturally robust alternative to the widely used linear mixed regression models that dominate research and application. In parts, this may be due to remaining uncertainties about their symptotic properties. Therefore, in the paper presented in Part III, the asymptotic normality of the corresponding maximum likelihood estimates is proven and a plugin-variance estimator is derived.