One of the goals of data analysts is to establish relationships between variables using regression models. Standard statistical techniques for linear and linear mixed regression models are commonly associated with interpretation, estimation, and inference. These techniques rely on basic assumptions underlying the working model, listed below:
- Normality: Transforming data to create symmetry in order to correctly use interpretation and inferential techniques - Homoscedasticity: Creating equality of spread as a means to gain efficiency in estimation processes and to properly use inference processes - Linearity: Linearizing relationships in an effort to avoid misleading conclusions for estimation and inference techniques.
Different options are available to the data analyst when the model assumptions are not met in practice. Researchers could formulate the regression model under alternative and more flexible parametric assumptions. They could also use a regression model that minimizes the use of parametric assumptions or under robust estimation. Another option would be to parsimoniously redesign the model by finding an appropriate transformation such that the model assumptions hold. A standard practice in applied work is to transform the target variable by computing its logarithm. However, this type of transformation does not adjust to the underlying data. Therefore, some research effort has been shifted towards alternative data-driven transformations, such as the Box-Cox, which includes a transformation parameter that adjusts to the data.
The literature of transformations in theoretical statistics and practical case studies in different research fields is rich and most relevant results were published during the early 1980s. More sophisticated and complex techniques and tools are available nowadays to the applied statistician as alternatives to using transformations. However, simplification is still a gold nugget in statistical practice, which is often the case when applying suitable transformations within the working model. In general, researchers have been using data transformations as a go-to tool to assist scientific work under the classical and linear mixed regression models instead of developing new theories, applying complex methods or extending software functions. However, transformations are often automatically and routinely applied without considering different aspects on their utility.
In Part 1 of this work, some modeling guidelines for practitioners in transformations are each presented. An extensive guideline and an overview of different transformations and estimation methods of transformation parameters in the context of linear and linear mixed regression models are presented in Chapter 1. Furthermore, in order to provide an extensive collection of transformations usable in linear regression models and a wide range of estimation methods for the transformation parameter, the package trafo is presented in Chapter 2. This package complements and enlarges the methods that exist in R so far, and offers a simple, user-friendly framework for selecting a suitable transformation depending on the research purpose.
In the literature, little attention has been paid to the study of techniques of the linear mixed regression model when working with transformations. This becomes a challenge for users of small area estimation (SAE) methods, since most commonly used SAE methods are based on the linear mixed regression model which often relies on Gaussian assumptions. In particular, the empirical best predictor is widely used in practice to produce reliable estimates of general indicators for areas with small sample sizes. The issue of data transformations is addressed in the current SAE literature in a fairly ad-hoc manner. Contrary to standard practice in applied work, recent empirical work indicates that using transformations in SAE is not as simple as transforming the target variable by computing its logarithm. In Part 2 of the present work, transformations in the context of SAE are applied and further developed. Chapter 3 proposes a protocol for the production of small area official statistics that is based on three stages, namely (i) Specification, (ii) Analysis/Adaptation and (iii) Evaluation. In this chapter, the use of some adaptations of the working model by using transformations is showed as a part of the (ii) stage. In Chapter 4 we extended the use of data-driven transformations under linear mixed model-based SAE methods; In particular, the estimation method of the transformation parameter under maximum likelihood theory. First, we analyze how the performance of SAE methods are affected by departures from normality and how such transformations can assist with improving the validity of the model assumptions and the precision of small area prediction. In particular, attention has been paid to the estimation of poverty and inequality indicators, due to its important socio-economical relevance and political impact. Second, we adapt the mean squared error estimator to account for the additional uncertainty due to the estimation of transformation parameters. Finally, as in Chapter 3, the methods are illustrated by using real survey and census data from Mexico. In order to improve some features of existing software packages suitable for the estimation of indicators for small areas, the package emdi is developed in Chapter 5. This package offers a methodological and computational framework for the estimation of regionally disaggregated indicators using SAE methods as well as providing tools for assessing, processing, and presenting the results.
Finally, in Part 3, a discussion of the applicability of transformations is made in the context of generalized linear models (GLMs). In Chapter 6, a comparison is made in terms of precision measurements between using count data transformations within the classical regression model and applying GLMs, in particular for the Poisson case. Therefore, some methodological differences are presented and a simulation study is carried out. The learning from this analysis focuses on the relevance of knowing the research purpose and the data scenario in order to choose which methodology should be preferable for any given situation.