Title: | Generalized Efficient Regression-Based Imputation with Latent Processes |
---|---|
Description: | Implements a new multiple imputation method that draws imputations from a latent joint multivariate normal model which underpins generally structured data. This model is constructed using a sequence of flexible conditional linear models that enables the resulting procedure to be efficiently implemented on high dimensional datasets in practice. See Robbins (2021) <arXiv:2008.02243>. |
Authors: | Michael Robbins [aut, cre], Max Griswold [ctb], Pedro Nascimento de Lima [ctb] |
Maintainer: | Michael Robbins <[email protected]> |
License: | GPL-2 |
Version: | 0.1.9 |
Built: | 2024-10-31 04:21:03 UTC |
Source: | https://github.com/cran/gerbil |
gerbil
ObjectsThis function assesses the bivariate properties of imputed data using a correlation analysis. Specifically, it calculates pairwise correlations for observed cases and for imputed cases. The function also calculates the Fisher z-transformation for each correlation and performs a hypothesis test using the transformed correlations in order to compare correlations calculated using imputed cases to those calculated using observed cases.
cor_gerbil(x, y = NULL, imp = 1, log = NULL, partial = "imputed")
cor_gerbil(x, y = NULL, imp = 1, log = NULL, partial = "imputed")
x |
A |
y |
A vector listing the column names of the imputed data that will be included in the correlation analysis. By default, |
imp |
A scalar indicating which of the multiply imputed datasets should be used for the analysis. Defaults to |
log |
A character vector that includes names variable of which a log transformation is to be taken prior to calculating correlations. |
partial |
Indicates how partially imputed pairs are handled when calculating correlations. If |
Cases are assigned a status of being observed or imputed in a pairwise fashion. That is, a specific
data unit may be considered observed when calculating a correlation for one pair of variables and be
imputed when calculating a correlation for another pair. For a given pair of variables, cases that
have both variables observed are always treated as observed, and cases that have both variables missing
are always treated as imputed. Cases that have only one variable in the pair observed (i.e., those that are
partially imputed) are treated as imputed when the input partial = 'imputed'
(the default) and are
otherwise treated as observed.
Correlations are calculated across an expanded dataset that creates binary indicators for categorical variables and for semicontinuous variables. Unlike the algorithm used to calculate the imputations, missingness is not artificially imposed in any binary indicator. Missingness is imposed, however, in the variable corresponding to the continuous portion of a semicontinuous variable.
Note that the hypothesis test based upon the Fisher z-transformation is based off of bivariate normal assumptions. As such, p-values may be misleading in data where this assumption does not hold.
cor_gerbil()
retuns an object of the class cor_gerbil
that has following slots:
A list containing two elements – these are named Observed
, Imputed
, and All
. The first is a matrix giving the sample correlations when calculated across cases labeled as observed. The second and third are analogous correlation matrices calculated across only cases labeled as imputed and across all cases, respectively.
A list containing two elements – these are named Observed
, Imputed
, and All
. The first is a matrix giving number of cases in the respective pair of variables that have been labeled as observed. The second and third are analogous matrices indicating the number of cases labeled as imputed for each pair and indicating the total number of cases for each pair, respectively.
A list containing two elements – these are named Observed
, Imputed
, and All
. These matrices give the Fisher z-transformation of the correlations in the matrices provided in the slot Correlations
.
A matrix that gives the value of the test statistic based on the Fisher z-transformation for each pair of variables. This statistic may be used to assess whether the correlations calculated across cases labeled as observed are statistically different from the correlations calculated across cases labeled as imputed.
A matrix that list the p-value for each test statistic provided in the matrix in the slot labeled Statistic
.
#Load the India Human Development Survey-II dataset data(ihd_mcar) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) #Run the correlation analysis cors.gerbil <- cor_gerbil(imps.gerbil, imp = 1) #Print a summary cors.gerbil
#Load the India Human Development Survey-II dataset data(ihd_mcar) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) #Run the correlation analysis cors.gerbil <- cor_gerbil(imps.gerbil, imp = 1) #Print a summary cors.gerbil
Coherent multiple imputation of general multivariate data as implemented through the GERBIL algorithm described by Robbins (2020). The algorithm is
coherent in that imputations are sampled from a valid joint distribution, ensuring MCMC convergence;
general in that data of general structure (binary, categorical, continuous, etc.) may be allowed;
efficient in that computational performance is optimized using the SWEEP operator for both modeling and sampling;
regression-based in that the joint distribution is built through a sequence of conditional regression models;
latent in that a latent multivariate normal process underpins all variables; and
flexible in that the user may specify which dependencies are enabled within the conditional models.
gerbil( dat, m = 1, mcmciter = 25, predMat = NULL, type = NULL, visitSeq = NULL, ords = NULL, semi = NULL, bincat = NULL, cont.meth = "EMP", num.cat = 12, r = 5, verbose = TRUE, n.cores = NULL, cl.type = NULL, mass = rep(0, length(semi)), ineligible = NULL, trace = TRUE, seed = NULL, fully.syn = FALSE )
gerbil( dat, m = 1, mcmciter = 25, predMat = NULL, type = NULL, visitSeq = NULL, ords = NULL, semi = NULL, bincat = NULL, cont.meth = "EMP", num.cat = 12, r = 5, verbose = TRUE, n.cores = NULL, cl.type = NULL, mass = rep(0, length(semi)), ineligible = NULL, trace = TRUE, seed = NULL, fully.syn = FALSE )
dat |
The dataset that is to be imputed. Missing values must be coded with |
m |
The number of multiply imputed datasets to be created. By default, |
mcmciter |
The number of iterations of Markov chain Monte Carlo that will be used to create each imputed dataset. By default, |
predMat |
A numeric matrix of |
type |
A named vector that gives the type of each variable contained in |
visitSeq |
A vector of variable names that has (at least) contains all names of each column of |
ords |
A character string giving a set of the column names of |
semi |
A character string giving a set of the column names of |
bincat |
A character string giving a set of the column names of |
cont.meth |
The type of marginal transformation used for continuous variables. Set to |
num.cat |
Any variable that does not have a type specified by any of the other parameters will be treated as categorical if it takes on no more than |
r |
The number of pairwise completely observed cases that must be available for any pair of variables to have dependencies enabled within the conditional models for imputation. By default, |
verbose |
If |
n.cores |
The number of CPU cores to use for parallelization. If |
cl.type |
The cluster type that is passed into the |
mass |
A named vector of the same length as the number of semi-continuous variables in |
ineligible |
Either a scalar or a matrix that is used to determined which values are to be considered missing but ineligible for imputation. Such values will be imputed internally within |
trace |
A logical that, if |
seed |
An integer that, when specified, is used to set the random number generator via |
fully.syn |
A logical that, if |
gerbil
is designed to handle the following classes of variables:
'continuous'
: Variables are transformed to be (nearly) standard normal prior to imputation. The default transformation method is based on empirical distributions (see Robbins, 2014) and ensures that imputed values of a variable are sampled from the observed values of that variable.
'binary'
: Dichotomous variables are handled through probit-type models in that they are underpinned by a unit-variance normally distributed random variable.
'categorical'
: Unordered categorical variables are handled by creating nested binary variables that underpin the categorical data. Missingness is artificially imposed in the nested variables in order to ensure conditional independence between them. See Robbins (2020) for details.
'ordinal'
: Ordered categorical variables (ordinal) are handled through a probit-type model in that a latent normal distribution is assumed to underpin the ordinal observations. See Robbins (2020) for details.
'semicont'
: Mixed discrete/continuous (semi-continuous) variables are assumed to observe a mass at a specific value (most often zero) and are continuous otherwise. A binary variable is created that indicates whether the semi-continuous variable takes on the point-mass value; the continuous portion is set as missing when the observed semi-continuous variable takes on the value at the point-mass. See Robbins et al. (2013) for details.
The parameter type
allows the user to specify the class for each variable. Routines are in place to establish the class by default for variables not stated in type
. Note that it is not currently possible for a variable to be assigned a class of semi-continuous by default.
gerbil
uses a joint modeling approach to imputation that builds a joint model using a sequence of conditional models, as outlined in Robbins et al. (2013).
This approach differs from fully conditional specification in that the regression model for any given variable is only allowed to depend upon variables that preceed it in an index ordering.
The order is established by the parameter visitSeq
. gerbil
contains the flexibility to allow its user to establish which of the permissible dependencies are enabled within the conditional models.
Enabled dependencies are stated within the parameter predMat
. Note that the data matrix used for imputation is an expanded version of the data that are fed into the algorithm (variables are created that underpin unordered categorical and semi-continuous variables).
Note also that conditional dependencies between the nested binary variables of a single undordered categorical variables or the discrete and continuous portions of a semi-continuous variable are not permitted.
The output of gerbil
is an object of class gerbil
which is a list that contains the imputed datasets (imputed
), missingness indicators (missing
and missing.latent
), summary information (summary
), output used for MCMC convergence diagostics (chainSeq
and R.hat
),
and modeling summaries (visitSeq.initial
, visitSeq.final
, predMat.initial
, predMat.final
, drops
, and forms
).
Some output regarding convergence diagnostics and modeling regards the expanded dataset used for imputation (the expanded dataset includes binary indicators for unordered categorical and semi-continuous variables).
Note that the nested binary variables corresponding to an unordered categorical variable X
with categories labeled a
, b
, c
, etc., are named X.a
, X.b
, X.c
, and so forth in the expanded dataset.
Likewise, the binary variable indicating the point mass of a semi-continuous variable Y
is named Y.B
in the expanded dataset, and the positive portion (with missingness imposed) is left as being named Y
.
gerbil
automatically checks each regression model for perfect collinearities and reduces the model as needed.
Variables that have been dropped from a given model are listed in the element named 'drops'
in a gerbil
object.
gerbil()
returns an object the class gerbil
that contains the following slots:
A list of length m
that contains the imputed datasets.
A matrix 0
s, 1
s, 2
s, and 4
s of the same dimension as dat
that indicates which values were observed or missing. A 0
indicates a fully observed value, a 1
indicates a missing value that was imputed, and a 4
indicates a missing value that was ineligible for imputation.
A matrix with ncol(dat)
number of rows that contains summary information, including the type of each variable and missingness rates. Note that for continuous variables, the type listed indicates the method of transformation used.
A list of six elements. Each element is a matrix with mcmciter
columns and up to ncol(dat)
rows. Objects means.all
and means.mis
give the variables means of data process across iterations of MCMC when all observations are incorporated and when only imputed values are incorporated, respectively. (Means of continuous variables are given on the transformed scale.) Similar objects are provided to track variances of variables. Variables are listed in the order provided by the gerbil
object visitSeq.latent
. Variables reported in this output are those contained in the dataset that has been expanded to include binary indicators for categorical and semi-continuous variables.
The value of the R hat statistics of Gelman and Rubin (1992) for the means and variances of each variable. The R hat statistic is also provided for mean of binary variables. Variables include those contained in the expanded dataset and are listed in the order provided by object visitSeq.latent
. Only calculated if m > 2
and mcmciter >= 4
.
A matrix of the same dimensions as the expanded dataset, but used to indicate missingness in the expanded dataset. In this matrix, 0
s indicate fully observed values, 1
s indicate fully missing values, 3
s indicate values that have imposed missingness (for binary indicators corresponding to categorical or semi-continuous variables), and 4
indicates a missing value that is ineligible for imputation (as determined by the input 'ineligible'
)..
A vector of variable names giving the sequential ordering of variables that is used for imputation prior to expanding the dataset include nested binary and point-mass indicators. Variables without missing values are excluded.
A vector of variable names giving the sequential ordering of variables in the expanded dataset that is used for imputation. Variables without missing values are excluded.
A matrix of ones and zeros indicating the dependencies enabled in the conditional models used for imputation. This matrix is determined from the input 'predMat'. Rows corresponding to variables with no missing values are removed.
A matrix of ones and zeros indicating the dependencies enabled in the conditional models used for imputation. This is of a similar format to the input 'predMat' but pertains to the expanded dataset. Rows corresponding to variables with no missing values are removed.
A list of length equal to the number of variables in the expanded dataset that have missing values. Elements of the list indicate which variables were dropped from the conditional model for the corresponding variable due to either insufficient pairwise complete observations (see the input 'r') or perfect collinearities.
A list of length equal to the number of variables in the expanded dataset that have missing values. Elements of the list indicate the regression formula used for imputation of the respective variable.
The final version of the input parameter mass
.
A logical matrix with the same number of rows and columns as dat
that indicates which elements are considered missing but ineligible for imputation.
A vector used to link column names in the expanded data to corresponding names in the original data.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457-472.
Robbins, M. W. (2014). The Utility of Nonparametric Transformations for Imputation of Survey Data. Journal of Official Statistics, 30(4), 675-700.
Robbins, M. W. (2020). A flexible and efficient algorithm for joint imputation of general data. arXiv preprint arXiv:2008.02243.
Robbins, M. W., Ghosh, S. K., & Habiger, J. D. (2013). Imputation in high-dimensional economic data as applied to the Agricultural Resource Management Survey. Journal of the American Statistical Association, 108(501), 81-95.
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Gerbil without types specified imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 10) # Gerbil with types specified (method #1) types.gerbil <- c( sex = "binary", age = "continuous", marital_status = "binary", job_field = "categorical", farm_labour_days = "semicont", own_livestock = "binary", education_level = "ordinal", income = "continuous") imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil) # Gerbil with types specified (method #2) imps.gerbil <- gerbil(ihd_mcar, m = 1, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) # Gerbil with types specified (method #3) types.gerbil <- c("binary", "continuous", "binary", "categorical", "semicont", "binary", "ordinal", "continuous") imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil) # Variables of class factor are treated as binary/categorical by default ihd.fac <- ihd_mcar ihd.fac$sex <- factor(ihd_mcar$sex) ihd.fac$marital_status <- factor(ihd_mcar$marital_status) ihd.fac$job_field <- factor(ihd_mcar$job_field) ihd.fac$own_livestock <- factor(ihd_mcar$own_livestock) ihd.fac$education_level <- ordered(ihd_mcar$education_level) imps.gerbil <- gerbil(ihd.fac, m = 1) # Univariate plotting of one variable plot(imps.gerbil, type = 1, y = "job_field") # gerbil with predMat specified (method #1) predMat <- matrix(c(1, 0, 0, 1), 2, 2) dimnames(predMat) <- list(c("education_level", "income"), c("sex", "job_field")) imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil, predMat = predMat) # gerbil with predMat specified (method #2) predMat <- rbind( c(0, 0, 0, 0, 0, 0, 0, 0), c(1, 0, 0, 0, 0, 0, 0, 0), c(1, 1, 0, 0, 0, 0, 0, 0), c(1, 1, 1, 0, 0, 0, 0, 0), c(1, 1, 1, 1, 0, 0, 0, 0), c(1, 1, 1, 1, 1, 0, 0, 0), c(1, 1, 1, 0, 1, 1, 0, 0), c(0, 1, 1, 1, 1, 1, 1, 0) ) imps.gerbil <- gerbil(ihd_mcar, type = types.gerbil, predMat = predMat) # Multiple imputation with more iterations imps.gerbil.5 <- gerbil(ihd_mcar, m = 5, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) plot(imps.gerbil.5, type = 1, y = "job_field", imp = 1:5) # Extract the first imputed dataset imputed.gerb <- imputed(imps.gerbil.5, imp = 1) # Write all imputed datasets to an Excel file write.gerbil(imps.gerbil.5, file = file.path(tempdir(), "gerbil_example.xlsx"), imp = 1:5) ## Not run: if(requireNamespace('mice')){ # Impute using mice for comparison types.mice <- c("logreg", "pmm", "logreg", "polyreg", "pmm", "logreg", "pmm", "pmm") imps.mice <- mice(ihd.fac, m = 1, method = types.mice, maxit = 100) imps.mice1 <- mice(ihd.fac, m = 1, method = "pmm", maxit = 100) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = "job_field") # Compare the performance of mice and gerbil # Replace some gerbil datasets with mice datasets imps.gerbil.m <- imps.gerbil.5 imps.gerbil.m$imputed[[2]] <- complete(imps.mice, action = 1) imps.gerbil.m$imputed[[3]] <- complete(imps.mice1, action = 1) # Perform comparative correaltion analysis cor_gerbil(imps.gerbil.m, imp = 1, log = "income") cor_gerbil(imps.gerbil.m, imp = 2, log = "income") cor_gerbil(imps.gerbil.m, imp = 3, log = "income") # Perform comparative univariate goodness-of-fit testing gof_gerbil(imps.gerbil.m, type = 1, imp = 1) gof_gerbil(imps.gerbil.m, type = 1, imp = 2) gof_gerbil(imps.gerbil.m, type = 1, imp = 3) # Perform comparative bivariate goodness-of-fit testing gof_gerbil(imps.gerbil.m, type = 2, imp = 1) gof_gerbil(imps.gerbil.m, type = 2, imp = 2) gof_gerbil(imps.gerbil.m, type = 2, imp = 3) # Produce univariate plots for comparisons plot(imps.gerbil.m, type = 1, file = file.path(tempdir(), "gerbil_vs_mice_univariate.pdf"), imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", "green3", "orange2"), legend = c("Observed", "gerbil", "mice: logistic", "mice: pmm")) ### Produce bivariate plots for comparisons plot(imps.gerbil.m, type = 2, file = file.path(tempdir(), "gerbil_vs_mice_bivariate.pdf"), imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", "green3", "orange2"), pch = c(1, 3, 4, 5), legend = c("Observed", "gerbil", "mice: logistic", "mice: pmm")) } ## End(Not run)
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Gerbil without types specified imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 10) # Gerbil with types specified (method #1) types.gerbil <- c( sex = "binary", age = "continuous", marital_status = "binary", job_field = "categorical", farm_labour_days = "semicont", own_livestock = "binary", education_level = "ordinal", income = "continuous") imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil) # Gerbil with types specified (method #2) imps.gerbil <- gerbil(ihd_mcar, m = 1, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) # Gerbil with types specified (method #3) types.gerbil <- c("binary", "continuous", "binary", "categorical", "semicont", "binary", "ordinal", "continuous") imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil) # Variables of class factor are treated as binary/categorical by default ihd.fac <- ihd_mcar ihd.fac$sex <- factor(ihd_mcar$sex) ihd.fac$marital_status <- factor(ihd_mcar$marital_status) ihd.fac$job_field <- factor(ihd_mcar$job_field) ihd.fac$own_livestock <- factor(ihd_mcar$own_livestock) ihd.fac$education_level <- ordered(ihd_mcar$education_level) imps.gerbil <- gerbil(ihd.fac, m = 1) # Univariate plotting of one variable plot(imps.gerbil, type = 1, y = "job_field") # gerbil with predMat specified (method #1) predMat <- matrix(c(1, 0, 0, 1), 2, 2) dimnames(predMat) <- list(c("education_level", "income"), c("sex", "job_field")) imps.gerbil <- gerbil(ihd_mcar, m = 1, type = types.gerbil, predMat = predMat) # gerbil with predMat specified (method #2) predMat <- rbind( c(0, 0, 0, 0, 0, 0, 0, 0), c(1, 0, 0, 0, 0, 0, 0, 0), c(1, 1, 0, 0, 0, 0, 0, 0), c(1, 1, 1, 0, 0, 0, 0, 0), c(1, 1, 1, 1, 0, 0, 0, 0), c(1, 1, 1, 1, 1, 0, 0, 0), c(1, 1, 1, 0, 1, 1, 0, 0), c(0, 1, 1, 1, 1, 1, 1, 0) ) imps.gerbil <- gerbil(ihd_mcar, type = types.gerbil, predMat = predMat) # Multiple imputation with more iterations imps.gerbil.5 <- gerbil(ihd_mcar, m = 5, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) plot(imps.gerbil.5, type = 1, y = "job_field", imp = 1:5) # Extract the first imputed dataset imputed.gerb <- imputed(imps.gerbil.5, imp = 1) # Write all imputed datasets to an Excel file write.gerbil(imps.gerbil.5, file = file.path(tempdir(), "gerbil_example.xlsx"), imp = 1:5) ## Not run: if(requireNamespace('mice')){ # Impute using mice for comparison types.mice <- c("logreg", "pmm", "logreg", "polyreg", "pmm", "logreg", "pmm", "pmm") imps.mice <- mice(ihd.fac, m = 1, method = types.mice, maxit = 100) imps.mice1 <- mice(ihd.fac, m = 1, method = "pmm", maxit = 100) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 100, ords = "education_level", semi = "farm_labour_days", bincat = "job_field") # Compare the performance of mice and gerbil # Replace some gerbil datasets with mice datasets imps.gerbil.m <- imps.gerbil.5 imps.gerbil.m$imputed[[2]] <- complete(imps.mice, action = 1) imps.gerbil.m$imputed[[3]] <- complete(imps.mice1, action = 1) # Perform comparative correaltion analysis cor_gerbil(imps.gerbil.m, imp = 1, log = "income") cor_gerbil(imps.gerbil.m, imp = 2, log = "income") cor_gerbil(imps.gerbil.m, imp = 3, log = "income") # Perform comparative univariate goodness-of-fit testing gof_gerbil(imps.gerbil.m, type = 1, imp = 1) gof_gerbil(imps.gerbil.m, type = 1, imp = 2) gof_gerbil(imps.gerbil.m, type = 1, imp = 3) # Perform comparative bivariate goodness-of-fit testing gof_gerbil(imps.gerbil.m, type = 2, imp = 1) gof_gerbil(imps.gerbil.m, type = 2, imp = 2) gof_gerbil(imps.gerbil.m, type = 2, imp = 3) # Produce univariate plots for comparisons plot(imps.gerbil.m, type = 1, file = file.path(tempdir(), "gerbil_vs_mice_univariate.pdf"), imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", "green3", "orange2"), legend = c("Observed", "gerbil", "mice: logistic", "mice: pmm")) ### Produce bivariate plots for comparisons plot(imps.gerbil.m, type = 2, file = file.path(tempdir(), "gerbil_vs_mice_bivariate.pdf"), imp = c(1, 2, 3), log = "income", lty = c(1, 2, 4, 5), col = c("blue4", "brown2", "green3", "orange2"), pch = c(1, 3, 4, 5), legend = c("Observed", "gerbil", "mice: logistic", "mice: pmm")) } ## End(Not run)
gerbil
objectsUsing a gerbil
object as an input, this function performs univariate and bivariate goodness-of-fit tests
to compare distributions of imputed and observed values.
gof_gerbil( x, y = NULL, type = 1, imp = 1, breaks = NULL, method = c("chi-squared", "fisher", "G"), ks = FALSE, partial = "imputed", ... )
gof_gerbil( x, y = NULL, type = 1, imp = 1, breaks = NULL, method = c("chi-squared", "fisher", "G"), ks = FALSE, partial = "imputed", ... )
x |
A |
y |
A vector listing the column names of the imputed data for which tests should be run. See details. By default, |
type |
A scalar used to specify the type of tests that will be performed. Options include univariate (marginal) tests ( |
imp |
A scalar or vector indicating which of the multiply imputed datasets should be used for testing. Defaults to |
breaks |
Used to determine the cut-points for binning of continuous variables into categories. Ideally, |
method |
The type of test that is used to compare contingency tables. Options include |
ks |
If |
partial |
Indicates how partially imputed pairs are handled in bivariate testing. If |
... |
Arguments to be passed to methods. |
Goodness of fit is determined using contingency tables of counts across categories of the corresponding variable(s).
For univariate testing (type = 1
), a one-way table is calculated for observed cases and compared to an analogous table for imputed cases,
whereas for bivariate testing (type = 2
), two-way tables are calculated.
Continuous variables are binned according to cut-points defined using the parameter breaks
.
Tests are performed using one of three methods (determined from the parameter method
): 1) Chi-squared (the default); 2) Fisher's exact; and 3) A G-test.
G-testing is implemented via the function GTest()
from the DescTools
package.
Note that for univariate testing of continuous variables, a Kolmogorov-Smirnov test may be performed instead by setting ks = TRUE
.
The only required input is a parameter x
which is a gerbil
object.
Note that univariate differences between observed and imputed data may be explained by the missingness mechanism and are not necessarily indicative of poor imputations. Note also that most imputation methods like gerbil (and mice and related methods) are not designed to capture complete bivariate distributions. As such, the bivariate tests may be likely to return small p-values.
gof_gerbil()
returns an object of the class gof_gerbil
that has following slots:
A vector (when type = 1
) or matrix (when type = 2
) giving the value of the test statistic (or coefficient) for the corresponding variable (or variable pair).
A vector (when type = 1
) or matrix (when type = 2
) giving the value of the p-value for the test applied to the corresponding variable (or variable pair).
A vector (when type = 1
) or matrix (when type = 2
) indicating the type of test applied to the corresponding variable (or variable pair).
A list giving the cutpoints used for binning each continuous or semi-continuous variable.
#Load the India Human Development Survey-II dataset data(ihd_mcar) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 200, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) #Run univariate tests tests.gerbil.uni <- gof_gerbil(imps.gerbil, imp = 1, type = 1) #Print a summary tests.gerbil.uni #Run bivariate tests tests.gerbil.bi <- gof_gerbil(imps.gerbil, imp = 1, type = 2) #Print a summary tests.gerbil.bi
#Load the India Human Development Survey-II dataset data(ihd_mcar) imps.gerbil <- gerbil(ihd_mcar, m = 1, mcmciter = 200, ords = "education_level", semi = "farm_labour_days", bincat = c("sex", "marital_status", "job_field", "own_livestock")) #Run univariate tests tests.gerbil.uni <- gof_gerbil(imps.gerbil, imp = 1, type = 1) #Print a summary tests.gerbil.uni #Run bivariate tests tests.gerbil.bi <- gof_gerbil(imps.gerbil, imp = 1, type = 2) #Print a summary tests.gerbil.bi
This dataset is a subset from the India Human Development survey. This dataset is included in the package only for demonstration purposes and should not be used for other purposes.
ihd
ihd
A data frame with 42155 rows and 8 variables:
0 = individual is male; 1 = individual is female
Age of the individual, between 0 & 99
Individual’s marital status. 0 = Unmarried; 1 = Married
Refer’s to the field of the individual’s profession or job status (i.e. agricultural worker; small business owner; student; unemployed; etc.
Number of days a year the individual worked on a farm. Can take on the value zero
0 = Individual does not own livestock. 1 = Individual does own livestock
Years of schooling attained by the individual. Censored for values above 16.
Household’s income, in rupees. Value can be negative
Desai, Sonalde, and Vanneman, Reeve. India Human Development Survey-II (IHDS-II), 2011-12. Inter-university Consortium for Political and Social Research [distributor], 2018-08-08. doi:10.3886/ICPSR36151.v6
This dataset is a subset from the India Human Development survey. This dataset is included in the package only for demonstration purposes and should not be used for other purposes.
ihd_mar
ihd_mar
A data frame with 42155 rows and 8 variables:
1 = individual is male; 2 = individual is female
Age of the individual, between 0 & 99
Individual’s marital status. 0 = married, absent spouse, 1 = Married, 2 = Unmarried, 3 = Widowed, 4 = Divorced/Separated, 5 = married, no gauna
Refer’s to the field of the individual’s profession or job status (i.e. agricultural worker; small business owner; student; unemployed; etc.
Number of days a year the individual worked on a farm. Can take on the value zero
0 = Individual does not own livestock. 1 = Individual does own livestock
Years of schooling attained by the individual. Censored for values above 16.
Household’s income, in rupees. Value can be negative
Desai, Sonalde, and Vanneman, Reeve. India Human Development Survey-II (IHDS-II), 2011-12. Inter-university Consortium for Political and Social Research [distributor], 2018-08-08. doi:10.3886/ICPSR36151.v6
This dataset is a subset from the India Human Development survey. This dataset is included in the package only for demonstration purposes and should not be used for other purposes.
ihd_mcar
ihd_mcar
A data frame with 42155 rows and 8 variables:
1 = individual is male; 2 = individual is female
Age of the individual, between 0 & 99
Individual’s marital status. 0 = married, absent spouse, 1 = Married, 2 = Unmarried, 3 = Widowed, 4 = Divorced/Separated, 5 = married, no gauna
Refer’s to the field of the individual’s profession or job status (i.e. agricultural worker; small business owner; student; unemployed; etc.
Number of days a year the individual worked on a farm. Can take on the value zero
0 = Individual does not own livestock. 1 = Individual does own livestock
Years of schooling attained by the individual. Censored for values above 16.
Household’s income, in rupees. Value can be negative
Desai, Sonalde, and Vanneman, Reeve. India Human Development Survey-II (IHDS-II), 2011-12. Inter-university Consortium for Political and Social Research [distributor], 2018-08-08. doi:10.3886/ICPSR36151.v6
Using a gerbil
object as an input, this function returns imputed datasets.
imputed(gerb, imp = 1)
imputed(gerb, imp = 1)
gerb |
A |
imp |
The imputed datasets which are to be returned (defaults to |
The function either return a single imputed dataset (if imp
is a scalar) or a tall dataset if (if imp
is a vector) with the individual datsets stacked on top of each other.
imputed()
returns a data frame or matrix. If imp
has multiple elements, columns are added to indicate the imputation number and the case ID.
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 5, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) # Return a single imputed datasets imp.gerb <- imputed(imps.gerbil, imp = 2) # Return multiple (stacked) datasets imp.gerb <- imputed(imps.gerbil, imp = 1:5)
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 5, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) # Return a single imputed datasets imp.gerb <- imputed(imps.gerbil, imp = 2) # Return multiple (stacked) datasets imp.gerb <- imputed(imps.gerbil, imp = 1:5)
Using a gerbil
object as an input, this function gives
diagnostic plots for selected variables
## S3 method for class 'gerbil' plot( x, y = NULL, type = "Univariate", imp = 1, col = NULL, lty = NULL, lwd = NULL, pch = NULL, log = NULL, legend = NULL, legend.loc = "topright", mfrow = c(3, 2), trace.type = "Mean", file = NULL, sep = FALSE, height = NULL, width = NULL, partial = "imputed", ... )
## S3 method for class 'gerbil' plot( x, y = NULL, type = "Univariate", imp = 1, col = NULL, lty = NULL, lwd = NULL, pch = NULL, log = NULL, legend = NULL, legend.loc = "topright", mfrow = c(3, 2), trace.type = "Mean", file = NULL, sep = FALSE, height = NULL, width = NULL, partial = "imputed", ... )
x |
A |
y |
A vector listing the column names of the imputed data for which plots should be created. See details. By default, |
type |
A scalar used to specify the type of plots that will be created. Options include univariate (marginal) plots ( |
imp |
A scalar or vector indicating which of the multiply imputed datasets should be used for plotting. Defaults to |
col |
The color used for plotting – should be a vector of length equal to |
lty |
The line type used for plotting imputed values with trace lines or density plots – should be a vector of length equal to |
lwd |
The line width used for density and trace line plotting – should be a vector of length equal to |
pch |
A length-2 vector that indicates the plotting symbol to be used for imputed and observed values in scatter and lattice plots. |
log |
A character vector that includes names of variables of which a log transformation is to be taken prior to plotting. |
legend |
A character or expression vector to appear in the legend. If |
legend.loc |
The location of the legend in the plots. |
mfrow |
The layout of plots across a single page when there are to be multiple plots per page (as is the case when |
trace.type |
The type of trace plot to be created (only valid when |
file |
A character string giving the name of file that will be created in the home directory containing plots. The name should have a |
sep |
If |
height |
The height of the graphics region (in inches) when a pdf is created. |
width |
The width of the graphics region (in inches) when a pdf is created. |
partial |
Indicates how partially imputed pairs are handled in bivariate plotting. If |
... |
Arguments to be passed to methods, such as |
Three types of plots may be produced:
1) Univariate (produced by setting type = 1
): Compares the marginal distribution of observed and imputed values of a given variable. Density plots are produced for continuous variables, and bar plots are given for binary, categorical, and ordinal variables. For semi-continuous variables, two plots are constructed: a) a bar plot for the binary portion of the variable and 2) a density plot for the continuous portion.
2) Bivariate (produced by setting type = 2
): Compares the bivariate distributions of observed and imputed values of two variables. Scatter plots are produced if both variables are continuous or semi-continuous, box plots are produced if one variable is continuous or semi-continuous and the other is not, and a lattice plot is produced if neither variable is continuous or semi-continuous. For bivariate plots, imputed observations are those that have one or more of the values of the pair missing within the original dataset.
3) Trace lines (produced by setting type = 3
): Plots a pre-specified parameter across iterations of MCMC in order to examine convergence for a given variable. Parameters that may be plotted include means (trace.type = 1
) and variances (trace.type = 2
).
Multiple plots may be created, as determined by the variable names listed in the parameter y
. For univariate and trace plots, one plot is created for
each variable listed in y
. For bivariate plotting, one plot is created for each combination of two elements within the vector y
(as such, y
must have a length of at least two in this case).
For trace plotting, elements of y
should correspond to column names in the dataset that has been expanded to include binary indicators for categorical and semi-continuous variables.
If multiple plots are to be created, it is recommended to specify a file for output using the parameter file
, in which case separate
files will be created for each plot (if sep = TRUE
) or all plots will be written to the same file (if sep = FALSE
).
The only required input is a parameter x
which is a gerbil
object.
No returned value, but instead plots are generated in the workspace or written to a specified directory.
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 1, ords = "education_level", semi = "farm_labour_days", bincat = "job_field") # Univariate plotting of all variables to a file plot(imps.gerbil, type = 1, file = file.path(tempdir(), "gerbil_univariate.pdf")) # Bivariate plotting of all variables to a file plot(imps.gerbil, type = 2, file = file.path(tempdir(), "gerbil_bivariate.pdf")) # Trace plotting of all variables to a file plot(imps.gerbil, type = 3, file = file.path(tempdir(), "gerbil_ts.pdf")) # Univariate plotting of one variable (not to a file) plot(imps.gerbil, type = 1, y = "job_field") # Bivariate plotting of one pair of variables (not to a file) plot(imps.gerbil, type = 2, y = c("job_field", "income")) # Bivariate plotting of one pair of variables (not to a file) with income logged plot(imps.gerbil, type = 2, y = c("job_field", "income"), log = "income")
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 1, ords = "education_level", semi = "farm_labour_days", bincat = "job_field") # Univariate plotting of all variables to a file plot(imps.gerbil, type = 1, file = file.path(tempdir(), "gerbil_univariate.pdf")) # Bivariate plotting of all variables to a file plot(imps.gerbil, type = 2, file = file.path(tempdir(), "gerbil_bivariate.pdf")) # Trace plotting of all variables to a file plot(imps.gerbil, type = 3, file = file.path(tempdir(), "gerbil_ts.pdf")) # Univariate plotting of one variable (not to a file) plot(imps.gerbil, type = 1, y = "job_field") # Bivariate plotting of one pair of variables (not to a file) plot(imps.gerbil, type = 2, y = c("job_field", "income")) # Bivariate plotting of one pair of variables (not to a file) with income logged plot(imps.gerbil, type = 2, y = c("job_field", "income"), log = "income")
cor_gerbil
object. Printed output includes the average difference of correlations, as well as summaries of the test statistics based on Fisher's z and their p-values.Prints a cor_gerbil
object. Printed output includes the average difference of correlations, as well as summaries of the test statistics based on Fisher's z and their p-values.
## S3 method for class 'cor_gerbil' print(x, ...)
## S3 method for class 'cor_gerbil' print(x, ...)
x |
object of |
... |
additional parameters to be passed down to inner functions. |
The functions print.cor_gerbil
and summary.cor_gerbil
display information
about the cor_gerbil
object. The output displayed includes:
1) the average absolute difference in correlation between observed and imputed cases across all relevant variable pairs,
2) the average value of the test statistic based on Fisher's z across all variable pairs,
3) the largest test statistic observed across any variable pair, and
4) the portion of p-values for the test based on Fisher's z that are less than 0.05.
gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.Prints a gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.
## S3 method for class 'gerbil' print(x, ...)
## S3 method for class 'gerbil' print(x, ...)
x |
object of |
... |
additional parameters to be passed down to inner functions. |
The functions print.gerbil
and summary.gerbil
display information about the gerbil
object.
Primarily, the variable type and missingness rate are displayed for each variable. The predictor matrix is also provided.
gof_gerbil
object. Printed output pertains to the goodness-of-fit tests that are applied in order to compare the distribution between observed and imputed cases for relevant variables or variable pairs.Prints a gof_gerbil
object. Printed output pertains to the goodness-of-fit tests that are applied in order to compare the distribution between observed and imputed cases for relevant variables or variable pairs.
## S3 method for class 'gof_gerbil' print(x, ...)
## S3 method for class 'gof_gerbil' print(x, ...)
x |
object of |
... |
additional parameters to be passed down to inner functions. |
The functions print.gof_gerbil
and summary.gof_gerbil
display information
about the cor_gerbil
object. The output displayed includes:
1) the average test statistic value across all variables or variable pairs contained in the object,
2) the average p-value of all goodness-of-fit tests contained within the object, and
3) the number of tests that yieled a p-value of less than 0.05.
gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.Summarises a gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.
## S3 method for class 'cor_gerbil' summary(object, ...)
## S3 method for class 'cor_gerbil' summary(object, ...)
object |
An object of |
... |
additional parameters to be passed down to inner functions. |
The functions print.cor_gerbil
and summary.cor_gerbil
display information
about the cor_gerbil
object. The output displayed includes:
1) the average absolute difference in correlation between observed and imputed cases across all relevant variable pairs,
2) the average value of the test statistic based on Fisher's z across all variable pairs,
3) the largest test statistic observed across any variable pair, and
4) the portion of p-values for the test based on Fisher's z that are less than 0.05.
gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.Summarises a gerbil
object. Printed output includes a variable-by-variable summary of variable types and missingness rates. The implemented predictor matrix is also provided.
## S3 method for class 'gerbil' summary(object, ...)
## S3 method for class 'gerbil' summary(object, ...)
object |
An object of |
... |
Additional parameters to be passed down to inner functions. |
The functions print.gerbil
and summary.gerbil
display information about the gerbil
object.
Primarily, the variable type and missingness rate are displayed for each variable. The predictor matrix is also provided.
gerbil
object. Printed output pertains to the goodness-of-fit tests that are applied in order to compare the distribution between observed and imputed cases for relevant variables or variable pairs.Summarises a gerbil
object. Printed output pertains to the goodness-of-fit tests that are applied in order to compare the distribution between observed and imputed cases for relevant variables or variable pairs.
## S3 method for class 'gof_gerbil' summary(object, ...)
## S3 method for class 'gof_gerbil' summary(object, ...)
object |
An object of |
... |
Additional parameters to be passed down to inner functions. |
The functions print.gof_gerbil
and summary.gof_gerbil
display information
about the cor_gerbil
object. The output displayed includes:
1) the average test statistic value across all variables or variable pairs contained in the object,
2) the average p-value of all goodness-of-fit tests contained within the object, and
3) the number of tests that yieled a p-value of less than 0.05.
Using a gerbil
object as an input, this function writes imputed datasets to an output file.
write.gerbil(gerb, file = NULL, imp = NULL, tall = FALSE, row.names = FALSE)
write.gerbil(gerb, file = NULL, imp = NULL, tall = FALSE, row.names = FALSE)
gerb |
A |
file |
The name of the file to which the imputed datasets are to be written.
Which type of file (.xlsx or .csv) is created depends upon the extension of the parameter |
imp |
The imputed datasets which are to be written. Can be a scalar or, if multiple imputed datasets are to be written, a vector.
All elements of |
tall |
A logical expression indicating whether the datasets are to be written in a tall (stacked) format
or written separately. When writing to an XLSX file with |
row.names |
A logical value indicating whether the row names of the datasets are to be written. |
The function writes imputed datasets to either an Excel (.xlsx) or a CSV (.csv) file, depending upon the extension of the parameter file
.
No other file types are supported.
To write multiple imputed datasets simultaneously, specify imp
as a vector with length greater than 1.
Multiple imputed datasets are either written in a stacked format (if tall = TRUE
) or written separately (if tall = FALSE
).
No returned value, but instead a data file is written to a specified directory.
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 5, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) # Write all imputed datasets to separate CSV files write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.csv"), imp = 1:5, tall = FALSE) # Write all imputed datasets to a single CSV files write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.csv"), imp = 1:5, tall = TRUE) # Write all imputed datasets to an XLSX file write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.xlsx"), imp = 1:5, tall = FALSE)
#Load the India Human Development Survey-II dataset data(ihd_mcar) # Create a gerbil object imps.gerbil <- gerbil(ihd_mcar, m = 5, ords = "education_level", semi = "farm_labour_days", bincat = "job_field", n.cores = 1) # Write all imputed datasets to separate CSV files write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.csv"), imp = 1:5, tall = FALSE) # Write all imputed datasets to a single CSV files write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.csv"), imp = 1:5, tall = TRUE) # Write all imputed datasets to an XLSX file write.gerbil(imps.gerbil, file.path(tempdir(), "gerbil_example.xlsx"), imp = 1:5, tall = FALSE)