#1 Format your data | meta:umbrella
 

Tutorial #1:
Format your data

Download example datasets here

Well-formatted dataset

One of the specificities of the metaumbrella package is that all the functions of this package do not have an argument to specify the name of the variables contained in the dataset of the users. Therefore, it is necessary that the datasets that are passed to the different functions of the package respect a very precise formatting (which we will refer to as well-formatted dataset). We present here the rules that must be respected when creating a well-formatted dataset.

The datasets passed to the functions of the metaumbrella package should contain information on each individual study pooled in the different meta-analyses included in the umbrella review. The information about each individual study must allow for replication of the meta-analyses. It is therefore necessary that the information contained in a well-formatted dataset allows for estimating the effect size and variance of all individual studies. Ten types of effect size measures are accepted:

  • "SMD" standardized mean difference (i.e., Cohen's d)
  • "G": Hedges' g
  • "MD": mean difference
  • "SMC": standardized mean change
  • "R": Pearson's r
  • "Z": Fisher's z
  • "OR" or "logOR": odds ratio or its logarithm
  • "RR" or "logRR": risk ratio or its logarithm
  • "HR" or "logHR": hazard ratio or its logarithm
  • "IRR" or "logIRR": incidence rate ratio or its logarithm

To estimate the effect size and the variance of each individual study, the metaumbrella package allows for flexible inputs. We detail below (A) the variables that are mandatory and must be indicated in a well-formatted dataset, (B) the variables that vary depending on the effect size measure and (C) the variables that are optional but that can be indicated to benefit from certain features of the package. Note that the package includes examples of well-formatted datasets for each effect size measure here.

A. Mandatory columns:

The following variables must be included in the dataset regardless of the effect size measure used. The name of these variables (in bold) cannot be changed.

  • meta_review: a character variable that contains an identifier for the sources of the meta-analyses included in an umbrella review. Typically, this variable contains the name of the first-author of the included meta-analyses.
  • factor: a character variable that contains an identifier for the risk factors or the interventions whose effect are studied. Importantly, all rows in the dataset with the same factor value will be pooled together in a meta-analysis.
  • author and year: character variables identifying the name and the year of publication of each individual study that is included in a meta-analysis. For a given factor, all rows with the same author and year values will be identified as having some type of dependence (see below).
  • measure: a character variable describing the type of effect size measure used to quantify the effect of the factor and it must be either "SMD", "MD", "G", "OR", "logOR", "RR", "logRR", "HR", "logHR", "IRR" or "logIRR". Note here that if a study reports the numbers of cases and controls in exposed and non-exposed groups but does not report an effect size value (i.e., the value of an OR or RR), we recommend specifying "OR" for case-control studies while "RR" for cohort studies.

B. Required information depending on the effect size measure:

Depending on the effect size measure used, different information must be provided to replicate the meta-analyses. To allow users adapting to the data available in the original articles, several combinations of information can be provided for a given effect size measure. We detail the information that can provided in the dataset to replicate the meta-analyses and we provide several summary tables displaying the various combinations of minimum information required to replicate the meta-analyses.

  • value: Value of the effect size for each individual study.
  • ci_lo: Lower bound of the 95% confidence interval around the effect size for each individual study.
  • ci_up: Upper bound of the 95% confidence interval around the effect size for each individual study.
  • n_sample: Total number of participants in each study.
  • n_cases: Number of cases in each individual study.
  • n_controls: Number of controls in each individual study.
  • n_exp: Number of exposed participants in each individual study.
  • n_nexp: Number of non-exposed participants in each individual study.
  • n_cases_exp: Number of cases in the exposed group in each individual study.
  • n_controls_exp: Number of controls in the exposed group in each individual study.
  • n_cases_nexp: Number of cases in the non-exposed group in each individual study.
  • n_controls_nexp: Number of controls in the non-exposed group in each individual study.
  • mean_cases: Mean of the cases for each individual study (at follow-up).
  • mean_controls: Mean of the controls for each individual study (at follow-up).
  • sd_cases: Standard deviation of the cases for each individual study (at follow-up).
  • sd_controls: Standard deviation of the controls for each individual study (at follow-up).
  • mean_pre_cases: Mean of the cases for each individual study at baseline.
  • mean_pre_controls: Mean of the controls for each individual study at baseline.
  • sd_pre_cases: Standard deviation of the cases for each individual study at baseline.
  • sd_pre_controls: Standard deviation of the controls for each individual study at baseline.
  • mean_change_cases: Mean change of the cases for each individual study (from baseline to follow up).
  • mean_change_controls: Mean change of the controls for each individual study (from baseline to follow up).
  • sd_change_cases: Standard deviation of the change of cases for each individual study (from baseline to follow up).
  • sd_change_controls: Standard deviation of the change of controls for each individual study (from baseline to follow up).
  • pre_post_cor: Correlation between the pre-post measure across groups.
  • time: Sum of the person-time of disease-free observation in the exposed and non-exposed groups for each individual study.
  • time_exp: Person-time of disease-free observation in the exposed group for each individual study.
  • time_nexp: Person-time of disease-free observation in the non-exposed group for each individual study.

    We now present the summary tables indicating the minimum combination of information that should be provided for each individual study to run the analyses. Here are some general indications to assist in understanding these tables.

    1. The header of the tables are the names of the columns in the dataset.
    2. The symbol & between two column names indicates that an information should be provided for the two columns of the dataset.
    3. The symbol OR between two column names indicates that an information should be provided for one of the two columns of the dataset.
    4. The symbol X indicates that an information should be provided.
    5. A blank cell indicates that the information can be missing.
    6. For each effect size measure, users must provide information on at least one row of the corresponding table. Otherwise, an error message will be printed and analyses will not be run.

As an example, the following table implies that the users can provide 3 combinations of information to estimate an effect size and its variance:

n_cases & n_controls value se OR var ci_lo & ci_up


  • the number of cases, the number of controls, the value of the effect size and its standard error (row 1 of the table)
  • the number of cases, the number of controls, the value of the effect size and its variance (row 1 of the table)
  • the number of cases, the number of controls, the value of the effect size and the lower and upper bounds of the effect size (row 2 of the table).

Here is a concrete example of a dataset respecting this formatting:

n_cases n_controls value se var ci_lo ci_up
25 34 0.432 0.132
43 120 2.210 0.762
20 20 0.042 -1.630 1.714


1. "SMD"

mean_cases & mean_controls &
sd_cases & sd_controls
n_cases & n_controls value se OR var ci_lo & ci_up


2. "G"

n_cases & n_controls value se OR var ci_lo & ci_up


3. "MD"

n_cases & n_controls value se OR var ci_lo &
ci_up


4. "SMC"

n_cases & n_controls value se OR var ci_lo & ci_up


n_cases & n_controls mean_pre_cases & sd_pre_cases &
mean_pre_controls & sd_pre_controls &
mean_cases & sd_cases &
mean_controls & sd_controls


n_cases & n_controls mean_change_cases & sd_change_cases &
mean_change_controls & sd_change_controls


5. "R"

n_sample value se OR var ci_lo & ci_up


6. "Z"

n_sample value se OR var ci_lo & ci_up


7. "OR" or "logOR"

n_cases_exp & n_controls_exp &
n_cases_nexp & n_controls_nexp
n_exp &
n_nexp
n_cases & n_controls value se OR var ci_lo & ci_up


8. "RR" or "logRR"

n_cases_exp & n_controls_exp &
n_cases_nexp & n_controls_nexp
n_cases & n_controls value se OR var ci_lo & ci_up


9. "HR" or "logHR"

n_cases & n_controls value se OR var ci_lo & ci_up


10. "IRR" or "logIRR"

n_cases_exp & n_cases_nexp &
time_exp & time_nexp
n_cases time value se OR var ci_lo & ci_up


D. Optional information:

The following variables do not have to be included in a well-formatted dataset but they can be added to benefit from certain features of the functions. The name of these variables (in bold) cannot be changed.

  • multiple_es: Reason for the presence of several effect sizes for a unique study (i.e., a study with the same author and year values within the same factor). It must be either "groups" or "outcomes". An example of a well-formatted dataset with multiple outcomes/groups can be downloaded and an example of analysis of a dataset with dependent effect sizes is available in a vignette of the package.
    • "groups": When "groups" is indicated, it is assumed that the multiple effect sizes for a unique study come from independent subgroups. A unique effect size per study is calculated using the Borenstein's (2009) approach. For each study, the sample size is obtained by summing up all participants from the different groups.
    • "outcomes": When "outcomes" is indicated, it is assumed that the multiple effect sizes come from multiple outcomes (or time-points) measured within the same sample. Again, a unique effect size per study is calculated using the Borenstein's (2009) approach. Strength of the correlation between the outcomes (or time-points) can be indicated using either the "r" column in your dataset (see below) or the slider. Indicating the strength of the correlation between the outcomes of a study in the "r" column allows to use different values depending on the study. In contrast, using the slider allows to conveniently set a unique correlation for all studies that do not have any value in the "r" column. For each study, the sample size is obtained by taking the largest sample size for one outcome/time-point.
  • r: The value of the correlation coefficient between the outcomes/time-points of a study. The r value should be (i) within the (-1, 1) range, (ii) constant within a study, and (iii) set as NA for studies which do not include multiple outcomes.
  • shared_nexp: In some situations, several studies share participants from the same non-exposed group but compare this group to various exposed groups. When several studies in the same factor share a same non-exposed group, they should be identified as such by having the same "shared_nexp" value. Identifying studies sharing the same non-exposed group allows to adjust calculations (the size of the shared sample is divided by the number of studies sharing the sample). Studies not sharing their non-exposed group should have a NA (or a unique) value in the shared_nexp column.
  • shared_controls: In some situations, several studies share participants from the same control group but compare this group to various experimental groups. When several studies in the same factor share a same control group, they should be identified as such by having the same shared_control value. Identifying studies sharing the same control group allows to adjust calculations (the size of the shared sample is divided by the number of studies sharing the sample). Studies not sharing their control group should have a NA (or a unique) value in the shared_controls column.
  • reverse_es: Whether users want to reverse the effect size of a study. All rows with a "reverse" value in this column will have the direction of their effect size flipped (e.g., an OR of 0.5 will be expressed as 2). Note that the reverse_es column has an action on both the direction of the value of an effect size and on the information used to calculate an effect size (e.g., if the means and SDs of experimental and control groups are reported, the mean and SD of the experimental group are used as the mean and SD of the control group and vice-versa). This feature is particularly useful to facilitate the presentation of the results when several meta-analyses report the same effects in opposite direction.
  • rob: The risk of bias of each individual study. Should be either "high", "low" or "unclear". These values are used to generate the "GRADE" classification and to stratify evidence according to the 'rob' criteria in the 'Personalized' classification. Studies with a missing rob are assumed to be at high risk of bias. The approach used to provide a categorical judgment ("low" vs. "unclear" vs. "high) on the risk of bias of a study is left to the user.
  • amstar: The amstar score of the meta-analysis. Note that the amstar score should be constant for a given factor. These values are used only to stratify evidence according to the 'amstar' criteria in the 'Personalized' classification.
  • analysis: Whether users want to conduct specific analyses. For now, only the "allelic" value can be specified, which multiplies by two the number of cases and controls.
  • discard: Whether a particular row should be removed from the analyses (any row with a "yes" or TRUE value in the discard column will be removed).