Identify whether each item is an example of empirical evidence or not.

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

Identify whether each item is an example of empirical evidence or not.


Page 2

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

Identify whether each item is an example of empirical evidence or not.

This project developed sequentially over time. The original study was part of a project for the Cochrane Back Review Group (CBRG). The additional work was funded by the Agency for Healthcare Research and Quality in steps as results of earlier analyses suggested fruitful areas for testing of new hypotheses.

We applied the 11 CBRG Internal Validity criteria (van Tulder et al, 2003) that appeared very promising in the quality scoring of Cochrane back reviews. The items cover established quality criteria (allocation concealment, blinding), as well as criteria for which no evidence on their potential for bias has been investigated or existing studies showed conflicting results.

The individual criteria address the adequacy of the randomization sequence generation, concealment of treatment allocation, baseline similarity of treatment groups, outcome assessor blinding, care provider blinding, patient blinding, adequacy and description of the dropout rate, analysis according to originally assigned group (intention-to-treat analysis), similarity of cointerventions, adequacy of compliance and similar assessment timing across groups. The items and the scoring guideline are shown in Appendix F.

The answer mode employed the following categories: “Yes,” “No,” and “Unclear.” The CBRG offers concrete guidance for each answer category. Assessor blinding for example is scored positively when assessors were either explicitly blinded or the assessor is clearly not aware of the treatment allocation (e.g., in automated test result analysis).

A number of items are topic specific and have to be defined individually. For each topic, a content expert (typically a clinician with trial research experience) was contacted to assist in the selection of baseline comparability variables and to establish reasonable dropout and compliance rates. The baseline comparability assessment requires that topic specific key prognostic predictors of the outcome are specified and the baseline comparability of the treatment groups has to be judged. For interventions that involve considerable patient commitment (e.g., presenting at multiple outpatient appointments) a dropout rate of about 25 percent was considered sufficient, while for other interventions a rate of 10 percent was considered sufficient in order to meet this criterion in the specific clinical area.

In addition, for one of the datasets the Jadad scale (Jadad et al., 1996) and criteria proposed by Schulz et al. (1995), operationalized as in the original publications, was applied. The Jadad scale (0 to 5 points) assesses randomization (0 to 2 points), blinding (0 to 2 points), and withdrawals (0 to 1 point). The applied Schulz criteria were allocation concealment, randomization sequence, analysis of all randomized participants, and double blinding. The items together with the scoring instructions can be found in Appendix F.

This project drew on three different study pools. One was available from previous work for the Cochrane Back Review Group, the project has been described in detail elsewhere (van Tulder et al., 2009). Two datasets were obtained for the purpose of this project only (datasets 2 and 3). First results on the association between quality and effect sizes in dataset 1 have been published previously (van Tulder et al., 2009), all further analyses were prepared for this report only.

For the CBRG project the quality criteria were originally applied to all CBRG reviews of nonsurgical treatment for nonspecific low back pain present in the Cochrane Library 2005, issue 3. The study set was drawn from 12 reviews (Assendelft, Morton, Yu, et al., 2004; Furlan, van Tulder, Tsukayama, et al, 2005; Furlan, Imamura, Dryden, et al., 2005; Hagen, Hilde, Jamtvedt, et al., 2005; Hayden, van Tulder, Malmivaara, et al., 2005; Henschke, Ostelo, van Tulder, et al., 2005; Heymans, van Tulder, Esmail, et al., 2004; Karjalainen, Malmivaara, van Tulder, et al., 2003; Khadilkar, Odebiyi, Brosseau, et al., 2005; Roelofs, Deyo, Koes, et al., 2005; van Tulder, Touray, Furlan, et al., 2003; van Duijvenbode, Jellema, van Poppel, et al., 2005). Studies reported on pain, function, or other improvement measures. The reviews assessed the effect of acupuncture, back schools, behavioral treatment, exercise therapy, bedrest, lumbar supports, massage, multidisciplinary bio-psycho-social rehabilitation, muscle relaxants, spinal manipulative therapy, and transcutaneous electrical nerve stimulation (TENS) for the treatment of low-back pain. Comparisons were placebo, usual care, or no treatment or comparisons between treatments. The dataset included 216 trials.

In the first of two efforts supported by AHRQ, we assembled a second dataset of trials based on Evidence-based Practice Center (EPC) reports. We searched prior systematic reviews and meta-analyses conducted by AHRQ-funded EPCs with the goal of assembling a test set of studies that represented a wide range of clinical topics and interventions. The criteria for selection were that the EPC report had to include a meta-analysis and that the EPC had to be willing to provide us with the data on outcomes, such that we only needed to assess the quality of the included trials. The study set was drawn from 12 evidence reports, the majority were also published as peer review journal articles (Balk, Lichtenstein, Chung, et al., 2006; Balk, Tatsioni, Lichtenstein, et al., 2007; Chapell, Reston, Snyder, et al., 2003; Coulter, Hardy, Shekelle et al., 2003; Donahue, Gartlehner, Jonas, et al., 2007; Hansen, Gartlehner, Webb, et al., 2008; Hardy, Coulter, Morton, et al., 2002; Lo, LaValley, McAlindon, et al., 2003; Shekelle, Morton, Hardy, 2003; Shekelle, Maglione, Bagley, et al., 2007; Shekelle, Morton, Maglione, et al., 2004; Towfigh, Romanova, Weinreb, et al., 2008). The reports addressed a diverse set of topics, pharmacological therapies as well as behavior modification interventions. All studies included in the main meta-analysis of the report were selected; studies included in more than one report entered our analysis only once. The dataset included 165 trials.

The reports addressed pharmaceuticals (orlistat, vitamin E, drugs for arthritis, S-adenosylmethionine, chromium, atypical antipsychotics, omega-3 fatty acids); non-pharmacological studies such as self-monitoring of blood glucose (SMBG), diet and weight loss, chronic disease self-management (CDSM); interventions to manage and treat diabetes (chromium, SMBG, CDSM); complementary and alternative medicine/dietary supplements (vitamin E, chromium, omega-3); as well as mental health topics (Alzheimer's, obsessive-compulsive-disorder [OCD]).

In each of the evidence reports one meta-analysis (in general the analysis with the largest number of trials) was selected and all studies included in that pooled analysis were chosen for the study pool. Only one comparison per study was included. Multiple publications and multiple outcomes were excluded so that each unique study entered the test set only once. In the majority, individual studies compared the intervention to placebo or usual care.

Following the results of the analysis of the EPC reports, we obtained a third dataset of studies. This third dataset was obtained by replicating a selection of trials used by Moher et al. (1998). The dataset was chosen as it has shown evidence of bias for established quality criteria (see Moher et al., 1998) and is designated in this report as “pro-bias.” Since the original publication does not specify exactly which trials and which outcomes were included in this analysis, we replicated the methods described by Moher and colleagues for selection. Two reviewers independently reviewed the 11 meta-analyses chosen by Moher et al. and reconciled their assessment of the primary outcome and the main meta-analysis in the publication. Following the described approach, this designation of the primary outcome was based on the largest number of randomized controlled trials (RCTs) reporting data on that endpoint since many meta-analyses did not identify a primary outcome. Individual trials present in multiple meta-analyses were included only once so that a trial did not enter our analyses more than once. Where multiple comparisons were reported in original articles we included those data chosen in the main analysis of the 11 meta-analyses. We were able to retrieve, quality score, and abstract 100 RCTs of the originally published set (79 percent).

The trials came from meta-analyses on digestive diseases (Marshall and Irvine, 1995; Pace, Maconi, Molteni, et al., 1995; Sutherland, May, and Shaffer, 1993), circulatory diseases (Ramirez-Lasspas and Cipolle, 1988; Lensing, Prins, Davidson, et al., 1995; Loosemore, Chalmers, and Dormandy, 1994), mental health (Mari and Streiner, 1994; Loonen, Peer, and Zwanikken, 1991; Dolan-Mullen, Ramirez, and Groff, 1994), stroke (Counsell Sandercock, 1995) and pregnancy and childbirth (Hughes, Collins, and Vanderkeckhove, 1995).

The flow diagram in Figure 1 summarizes the dataset composition.

We developed and pilot tested a standardized form to record decisions for the quality criteria. For all datasets, two reviewers independently rated the quality of each study by applying the outlined quality criteria. The reviewers used the full publications to score the studies and were not blinded to authors, journals or other variables. The reviewers were experienced in rating study quality in the context of evidence based medicine and underwent an additional training session for this study. The pair of reviewers reconciled any disagreement through consensus; any remaining disagreements were resolved by discussion in the research team.

The outcomes of the individual RCTs were extracted by a statistician together with measures of dispersion where available and the number of patients in each group. For dataset 1 (back pain) absolute effect sizes were used as this dataset included comparisons between treatment and placebo as well as comparisons between active treatments. For dataset 2 (EPC reports) in order to be able to combine studies within data sets or where possible between datasets, standardized effect sizes were computed for each study. As all studies in dataset 3 (pro-bias) reported dichotomous outcomes, odds-ratios (OR) were calculated. As a quality check, the point estimate and 95 percent confidence interval (CI) of each meta-analysis was calculated and compared to the original meta-analytic result.

Figure 2 depicts the basic hypothesis of the project: the assumption that there is an association between quality features of research studies and the size of the reported treatment effect. The arrows indicate the direction of effects. The figure also depicts the assumption that other variables apart from quality will affect effect sizes, as represented by the arrow on the right. These other variables include the true effect of the intervention as well as other potential influences; quality variables may explain part of the reported effect sizes, but there are other and possibly more important factors that are not quality related (e.g., the efficacy of the treatment). The analysis covers descriptive information on the datasets, an evaluation of the association between quality and effect sizes, and an analysis of potential moderators and confounders to investigate which factors influence the association between quality criteria and effect sizes.

The three datasets were often used to replicate results obtained in one dataset to test the robustness of effects across datasets; some analyses were only possible in one or two datasets. The initial intention to combine all three datasets to allow more powerful analyses was not feasible due to differences in outcome measures (all RCTs in dataset 3 used dichotomous outcomes, to transform all outcomes into continuous measures was considered problematic).

Since this analysis plan involves multiple testing, we considered several methods for accounting for this; however these are not appropriate when tests are correlated. In addition, there is debate about the range to which multiple testing corrections should be employed (for an analysis, a study) and each of these would lead to different conclusions. All statistical multiple testing approaches lead to substantial loss of power (Bland and Altman, 1995). We therefore chose not to employ any of the methods to “correct” for multiple testings. Instead our results need to be interpreted with more caution, as a result of multiple testing.

The three datasets were derived through different means and differed in a number of ways. First, we investigated if there were systematic differences related to the level of quality within the datasets. The level of quality may vary between clinical fields as the clinical areas may have different standards or awareness of quality features. The quality of published RCTs may have improved since the publication of the Consolidated Standards of Reporting Trials (CONSORT) statement in 1996 so another variable we explored further was the year of publication of studies included in each dataset.

To describe the internal consistency of the quality items, inter-item correlations and the Cronbach's alpha statistic for an overall quality scale were computed in each dataset. The Pearson correlations across items were inspected for consistency (are the individual quality features positively correlated) but also to detect high inter-item correlations (e.g. above 0.5) as an indicator for conceptual overlap (the answer in one item lets us predict the answer in another item).

All of the items score quality features. It is possible that the features are independent of each other (blinding of outcome assessors is not necessarily related to the similarity of the cointerventions). However, empirically the presence of one quality indicator might increase the likelihood that a second quality criterion is also fulfilled. For example, a study that used an appropriate method for a randomization sequence may also be likely to have employed an appropriate method to guarantee allocation concealment. Finally, theoretically, it is also possible that the individual items are indicators of an underlying factor representing “quality.” A quality RCT is more likely to show several fulfilled quality criteria. Individual quality items may be indicators of this underlying quality factor.

We also used the individual quality items to create a sum scale. This overall quality score was computed by calculating the average quality scores across all items, with all items being weighted equally. Cronbach's alpha values range from 0 to 1; alpha coefficients above 0.7 indicate internal consistency. The Cronbach's alpha statistic was exploratory and was chosen as a measure with well-known properties, not because we assume a shared overarching latent quality factor. The included quality features may still be conceptually independent from another and may not represent items from the same item pool of a shared latent factor.

We also used factor analysis to explore the structure of the relationships between the 11 items. Conventional exploratory factor analysis attempts to find latent factors which explain the covariance between a set of items. Factor analysis assumes an underlying factor that is hypothesized to influence a number of observed variables, that is, the individual items. Factor analysis can show whether all included items can be explained through one underlying factor (e.g., “quality”), whether there are clusters of items representing different quality aspects, or whether all 11 items are unrelated and represent unique features. Conventional factor analysis only takes the pattern of quality scores across items into account; this approach does not incorporate the relationships with an outcome (such as effect size). We used an extension of factor analysis, a multiple indicator multiple cause (MIMIC) model, which allows us to model the relationships between the items in an exploratory fashion, and simultaneously examine the relationship between the latent variables that were identified and the outcome of interest (in this case, the effect size of the study). The factor analysis hence takes the inter-item relationships as well as the strength of association with effect sizes into account.

The path model shown in Figure 3 below is a simplified diagrammatic representation of the model assuming four indicators of quality (x1 to x4), that is, individual quality features. Single-headed arrows are regression paths—the four indicators of quality are hypothesized to be explained by two latent variables, F1 and F2. The two latent (unmeasured) variables represent distinct broad quality domains but are not necessarily completely independent from each other either; we assume in our model that they are correlated (the two-headed, curved arrow indicates this).

We hypothesize that the covariances between x variables are accounted for by the factors, the latent variables. We assume the latent (unmeasured) factors (F1, F2) are responsible for the majority of variation in individual quality criteria, and that these latent variables are also predictors of effect size. The indicator variables, that is, the individual quality items, are not conceptualized as being correlated; they can be independent from another, such as blinding and similarity of cointerventions. The partial correlation between individual quality indicators and the effect size is diminished to zero when controlling for the latent factors.

In summary, the effect size reported in the trials is regressed on the latent variables—thus quality is indicated by the x-variables (individual quality features), but the latent variables (unmeasured, broad quality domains) are hypothesized to predict variance in the effect size. It has to be kept in mind that variables other than quality will affect effect sizes, as represented by the arrow on the right.

To identify the appropriate number of latent factors that are required to account for the data, we employed fit indices (x2, comparative fit index, [CFI] and root mean square error of approximation [RMSEA]). We tested a series of models, each time increasing the number of factors and comparing the improvement of the model fit. This approach is used to determine the smallest number of factors that can account for the data.

The factor analysis solution is more parsimonious and enables a large number of items to be reduced to a smaller number of underlying factors. Factor analysis allows summarizing results across items without reducing the quality data to a summary score. However, the analysis should be considered descriptive as data are not weighted by standard error as is conventional in meta-analysis.

We investigated the association between quality and effect sizes in a number of ways. First, the differences between results in studies that met a quality criterion were calculated for each of the 11 quality features. Secondly, a summary score was calculated across all quality components and a linear relationship between quality and effect sizes was investigated. Third, the associations based on empirically derived factor scores was tested, the factor structure took the inter-correlations between items and their effects on outcomes into account. Fourth, we explored different cutoffs of quality scores according to the number of quality components met.

For all analyses we differentiated quality items scored “yes” and those with the quality item scored “not yes” which included the answer “no” and “unclear” unless otherwise stated.

In the first two datasets (back pain, EPC reports) we compared the effect sizes of studies with the quality item scored “yes” and those with the quality item scored “not yes” for each of the 11 quality features. The difference in effect sizes between these two subgroups per feature was used as a measure of bias. The difference was estimated using meta-regression (Berkey et al., 1995). A meta-regression was conducted separately for each quality feature. The coefficient from each regression estimates the difference in effect sizes between those studies with the quality feature scored “yes” versus “not yes.” A difference with a significance level of p<0.05 was considered statistically significant.

In the third dataset, the published “pro-bias” dataset, all studies used dichotomous outcomes. An odds ratio below 1 indicates the treatment group is doing better than the control. For the analysis we compared odds ratios (ORs) of studies where the quality criterion was either met or not met and computed the ratio of the odds ratios (ROR). The ROR is ORno/ORyes where ORno is the pooled estimate of studies without the quality feature and ORyes is the pooled estimate of studies where the quality criterion is met.

Note that the interpretation of reported differences of the first two datasets differs from that of the third one. In the first two datasets a negative difference coefficient indicated that studies with the quality item scored “yes” have smaller effect sizes than those that scored “not yes.” Hence, a negative difference indicates that the higher quality RCTs report less pronounced treatment effects. In the third dataset a ROR of being less than 1 indicates that high quality studies reported smaller treatment effects (i.e., the OR closer to 1) than low quality studies.

We compared results based on a random effects meta-regression, and a fixed effects model in order to be able to match results reported in the literature.

The sum of the quality items scored “yes” was calculated across all 11 items with all items contributing equally to the total score. To assess a linear relationship between overall quality and effect size, reported outcome results were regressed on the sum score. A simple linear relationship indicates that the reported treatment effects increase the lower the quality level. A level of p<0.05 was considered statistically significant.

We used the empirically derived factor scores representing broad quality domains and regressed effect sizes on these factors. The factor scores were based on the inter-item relationships as well as their effects, that is, the association with the study results that provides a description of distinct groups of items. The analysis was equivalent to the sum score analysis.

Different cutoffs depending on the number of criteria met were explored to differentiate high and low quality studies. The difference in effect sizes of studies above and below possible thresholds was investigated. The statistical analysis followed the approach outlined under (1).

The different methods of establishing associations between quality and effect sizes were exploratory and we did not a priori assume consistent results across methods. For example, a simple linear relationship between a total quality scale and effect sizes will not necessarily be present even when individual quality features show large associations with effect sizes; the internal consistency across items was one of the issues under investigation.

The analyses were conducted separately in each of the three datasets. Each dataset consisted of trials included in up to 12 meta-analyses. We did not correct for clustering in analyses within datasets. We do not assume nonindependence of RCTs within meta-analyses since the selection into the meta-analysis happened after the event (when the study was already conducted and published).

Effect sizes are influenced by many variables, not just the methodological quality of the research study. In addition, we have to assume from conflicting literature results that there are factors that influence the relationship between methodological quality and the effect size. Figure 4 shows a model that assumes factors influencing the association between quality and effect sizes and indicates that effect sizes are also influenced by other variables independent from trial quality.

Two effects need to be considered: confounding effects and moderating effects. These are of particular relevance in dataset 2, where papers are selected from a wide range of clinical topics and interventions.

Confounding effects occur when the quality of trials is not equally distributed across areas of study, resulting in a correlation between quality and area of study. This correlation can lead to erroneous conclusions if the area of study is not incorporated as a covariate. In extreme cases, this correlation can lead to counter-intuitive results, an effect known as Simpson's paradox. The example in Table 1 considers two areas of study, labeled A and B, and a measure of quality, such as randomization, which is either achieved or not achieved, giving four combinations. The effect sizes are given in the table below. Within study area A, the effect size is 0.1 higher when the quality measure is not achieved. Similarly, within study area B the effect size is 0.1 higher when the quality rating is not achieved. However, studies in Area B have higher effect sizes on average (0.25) than studies in Area A (0.15), and studies in Area B are much more likely to have achieved the quality rating. This confounding means for subpopulations of studies the result is in one direction, but for the whole population the result is in the other direction.

The second potential issue is one of moderation. In the case of moderation, the causal effect for a quality rating varies between different substantive areas. We illustrate a moderator effect in Table 2. This example shows that for substantive area A, quality does not influence the effect size; however for area B there is a substantial influence of quality on effect size. To take the average quality association would be inappropriate when the influence differs across substantive areas (and would therefore be influenced by the number of studies identified in each area).

The literature reports some conflicting results regarding the strength of association between quality features and effect sizes indicating that we have to assume factors that influence the relationship between the two variables. In this project we set out to investigate the influence of four variables: the size of the treatment effect, the condition that is being treated, the type of analyzed outcome and the variance in effect sizes across studies for the quality feature in question.

We tested the hypothesis that the association of quality features and reported effect sizes varies according to the size of the overall treatment effect. Strong treatment effects may mask any effects of quality features on the individual study outcome. Similarly, an ineffective treatment may likewise yield the same result regardless of study quality. We computed the mean effect size for each included meta-analysis and added this variable to the regression models and compared results between two datasets.

We tested the hypothesis that the association of quality features and effect size varies by condition. Under this hypothesis the selection of clinical conditions in a dataset determines whether or not an association between quality and effect size can, or cannot be shown. The underlying factors for this differential effect may remain unknown; we are only testing whether the association with quality features can be documented in one clinical area or groups of clinical areas but not in others.

The analysis was restricted to the large and diverse EPC report dataset (dataset 2, 165 trials). The back pain studies addressed a homogeneous condition. The third dataset was too small to investigate the effects for each of the 11 included conditions (most comparisons would be incomputable) and too unbalanced (only 2 out of 11 studies were not drug studies, only 1 meta-analysis was in pregnancy and childbirth).

We tested the hypothesis that the association of quality and effect sizes varies by type of outcome. Whether an association of quality and effect sizes can be shown may depend primarily on the investigated outcome. Some types of outcomes may be more susceptible to bias than others. More objective versus more subjective outcome measures may represent different kind of outcome types. Hypothesis 3 tests whether the association of quality features and effect size may vary by the type of analyzed outcome.

In the back pain dataset, the measured outcomes were all either subjective outcomes such as pain or outcomes involving clinical judgment such as “improvement,” so this set could not contribute to this moderator analysis. The outcomes in the EPC report dataset were more diverse. We distinguished automated data (hemoglobin A1c, high-density lipoprotein, and total cholesterol) versus other endpoints (Alzheimer's Disease Assessment Scale cognition score, arthritis responders, reduction in seizures, pain, OCD improvement, weight loss, and depression scores). In the third dataset, we distinguished objective data such as death, pregnancy, and biochemical indicators of smoking cessation, from other endpoints of a more subjective nature or involving clinical judgment (response in ulcer healing or pain relief, bleeding complications, schizophrenic relapse, ulcer healing rate, affective relapse, and maintenance of ulcerative colitis remission).

We tested the hypothesis that the association of quality features and effect sizes may depend on the variance in effect sizes across studies in a given dataset. In a dataset where there is a wide range of reported effect sizes across studies, quality may be more likely to explain differences in effect sizes across studies.