Search J Rheum

Advanced Search

Home

Current Issue

Archives

Guidelines for Authors

Classified Ads

Links

Search PubMed

Subscriptions

Subscriber Registration

Guidelines for Website Users

JRheum Update Service

Contact Info

Is the Jadad Score the Proper Evaluation of Trials?

To the Editor:

Towheed1 evaluated 3 published trials and an abstract describing another trial of Pennsaid, and noted that the Jadad score was a perfect 5 of 5 for each of the 3 published trials. This uncritical acceptance of the Jadad score led to corresponding uncritical acceptance of the trial results as well. For example, Towheed1 seems convinced that the "excellent quality" of the trials, and their findings that Pennsaid is effective, prove that "Pennsaid deserves further consideration when the existing treatment guidelines for OA of the knee are updated." In fact, what needs updating is the method by which trial quality is evaluated, because there is no difficulty in finding serious methodological flaws in the very studies that earned such high praise. Given the space constraints, I will focus on only one study2, which by itself could fill volumes with examples of what not to do in good clinical research.

First, an unmasked trial was referred to as masked, and treated as masked. But masking means more than simply attempting to conceal treatment identities, it requires the success of this effort. The authors acknowledge the garlic taste of the active treatment. In addition, the differential rate of dry skin across treatment groups could certainly lead to unmasking. A block size of 6 is quite small in an unmasked trial with 3 treatment groups, and this has to be considered a methodological flaw that allows for prediction of upcoming allocations, and hence selection bias3. Was there selection bias in this trial? We do not know, because not only was selection bias not tested, but in fact even the baseline p values were suppressed. It is notable that many more patients with 2 bad knees ended up in the placebo group than in the active group. The worse of 2 knees will tend to be worse than a single bad knee, so this baseline imbalance represents an advantage for the active group, even if the p value exceeds 0.05. It may be argued that this was already taken care of, by considering Δ [change from baseline in the Western Ontario and McMaster University Osteoarthritis Index (WOMAC) subscale score for pain] as the primary outcome measure. This leads to our next methodological flaw. The WOMAC subscale score for pain is scored on a 5-point Likert scale, and hence is non-numeric data. What does Δ represent?

There has to be concern regarding the equating of all one-category shifts, for example. Is a change from 0 to 1 the same as a shift from 1 to 2, from 2 to 3, or from 3 to 4? This seems highly unlikely, and so a table is needed showing how many patients in each treatment group shifted from each given baseline score to each given subsequent score, as in Berger, et al 4. This most basic of data presentations is not provided, so the reader cannot produce a reasonable analysis (that the authors should have provided) that is not corrupted by the imposition of this artificial assumption that the 5 categories are spaced uniformly. Then, to make matters worse, some of these baseline scores were actually measured at Day 1, that is, subsequent to randomization. The potential for bias goes without saying5,6. To compensate, there was an unplanned increase in sample size, which essentially is an interim analysis with no penalty applied. Again, the potential for bias is clear to any beginning biostatistics student. Moreover, there were additional missing data, and these were imputed by carrying forward the last observation, with no mention of any sensitivity analyses. This is also quite problematic7.

The primary analysis is the analysis of covariance (ANCOVA). The assumptions underlying the ANCOVA model include normality of residuals, equal variances, linearity, and independence. It is not likely that these assumptions can all be met, and when these assumptions are not met the ANCOVA may not be robust8. By not requiring such assumptions, a nonparametric analysis offers better robustness properties, and so should have been used instead. As it stands, the low p value rejects the combination of the null hypothesis and all assumptions, and hence may be attributable to the falsity of any of the assumptions instead of to the falsity of the null hypothesis. To make matters worse, the analysis labeled as intent-to-treat is based on a subset of the true intent-to-treat sample. That is, there is a post-randomization exclusion. The bias this can create is bad enough, but to call the analysis "intent-to-treat" is unconscionable.

The issues discussed include: (1) unmasking; (2) prediction of future allocations; (3) selection bias; (4) performing arithmetical operations on numbers assigned fairly arbitrarily to non-numeric categories; (5) failure to present the most meaningful data structures; (6) using post-randomization data as baseline data; (7) failing to apply a penalty for an unplanned interim analysis; (8) carrying forward the last observation without mentioning any sensitivity analyses; (9) using an analysis requiring so many unverifiable assumptions that it cannot be taken seriously in the context of an actual clinical trial; and (10) excluding from the analysis some post-randomized data. One can easily anticipate the responses of the authors when trying to defend their work. The study was masked, because the paper says that it was. This takes care of the first 3 issues. The categories are nearly equally spaced, the treatment did not yet have time to influence Day 1 data, the increase in sample size was not based on an attempt to get a nearly significant result to become significant, last observation carried forward (LOCF) and ANCOVA are industry standards, and only one randomized patient was excluded from the analysis called intent-to-treat.

In fact, one may be able to argue convincingly that any one of these issues cannot by itself invalidate the findings, or the conclusions based on the findings. But if any one bias can explain the results ("or" logic), or even if a combination of them can do the trick, then the conclusions are not supported. Supporting the conclusions therefore requires arguing that none of these 10 flaws materially affected the outcomes. Some of these arguments would be hard to support, but even if 10 solid arguments were provided, this still would not absolve the authors of their responsibility to conduct good research. Clearly, they did not, and this remains true even if it is found that the many flaws did not materially affect the research. The questions we are left with are (1) Is Pennsaid in fact effective for osteoarthritis of the knee? and (2) Is the Jadad score the way to evaluate trial quality? While deferring to those more knowledgeable than I am regarding the first question, I can offer an unqualified "No" to the second. It would take a much more comprehensive set of checks than the Jadad score offers to be able to replace critical thinking and evaluation with a checklist. Until such a comprehensive checklist is developed, peer review is needed to weed out flawed research. Clearly, peer review also failed in the case of this study.

VANCE W. BERGER, PhD, National Cancer Institute and University of Maryland Baltimore County, Biometry Research Group, National Cancer Institute, Executive Plaza North, Suite 3131, 6130 Executive Boulevard, MSC 7354, Bethesda, Maryland 20892-7354, USA.

Address reprint requests to Dr. V.W. Berger. E-mail: vb78c@nih.gov

REFERENCES

1. Towheed TE. Pennsaid therapy for osteoarthritis of the knee: a systematic review and metaanalysis of randomized controlled trials. J Rheumatol 2006;33:567-73.

2. Bookman AM, Williams KSA, Shainhouse JZ. Effect of a topical diclofenac solution for relieving symptoms of primary osteoarthritis of the knee: a randomized controlled trial. CMAJ 2004;171:333-8.

3. Berger VW. Selection bias and covariate imbalances in randomized clinical trials. Chichester: John Wiley & Sons; 2005.

4. Berger VW, Zhou YY, Ivanova A, Tremmel L. Adjusting for ordinal covariates by inducing a partial ordering. Biometrical Journal 2004;46:48-55.

5. Voutilainen PE. Assessment of grouping variable should have been blind in trial of dementia [letter]. BMJ 2001;322:1491.

6. Berger VW. Valid adjustment for binary covariates of randomized binary comparisons. Biometrical Journal 2004;46:589-94.

7. Unnebrink K, Windeler J. Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. Stat Med 2001;20:3931-46.

8. Lachenbruch PA, Clements PJ. ANOVA, Kruskal-Wallis, normal scores, and unequal variance. Communications in Statistics — Theory and Methods 1991;20:107-26.



Return to August 2006 Table of Contents



© 2006. The Journal of Rheumatology Publishing Company Limited.
All rights reserved.