Transcription

Stat 425Introduction to Nonparametric StatisticsRank Tests for Comparing Two TreatmentsFritz ScholzSpring Quarter 2009 May 16, 2009

TextbooksErich L. Lehmann,Nonparametrics, Statistical Methods Based on Ranks,Springer Verlag 2008 (required)W. John Braun and Duncan J. Murdoch,A First Course in STATISTICAL PROGRAMMING WITH R,Cambridge University Press 2007 (optional).For other introductory material on R see the class web page.http://www.stat.washington.edu/fritz/Stat425 2009.html1

Comparing Two TreatmentsOften it is desired to understand whether an innovation constitutes an improvementover current methods/treatments.Application Areas: New medical drug treatment, surgical procedures, industrialprocess change, teaching method, . . . Example 1 New Drug: A mental hospital wants to investigate the (beneficial?)effect of some new drug on a particular type of mental or emotional disorder.We discuss this in the context of five patients (suffering from the disorder to roughlythe same degree) to simplify the logistics of getting the basic ideas out in the open.Three patients are randomly assigned the new treatment while the other two geta placebo pill (also called the “control treatment”).2

Precautions: Triple Blind!The assignments of new drug and placebo should be blind to the patientto avoid “placebo effects”.It should be blind to the staff of the mental hospital to avoid conscious orsubconscious staff treatment alignment with either one of these treatments.It should also be blind to the physician who evaluates the patientsafter a few weeks on this treatment.A placebo effect occurs when a patient thinks he/she is on the beneficial drug,shows some beneficial effect, but in reality is on a placebo pill.(power of positive thinking, which is worth something)3

Evaluation by RankingThe physician (an outside observer) evaluates the condition of each patient,giving rank 1 to the patient with the most severe level of the studied disorder,rank 2 to the patient with next most severe level,. .and rank 5 to the patient with mildest disorder level.Ranking is usually a lot easier than objectively measuring such disorder levels.However, it may become problematic for a large number of patients,possibly requiring several ranking iterations (coarse ranking refined ranking).4

How to Judge the Ranking Results?If the patients on the new drug all receive high ranks we might consider that to beevidence in favor of the new drug.How do we judge the strength of that evidence?As benchmark we will consider the null hypothesis H0 that the new drug is actingjust like a placebo, i.e., has no effect or just a placebo effect w.r.t. the disordercondition. In that case there is no difference between placebo and new treatment.The rankings would have been the same, no matter which way the new drugand the placebo were assigned, i.e., at study begin or after ranking.5

Possible Ranking ,5)(4,5)10 different ways to split the ranks up among treatment and control group.Under H0 the rankings would have been the same regardless of treatment/control.Our randomization makes all 10 splits of ranks equally likely, probability 1/10.This can be used to assess the statistical significance (rarity) of any observedresult, when trying to judge the effectivity of the new treatment as opposed to justthe luck of the draw. (later)6

More Study SubjectsFor ease of exposition we dealt with N 5 subjects, split into n 3 with treatmentand m N n 2 with placebo.What do we get for larger N and different m, n with N m n?The number of ways of selecting a group of n (without regard to order) from N is N!NN(N 1) · · · · · (N n 1) nn!(N n)!1·2· ··· ·n 5·4·35 1031·2·3This binomial coefficient is also referred to as the number of combinations of Nthings, taken n at a time.The text gives Table A for a range of m and n. In R we get N choose(N, n)nfor a much wider range of n and m N n, e.g., choose(50,25) 1.264106e 14.7

The Distribution of RanksUnder H0 each set of ordered ranks S1 S2 . . . Sn associated with theN treatment goup has the same chance 1/ n , i.e.,1P(S1 s1, . . . , Sn sn) N nfor each of the possible ordered n-tuples (s1, . . . , sn).8

Another ExampleExample 2 Effect of Discouragement: 10 subjects were split up randomly into twogroups of 5 each.All 10 subjects were given form L of the revised Stanford-Binet (IQ) test,under the conditions prescribed by this test.Two weeks later they were given form F of this test, the controls under the usualconditions but the ‘treatment’ group receiving in addition some prior discouragingcomments concerning their previous performance.The actual scores on the prior test were not disclosed to the subjects.The following differences (second test score first test score) were observedControls : 5 0 16 2 9Treated : 6 5 6 1 49

Ranks of Test Score DifferenceGiving rank 1 to the subject with smallest difference in scores, rank 2 to the secondsmallest difference, etc., the ranks in the treatment group werewhile the control group had ranks 31 2 4 6 85 7 9 10.The ranks as well as the original test score differences suggest that thediscouragement treatment has some detrimental effect.Its statistical significance will be assessed later.Again we use as benchmark the null hypothesis H0 under which thediscouragement has no effect at all, i.e., the test score differences would be thesame, with or without discouragement.10 In that case all 5 252 splits of ranks into treatment and control groups wouldbe equally likely, with chance 1/252 each.10

Comparing Examples 1 and 2Both examples are similar, resulting in ranks for treatment and control groups.Under the hypothesis H0 of no treatment effect all possible splits of the ranks intotreatment and control group are equally likely.Yet, some may argue that we should be using the original test score differences,since by ranking them we lose some detail/information.We will return to this issue later.11

The Role of RandomnessIn both discussed examples we dealt with fixed subjects, they were not randomlychosen from some population.Randomness entered through our deliberate action of randomizing treatmentassignments.Under H0 this randomness is very well understood and does not involve unknownparameters (nonparametric). The distribution of the ranks is known.In the case of test score differences it is conceivable to view them as a randomsample from some (hypothetical) population.In the case of ranking emotional disorder it does not appear feasible to view it inthe context of a population model, without absolute objective scores(not just rankings within the 5 subjects).12

The Randomization ModelThe imposed or induced randomization in the previous two examples gives us aknown statistical model under H0.We call it the randomization model.This is in contrast to population models, where subjects are drawn at random froma population of such subjects.In that case any conclusions drawn reflect back on that sampled distribution.In the case of the randomization model any such conclusion can logically onlypertain to the much smaller “universe” of randomized subjects.13

Difficulties in Comparing Sets of RanksHaving obtained a set of ranks for the treatment group and assuming that atreatment effect would result in higher subject rankings, we need a criterionthat allows us to judge whether a set of treatment ranks S1 . . . Snis generally higher than the set of control group ranks R1 . . . Rm.Comparing such vectors (S1, . . . , Sn) and (R1, . . . , Rm) faces two difficulties:1) the vectors may have different lengths, i.e., n 6 m, and2) vectors have several components, which may be compared on a component bycomponent basis (if n m), but different order relations may result for the variousrank coordinates.There are many possible ways of dealing with these issues, but the simplest oneseems to sort such rank vectors by their rank-sumWs S1 . . . Snrejecting H0 whenever Ws is sufficiently large, say, when Ws c.14

The Wilcoxon Rank-Sum TestThe test defined byWs S1 . . . Snand rejecting H0 wheneverWs cis called the Wilcoxon rank-sum test.This term distinguishes it from another Wilcoxon test to be discussed later.When there is no confusion, as in the current context, we omit the qualifierrank-sum.The constant c, the critical value, is chosen such that PH0 (Ws c) α,where the significance level α is some specified small number.15

Determining the Critical (1,2,4)(1,2,3)w89876w6789101112PH0 (Ws w).1.1.2.2.2.1.1cPH0 (Ws c)6789101112131.0.9.8.6.4.2.1016

Limited Set of Significance LevelsAs we saw previously, the number of possible significance levels is limited.This persists to some degree even for larger N .We can compromise and choose one of the possible significance levelsnear the desired αor we can report the p-value or significance probability,i.e., p(wobs) PH0 (Ws wobs), where wobs is the actually observed value of Ws.The latter is preferable since it expresses more clearly how strongly we reject H0with the observed value of Ws.We reject H0 at level α p-value p(wobs) α.17

Reject H0 at level α p(wobs) αFor a level α test we choose the smallest critical point c such that PH0 (Ws c) α.Denote this c by cα with corresponding type I error probabilityPH0 (Ws cα) αcα α .Note that by definition we have PH0 (Ws cα 1) α ( )For any observed value wobs cα (reject H0 at level α since Ws cα) we havep(wobs) PH0 (Ws wobs) PH0 (Ws cα) αcα α.Conversely,p(wobs) α wobs cαbecause if wobs cα, i.e., wobs cα 1 (since both wobs and cα are integers),we would then have p(wobs) PH0 (Ws wobs) α (see ( )) a contradiction.18

Significance Levels for N 13 & n 813 12 11 10 9 8 7 6 76cPH0 (Ws c)cPH0 (Ws c)cPH0 (Ws 6890.998450.99922119

R Code for Previous TableWilcoxonsig.levels - function (N 13, n 8){out combn(1:N, n, FUN sum)Com choose(N, n)out sort(out)c.unique rev(unique(out))cx c(max(c.unique) 1, c.unique)alpha NULLfor (i in cx) {alpha[i] sum(out i)/Com # or mean(out i)}cbind(cx, alpha)}Note the use of the function combn. It evaluates the sum of all combinations of Ntaken n at a time, i.e., it evaluates the rank-sum for all possible splits.20

combn(5,3) combn(5,3,FUN NULL) combn(5,3)[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10][1,]1111112223[2,]2223343344[3,]3454554555It gives (as columns) all possible combinations of N 5 taken n 3 at a time. combn(5,3,FUN sum)[1] 6 7 8 8 9 109 10 11 12This gives the corresponding sums for each column.21

Tabulation of Rank-Sum DistributionThe Text gives Table B for PH0 (WXY a) where WXY Ws n(n 1)/2.Table B covers{m 3, 4, m n 12}and{5 m 10, m n 10}.The preference for tabulating the null distribution of WXY instead of Ws is based onthe property that the null distribution of WXY is the same for (n, m) (k1, k2)as for (n, m) (k2, k1).WXY 0, 1, . . . , mn.The choice of the symbol WXY for Ws n(n 1)/2 will become clear when wediscuss the Mann-Whitney test.It also makes Table B more compact since the smallest value of Ws is1 2 . . . n n(n 1)/2, which can be quite large.22

R Function for WXYR gives the distribution of WXY for a much wider range of values n, m.dwilcox(x, m, n, log FALSE) PH0 (WXY x)pwilcox(q, m, n, lower.tail TRUE, log.p FALSE) PH0 (WXY q)qwilcox(p, m, n, lower.tail TRUE, log.p FALSE) p-quantile of WXY min x : PH0 (WXY x) p .rwilcox(nn, m, n) random sample of size nn from WXY null distribution.WarningThese functions can use large amounts of memory and stack(and even crash R if the stack limit is exceeded and stack-checkingis not in place) if one sample is large (several thousands or more).23

Symmetry of Null Distribution of WsThe null distribution of Ws is symmetric around n(N 1)/2, i.e., the null distributionof Ws n(N 1)/2 is symmetric around zeroorP(Ws n(N 1)/2 a) P(Ws n(N 1)/2 a)P(Ws n(N 1)/2 a) P(Ws n(N 1)/2 a)To see this consider ranking all subjects in reverse order, rank 1 becomes rank N ,rank 2 becomes rank N 1, etc. Denote these reverse ranks by S10 , . . . , Sn0 withSi0 N 1 Si.N 00Since P(S1 s1, . . . , Sn s1) 1/ n , the rank-sumWs0 S10 . . . Sn0 [(N 1) S1] . . . [(N 1) Sn] n(N 1) Wshas the same null distribution as Ws. Subtracting n(N 1)/2 on both sides we getWs n(N 1)/2D Ws0 n(N 1)/2 n(N 1)/2 Ws24

The Mann-Whitney Test StatisticSuppose we have scores X1, . . . , Xm for the control group and scores Y1, . . . ,Yn forthe treatment group. Assume that all scores are different.Define the Mann-Whitney statisticseXY number of pairs (Xi,Y j ) with Xi Y jWeY X number of pairs (Xi,Y j ) with Y j Xi.WLet Y(1) . . . Y(n) be the order statistics of Y1, . . . ,Yn with corresponding ranksS1 . . . Sn.Since Y(1) has rank S1 there are (S1 1) X 0s less than Y(1).Similarly, we have (S2 2) X 0s less than Y(2), since Y(1) Y(2) contributes 1 to S2,.(Sn n) X 0s Y(n), since Y(1) Y(2) . . . Y(n 1) Y(n) contribute n 1 to SneXY (S1 1) (S2 2) . . . (Sn n) Ws (1 2 . . . n) Ws n(n 1)/2 WXYWThe Wilcoxon and the Mann-Whitney statistics are equivalent (di