Assessment for Students with DisabilitiesTechnical Report 1 June 2010PAD IUsing Evidence-Centered Design andUniversal Design for Learning to DesignScience Assessment Tasks for Studentswith DisabilitiesProject: Principled Science Assessment Designs for Studentswith DisabilitiesGeneva Haertel, AngelaHaydelDeBarger, Britte Cheng, Jose Blackorby, Harold Javitz,RobertMislevyLiliana Ructtinger, andUniversityEric Snow,SRI Internationalof MarylandRobert J. Mislevy and Ting Zhang, University of MarylandElizabeth Murray, Jenna Gravel and David Rose, Center for Applied Special TechnologyAlexis Mitman Colker, Independent ConsultantEric G. Hansen, Educational Testing ServiceReport Series Published by SRI International

SRI InternationalCenter for Technology in Learning333 Ravenswood AvenueMenlo Park, CA cal Report Series EditorsAlexis Mitman Colker, Ph.D., Project ConsultantGeneva D. Haertel, Ph.D., Co-Principal InvestigatorRobert Mislevy, Ph.D., Co-Principal InvestigatorRon Fried, Documentation DesignerCopyright 2010 SRI International. All Rights Reserved.

ASSESSMENT FOR STUDENTS WITH DISABILITIESTECHNICAL REPORT 1Using Evidence-Centered Design and UniversalDesign for Learning to Design Science AssessmentTasks for Students with DisabilitiesJune 2010Prepared by:Geneva Haertel, Angela DeBarger, Britte Cheng, Jose Blackorby,Harold Javitz, Liliana Ructtinger, & Eric SnowSRI InternationalRobert J. Mislevy & Ting ZhangUniversity of MarylandElizabeth Murray, Jenna Gravel, & David RoseCAST, Inc.Alexis Mitman ColkerIndependent ConsultantEric G. HansenEducational Testing ServiceAcknowledgmentsThis material is based on work supported by the Institute of Educational Sciences, Department of Education underGrant R324A070035 (Principled Assessment Designs for Students with Disabilities).DisclaimerAny opinions, findings, and conclusions or recommendations expressed in this material are those of the authors anddo not necessarily reflect the views of IES.

CONTENTSAbstractIII1.0Purpose12.0State of Science Assessment for Students with Disabilities2.1 Focus on Middle School Science2.2 NLTS2 Background2223.0Theoretical Frameworks3.1 Universal Design for Learning3.2 Evidence-Centered Design5564.0Integration of UDL and ECD in PADI Online Assessment Design System95.0Method5.1Infusing UDL in PADI Design Patterns5.1.1 Variable Feature Categories Derived from UDL Principles5.2Background on Design Patterns, Construct Validity, and SpecificAssessment Contexts5.3Examples of Task erences27Appendix A: Variable Features by UDL Category with Examples29I

LIST OF FIGURESFigure 1.Percentile Distribution of Scores on the WJ3 Science Concepts Subtest3Figure 2.Percentiles on the WJ3 Science Concepts Subtest by Disability Category4Figure 3.Design Pattern Template10Figure 4.Example Item 1: A Time-Distance Graph19Figure 5.Example Item 2: Features of Plant and Animal Cells22Figure 6.Example Item 3: The Boiling Points of two Beakers of Water24LIST OF TABLESTable 1.Design Space and Application Space Descriptors16II

ABSTRACTThis report describes a design methodology for improving the validity of inferences about theperformance of students with disabilities on large-scale science assessments. The work to datecombines the use of “universal design for learning” (UDL) with “evidence-centered design” (ECD)to redesign statewide science items to more accurately evaluate the knowledge and skills of allstudents, including those with high incidence disabilities (mild mental retardation and learningdisabilities). The presentation will: (1) describe the state of science assessment for students withdisabilities, (2) overview the ECD and UDL frameworks and describe how these frameworks wereintegrated within a working Web-based assessment design system; (3) describe how the Webbased system helps guide designers through the complex decisions prerequisite to thedevelopment of assessments for students with disabilities; and (4) present examples ofredesigned science assessment items and design documentation.III

1.0 PurposeThe No Child Left Behind Act requires that students with disabilities be included in stateassessments and accountability. However, the use of accommodations, modifications, andalternate assessments to permit the inclusion of students with disabilities has given rise to anumber of issues related to fairness and test validity. Recently, researchers have begun toexplore whether tests can be designed from the outset to be more accessible and valid for awider range of students; this approach is termed "universal design." The researchers on thisproject are studying the use of universal design for learning (UDL) paired with an approachtermed "evidence-centered design" (ECD) to redesign or develop assessment items that canmore accurately evaluate the knowledge and skills of all students on statewide tests. Theacademic content focus of this study is middle school science, but if successful, the approach canbe applied to other topics and age ranges. In this study, the researchers' specific goals are (1) toevaluate the validity of inferences that can be drawn from existing state science assessments forstudents with and without high incidence disabilities (learning disabilities and mild mentalretardation), (2) to redesign assessment items to increase the validity for students both with andwithout disabilities, (3) to conduct empirical studies of the validity of inferences drawn from thescores on the redesigned items, and (4) to develop research-based guidelines that can be usedin large-scale assessment design and development to increase the validity of inferences fromscience assessment scores for all students.This paper describes a design methodology for improving the validity of inferences about theperformance of students with disabilities on large-scale science assessments. We present workto date from a study that combines the use of “universal design for learning” (UDL) with“evidence-centered design” (ECD) to redesign statewide science items to more accuratelyevaluate the knowledge and skills of all students, including those with high incidence disabilities(mild mental retardation and learning disabilities). The presentation will: (1) describe the state ofscience assessment for students with disabilities, (2) overview the ECD and UDL frameworks anddescribe how these frameworks were integrated within a working Web-based assessment designsystem; (3) describe how the Web-based system helps guide designers through the complexdecisions prerequisite to the development of assessments for students with disabilities; and (4)present examples of redesigned science assessment items and design documentation.1

2.0 State of Science Assessment for Students with Disabilities2.1 Focus on Middle School ScienceThe decision to focus project research on science assessments for students with disabilities wasmotivated by the extension of NCLB to science in 2007 and an understanding that success inscience coursework serves as a pipeline to scientific careers as well as greater postsecondaryeducation and labor market opportunities for students. Our focus on middle school level studentswas motivated by the formal introduction of scientific reasoning and problem solving in grades 6through 8 and by the interdependence of reading, math, and science knowledge, skills, andabilities. Science instruction and assessment are noted for abstract content, challengingvocabulary, text (books) written at difficult readability levels, and complex lab activities. Inability tosuccessfully engage with these curricula and the more complex science content can lead to highschool students’ decisions to opt out of science classes and scientific career trajectories.Although some special education researchers have developed interventions and outlined bestpractices for instructing students with disabilities in science (Mastropieri & Scruggs, 1992;Mastropieri & Scruggs, 1995; McClery & Tindal, 1999; Norman, Caseau, & Stefanisch, 1998),science education for students with disabilities has historically been a lower priority in researchprograms than reading/language arts and mathematics. Moreover, whereas science assessmenttasks that entail declarative and procedural knowledge (Li & Shavelson, 2001) require students torecognize and recall information, tasks that entail schematic or strategic knowledge furtherchallenge students to execute or evaluate problem solutions as well as to judge theappropriateness of knowledge applied — precisely the areas affected by many cognition-baseddisabilities.2.2 NLTS2 BackgroundThe National Longitudinal Transition Study-2 (NLTS2), funded by the U.S. Department ofEducation, Institute of Education Sciences (IES), is collecting information from parents, youth,and schools from 2001 to 2010. The study provides a national picture of the educationalprograms, accommodations, and in–and–out–of–school outcomes of young people withdisabilities as they transition from secondary school to early adulthood roles. The NLTS2 sampleis comprised of 11,275 students in all disability categories stratified by geographic region andLocal Education Agency (LEA) size and LEA wealth. NLTS2 data summaries generalize to thenational population of youth with disabilities, as well as to each disability category individually.NLTS2 collects longitudinal data via telephone surveys of parents and youth, paper–basedsurveys of teachers, and face–to–face assessments of academic performance. To be included inthe assessment, students were required to be able to speak and understand English or ASL andbe able to complete all measures required for the study using the same accommodations2

provided to them in the course of day–to–day instruction and assessment. Data presented hereare student scores on the version Woodcock Johnson III (WJ3) assessment (Woodcock,McGrew, & Mather, 2001). Data from four subtests were obtained: science concepts, appliedproblems, passage comprehension, and calculation. Here, we examine the science conceptssubtest scores.Evidence from the WJ3 (Woodcock, McGrew, & Mather, 2001) shows that students withdisabilities have difficulties in science in addition to reading and mathematics. On average,students with disabilities nationally score at the 24th percentile on the science concepts subtest,thand one in three students has scores below the 5 percentile. Figure 1 illustrates that there isconsiderable variation in performance both within and across disability categories. This impliesthat the careful analysis of task requirements, both relevant and irrelevant to the constructs beingmeasured, in this project is appropriate as task requirements may differentially reflectperformance of students with different disabilities. For example, in the boxplot below (see Figure1), students with mental retardation and multiple disability diagnoses have both significantly lowerand less variable performance relative to students in other disabilities (for example, learningdisability, visual impairment) on the science concepts subtest. This pattern is typical of those inmany subject areas tested as part of NLTS2, including the mathematics applied problems subtestthat presents students with problems consistent with the types of scientific reasoning commonlyintroduced in middle school science. It is important to note that the student performance on theWJ3 passage comprehension and calculation subtests are even more variable than those shownbelow.Figure 1. Percentile Distribution of Scores on the WJ3 Science Concepts Subtest3

Figure 2. Percentiles on the WJ3 Science Concepts Subtest by Disability CategoryFigure 2 presents an alternative view of these data that illustrates the need to better understandthe ways that disabilities differentially limit student performance on assessments. Given thatachievement data often are reported in terms of proportions of students above or below aparticular threshold, we use this method with a 40 percentile rank threshold to illustrate thevariation found in scores on the WJ3 science concepts subtest. This is the way thataccountability systems organize achievement data and also provide rough estimates of the levelof improvement required for students to meet proficiency targets. This figure illustrates thatstudents with disabilities may not be well represented by traditional reporting of achievementdata, due to the highly skewed distribution of achievement scores, although the degree ofskewness is highly dependent on students’ disability category. Variation across disabilitycategories indicate that to more accurately gauge student performance (and possibly progressover time), we need to understand the variation in student achievement data and, specifically,assessment designers need to document the ways we anticipate task requirements, students’abilities, and their particular disabilities will interact in testing situations. In addition, from a policyperspective it important to explore if some of the achievement gap in test scores can be closed byimprovements in test design and administration; for example, by applying universal design forlearning (UDL) principles implemented through the PADI online assessment design system — thegoal of our project.4

3.0 Theoretical FrameworksThe redesign of statewide science items in this project was based on principles of ECD and UDL.In the following section, we describe the principles underlying each of these frameworks.3.1 Universal Design for LearningUniversal design for learning (UDL) helps to meet the challenge of diversity by suggesting flexibleassessment materials, techniques, and strategies (Dolan, Rose, Burling, Harris, & Way, 2007).The flexibility of UDL empowers assessors to meet the varied needs of students and to accuratelymeasure student progress. The UDL framework includes three guiding principles that addressthree critical aspects of any learning activity, including its assessment. The first principle, multiplemeans of representation, addresses the ways in which information is presented. The secondprinciple is multiple means of action and expression. This principle focuses on the ways in whichstudents can interact with content and express what they are learning. Multiple means ofengagement is the third principle, addressing the ways in which students are engaged in learning(Meyer & Rose, 2006; Rose & Meyer, 2002; Rose, Meyer, & Hitchcock, 2005). Described in moredetail below, these principles provide structure for the infusion of UDL into assessment.Principle I. Provide Multiple Means of Representation (the "what" of learning).Students differ in the ways that they perceive and comprehend information that ispresented to them. For example, those with sensory disabilities (e.g., blindness ordeafness), learning disabilities (e.g., dyslexia), language or cultural differences, and soforth, may all require different ways of approaching content. Others may simply graspinformation better through visual or auditory means rather than printed text. In reality,there is no one means of representation that will be optimal for all students; providingoptions for representation is essential.Principle II: Provide Multiple Means of Action and Expression (the "how" oflearning).Students differ in the ways that they can interact with materials and express what theyknow. For example, individuals with significant motor disabilities (e.g., cerebral palsy),those who struggle with strategic and organizational abilities (executive functiondisorders, ADHD), those who have language barriers, and so forth, approach learningtasks very differently and will demonstrate their mastery very differently. Some may beable to express themselves well in writing text but not oral speech, and vice versa. Inreality, there is no one means of expression that will be optimal for all students; providing5

options for expression is essentialPrinciple III: Provide Multiple Means of Engagement (the "why" of learning).There is also an affective component to learning. Students differ markedly in the ways inwhich they can be engaged or motivated to learn. Some students enjoy spontaneity andnovelty, while others do not, preferring strict routine. Some will persist with highlychallenging tasks while others will give up quickly. In reality, there is no one means ofengagement that will be optimal for all students; providing multiple options forengagement is essential.3.2 Evidence-Centered DesignEvidence-centered assessment design (ECD) was formulated by Robert Mislevy, LindaSteinberg, and Russell Almond (2003) at Educational Testing Service. ECD builds ondevelopments in fields such as expert systems (Breese, Goldman, & Wellman, 1994), softwaredesign (Gamma, Helm, Johnson, & Vlissides, 1994), and legal argumentation (Tillers & Schum,1991) to make explicit, and to provide tools for, building assessment arguments that help in bothdesigning new assessments and understanding familiar ones (Mislevy & Riconscente, 2005).Two complementary ideas organize the effort. The first is an overarching conception ofassessment as an argument from imperfect evidence. Specifically, it involves making explicit theclaims (the inferences that one intends to make based on scores) and the nature of the evidencethat supports those claims (Hansen & Mislevy, 2008). The second idea is distinguishing layers atwhich activities and structures appear in the assessment enterprise, all to the end of instantiatingan assessment argument in operational processes. By making the underlying evidentiaryargument more explicit, the framework makes operational elements more amenable toexamination, sharing, and refinement. Making the argument more explicit also helps designersmeet diverse assessment needs caused by changing technological, social, and legalenvironments (Hansen & Mislevy, 2008). In ECD, assessment is expressed in terms of fivelayers that provide structure for different kinds of work and information at different stages of theprocess:Domain Analysis. In the domain analysis layer, research and experience about thedomains and skills of interest are gathered—information about the knowledge, skills, andabilities (KSAs) of interest, the ways people acquire KSAs and use them, the situationsunder which the KSAs are employed, and the indicators of successful application of theKSAs.6

Domain Modeling. In the domain modeling layer, information from domain analysis isorganized to form the assessment argument. Domain modeling structures the outcomesof domain analysis in a form that reflects the narrative structure of an assessmentargument, in order to ground the more technical models in the next layer. The PADIonline assessment design system uses objects called design patterns to assist taskdesigners with domain modeling. Design patterns play a key role in the present project,as we consider the impact of UDL principles and accommodations on task design andevaluation.Conceptual Assessment Framework (CAF). The CAF layer concerns technicalspecifications for operational elements including measurement models, scoring methods,test assembly specifications, and requirements and protocols for assessment delivery. Anassessment argument laid out in narrative form at the domain modeling layer is hereexpressed in terms of coordinated pieces of machinery: specifications for tasks,measurement models, scoring methods, and delivery requirements within templates. Thecentral models within the CAF are the Student Model, Evidence Model, and Task Model.In addition, the Assembly Model determines how tasks are assembled into tests, thePresentation Model indicates the requirements for interaction with a student (e.g.,simulator requirements), and the Delivery Model specifies requirements for theoperational setting. Details about task features, measurement-model parameters,stimulus material specifications, and the like are expressed in the CAF model templatesin terms of knowledge representations and data structures, which guide theirimplementation and ensure their coordination. These templates are essentially blueprintsthat specify, at a meta–level, the necessary elements for tasks. The present project willinclude some work at the CAF layer, as we develop example templates that demonstratehow tasks can be developed in accordance with UDL principles and modified inaccordance with student needs.Assessment Implementation. The work in this layer includes activities in preparation fortesting examinees such as authoring tasks, calibrating items, finalizing rubrics, producingmaterials, producing presentation environments, and training interviewers and scorers, allin accordance with the assessment arguments and test specifications created in previouslayers of ECD. The ECD approach links the rationales for each layer back to theassessment argument and provides structures that support opportunities for reuse andinteroperability.7

Assessment Delivery. The work in this layer includes activities such as presenting tasksto examinees, evaluating performances to assign scores, and reporting the results toprovide feedback to students themselves, teachers, decision-makers, or otherstakeholders.The ECD framework described in this report applies principles of evidentiary reasoning to handlethe complexities of the validity argument (Cronbach & Meehl, 1955; Messick, 1989, 1994; Kane,1992) associated with accessibility features. The key idea is to lay out the evidentiary structures— or what may be termed the validity argument (or “validation argument” [National ResearchCouncil, 2004, p. 104]). An assessment argument can be summarized as comprising: (a) a claimabout a person possessing at a given level a certain targeted proficiency, (b) the data (e.g.,scores) that would likely result if the person possessed, at a certain level, the targetedproficiency, (c) the warrant (or rationale, based on theory and experience) that tells why theperson’s level of the targeted proficiency would yield the expected score, and (d) “alternativeexplanations” for the person’s high or low scores (i.e., explanations other than the person’s levelof the targeted proficiency). The existence of alternative explanations that are both significant andcredible might indicate that validity is threatened or being compromised (Messick, 1989).Much of the analysis that is the focus of this project has to do with these alternative explanations— factors that can hinder an assessment from yielding valid inferences. When the potential forsuch alternative explanations is recognized at the earliest stages of test design, then later reworkand retrofitting can be avoided. The ECD accessibility effort has focused on building argumentstructures that might help anticipate and address key details of these alternative explanations,particularly as they relate to test takers with disabilities (Hansen & Mislevy, 2008).8

4.0 Integration of UDL and ECD in PADI Online AssessmentDesign SystemPrincipled Assessment Designs for Inquiry (PADI) was a project supported by the NationalScience Foundation to improve the assessment of science inquiry (through the InteragencyEducational Research Initiative under grant REC-0129331). The PADI project has developed adesign framework for assessment tasks based on the evidence-centered design (ECD)framework. PADI was developed as a system for designing blueprints for assessment tasks, witha particular eye toward science inquiry tasks — tasks that stress scientific concepts, problemsolving, building models, using models, and cycles of investigation. The PADI framework guidesan assessment developer’s work through design structures that embody assessment argumentsand take advantage of the commonalities across the assessments for sharing and reusingconceptual and operational elements (Mislevy & Haertel, 2006). PADI provides a conceptualframework, data structures, and software supporting tools for this work. The PADI onlineassessment design system is fully operational.ECD seeks to integrate the processes of assessment design, authoring, delivery, scoring, andreporting. Work within PADI, however, is focused on design layers that lie above the level ofspecific environments for task authoring and assessment delivery. The key PADI design objectsthat will be involved in the present project are design patterns and templates.PADI assessment design patterns (analogous to those in architecture and software engineering)capture design rationale in a reusable and generative form in the domain modeling layer ofassessment. They help designers think through substantive aspects of an assessment argumentin a structure that spans specific domains, forms, grades, and purposes (Mislevy et al., 2003).Assessment designers working with the PADI design system use the web-based design interfaceillustrated for design patterns (see Figure 3 for the design pattern template). Key attributes of adesign pattern are summarized as follows:9

Figure 3. Design Pattern TemplateIn a design pattern, four key attributes, namely focal knowledge, skills, and other abilities(Focal KSAs), Additional KSAs, Characteristic Features, and Variable Features, areparticularly important for building the assessment argument for students with or withoutdisabilities. Hansen and Mislevy (2007, p.12) describe these four key attributes asfollows:Focal KSAs consist of the primary knowledge/skills/abilities of students that areaddressed by assessment (Mislevy et al., 2003). Comparability of scores betweenindividuals with and without disabilities is important, which suggests that one shouldseek evidence about the same set of Focal KSAs, regardless of whether the test taker hasa disability or not.Focal KSAs. These are the primary knowledge/skills/abilities of students that one wants to knowabout and that are addressed by the assessment. Comparability of scores between individualswith and without disabilities is important, which suggests that one should seek evidence about thesame set of Focal KSAs, regardless of whether or not the test taker has a disability.Additional KSAs. These are the other knowledge/skill/abilities that may be required in a task(Mislevy et al., 2003). For tests of academic subjects, the abilities to “see” and “hear” are typicallyAdditional KSAs. On the other hand, for assessment of sight and hearing, respectively, sight andhearing are likely to be defined as Focal KSAs. Notice that there are many disabilities that involveimpairments of sight, hearing, or both (e.g., blind, low vision, color-blind, deaf, hard to hear, deafblind). Cognitive issues such as dyslexia, attention deficit, and executive processing limitationsalso can be addressed. Deficits in such Additional KSAs can cause unduly low scores among testtakers with disabilities.Potential Observations. These are possible things that students could say, do, or make thatgive evidence about the Focal KSAs.10

Potential Work Products. These are various modes or formats in which students might producethe evidence relevant to the Focal KSAs.Characteristic Features. Characteristic Features of the assessment are the features that mustbe present in a situation in order to evoke the desired evidence about the Focal KSAs (Mislevy etal., 2003).Variable Features. Variable Features are described as features that can be varied to shift thedifficulty or focus of tasks (Mislevy et al., 2003). Variable Features have a particularly significantrole with respect to test takers with disabilities and other sub-populations (e.g., speakers ofminority language). Much of our attention will be on manipulating Variable Features to reduce oreliminate demands for Additional KSAs in which there may be a deficit while making sure (to theextent possible) that demands for Focal KSAs have not been changed.11

5.0 MethodIn this study, ECD and UDL were applied to a subset of 20 preexisting statewide science practiceitems that were developed for online delivery in the state of Kansas. As preparation for theredesign process, design patterns were completed that represent the assessment argumentsunderlying many of the items. Specifically, the design pattern helps identify whether taskrequirements elicit proficiency on intended test constructs (Focal KSAs) or inadvertentlycontribute variance to student scores but are not relevant to the construct being measured(construct–irrelevant Additional KSAs). Based on this analysis, revisions were made to itemdesigns to reduce the influence of construct–irrelevant Additional KSAs. In the following section,we describe how we implemented revisions to the Kansas practice assessment items.Construct validity is the sine qua non of assessment properties; to what degree do the evidenceand rationale for the data gathered in an assessment support the inferences or decisions that auser wants to make? In the literature on accommodated assessment, the question typicallycenters on whether a given alteration of a task “changes the construct” (Standards forEducational and Psychological Testing, AERA, APA, NCME, 1985. p. 78). Specifically, if analteration changes the construct, then construct validity has been violated. Conversely, if thealteration does not change the construct, then construct validity has not been violated.Yet for assessment designers and developers, as well as some other audiences, there is often aneed to reason more deeply about the relationships between construct validity and task design.We would argue that it is important to specify more carefully what knowledge and skills need tobe assessed and at what levels; the assessment designers need to determine the essence of theintended construct that is to be assessed and what knowledge and skills influence testperformance but are not the intended construct. This cannot be determined simply by examiningthe tasks on a test, because all of the knowledge, skills, and abilities needed to do well on a testare jointl

Liliana Ructtinger, and Eric Snow, SRI International Robert J. Mislevy and Ting Zhang, University of Maryland Elizabeth Murray, . Percentile Distribution of Scores on the WJ3 Science Concepts Subtest 3 . perspective it important to explore if some of the achievement gap in test scores