Transcription

CHAPTER 13 CODING DESIGN, CODING PROCESS, AND CODER RELIABILITY STUDIESINTRODUCTIONThe proficiencies of PISA respondents were estimated based on their performance on the test itemsadministered in the assessment. In the PISA 2015 assessment, countries1 taking part in the computerbased assessment (CBA) administered 18 clusters of trend items from previous cycles – six clusters eachof mathematics, reading and science – and six clusters of new science items developed for 2015.Countries that chose to take part in the Financial Literacy assessment administered two additionalclusters of financial literacy items. The tests in countries that used paper-based assessment (PBA) wasbased solely on the 18 clusters of items from previous PISA cycles.The PISA 2015 tests consisted of both selected- and constructed-response items. Selected-responseitems had predefined correct answers that could be computer-coded. While some of the constructedresponse items were automatically coded by computer, some elicited a wider variety of responses thatcould not be categorized in advance, thus requiring human coding. The breakdown of all test items bydomain, item format, and coding method is shown in Table 13.1.Table 13.1: Number of cognitive items by domain, item format, and coding BAAutomaticItem FormatConstructed-responseSimple selected-responseComplex esponseSimple selected-responseComplex 18 (17)16 (19)13 (13)22 (20)69 (69)41 (38)15 (18)12 (12)3 (3)71 (71)40 (44)31 (27)11 (10)6 (6)88 (87)50 (51)30 (27)8 (9)0 (0)88 rend and New)161012543NANote:1. Consistent with previous cycles, easier and standard forms were developed for mathematics and literacy. Number in the cell corresponds to the standard forms while the number in parenthesiscorresponds to the easier form.2. New Science and Financial Literacy are CBA domains only.3. The six parts of the trend Reading unit, Employment, R219, were separately coded to achieve consistent and accurate scoring. Note that, in the final item counts, four parts related to completingan employment application form were counted as a single item.The multiple coding design in PISA 2015 included all human-coded items for monitoring coderreliabilities within country as well as across countries. This chapter aims to describe coding proceduresand preparation, coding design options, and coding reliability studies.CODING PROCEDURESFor CBA participants, the coding designs for the CBA responses for Mathematics, reading, science, andFinancial Literacy (when applicable) were greatly simplified through use of the Open-Ended CodingSystem (OECS). This computer system, developed for PISA 2015, supported coders in their work to codethe CBA responses while ensuring that the coding design was appropriately implemented. Detailedinformation about the system was included in the OECS Manual. The OECS system worked offline,1PISA participants can be a country, a region, an economy, or a subsample within the former three types of entities. In this chapter, the genericterms “countries” or “participants” are used for the purpose of simplicity.1

meaning coders did not need a network connection. It organized responses according to the agreedupon coding designs.During the CBA coding, coders worked only with individual PDF files, one for each item, containing onepage per item response to be coded. Each page displayed the item stem or question, the individualresponse, and the available codes for the item. The coder was instructed to click the circle next to theselected code, which was then saved within the file. Also included on each page were two checkboxeslabeled “recoded” and “defer.” The recoded box was to be checked if the response had been recoded byanother coder for any reason. The defer box was used if the coder was not sure what code to assign tothe response. These deferred responses were later reviewed and coded by the coder. It was expectedthat coders would code the majority of responses for which they were responsible and defer responsesonly in unusual circumstances. When deferring a response, it was suggested that the coder entercomments into the box labeled “comment” to indicate the reason for deferring the given response.Coders worked on one file until all responses in that file were coded. The process was repeated until allitems were coded. The approach of coding by item has been shown to improve reliability and wasgreatly facilitated by the OECS.For PBA participants, the coding designs for the PBA responses for Mathematics, reading, and sciencewere supported by the Data Management Expert System and reliability was monitored through theOpen-Ended Reporting System (OERS), a computer tool that worked in conjunction with the DataManagement Expert (DME) software to evaluate and report reliability for paper-based open-constructedresponses. Detailed information about the system was provided in the OERS Manual. The coding processfor PBA participants involved using the actual paper booklets, with some booklets single coded andothers multiple coded by two or more coders. When single coded, coders marked directly in thebooklets. When multiple coded, coders coded first on the coding sheets, while the last coder codeddirectly in the booklet.National centres used the output reports generated by the OECS and OERS to monitor irregularities anddeviations in the coding process. Careful monitoring of coding reliability plays an important role in dataquality control. Through coder reliability monitoring, coding inconsistencies or problems within andacross countries could be detected early in the coding process through OECS/OERS output reports,allowing action to be taken as soon as possible. The OECS/OERS worked in concert with the DMEdatabase to generate two types of reliability reports: i) proportion agreement and ii) coding categorydistributions. National Project Managers (NPMs) were instructed to investigate whether a systematicpattern of irregularities existed and was attributable to a particular coder or item. In addition, they wereinstructed not to carry out resolution (e.g. changing coding on individual responses to reach highercoding consistency). Instead, if systematic irregularities were identified, all responses from a particularitem or a particular coder needed to be recoded, including those that showed disagreement as well asthose that showed agreement. In general, inconsistencies or problems were due to misunderstanding ofgeneral scoring guidelines and/or a rubric for a particular item or misuse of OECS/OERS. Coder reliabilitystudies also made use of the OECS/OERS reports submitted by national centres.CODING PREPARATIONPrior to the assessment, a number of key activities were completed by National Centres to prepare forthe process of coding responses to the human-coded constructed-response items.Recruitment of National Coder Teams2

National Project Managers were responsible for assembling a team of coders. Their first task was toidentify a lead coder who would be part of the coding team and additionally be responsible for thefollowing tasks: training coders within the country; organizing all materials and distributing them to coders; monitoring the coding process; monitoring the inter-rater reliability and taking action when the coding results wereunacceptable and required further investigation; retraining or replacing coders if necessary; consulting with the international experts if item-specific issues arose; and producing reliability reports.The lead coder was required to be proficient in English (as international training and interactions withthe contractors were in English only) and to attend the international coder trainings in Malta in January2014 and Portugal in January 2015. It was also assumed that the lead coder for the Field Trial wouldretain the role for the Main Survey. When this was not the case, it was the responsibility of the NationalCentre to ensure that the new lead coder received training equivalent to that provided at theinternational coder training prior to the Field Trial.The guidelines for assembling the rest of the coding team included the following requirements: all coders should have more than a secondary qualification (i.e., high school degree); universitygraduates were preferable; all should have a good understanding of secondary level studies in the relevant domains; all should be available for the duration of the coding period, which was expected to last two tothree weeks; due to normal attrition rates and unforeseen absences, it was strongly recommended that leadcoders train a backup coder for their teams; and two coders for each domain MUST be bilingual in English and the language of the assessment.International Coder TrainingDetailed coding guides were developed for all the new science items that included coding rubrics as wellas examples of correct and incorrect responses. For trend items, coding information from previouscycles was included in the coding guides. For new items, coding rubrics were defined for the Field Trial,and then information from Field Trial coding was used to revise the coding guides for the Main Survey.Prior to the Field Trial, NPMs and lead coders were provided with a full item-by-item coder training inMalta in January 2014. The Field Trial training covered all the items across all domains. Prior to the MainSurvey, NPMs and lead coders were provided with a new round of full item-by-item coder training inPortugal in January 2015. The Main Survey training covered all new items as well as a set of trend3

science and trend reading items that required additional training based on the Field Trial experience.During these trainings, the coding guides were presented and explained. Training participants practicedcoding on sample items and discussed any ambiguous or problematic situations as a group. By focusingon sample responses most challenging to code, training participants had the opportunity to askquestions and get the coding rubrics clarified as much as possible. When the discussion revealed areaswhere rubrics could be improved, those changes were made and included in an updated version of thecoding guide documents available after the meeting. As in previous cycles, a “workshop” version of thecoding guides was also prepared for the national training. This version included a more extensive set ofsample responses, the official coding for each response, and a rationale for why each response wascoded as shown.To support the national teams during their coding process, a coder query service was offered. Thisallowed national teams to submit coding questions and receive responses from the relevant domainexperts. National teams were also able to review questions submitted by other countries along with theresponses from the test developers. In the case of trend items, responses to queries from previouscycles were also provided. A summary report of coding issues was provided on a regular basis and allrelated materials were archived in the PISA 2015 Portal for reference by national coding teams.National Coder Training Provided by the National CentresEach National Centre was required to develop a training package for their own coders. The trainingpackage consisted of an overview of the survey and their own training manuals based on the manualsand materials provided by the international PISA contractors. Coding teams were asked to work on thesame schedule and at the same location in order to facilitate discussion about any items that provedchallenging. Past experience has shown that if coders can discuss items among themselves and withtheir lead coder, many issues can be resolved in a way that results in more consistent coding. Each coderwas assigned a unique coder ID that was specific to each domain and design.The National Centres were responsible for organizing training and coding using one of the following twoapproaches and checking with contractors in the case of deviations:a) Coder training took place at the “item” level. Under this approach, coders were fully trained oncoding rules for each item and proceeded with coding all responses for that item. Once thatitem was done, training was provided for the next item, and so on.b) Coder Training took place at the “item set” level. While coding was conducted at the “item”level, the coder training took place at the “item set” level, with each “item set” containing a fewunits of items. In this alternative approach, coders were fully trained on a set that varied from13 to 18 items. Once the full training was complete, coding took place at the item level.However, to ensure that the coding rules were still fresh in the coders’ memory, a codingrefresher was recommended before the coding of each item.CODING DESIGN2In order to meet the unique characteristics of the CBA participants during the Main Survey whileensuring that the coding process was completed within a two-to-three week period, 10 possible coding2For a better understanding of the PISA coding designs, it is recommended that the descriptions of the PISA assessment designs in Chapter 2 be readfirst as important background information.4

designs (1 standard design and 9 variations) were offered to the CBA participants and four possiblecoding designs (1 standard design and 3 variations) were offered to the PBA participants. Those designswere developed to accommodate participants’ various needs in terms of the number of languagesassessed, the sample size, and the specified number of coders required in each domain.The number of coders by domain in each CBA coding design is shown in Table 13.2. The design ofmultiple coding in the CBA standard coding design is shown in Table 13.3. In CBA coding designs, humancoded items were bundled into one item set or multiple item sets in each domain. For each commonitem, coders coded a set of 100 student responses that were randomly selected from all the studentresponses. Each domain had two bilingual coders who needed additionally to code 10 anchor responsesfor each item assigned to both of them. The rest of the student responses to each item were evenly splitamong coders to be single coded. The difference in multiple coding between the standard coding designand other CBA coding designs mainly lay in the number of coders in each domain and which item setswere assigned to each coder.Table 13.2: Number of CBA coders by domain and coding designDesign LabelStandard designAlternative design 1Alternative design 1aAlternative design 2Alternative design 2aAlternative design 3Alternative design 3aAlternative design 4Minority Language Design 1Minority Language Design 2Sample Size RequirementsCountries with the standard sample size(4,000 – 7,000) for a given languageCountries with a sample between 7,000and 9,000 for a given languageCountries with a sample between 7,000and 9,000 for a given languageCountries with a sample between 9,000and 13,000 for a given languageCountries with a sample between 9,000and 13,000 for a given languageCountries with a sample between 13,000and 19,000 for a given languageCountries with a sample between 13,000and 19,000 for a given languageCountries with a sample larger than19,000 for the majority languageCountries with a sample less than1,500 for the minority languageCountries with a sample between1,500 and 4,000 for the minority end and New)Financial Literacy(Trend and 622223343Table 13.3: Multiple coding in CBA standard coding design5

Coder IDsMathematics(Trend)Item Set 1Number of Responsesfor Multiple Coding100 student responses per item301(Bilingual) Item Set 1Reading(Trend)Item Set 1Item Set 2Item Set 310 anchor responses per itemNumber of Responsesfor Multiple Coding100 student responses per item100 student responses per item100 student responses per item 201(Bilingual) Item Set 1Item Set 2Item Set 3Science(Trend and New)Item Set 1Item Set 2Item Set 3Item Set 410 anchor responses per item10 anchor responses per item10 anchor responses per itemNumber of Responsesfor Multiple Coding100 student responses per item100 student responses per item100 student responses per item100 student responses per item Item Set 1Item Set 2Item Set 3Item Set 4Financial Literacy(Trend and New)Item Set 110 anchor responses per item10 anchor responses per item10 anchor responses per item10 anchor responses per itemNumber of Responsesfor Multiple Coding100 student responses per item401(Bilingual) Item Set 110 anchor responses per item 101(Bilingual) 302 202 102303(Bilingual) 203(Bilingual) 103(Bilingual)304 204206 104105106 205 107108 402 403(Bilingual) 404 Note: " " denotes the coder should code 100 student responses for each item in the item set. " " denotes the coder should code 10 anchor responses for each item in the item set.Four variations of coding design were offered to PBA participants (See Table 13.4).The design of multiplecoding in the PBA standard coding design is shown in Table 13.5. For PBA participants, all paper-andpencil booklets were organized by form type into 27 different bundle sets: 9 bundle sets per domain.Bundle sets 1, 2, and 3 in each domain were composed of forms for multiple coding: Forms 13, 15, and17 for mathematics; Forms 1, 3, and 5 for reading; and Forms 7, 8, and 9 for science. For each form, 100student booklets were randomly selected from all the student responses. Each coder coded his or herassigned clusters on the sets of 100 student booklets until all items in the booklets were coded. Bundlesets 4-9 in each domain were composed of 6 or 7 types 3 of anchor forms. The forms were labelled 301307 for mathematics; 201-207 for reading; and 101-106 for science (See Table 13.5). Differing from nonanchor forms, the anchor forms each contained only one cluster of items. For example, Form 301contained all the items from the first cluster in Maths, and Form 202 contained all the items from thesecond cluster in reading. Each anchor form had 10 pre-filled English booklets that were coded by thebilingual coders from each domain. Each domain in the PBA standard design had two bilingual coders:31 and 33 for mathematics, 21 and 23 for reading, and 11 and 13 for science.CBA constructed-response items were organized by item set during multiple coding; by contrast, PBAconstructed-response items were organized by bundle set during multiple coding. In other words,multiple coding in the PBA standard design was form- rather than item-set-based. Although codersconducted coding on the booklets, each coder only coded the clusters assigned to him or her for eachbooklet, leaving the rest of the clusters to other coders. This multiple coding design enabled the within3In mathematics, there was an additional cluster, as instead of M06 there was M06A and M06B. Since countries could only choose M06A orM06B, but not both, the actual number of clusters in each domain is six rather than seven. The same is true for clusters R06A and R06B inreading.6

and across-country comparison. After the multiple coding was completed, all the clusters that remaineduncoded were equally split among coders and coded only once. The difference in multiple codingbetween the PBA standard design and other PBA coding designs mainly lay in the number of coders ineach domain, and which forms were assigned to each coder.Table 13.4: Number of PBA coders by domain and coding designDesign LabelSample Size RequirementsStandard designAlternative design 1Minority language design 1Minority design 2Mathematics(Trend)Reading(Trend)Science(Trend and New)466699222334Countries with the standardsample size (3,501 – 5,500)Countries with a sample largerthan 5,500 for the majority languageCountries a sample less than1,500 for the minority languageCountries with a sample between1,501 and 3,500 for the minority languageTable 13.5: Multiple coding in PBA standard coding designMathematics(Trend)Bundle set 1Bundle set 2Bundle set 3Bundle set 4Bundle set 5Bundle set 6Bundle set 7Bundle set 8Bundle set 9Reading(Trend)Bundle set 1Bundle set 2Bundle set 3Forms (Clusters)Form 13 (PM1&PM2)Form 15 (PM3&PM4)Form 17 (PM5&PM6a or PM5&PM6b )Form 301 (PM1)Form 302 (PM2)Form 303 (PM3)Form 304 (PM4)Form 305 (PM5)Form 306 (PM6a) or 307 (PM6b)Forms (Clusters)Form 1 (PR1&PR2)Form 3 (PR3&PR4)Form 5 (PR5&PR6a or PR5&PR6b)Number of Bookletsper Form100 student booklets100 student booklets100 student booklets31(Bilingual) 10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor bookletsNumber of Bookletsper Form100 student booklets100 student booklets100 student booklets 21(Bilingual) Bundle set 4Bundle set 5Bundle set 6Bundle set 7Bundle set 8Bundle set 9Science(Trend)Bundle set 1Bundle set 2Bundle set 3Form 201 (PR1)Form 202 (PR2)Form 203 (PR3)Form 204 (PR4)Form 205 (PR5)Form 206 (PR6a) or 207 (PR6b)Form 7 (PS1&PS2)Form 8 (PS3&PS4)Form 9 (PS5&PS6)10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor bookletsNumber of Bookletsper Form100 student booklets100 student booklets100 student bookletsBundle set 4Bundle set 5Bundle set 6Bundle set 7Bundle set 8Bundle set 9Form 101 (PS1)Form 102 (PS2)Form 103 (PS3)Form 104 (PS4)Form 105 (PS5)Form 106 (PS6)10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor booklets10 anchor bookletsForms (Clusters)11(Bilingual) 32 22 12 Coder IDs3334(Bilingual) 23(Bilingual) 13(Bilingual) 242526 141516 Note:1. " " denotes the coder should code 100 student booklets for the specific form as a bundle set. " " denotes the coder should code 10 anchor booklets for the specific form as a bundle set.2. Paper-based Mathematics, Reading and Science assessments are referred as PM, PR and PS in this table. The number following PM, PR, and PS is the Cluster number. For instance, PM1 represents Cluster 1in Mathematics domain.3. Mathematics and Reading domains have two versions of item cluster 06: 06A and 06B. Each PISA participant selected one or the other version to administer.4. CBA participants' coder ID is three-digit; while PBA participants' coder ID is two-digit.7

Within-Country and Across-Country Coder ReliabilityReliable human coding is critical for ensuring the validity of assessment results within a country, as wellas the comparability of assessment results across countries. Coder reliability in PISA 2015 was evaluatedand reported at both within- and across-country levels. The evaluation of coder reliability was madepossible by the design of multiple coding - a portion or all of the responses from each human-codedconstructed-response item were coded by at least two human coders.The purpose of evaluating the within-country coder reliability was to ensure coding reliability within acountry and identify any coding inconsistencies or problems in the scoring process so they could beaddressed and resolved earlier in the process. The evaluation of within-country coder reliability wascarried out by the multiple coding of a set of student responses—assigning identical student responsesto different coders so those responses were coded multiple times within a country. To multiple code allstudent responses in an international large-scale assessment like PISA is not economical, so a codingdesign combining multiple coding and single coding was utilized to reduce national costs and coderburden. In general, a set of 100 responses per human-coded item was randomly selected from actualstudent responses to be multiple coded. The rest of the student responses needed to be evenly splitamong coders to be single coded.Accurate and consistent scoring within a country does not necessarily mean that coders from allcountries are applying the coding rubrics in the same manner. Coding bias may be introduced if onecountry codes a certain response differently than other countries. Therefore, in addition to withincountry coder reliability, it was also important to check the consistency of coders across countries. Theevaluation of across-country coder reliability was made possible by the multiple coding of a set ofanchor responses. In each country, two coders in each domain had to be bilingual in English and thelanguage of assessment. These coders were responsible for coding the set of anchor responses inaddition to any student responses assigned to them. For each constructed-response item, a set of 10anchor responses in English was provided. These anchor responses were answers obtained from realstudents and their authoritative coding were not released to the countries. Since countries using thesame mode of administration coded the same anchor responses for each human-coded constructedresponse item, their coding results on the anchor responses could be compared to each other.CODER RELIABILITY STUDIESCoder reliability studies were conducted to evaluate consistency of coding of human-coded constructedresponse items within and across the countries participating in PISA 2015. The studies were based on 59CBA countries (for a total of 72 country-by-language groups) and 15 PBA countries (for a total of 17country-by-language groups) with sufficient data to yield reliable results.4 The coder reliability studieswere conducted for three aspects of coder reliability: the domain-level proportion agreement; the item-level proportion agreement; and the coding category distributions of coders on the same item.Proportion agreement and coding category distribution are the main indicators of coder reliability usedin PISA 2015.4Coding data from Kazakhstan (Kazakh) and Kazakhstan (Russian) were not included in this analysis and all human-coded responses wereexcluded from the calculation of proficiency estimates.8

Proportion agreement refers to the percentage of each coder’s coding that matched the othercoders’ coding on the identical set of multiple-coded responses for an item. It can vary from 0(0% agreement) to 1 (100% agreement). Each country was expected to have an average withincountry proportion agreement of at least 0.92 (92% agreement) across all items, with aminimum 85% agreement for any one item. Coding category distribution refers to the aggregation of the distributions of coding categories(such as “full credit”, “partial credit”, and “no credit”) assigned by a coder to two sets ofresponses: a unique set of 100 responses for multiple coding, and responses randomly allocatedto the coder for single coding. Notwithstanding that negligible differences of coding categoriesamong coders were tolerated, the coding category distributions between coders were expectedto be statistically equivalent based on the standard chi-square distribution due to the randomassignment of the single-coded responses.Domain-Level Proportion AgreementThe average within-country agreement by domain in PISA 2015 exceeded 92% in each domain across the89 country-by-language groups with sufficient data (see Table 13.6 and 13.7). The difference betweenCBA and PBA participants’ average proportion agreements in each of the Mathematics, reading, andtrend science domain was less than 0.5%. Within each mode, the within-country agreements betweendomains was not significantly different, either. The mathematics domain had higher agreement (97.5%for CBA; 97.5% for PBA) than the other domains. The reading domain also had agreement higher than95% (95.6% for CBA; 95.8% for PBA). The trend science domain had an average agreement of 94.2% forCBA and 94.7% for PBA. The new science domain for CBA also had an average agreement of 94.2%. TheFinancial Literacy domain had slightly lower agreement (93.7% for CBA) than the other domains.Across-country agreement by domain in PISA 2015 exceeded 92% when averaged over all the 72 CBAcountry-by-language groups (see Table 13.6). The PBA participants had lower across-country agreementthan the CBA participants on average (see Table 13.6 and 13.7). The difference in domain-levelproportion agreement between CBA and PBA is 3.3% for mathematics, 3.9% for reading, and 5.0% fortrend science. Domain-level agreement was the highest in the mathematics domain for both CBA andPBA responses (97.2% for CBA; 94.0% for PBA). For the CBA participants, the reading, trend Science, newScience, and Financial Literacy domain had across-country agreement at similar levels, ranging between93.1% and 93.9%. For the PBA participants, the average across-country agreements of the reading andtrend Science domains were 90.0% and 88.6%, respectively, slightly lower than the criterion but stillacceptable.Table 13.6: Summary of within-country and across-country agreement (%) per domain forCBA participants9

Within-Country AgreementComputer-Based ParticipantsOECD MembersAustraliaAustriaBelgium (Flemish)Belgium (French)CanadaCanadaChileCzech dsNew ZealandNorwayPolandPortugalSlovak andSwitzerlandSwitzerlandTurkeyUnited Kingdom excluding ScotlandUnited Kingdom - ScotlandUnited States excluding Puerto RicoMean - OECD MembersMedian - OECD MembersOECD PartnersBrazilBulgariaChina (B-S-J-G)Chinese TaipeiColombiaCosta RicaCroatiaCyprus 2,3Dominican RepublicHong tarQatarRussian FederationSingaporeThailandTunisiaUnited Arab EmiratesUnited Arab EmiratesUruguayMean - OECD PartnersMedian - OECD PartnersMean - CBA ParticipantsMedian - CBA ewItalia

The six parts of the trend Reading unit, Employment, R219, were separately coded to achieve consistent and accurate scoring. Note that, in the final item counts, four parts related to completing an employment application form were counted as a single item. . were supported by the Data Management