Transcription

ANOVA Analysis of Student Daily Test Scores inMulti-Day Test PeriodsMatthew L. Mouritsen, PH.D.Department of Accounting and TaxationWeber State UniversityOgden, UtahJefferson T. Davis, PH.D., CPA, CISADepartment of Accounting and TaxationWeber State UniversityOgden, UtahSteven C. JonesInstitutional ResearchWeber State UniversityOgden, UtahABSTRACTInstructors are often concerned when giving multiple-day tests because students taking the test later in the exam periodmay have an advantage over students taking the test early in the exam period due to information leakage. However,exam scores seemed to decline as students took the same test later in a multi-day exam period (Mouritsen and Davis,2012). This study reports mean test score analysis of a four-day exam period. Students with higher cumulative GPAstend to take the exam earlier in the testing period. The majority of students take the exam the last day of the testing period. Test score variance for each test day also increases with each test day. One-way ANOVA analysis finds that meantest scores of students who take the test later in the test period significantly decline. Pairwise comparisons that assumeunequal numbers of observations in each group as well as unequal variances of exam scores for each day, show that day4 mean scores are significantly less than days 1, 2, and 3. The only other pairwise difference is day 1 and day 3. Further, a 4 X 2 (4 test days by two different professors) ANCOVA analysis is also reported where cumulative GPA andTest # (4 or more tests each semester) are used as control variables to see if student test scores still decrease for studentstaking the test later in the testing period. The results show significant decreases in mean test scores as students take thetest later in the testing period even when controlling for students’ cumulative GPA and Test # within the semester. Anestimated marginal means analysis further shows that the upper bound of day 4 is below the lower bound of days 1, 2,and 3, consistent with pairwise comparisons of test score means. The results suggest that information leakage, if any,is not evident in multi-day test scores. The results suggest that an instructor may have an opportunity to further helpstudents taking the exam later in the exam period. Further research on demographics, test preparation, procrastination, self-efficacy, and emotional intelligence of students taking multi-day tests is in order (Hen and Goroshit, 2014).INTRODUCTIONMouritsen and Davis, 2012, and Reed and Holley 1989).Although, this information does not mean informationleakage does not take place, it does suggest that other factors are much more prominent in determining test scoresin a multi-day test period than any information leakagethat may take place. For example, there are several articlesin the education literature that study procrastination inacademic settings.Many universities are using testing centers to allow students to take tests when it is more convenient for the student. One of the issues related to testing centers in general, and specifically for tests that can be taken by studentsover multiple days, is the risk of information leakage tostudents who take that test later in the test period. However, two studies have found that instead of test scores being higher for students taking the test late in the multi-day The objective of this research is to discover why averagetesting period, test scores are actually lower for students test scores of students who take the test at the end of awho take the test later in the multi-day testing period (see multi-day testing period are lower than average scores ofstudents who take the test earlier in the testing period.Journal of Learning in Higher Education73

Matthew L. Mouritsen, Jefferson T. Davis, & Steven C. JonesThis study analyses test scores of students taking examsover multi-day testing periods for introductory financialaccounting (Accounting 2010) and introductory managerial accounting (Accounting 2020) courses taught bytwo different instructors over several semesters. The testswere all administered in the testing center over a 4-dayperiod. Students were allowed to select when to take thetest during the 4-day testing period. The exams were allmultiple choice and no time limit was given. The analysesin this study include test scores from different tests takenduring different semesters. Exhibit 1: shows the Distribution of Students included in the study taking the testsduring each of the successive four test days. The data includes only tests where four test days were used so thatthe test percentages for each course could be consistentbased on the number of days. Exhibit 1: Distribution ofStudents Taking Exam Each Day for Both Courses showsthat more students took the test each successive day of thetest period and the total number of tests included in thestudy for each course. The total number of tests did include up to four test scores from each individual studentfor different exams taken during a semester. Exhibit 2:Distribution of Mean Exam Scores by Test Day for BothCourses shows that test percentage scores drop with eachANOVA Analysis of Student Daily Test Scores in Multi-Day Test Periodssuccessive day of the test period. One might expect thatbetter students tend to take the test earlier in the examperiod. Exhibit 3: Mean GPA of Students by Test Day forBoth Courses shows that, in fact, the average cumulativeGPA of students who take the test earlier is higher thanthe average GPA of students who take the test later. Thisresearch is thus aimed at discovering and analyzing whatother course and student characteristics might play a rolein students’ test taking and scores over a multi-day testingperiod.Student characteristics were also paired with the testscores of each student as well as information about whatday the test was taken by each student during a 4-day testing period. In addition to student test percentage scores,the student test percentages were matched with other testinformation and student characteristics, including examnumber during the semester the course was taken, classlevel (freshman, sophomore, etc.), whether the studentwas full-time or part-time, and age of student.3.503.403.30The descriptive statistics support the finding that students’ average test scores get worse by day as the multi-dayItemDay 1Day 2Day 3Day 4Total# of Students106147154310717% of Students15%21%21%43%100%# of Students6860184428740% of 2.60RESEARCH DESIGN AND HYPOTHESESExhibit 1Distribution of Students Taking Exam Each Day for Both CoursesCourseExhibit 3Mean GPA of Students by Test Day for Both Courses20202010Average of GPACourseDay 1Day 2Day 3Day 4Overall 03.02Average by Day3.323.163.182.963.08Accounting 2010Accounting 2020Exhibit 2Distribution of Mean Exam Scores by Test Day for Both CoursesCourseItemAccounting 2010Accounting 2020Combined74Mean scoreby dayDay 1Day 2Day 3Day %80%71%77%Fall 2016 (Volume 12 Issue 2)testing period progresses. With the ultimate objective ofthis research being to discover why average test scores ofstudents who take the test at the end of a multi-day testingperiod are lower than average scores of students who takethe test earlier in the testing period, this research takes thefollowing basic approach: First an ANOVA model is usedto determine whether there are differences overall in themean test scores for each of the four days in the testingperiod. Then, if an overall difference is found, a pairwisetest is used to determine which test days exhibit differentmean test scores from each of the other test days. Statistical correlations are also run to find relationships betweenmean student test scores and various course and studentcharacteristics. Using the information from these correlations an ANCOVA model is developed to test whetherthese course and student characteristics are statically significant variables for determining mean test score by testday. Finally, a marginal means analysis is used to furtherstudy the relationship of these student characteristics toJournal of Learning in Higher Educationthe day they took the test and the mean test score for eachday.ANOVA Hypothesis and Test ResultsTo determine whether the mean test score (test percentage) differs overall for the 4-day test period an ANOVAmodel is appropriate. The ANOVA model provides anindication if the mean test scores for the four days are statistically different based on days. Formally, the null hypothesis is as follows:ANOVA H1(null):No overall mean test score differences between test days exist.If H1(null) is not rejected, then the results of the researchend with the finding that, on average, it does not matterwhich day a student takes the exam in relation to theirmean test score. If the null hypothesis is rejected, then the75

Matthew L. Mouritsen, Jefferson T. Davis, & Steven C. JonesANOVA Analysis of Student Daily Test Scores in Multi-Day Test 1674227.0101037.0109214.0077168.0062896.004386295% Confidence Interval for MeanLower BoundUpper 1460.726155.757169.774377ANOVABetween GroupsWithin GroupsTotalSum of Squares4.35436.45940.812df314531456Mean Square1.451.025F57.83512Sig.00034Robust Test of Equality of .419.0001The Brown-Forsythe test, which accounts for the lack of variance homogeneity, indicates statistically significantresults even with unequal variances and unequal number of test scores in each day.2Asymptotically F distributed.76If H2 (null) is rejected, we will then have informationconcerning which test days’ mean test scores are statistically different from each of the other test days’ mean testscore.Fall 2016 (Volume 12 Issue 2)MeanDifference (a-b)Std. ErrorSig.0290471.01487823.0520339*495% Confidence IntervalLower BoundUpper Bounddimension3Std. ErrorNo day-to-day pairwise differences in mean test scores for eachof the four test days . DeviationPairwise H2 00.075735.142426dimension3MeanIn the case of differences, a pairwise comparison can provide information as to any statistical differences betweenmean test scores for each day in relation to each of theother days. Formally, the null hypothesis states:Exhibit 5Pairwise Comparisons of Test Days’ Mean Exam ScoresExam Day Exam Day(a)(b)DescriptivesNPairwise Hypothesis and Test ResultsMultiple ComparisonsTamhane’s T2 Pairwise Test1Exhibit 4Mean Test Percent ScoreANOVA Results and Brown Forsyth for Non-homogeneity of VarianceDayday may have differences that lead to different exam scoresfor each day. The fact that the number of students taking the exam each day increases by day and that standarddeviations for each day test scores also generally increasesuggests that the ANOVA may not be valid. ANOVAprocedures generally assume homogeneous (similar) variances in the data. To test for non-homogenous (non-similar) variances, the Brown-Forsyth test was also performed.The Brown-Forsyth test results show statistical differences in mean test scores for the multi-day testing periodeven when accounting for unequal variances and unequalnumber of students taking the test each day. With statistical differences in mean test scores for the 4-day testing period confirmed by the ANOVA and Brown-Forsyth tests,the next step is to test for pairwise differences of mean testscores for each 0936*.0099553.000-.112353-.059834dimension3The ANOVA to determine if statistical differences between mean test scores for the 4-day test period rejectsthe null hypothesis that there are no differences basedon which day the test was taken by students. Exhibit 4shows the descriptive statistics, the ANOVA and BrownForsythe results the test scores for the 4-day test period.The mean (average) test scores in the descriptive panelmatch the means listed in Exhibit 3. The descriptive panelalso provides the number of students taking the test ineach of the four days, the standard deviation for each ofthe 4-days test scores, and the 95% confidence intervalsfor each of the 4-days test scores. The main result of theANOVA procedure shows strong differences betweenthe mean test scores for the four test days (significanceof .000). An important aspect of the descriptive statisticsreveals that many more students take the exam on thesecond day than on the first day. Day three and four havemore students who take the exam than the previous daysas well. Also notice that, with the exception of day three,the standard deviation (a measure of variation from themean test score for the day) increases during the 4-day testperiod. It is not surprising that the standard deviation oftest scores increases with the number of students takingthe exam on a given day—more students, more variety.This finding suggests that students taking the exam eachdimension2results indicate that the mean test scores do differ by dayof the test period. Based on the descriptive statistics foundin Exhibit 3, the expectation is that the null hypothesiswill be rejected, in other words, statistical differences exist in mean test scores for students taking the tests overa 4-day test period. One student characteristic that mayseem somewhat obvious is that better students will takethe test earlier in the test period. Exhibit 3 shows studentGPA in relation to mean test score by test day. There maybe other explanations for the results as well. Further analysis is in order if statistical differences are found using theANOVA test.*. The mean difference is significant at the 0.05 level.The Tamhane’s T2 is a pair-wise procedure based on the Student t-distribution. Tamhane’s is a more conservativepost hoc comparison for data with unequal variances and is appropriate when variances are unequal and/or whenthe sample sizes are different.” (source: chapter 11, page 256 of Basic Statistics and Pharmaceutical StatisticalApplications By James E. De Muth1Journal of Learning in Higher Education77

Matthew L. Mouritsen, Jefferson T. Davis, & Steven C. JonesThe results of the pairwise test comparing the mean testscore of each day to each of the other three days is foundin Exhibit 5: Pairwise Comparisons of Test Days’ MeanExam Scores.Pairwise procedures result in mixed results as to whetherthe null hypothesis of no means test score differences ofa particular day in relation to each of the other days is rejected or accepted. The results show that day 1 mean scoreis not statistically higher than day 2 (272), but it is higherthan the mean test scores of day 3 (.000) and day 4 (.000).The day 2 mean test score is not different than day 1 (.272)or 3 (.418), but it is higher than day 4 (.000). Finally, day3 mean test score is higher than day 4 test score (.000). Itshould be noted that day 4 mean test score is significantlylower than each of the other three days’ mean test scores(.000).The Tamhane’s T2 pairwise procedure was chosen because this particular pairwise test is appropriate when unequal samples sizes exist and when variances (i.e standarddeviations) are also unequal. Since pairwise differencesbetween mean test scores for most of the days are found,further analysis is needed to determine why the test scoresfor different test days tend to get lower as test days progress from day 1 through 4. Particularly, further analysisseeks to find answers to the question, “Why are test scoresfor the last day, day 4, lower than each of the other threedays of the exam period?”Correlations of Test Scores withStudent and Course CharacteristicsSince some pairwise differences between each days’ meantest scores were found, the next step is to study potentialreasons why different days in the testing period yield different mean test scores. Statistical correlation proceduresare used to find strong or weak relationships betweenstudent and course characteristics (i.e. course/prof, testnumber, student GPA, class level, full/part time) and testscores. Exhibit 6 shows the Correlation results betweenstudent test scores and student’s cumulative GPA, examday, exam number, class level (freshman, sophomore, etc.),semester, and age of student.The Pearson correlations were significant for GPA (.437;.000), exam day (-.312; .000), and exam number (-.292;.000). Exam number refers to the first to last exams in thesemester. The correlation shows that exam scores tend tobe lower for exams given later in the semester. This resultmakes sense as exams taken later in the semester typically deal with more difficult topics or topics that build oninformation from the earlier part of the course. And ofcourse it makes sense that exam day has a negative correlation with text scores.78ANOVA Analysis of Student Daily Test Scores in Multi-Day Test PeriodsExhibit 6Correlations betweenMean Exam Score as a Percentage andOther Variables (N 1457)PearsonCorrelationCumulative GPA.437 **Exam Day-.312 **Exam #-.292 **Class Level.046Semester.008Age.006** Significant at .01 level .079.756.830Class level exhibited some correlation with test scores(.045) but the significance level (.079) did not approachreach .01. Semester and student age had extremely weakcorrelations and were very far from statistical significance.Whether a student was full or part time also did not showa relationship with test scores. These correlations werethen used to determine what variables would be used inthe ANCOVA.ANCOVA Hypotheses andANCOVA and Marginal Means Tests ResultsBased on the correlation results, an ANCOVA model wasdeveloped to see if mean test scores by test day still differ ifthese course and student characteristics are used as controlvariables in the ANCOVA model. In general, ANCOVAis a combination of ANOVA and linear regression. TheANOVA includes a dependent variable (mean test scores)with one or more categorical independent variables (4 testdays and 2 different courses), combined with other control variables to “correct” for or take into account othervariables or characteristics that may confound or make adifference in the predictive model. The ANCOVA model tests for statistical differences in mean test day scoreswhile controlling for these characteristics. The ANCOVAresults will also find which of these variables statistically contribute or help to explain differences in mean testscores for each of the four test days. The null hypothesesrelated to the ANCOVA are as follows:ANCOVA H3A (null) No mean test score differencesfrom main effects in 4X2 (4 daysX 2 courses/professors) design.Fall 2016 (Volume 12 Issue 2)ANCOVA H3B (null) Covariates (Student GPA, Test#) are not significant variablesand do not contribute to anymean test score differences in relation to 4-day exam period nor2 different courses/professors.Finally, a marginal means test was conducted to explorefurther differences in any ANCOVA results to show thepercentage of students within each day’s mean scores.The ANCOVA was a 4X2 design (4 test days by 2 courses/professors) for the main effects. The covariates included inthe model to control for characteristics that might confound main effects on the ANOVA were student GPAand exam number.The results show the H3A (null) and H3B (null) are bothrejected. In other words, the main effects, test day andcourse/professor were significant contributing variablesto predicting the test score. Also, the two covariates (student GPA, and exam #) were significant to the ANCOVA model. Therefore, even when controlling for studentGPA and exam #, the main effect variables of test day andcourse/professor were still strong predictors of studenttest scores. The results also show that student GPA andexam # have an impact on student test scores. The interaction between test day and course did not, however, significantly impact the strength of the model in explainingstudent test scores. The ANCOVA results achieved anadjusted R 2 of .366. This means that, overall, the modelexplains student test scores fairly well.The estimated marginal means further shows that dayfour test scores have an upper 95% confidence interval(upper bound is .743) that is lower than all the other days(lowest lower bound day 3 is .775) 95% confidence lowerbound even when controlling for student GPA and Testnumber.A 95% confidence interval means that with 95% probability, the true mean test score is within that interval.Since the upper bound of day four exam scores is lowerthan the lower bound of any other day’s mean test score,it is clear that there is very small probability (5%) that dayfour mean test score overlaps any other days’ true meantest score. The marginal means statistics resulting fromthe ANCOVA model show that the day four group characteristics in relation to exam scores are strongly differentthan students taking the test on the other three days. Theday four group is the largest group, has the lowest averageGPA, and the largest test score variation. Although themarginal means standard error is smallest for day four, thestandard deviation for day four test scores is the largest(see Exhibit 4). The reason the marginal means standarderror is smallest is largely due to the fact that the number of students who take the test on day 4 is much largerthan the other three days. A higher number N typicallystrengthens the statistical ability to narrow the confidenceinterval.Exhibit 74x2 ANCOVA Design(4 levels:{day 1, day 2, day 3, day 4} x 2 levels: {professor 1, professor 2})Covariates: Cumulative GPA, Exam #Tests of Between-Subjects EffectsDependent Variable: Exam Score PercentType IIISourceSum of SquaresCumulative y * 192.30226.51563.2211.009.000.000.000.000.388R Squared .370 (Adjusted R Squared .366All main effect and covariates are statistically Significant.*No Statistically significant interaction effect between ExamDay and ProfessorJournal of Learning in Higher Education79

Matthew L. Mouritsen, Jefferson T. Davis, & Steven C. JonesANOVA Analysis of Student Daily Test Scores in Multi-Day Test PeriodsExhibit 8Estimated Marginal Means of Exam Score Percentage and Exam DayDependent Variable: Percent of Exam ScoreMeanStd. Error1.814a2dimension1Exam Day95% Confidence IntervalLower BoundUpper 034.733a.005.723.743a. Covariates appearing in the model are evaluated at the following values:CUM GPA UGRAD 3083.18, Exam # 2.62.LIMITATIONS, SUMMARY,CONCLUSIONS AND FURTHER STUDYThe breadth of the study is fairly limited since only twodifferent accounting courses and only two different professors are included in the data. Readers should also recognize that, although the variables used as measures ofstudent and course characteristics exhibit correlationsor strong relationships between student test scores, causeand effect cannot be concluded. For example, we cannotconclude that a student’s GPA causes their test score onany particular exam. However, the relationship betweena student’s GPA may help an instructor predict who mayneed more help in learning information to perform wellon a test.The results show significant decreases in mean test scoresas students take the test later in the testing period evenwhen controlling for students’ cumulative GPA and Test# within the semester. An estimated marginal means analysis further shows that the upper bound of day 4 is belowthe lower bound of days 1, 2, and 3, consistent with pairwise comparisons of test score means. The results suggestthat information leakage, if any, is not evident in multiday test scores. The results clearly show that students taking the exam on day 4 are different from students takingthe exam on days one through three. The results suggestthat an instructor may have an opportunity to furtherhelp students taking the exam later in the exam period.Further research on demographics, test preparation, andtest taking skills of students taking the exam on day 4 isin order. Perhaps interviews with students can provide afurther understanding about student motivation, studenttest preparation, and student test-taking challenges. Particularly, further research can help instructors learn potential ways to help day four test takers improve their testscores.Hen and Goroshit (2014) provide some direction forfuture research on how teachers might find ways to helpstudents. They found that procrastination is related tolower levels of self-regulated learning and academic selfefficacy (Bandura, 1977) and associated with higher levelsof anxiety, stress, and illness. They also review and discussemotional intelligence (EI) and how it may influence astudent’s ability to assess, regulate, and utilize emotionsassociated with academic self-efficacy and academic performance including student GPA (see also Haycock, et al.,1998; Wolters, 2003; Zajacova, et al., 2005; Seo, 2008;Klassen et al., 2008; Deniz, et al., 2009). Using the data inthe current study, the test starting times showed that day4 students started the exam on average at 2:51 pm whileday one average was 12:39 pm, day 2 average was 1:12pm, and day 3 average was 1:24 pm. The days of the weekshowed that most all the tests were taken during week-80Fall 2016 (Volume 12 Issue 2)Journal of Learning in Higher Educationdays, so weekend test days were not a factor of taking thetest later in the day. This data is another indication thatprocrastination plays a role especially for day 4 test takers.Future research could use standardized tests available tomeasure students for emotional intelligence, self-efficacy,and motivation, look for direct and indirect relationshipsto procrastination and academic success. Then instructorsmight be able to begin to address these related issues tohelp students be more successful in academic settings.REFERENCESBandura, A. (1977). Self efficacy: Toward a Unifying Theory of Behavioral Change. Psychological Review, 84, 191 215.Deniz, M., Tras, Z., and Adygan, D. (2009). An Investigation of Academic Procrastination, Locus of Control,and Emotional Intelligence. Educational Sciences:Theory & Practice, 9(2), 623-632.Haycock, L., McCarthy, P., and Skay, C. (1998). Procrastination in College Students: The Role of Self-Efficacyand Anxiety. Journal of Counseling & Development,76(3), 317-324.Hen, M, Goroshit, M. (2014) Academic Procrastination, Emotional Intelligence, Academic Self-Efficacy,and GPA: A Comparison between Students with andwithout Learning Disabilities. Journal of Learning andDisability, 47(2), 116-124.Hen, M. and Goroshit, M. (2014). Academic Self-Efficacy,Emotional Intelligence, GPA and Academic Procrastination in Higher Education. Eurasian Journal of SocialSciences, 2(1), 1-10Klassen, R., Krawchuk, L., and Rajani, S. (2008). Academic Procrastination of Undergraduates: Low SelfEfficacy to Self-Regulate Predicts Higher Levels ofProcrastination. Contemporary Educational Psychology, 33(4), 915–931.Mouritsen, M., and Davis, J. (2012). Declining Test Scoreamong Introductory Accounting Students: A Comparison of Mean Test Scores in Multi-Day Examination Periods. International Journal of Business andSocial Science, 3(15), 1-8.Muth, J. E. (2006). Basic Statistics and PharmaceuticalStatistical Applications. 256, CRC Press, Boca Rotan,FL.Reed, S., & Holley, J. (1989). The Effect of Final Examination Scheduling on Student Performance. Issues inAccounting Education, 4(2), 327-344.81

Matthew L. Mouritsen, Jefferson T. Davis, & Steven C. JonesSeo, E. (2008). Self-Efficacy as a Mediator in the Relationship Between Self-Oriented Perfectionism and Academic Procrastination. Social Behavior and Personality, 36(6), 753-764.Wolters, C. (2003). Understanding Procrastination froma Self-Regulated Learning Perspective. Journal of Educational Psychology, 95(1), 179–187.Zajacova, A., Lynch, S., and Espenshade, T. (2005). SelfEfficacy, Stress and Success in College. Research inHigher Education, 46(6), 677-706.82Fall 2016 (Volume 12 Issue 2)

ANOVA test. The ANOVA to determine if statistical differences be-tween mean test scores for the 4-day test period rejects the null hypothesis that there are no differences based on which day the test was taken by students. Exhibit 4 shows the descriptive statistics, the ANOVA and Brown-Fo