JTE v4n1 - Post Hoc Analysis of Test Items Written by Technology Education Teachers
Volume 4, Number 1
Fall 1992
Post Hoc Analysis of Test Items Written by Technology Education Teachers W. J. Haynie, III Technology education teachers frequently author their own tests. The effectiveness of tests depends upon many factors, however, it is clear that the quality of each individual item is of great importance. This study sought to determine the quality of teacher-authored test items in terms of nine rating factors. BACKGROUND Most testing in schools employs teacher-made tests (Haynie, 1983, 1990, 1991; Herman & Dorr-Bremme, 1982; Mehrens & Lehmann, 1987; Newman & Stallings, 1982). Despite this dependance upon teacher-made tests, Stiggins, Conklin, and Bridgeford (1986) point out that "nearly all major studies of testing in the schools have focused on the role of standardized tests" (p. 5). Research concerning teacher-constructed tests has found that teachers lack understanding of measurement (Fleming & Chambers, 1983; Gullickson & Ellwein, 1985; Mehrens & Lehmann, 1987; Stiggins & Bridgeford, 1985). Research has shown that teachers lack sufficient training in test development, fail to analyze tests, do not establish reliability or validity, do not use a test blueprint, weight all content equally, rarely test above the basic knowledge level, and use tests with grammatical and spelling errors (Burdin, 1982; Carter, 1984; Gullickson, 1982; Gullickson & Ellwein, 1985; Hills, 1991). Technically their tests are simplistic and depend upon short answer, true-false, and other easily prepared items. Their multiple-choice items often have serious flaws--especially in distractors (Haynie, 1990; Mehrens & Lehmann, 1984, 1987; Newman & Stallings, 1982). A few investigations have studied the value of tests as aids to learning subject content (Haynie, 1987, 1990, 1991; Nungester & Duchastel, 1982). Time on-task has been shown to be very important in many studies (Jackson, 1987; Salmon, 1982; Seifert & Beck, 1984). Taking a test is a time on-task learning activity. Works which studied testing versus similar on-task time spent in structured review of the material covered in class have had mixed results, but testing appears to be at least as effective as reviews in promotion of learning (Haynie, 1990; Nungester & Duchastel, 1982). Research is lacking on the quality of tests and test items written by technology education teachers. PURPOSE The purpose of this investigation was to study the quality of technology education test items written by teachers. Face validity, clarity, accuracy in identifying taxonometric level, and rates of spelling and punctuation errors were some of the determinants of quality assessed. Additionally, data were collected concerning teachers' experience levels, highest degree held, and sources of training in test construction. The following research questions were addressed in this study: 1. What types of errors are common in test items? 2. Do the error rate or types of errors in teacher constructed test items vary with demographic factors? 3. Do teachers understand how to match test items to curriculum content and taxonometric level? METHODOLOGY SOURCE OF DATA Between April 23, 1988 and January 8, 1990, a team of 15 technology education teachers worked to develop test items for a computerized test item bank for the North Carolina State Department of Public Instruction (SDPI). The work was completed under two projects funded by SDPI and directed by DeLuca and Haynie (1989, 1990) at North Carolina State University. The data for this study came from the items developed in those projects. TEST ITEM AUTHORS The teachers were selected on recommendation of supervisors, SDPI consultants, or teacher educators. All were recognized as leaders among their peers and most had been nominated for teacher of the year or program of the year commendation. They were all active in the North Carolina Technology Education Association and supported the transition to the new curriculum. Table 1 displays demographic data concerning the test item authors. TABLE 1 PROFILE OF AUTHORS' DEMOGRAPHIC FACTORS --------------------------------------------------- Graduate Years of Undergraduate Test & Teaching Highest Test & Measure Measure Author Experience Degree Courses Courses --------------------------------------------------- 1 9 B.S. 0 0 2 5 B.S. 1 0 3 23 B.S. 0 0 4 4 B.S. 0 1 5 5 B.S. 0 1 6 23 M.Ed. 0 1 7 19 M.Ed. 0 1 8 17 M.Ed. + 2 yrs. 0 2 9 25 M.Ed. 0 0 10 5 M.Ed. 0 0 11 7 M.Ed. 0 0 12 7 B.S. 0 0 13 7 M.Ed. 0 0 14 15 B.S. 1 0 15 5 B.S. 1 1 --------------------------------------------------- TRAINING OF AUTHORS Teachers came to the university campus for a workshop on April 23, 1988. Project directors oriented teachers to the computerized test bank, reviewed the revised technology education curriculum, and explained how to develop good test items. A 13 page instructional packet was also given to each author. It should be noted that the training session and instructional packet may confound attempts to generalize these findings. The authors were required to develop and properly code six items which were submitted for approval and corrective feedback before they were allowed to proceed. The teachers who authored the items were paid an honorarium for their services. EDITING AND CODING OF ITEMS Each item was prepared on a separate sheet of paper with a coding sheet attached and completed by the teacher. The coding sheet identified the author, the specific objective tested, the taxonometric level, and information for the computerized system. The project directors edited the items with contrasting colored felt tip pens on the teachers' original forms. DESIGN OF THIS STUDY The data for this investigation were the editing markings on the original test items submitted by the teachers. Scores for 9 scales of information were recorded for analysis. Each of the scales was established so that a low score would be optimal. The scales were Spelling Errors (SE), Punctuation Errors (PE), Distractors (D), Key (K), Usability (U), Validity (V), Stem Clarity (SC), Taxonomy (TX), and an overall Quality (Q) rating. After all of the ratings were completed, the General Linear Models (GLM) procedure was used for F testing and the LSD procedure was used when t-tests were appropriate. FINDINGS SPELLING ERRORS (SE) The frequency and percentage of scores for the 993 items on the nine ratings, and mean scores of each factor, are shown in Table 2. An item's SE rating indicates how many words were misspelled in the item. There were 98 items (10%) which had one or more spelling errors. Spelling errors are detrimental to good teaching and testing. However the literature shows that this problem is common to other disciplines. TABLE 2 RATINGS OF TEST ITEM QUALITY ----------------------------------------------------------- Frequency of % of Mean Items With Items/ Item Rating Category Score Each Score Score Score SD ----------------------------------------------------------- Spelling Errors (SE) 0 895 90.1 1 76 7.7 2 11 1.1 3 6 0.6 4 3 0.3 5 1 0.1 6 1 0.1 SE Totals --- 993 100% 0.14 0.52 ----------------------------------------------------------- Punctuation Errors(PE) 0 735 74.0 1 220 22.2 2 25 2.5 3 4 0.4 4 1 0.1 5 8 0.8 PE Totals --- 993 100% 0.38 0.68 ----------------------------------------------------------- Distractors (D) 0 447 45.0 1 398 40.1 2 95 9.6 3 30 3.0 4 9 0.9 5 14 1.4 D Totals --- 993 100% 0.79 0.96 ----------------------------------------------------------- Key (K) 0 889 89.5 2 104 10.5 K Totals --- 993 100% 0.21 0.61 ----------------------------------------------------------- Usability (U) 0 249 25.1 1 265 26.7 2 159 16.0 3 131 13.2 4 74 7.5 5 50 5.0 6 21 2.1 7 11 1.1 8 16 1.6 9 17 1.7 U Totals --- 993 100% 2.02 2.04 ----------------------------------------------------------- Stem Clarity (SC) 0 602 60.6 1 352 35.4 2 39 3.9 SC Totals --- 993 100% 0.43 0.57 ----------------------------------------------------------- Taxonomy (TX) 0 835 84.1 1 124 12.5 2 34 3.4 TX Totals --- 993 100% 0.19 0.47 ----------------------------------------------------------- Quality (Q) 0 208 20.9 1 235 23.7 2 200 20.1 3 129 13.0 4 74 7.5 5 58 5.8 6 42 4.2 7 17 1.7 8 10 1.0 9 12 1.2 10 2 0.2 11 3 0.3 12 1 0.1 13 1 0.1 14 1 0.1 15 0 --- 16 0 --- 17 1 0.1 Q Totals ---- 993 100% 2.28 2.20 ---------------------------------------------------------- NOTE. There were 993 items. The authors were compared on each of the scales to determine whether they differed significantly and to see if similar or dissimilar errors were made by different authors. On the spelling errors factor authors were found to differ significantly: F(14, 978) = 11.99, p<.0001. ___="" ____="" a="" ability="" ability.="" about="" above="" according="" accuracy="" accurate="" accurately="" activities="" actually="" additionally="" addressed="" adjacent="" after="" agreement="" all="" alone="" already="" also="" alternatively="" alternatives="" among="" an="" analysis="" and="" another="" answer="" answered="" any="" apparently="" appear="" appeared="" appeared.="" application="" are="" areas="" article="" as="" aspect="" aspects="" assessment="" assigned="" assignments="" assume="" at="" attained="" author="" authored="" authors="" average="" bachelor="" bank="" be="" be:="" because="" become="" been="" before="" begins="" begun="" believed="" beneficial.="" best="" better="" between="" blank.="" bloom="" both="" burn="" but="" by="" can="" capable="" carefully="" case="" categories="" categories.="" category="" category.="" category:="" clarity="" clarity.="" clear="" clearly="" clearly.="" code="" coded="" codes="" coding="" cognitive="" colon="" common="" compared="" comparisons="" comparisons.="" competing="" comprehension="" conclude="" concluded="" conclusions="" confusing="" confusion="" considered="" correct="" correct.="" correctly="" correspond="" could="" counted="" counting="" course="" course.="" courses="" criticism="" d="" damaging="" data="" defects="" degree="" degrees="" demographic="" demonstrate.="" demonstrated="" derived="" desired="" despite="" develop="" developed="" developing="" development="" devote="" devoted="" did="" differ="" differed="" difference="" differences="" different="" differing="" difficult="" difficulties="" difficulty="" disciplines="" discussion="" distractors="" distractors:="" divided="" documents="" domain="" done="" drawn="" due="" each="" earlier="" earned="" editing="" education="" effective="" eight="" either="" element="" eliminated="" enabling="" end="" ended="" enough="" equalled="" error="" errors="" errors.="" evaluation="" even="" exactly="" examine="" example="" except="" experience="" experience.="" experienced="" explained="" extent="" extra="" f="" face="" face.="" fact="" factor="" factors="" favorably="" felt="" fewer="" finding="" findings="" findings.="" first="" five="" flawless.="" follow-up="" for="" forms--thus="" forth="" found="" four="" frequent="" frequently="" from="" function="" gender.="" general="" generally="" given="" good="" grading="" graduate="" grammar="" grand="" greater="" groups="" had="" has="" have="" held="" help="" helped="" helpful="" helping="" here="" high="" higher="" highest="" how="" however="" identified="" identify="" if="" immediately="" important="" improve="" improved="" in="" in:="" inaccurate="" included="" incompatibility="" incorrect="" incorrectly="" increase="" indeed="" indicate="" indicated="" indicates="" indicating="" individual="" inexperienced="" inflated="" informal="" information="" instructions="" insufficient="" intended="" intention.="" into="" introductory="" invalid="" investigated="" investigation="" is="" it="" item="" item.="" items="" items.="" items:="" judged="" judgement="" k="" key="" keyed="" keying="" keying.="" knowledge="" knowledgeable="" lack="" lead="" learning="" learning.="" least="" left="" less="" level="" level.="" level:="" levels="" likely="" likewise="" limited="" longer="" low="" lower="" lsd="" made="" many="" marginally="" marked="" marks="" match="" matter="" may="" meaningful="" means="" measurement="" measurements="" mechanical="" mismatch="" misspelled="" more="" most="" much="" n="" name="Burdin" necessarily="" need="" needs="" neighboring="" no="" none="" normal="" not="" note.="" noteworthy="" number="" numerous="" objective="" objectives="" objectives.="" obviously="" of="" off="" often="" omission="" on="" on-task="" one="" ones="" only="" operate="" operate.="" operated="" or="" original="" other="" others="" others.="" out="" outperformed="" overall="" p="" pair="" partial="" participated.="" particular="" patience="" pe="" peers="" per="" perhaps="" plural="" plus="" poor="" poorer="" poorest="" poorly="" portion="" possessed="" possibility="" possibly="" predicted="" preparation="" prepare="" prepared="" preparing="" presented="" previous="" problem="" problems="" problems.="" procedure="" produced="" profession="" projects="" promote="" proofreading="" proper="" prose="" punctuation="" punctuation.="" purpose="" purposes.="" q="" quality="" quality.="" quality:="" quantify="" question="" questionable.="" questions="" range="" rate="" rated="" rates.="" rating="" rating.="" rating:="" ratings="" ratings.="" read="" reading="" reasonable="" reasonably="" received="" recent="" references="" regardless="" regrettably="" related="" reliability="" remain="" remaining="" required="" research="" research.="" researcher="" response="" resulted="" results="" safe="" same="" sample="" sampling.="" sc="" score="" scoring="" se="" see="" seem="" seemed="" selected="" selection="" set="" several="" should="" showed="" shown="" shows="" significant="" significantly="" simple="" simply="" since="" singular="" six="" size="" sizeable="" small="" so="" some="" sort="" special="" specific="" spelling="" spend="" spent.="" spurious="" spuriously="" statements="" stem="" stems="" stems.="" stems:="" still="" stressed="" students="" studied="" studied:="" study="" study--but="" study.="" study:="" subject="" subjective="" submitted="" submitting="" such="" suggest="" suited="" sum="" summarized="" summation="" summation.="" summed="" summing="" superior="" switched="" table="" take="" taken="" taking="" targeted="" taxonometric="" taxonomy="" taxonomy:="" teacher="" teacher-made="" teacher.="" teachers="" teachers.="" teachers:="" teaching="" technology="" tense="" terms="" test="" test.="" tested="" tested.="" testing="" tests="" tests.="" than="" that="" the="" their="" them="" them.="" then="" theories.="" there="" there.="" these="" they="" this="" those="" though="" three="" time="" time.="" to="" together="" total="" training="" true="" two="" tx="" types="" u="" unanticipated="" undergraduate="" understandable="" unique="" unknown="" usability="" usable="" use="" used="" useful="" usefully="" v="" valid="" validity="" validity.="" value="" variables="" vary="" version="" very="" via="" was="" waste="" ways="" weaknesses="" well="" were="" were:="" when="" whether="" which="" who="" with="" word="" worded="" wording="" work="" works="" worse="" worst="" would="" write="" writing="" wrong="" wrote="" years="">Burdin, J.L. (1982). Teacher certification. In H.E. Mitzel (Ed.), Encyclopedia of education research (5th ed.). New York: Free Press. Carter, K. (1984). Do teachers understand the principles for writing tests? Journal of Teacher Education, 35(6), 57-60. DeLuca, V.W. & Haynie, W.J. (1990). Updating, computerization, and field validation of competency-based test-item banks for selected construction and communications technology courses (Contract No. RFP 90-A-07). Raleigh, NC: North Carolina State Department of Public Instruction. DeLuca, V.W. & Haynie, W.J. (1989). Updating, computerization , and field validation of competency-based test-item banks for selected manufacturing technology education courses (Contract No. RFP 88-R-03). Raleigh, NC: North Carolina State Department of Public Instruction. Fleming, M. & Chambers, B. (1983). Teacher-made tests: Windows on the classroom. In W. E. Hathaway (Ed.), Testing in the schools: New directions for testing and measurement, NO. 19 (pp.29-38). San Francisco: Jossey-Bass. Gullickson, A.R. (1982). Survey data collected in survey of South Dakota teachers' attitudes and opinions toward testing. Vermillion: University of South Dakota. Gullickson, A.R. & Ellwein, M.C. (1985). Post hoc analysis of teacher-made tests: The goodness-of-fit between prescription and practice. Educational Measurement: Issues and Practice, 4(1), 15-18. Haynie, W.J. (1983). Student evaluation: The teachers' most difficult job. Monograph Series of the Virginia Industrial Arts Teacher Education Council, Monograph Number 11. Haynie, W.J. (1987). Anticipation of tests as a learning variable. Unpublished manuscript, North Carolina State University, Raleigh, NC. Haynie, W.J. (1990). Effects of tests and anticipation of tests on learning via videotaped materials. Journal of Industrial Teacher Education, 27(4), 18-30. Haynie, W.J. (1991). Effects of take-home and in-class tests on delayed retention learning acquired via individualized, self-paced instructional texts. Manuscript submitted for publication. Herman, J. & Dorr-Bremme, D.W. (1982). Assessing students: Teachers' routine practices and reasoning. Paper presented at the annual meeting of the American Educational Research Association, New York. Hills, J.R. (1991). Apathy concerning grading and testing. Phi Delta Kappan, 72(7), 540-545. Jackson, S.D. (1987). The relationship between time and achievement in selected automobile mechanics classes. (Doctoral dissertation, Texas A&M University). Mehrens, W.A. & Lehmann, I.J. (1984). Measurement and Evaluation in Education and Psychology. 3rd ed. New York: Holt, Rinehart, and Winston. Mehrens, W.A. & Lehmann, I.J. (1987). Using teacher-made measurement devices. NASSP Bulletin, 71(496), 36-44. Newman, D.C. & Stallings, W.M. (1982, March). Teacher competency in classroom testing, measurement preparation, and classroom testing. Paper presented at the Annual Meeting of the National Council on measurement in Education. (In Mehrens & Lehmann, 1987) Nungester, R.J. & Duchastel, P.C. (1982). Testing versus review: Effects on retention. Journal of Educational Psychology, 74(1), 18-22. Salmon, P.B. (Ed.). (1982). Time on task: Using instructional time more effectively. Arlington, VA: American Association of School Administrators. Seifert, E.H. & Beck, J.J. (1984). Relationships between task time and learning gains in secondary schools. Journal of Educational Research, 78(1), 5-10. Stiggins, R.J. & Bridgeford, N.J. (1985). The ecology of classroom assessment. Journal of Educational Measurement, 22(4), 271-286. Stiggins, R.J., Conklin, N.F. & Bridgeford, N.J. (1986). Classroom assessment: A key to effective education. Educational Measurement: Issues and Practice, 5(2), 5-17. ---------------- W.J. Haynie, III is Associate Professor, Department of Occupational Education, North Carolina State University, Raleigh, NC. Permission is given to copy any article or graphic provided credit is given and the copies are not intended for sale. Journal of Technology Education Volume 4, Number 1 Fall 1992