JTE v4n1 - Post Hoc Analysis of Test Items Written by Technology Education Teachers
Volume 4, Number 1
Fall 1992
Post Hoc Analysis of Test Items Written by Technology Education Teachers
W. J. Haynie, III
Technology education teachers frequently author their
own tests. The effectiveness of tests depends upon many
factors, however, it is clear that the quality of each
individual item is of great importance. This study sought
to determine the quality of teacher-authored test items in
terms of nine rating factors.
BACKGROUND
Most testing in schools employs teacher-made tests
(Haynie, 1983, 1990, 1991; Herman & Dorr-Bremme, 1982;
Mehrens & Lehmann, 1987; Newman & Stallings, 1982). Despite
this dependance upon teacher-made tests, Stiggins, Conklin,
and Bridgeford (1986) point out that "nearly all major
studies of testing in the schools have focused on the role
of standardized tests" (p. 5).
Research concerning teacher-constructed tests has found
that teachers lack understanding of measurement (Fleming &
Chambers, 1983; Gullickson & Ellwein, 1985; Mehrens &
Lehmann, 1987; Stiggins & Bridgeford, 1985). Research has
shown that teachers lack sufficient training in test
development, fail to analyze tests, do not establish
reliability or validity, do not use a test blueprint, weight
all content equally, rarely test above the basic knowledge
level, and use tests with grammatical and spelling errors
(Burdin, 1982; Carter, 1984; Gullickson, 1982; Gullickson
& Ellwein, 1985; Hills, 1991). Technically their tests are
simplistic and depend upon short answer, true-false, and
other easily prepared items. Their multiple-choice items
often have serious flaws--especially in distractors (Haynie,
1990; Mehrens & Lehmann, 1984, 1987; Newman & Stallings,
1982).
A few investigations have studied the value of tests as
aids to learning subject content (Haynie, 1987, 1990, 1991;
Nungester & Duchastel, 1982). Time on-task has been shown
to be very important in many studies (Jackson, 1987; Salmon,
1982; Seifert & Beck, 1984). Taking a test is a time
on-task learning activity. Works which studied testing
versus similar on-task time spent in structured review of
the material covered in class have had mixed results, but
testing appears to be at least as effective as reviews in
promotion of learning (Haynie, 1990; Nungester & Duchastel,
1982). Research is lacking on the quality of tests and test
items written by technology education teachers.
PURPOSE
The purpose of this investigation was to study the
quality of technology education test items written by
teachers. Face validity, clarity, accuracy in identifying
taxonometric level, and rates of spelling and punctuation
errors were some of the determinants of quality assessed.
Additionally, data were collected concerning teachers'
experience levels, highest degree held, and sources of
training in test construction. The following research
questions were addressed in this study:
1. What types of errors are common in test items?
2. Do the error rate or types of errors in teacher
constructed test items vary with demographic factors?
3. Do teachers understand how to match test items to
curriculum content and taxonometric level?
METHODOLOGY
SOURCE OF DATA
Between April 23, 1988 and January 8, 1990, a team of
15 technology education teachers worked to develop test
items for a computerized test item bank for the North
Carolina State Department of Public Instruction (SDPI). The
work was completed under two projects funded by SDPI and
directed by DeLuca and Haynie (1989, 1990) at North Carolina
State University. The data for this study came from the
items developed in those projects.
TEST ITEM AUTHORS
The teachers were selected on recommendation of
supervisors, SDPI consultants, or teacher educators. All
were recognized as leaders among their peers and most had
been nominated for teacher of the year or program of the
year commendation. They were all active in the North
Carolina Technology Education Association and supported the
transition to the new curriculum. Table 1 displays
demographic data concerning the test item authors.
TABLE 1
PROFILE OF AUTHORS' DEMOGRAPHIC FACTORS
---------------------------------------------------
Graduate
Years of Undergraduate Test &
Teaching Highest Test & Measure Measure
Author Experience Degree Courses Courses
---------------------------------------------------
1 9 B.S. 0 0
2 5 B.S. 1 0
3 23 B.S. 0 0
4 4 B.S. 0 1
5 5 B.S. 0 1
6 23 M.Ed. 0 1
7 19 M.Ed. 0 1
8 17 M.Ed. + 2 yrs. 0 2
9 25 M.Ed. 0 0
10 5 M.Ed. 0 0
11 7 M.Ed. 0 0
12 7 B.S. 0 0
13 7 M.Ed. 0 0
14 15 B.S. 1 0
15 5 B.S. 1 1
---------------------------------------------------
TRAINING OF AUTHORS
Teachers came to the university campus for a workshop
on April 23, 1988. Project directors oriented teachers to
the computerized test bank, reviewed the revised technology
education curriculum, and explained how to develop good test
items. A 13 page instructional packet was also given to
each author. It should be noted that the training session
and instructional packet may confound attempts to generalize
these findings.
The authors were required to develop and properly code
six items which were submitted for approval and corrective
feedback before they were allowed to proceed. The teachers
who authored the items were paid an honorarium for their
services.
EDITING AND CODING OF ITEMS
Each item was prepared on a separate sheet of paper
with a coding sheet attached and completed by the teacher.
The coding sheet identified the author, the specific
objective tested, the taxonometric level, and information
for the computerized system. The project directors edited
the items with contrasting colored felt tip pens on the
teachers' original forms.
DESIGN OF THIS STUDY
The data for this investigation were the editing
markings on the original test items submitted by the
teachers. Scores for 9 scales of information were recorded
for analysis. Each of the scales was established so that a
low score would be optimal. The scales were Spelling Errors
(SE), Punctuation Errors (PE), Distractors (D), Key (K),
Usability (U), Validity (V), Stem Clarity (SC), Taxonomy
(TX), and an overall Quality (Q) rating. After all of the
ratings were completed, the General Linear Models (GLM)
procedure was used for F testing and the LSD procedure was
used when t-tests were appropriate.
FINDINGS
SPELLING ERRORS (SE)
The frequency and percentage of scores for the 993
items on the nine ratings, and mean scores of each factor,
are shown in Table 2. An item's SE rating indicates how
many words were misspelled in the item. There were 98 items
(10%) which had one or more spelling errors. Spelling
errors are detrimental to good teaching and testing. However
the literature shows that this problem is common to other
disciplines.
TABLE 2
RATINGS OF TEST ITEM QUALITY
-----------------------------------------------------------
Frequency of % of Mean
Items With Items/ Item
Rating Category Score Each Score Score Score SD
-----------------------------------------------------------
Spelling Errors (SE) 0 895 90.1
1 76 7.7
2 11 1.1
3 6 0.6
4 3 0.3
5 1 0.1
6 1 0.1
SE Totals --- 993 100% 0.14 0.52
-----------------------------------------------------------
Punctuation Errors(PE) 0 735 74.0
1 220 22.2
2 25 2.5
3 4 0.4
4 1 0.1
5 8 0.8
PE Totals --- 993 100% 0.38 0.68
-----------------------------------------------------------
Distractors (D) 0 447 45.0
1 398 40.1
2 95 9.6
3 30 3.0
4 9 0.9
5 14 1.4
D Totals --- 993 100% 0.79 0.96
-----------------------------------------------------------
Key (K) 0 889 89.5
2 104 10.5
K Totals --- 993 100% 0.21 0.61
-----------------------------------------------------------
Usability (U) 0 249 25.1
1 265 26.7
2 159 16.0
3 131 13.2
4 74 7.5
5 50 5.0
6 21 2.1
7 11 1.1
8 16 1.6
9 17 1.7
U Totals --- 993 100% 2.02 2.04
-----------------------------------------------------------
Stem Clarity (SC) 0 602 60.6
1 352 35.4
2 39 3.9
SC Totals --- 993 100% 0.43 0.57
-----------------------------------------------------------
Taxonomy (TX) 0 835 84.1
1 124 12.5
2 34 3.4
TX Totals --- 993 100% 0.19 0.47
-----------------------------------------------------------
Quality (Q) 0 208 20.9
1 235 23.7
2 200 20.1
3 129 13.0
4 74 7.5
5 58 5.8
6 42 4.2
7 17 1.7
8 10 1.0
9 12 1.2
10 2 0.2
11 3 0.3
12 1 0.1
13 1 0.1
14 1 0.1
15 0 ---
16 0 ---
17 1 0.1
Q Totals ---- 993 100% 2.28 2.20
----------------------------------------------------------
NOTE. There were 993 items.
The authors were compared on each of the scales to
determine whether they differed significantly and to see if
similar or dissimilar errors were made by different authors.
On the spelling errors factor authors were found to differ
significantly: F(14, 978) = 11.99, p<.0001. ___="" ____="" a="" ability="" ability.="" about="" above="" according="" accuracy="" accurate="" accurately="" activities="" actually="" additionally="" addressed="" adjacent="" after="" agreement="" all="" alone="" already="" also="" alternatively="" alternatives="" among="" an="" analysis="" and="" another="" answer="" answered="" any="" apparently="" appear="" appeared="" appeared.="" application="" are="" areas="" article="" as="" aspect="" aspects="" assessment="" assigned="" assignments="" assume="" at="" attained="" author="" authored="" authors="" average="" bachelor="" bank="" be="" be:="" because="" become="" been="" before="" begins="" begun="" believed="" beneficial.="" best="" better="" between="" blank.="" bloom="" both="" burn="" but="" by="" can="" capable="" carefully="" case="" categories="" categories.="" category="" category.="" category:="" clarity="" clarity.="" clear="" clearly="" clearly.="" code="" coded="" codes="" coding="" cognitive="" colon="" common="" compared="" comparisons="" comparisons.="" competing="" comprehension="" conclude="" concluded="" conclusions="" confusing="" confusion="" considered="" correct="" correct.="" correctly="" correspond="" could="" counted="" counting="" course="" course.="" courses="" criticism="" d="" damaging="" data="" defects="" degree="" degrees="" demographic="" demonstrate.="" demonstrated="" derived="" desired="" despite="" develop="" developed="" developing="" development="" devote="" devoted="" did="" differ="" differed="" difference="" differences="" different="" differing="" difficult="" difficulties="" difficulty="" disciplines="" discussion="" distractors="" distractors:="" divided="" documents="" domain="" done="" drawn="" due="" each="" earlier="" earned="" editing="" education="" effective="" eight="" either="" element="" eliminated="" enabling="" end="" ended="" enough="" equalled="" error="" errors="" errors.="" evaluation="" even="" exactly="" examine="" example="" except="" experience="" experience.="" experienced="" explained="" extent="" extra="" f="" face="" face.="" fact="" factor="" factors="" favorably="" felt="" fewer="" finding="" findings="" findings.="" first="" five="" flawless.="" follow-up="" for="" forms--thus="" forth="" found="" four="" frequent="" frequently="" from="" function="" gender.="" general="" generally="" given="" good="" grading="" graduate="" grammar="" grand="" greater="" groups="" had="" has="" have="" held="" help="" helped="" helpful="" helping="" here="" high="" higher="" highest="" how="" however="" identified="" identify="" if="" immediately="" important="" improve="" improved="" in="" in:="" inaccurate="" included="" incompatibility="" incorrect="" incorrectly="" increase="" indeed="" indicate="" indicated="" indicates="" indicating="" individual="" inexperienced="" inflated="" informal="" information="" instructions="" insufficient="" intended="" intention.="" into="" introductory="" invalid="" investigated="" investigation="" is="" it="" item="" item.="" items="" items.="" items:="" judged="" judgement="" k="" key="" keyed="" keying="" keying.="" knowledge="" knowledgeable="" lack="" lead="" learning="" learning.="" least="" left="" less="" level="" level.="" level:="" levels="" likely="" likewise="" limited="" longer="" low="" lower="" lsd="" made="" many="" marginally="" marked="" marks="" match="" matter="" may="" meaningful="" means="" measurement="" measurements="" mechanical="" mismatch="" misspelled="" more="" most="" much="" n="" name="Burdin" necessarily="" need="" needs="" neighboring="" no="" none="" normal="" not="" note.="" noteworthy="" number="" numerous="" objective="" objectives="" objectives.="" obviously="" of="" off="" often="" omission="" on="" on-task="" one="" ones="" only="" operate="" operate.="" operated="" or="" original="" other="" others="" others.="" out="" outperformed="" overall="" p="" pair="" partial="" participated.="" particular="" patience="" pe="" peers="" per="" perhaps="" plural="" plus="" poor="" poorer="" poorest="" poorly="" portion="" possessed="" possibility="" possibly="" predicted="" preparation="" prepare="" prepared="" preparing="" presented="" previous="" problem="" problems="" problems.="" procedure="" produced="" profession="" projects="" promote="" proofreading="" proper="" prose="" punctuation="" punctuation.="" purpose="" purposes.="" q="" quality="" quality.="" quality:="" quantify="" question="" questionable.="" questions="" range="" rate="" rated="" rates.="" rating="" rating.="" rating:="" ratings="" ratings.="" read="" reading="" reasonable="" reasonably="" received="" recent="" references="" regardless="" regrettably="" related="" reliability="" remain="" remaining="" required="" research="" research.="" researcher="" response="" resulted="" results="" safe="" same="" sample="" sampling.="" sc="" score="" scoring="" se="" see="" seem="" seemed="" selected="" selection="" set="" several="" should="" showed="" shown="" shows="" significant="" significantly="" simple="" simply="" since="" singular="" six="" size="" sizeable="" small="" so="" some="" sort="" special="" specific="" spelling="" spend="" spent.="" spurious="" spuriously="" statements="" stem="" stems="" stems.="" stems:="" still="" stressed="" students="" studied="" studied:="" study="" study--but="" study.="" study:="" subject="" subjective="" submitted="" submitting="" such="" suggest="" suited="" sum="" summarized="" summation="" summation.="" summed="" summing="" superior="" switched="" table="" take="" taken="" taking="" targeted="" taxonometric="" taxonomy="" taxonomy:="" teacher="" teacher-made="" teacher.="" teachers="" teachers.="" teachers:="" teaching="" technology="" tense="" terms="" test="" test.="" tested="" tested.="" testing="" tests="" tests.="" than="" that="" the="" their="" them="" them.="" then="" theories.="" there="" there.="" these="" they="" this="" those="" though="" three="" time="" time.="" to="" together="" total="" training="" true="" two="" tx="" types="" u="" unanticipated="" undergraduate="" understandable="" unique="" unknown="" usability="" usable="" use="" used="" useful="" usefully="" v="" valid="" validity="" validity.="" value="" variables="" vary="" version="" very="" via="" was="" waste="" ways="" weaknesses="" well="" were="" were:="" when="" whether="" which="" who="" with="" word="" worded="" wording="" work="" works="" worse="" worst="" would="" write="" writing="" wrong="" wrote="" years="">Burdin, J.L. (1982). Teacher certification. In H.E. Mitzel
(Ed.), Encyclopedia of education research (5th ed.). New
York: Free Press.
Carter, K. (1984). Do teachers understand the principles for
writing tests? Journal of Teacher Education, 35(6),
57-60.
DeLuca, V.W. & Haynie, W.J. (1990). Updating,
computerization, and field validation of
competency-based test-item banks for selected
construction and communications technology
courses (Contract No. RFP 90-A-07). Raleigh, NC: North
Carolina State Department of Public Instruction.
DeLuca, V.W. & Haynie, W.J. (1989). Updating,
computerization , and field validation of
competency-based test-item banks for selected
manufacturing technology education courses (Contract No.
RFP 88-R-03). Raleigh, NC: North Carolina State
Department of Public Instruction.
Fleming, M. & Chambers, B. (1983). Teacher-made tests:
Windows on the classroom. In W. E. Hathaway (Ed.),
Testing in the schools: New directions for testing and
measurement, NO. 19 (pp.29-38). San Francisco:
Jossey-Bass.
Gullickson, A.R. (1982). Survey data collected in survey of
South Dakota teachers' attitudes and opinions toward
testing. Vermillion: University of South Dakota.
Gullickson, A.R. & Ellwein, M.C. (1985). Post hoc analysis
of teacher-made tests: The goodness-of-fit between
prescription and practice. Educational Measurement:
Issues and Practice, 4(1), 15-18.
Haynie, W.J. (1983). Student evaluation: The teachers' most
difficult job. Monograph Series of the Virginia
Industrial Arts Teacher Education Council, Monograph
Number 11.
Haynie, W.J. (1987). Anticipation of tests as a learning
variable. Unpublished manuscript, North Carolina State
University, Raleigh, NC.
Haynie, W.J. (1990). Effects of tests and anticipation of
tests on learning via videotaped materials. Journal of
Industrial Teacher Education, 27(4), 18-30.
Haynie, W.J. (1991). Effects of take-home and in-class tests
on delayed retention learning acquired via
individualized, self-paced instructional texts.
Manuscript submitted for publication.
Herman, J. & Dorr-Bremme, D.W. (1982). Assessing
students: Teachers' routine practices and reasoning.
Paper presented at the annual meeting of the American
Educational Research Association, New York.
Hills, J.R. (1991). Apathy concerning grading and testing.
Phi Delta Kappan, 72(7), 540-545.
Jackson, S.D. (1987). The relationship between time and
achievement in selected automobile mechanics classes.
(Doctoral dissertation, Texas A&M University).
Mehrens, W.A. & Lehmann, I.J. (1984). Measurement and
Evaluation in Education and Psychology. 3rd ed. New
York: Holt, Rinehart, and Winston.
Mehrens, W.A. & Lehmann, I.J. (1987). Using teacher-made
measurement devices. NASSP Bulletin, 71(496), 36-44.
Newman, D.C. & Stallings, W.M. (1982, March). Teacher
competency in classroom testing, measurement
preparation, and classroom testing. Paper
presented at the Annual Meeting of the National Council
on measurement in Education. (In Mehrens & Lehmann,
1987)
Nungester, R.J. & Duchastel, P.C. (1982). Testing versus
review: Effects on retention. Journal of Educational
Psychology, 74(1), 18-22.
Salmon, P.B. (Ed.). (1982). Time on task: Using
instructional time more effectively. Arlington, VA:
American Association of School Administrators.
Seifert, E.H. & Beck, J.J. (1984). Relationships between
task time and learning gains in secondary schools.
Journal of Educational Research, 78(1), 5-10.
Stiggins, R.J. & Bridgeford, N.J. (1985). The ecology of
classroom assessment. Journal of Educational
Measurement, 22(4), 271-286.
Stiggins, R.J., Conklin, N.F. & Bridgeford, N.J. (1986).
Classroom assessment: A key to effective education.
Educational Measurement: Issues and Practice, 5(2),
5-17.
----------------
W.J. Haynie, III is Associate Professor, Department of
Occupational Education, North Carolina State University,
Raleigh, NC.
Permission is given to copy any
article or graphic provided credit is given and
the copies are not intended for sale.
Journal of Technology Education Volume 4, Number 1 Fall 1992