DESIGNING
CLASSROOM LANGUAGE TESTS
1. What is the purpose of the test? Why
am i creating this test or why was it created by someone else ? for an
evaluation of overall proficiency? To place students into a course? To measure
achievements within a course? once you have established the major purpose of a
test, you can determine its objectives.
2. what are the objectives of the
test? What specifically am i trying to find out?
Establishing appropriate objectives involves a number of issues, ranging from
relatively simple ones about forms and functions covered in a course unit to
much more complex ones about constructs to be operationalized in the test.
Included here are decisions about what language abilities are to be assessed.
3. How will the test specifications
reflect both the purpose and the objectives? To evaluate or
design the test you must make sure that the objectives are incorporated into a
structure that appropriately weights the various competencies being assessed.
4. How will the test tasks be selected
and the separate items arranged? The tasks that the
test-takers must perform need to be practical in the ways defined in previous
chapter. They should also achieve content validity by presenting tasks that
mirror those of the course being assessed. Further, they should be able to be
evaluated reliably by the teacher or scorer. The tasks rhemselves should strive
for authenticity and the progression of tasks ought to be blased for best
performance.
5. What kind of scoring, grading, and
or feedback is exoected? Tests vary in the form and
function of feedback, depending on their purpose. For every test, test, the way
results are reported is an important consideration. Under some circumtances a
lettere grade or a holistic score may be appropriate; other circumtances may
require that a teacher offer substantive washback to the learner.
TEST
TYPES
The
first task you will face in designing a test for your students is to determine
the purpose for the test. Defining your purpose will help you choose the right
kind of test, and it will also help you to focus on the spesific objectives of
the test. We will look first at two test types that you will probably not have
many opportunities to create as a classroom teacher-language aptitude tests and
language proficiency tests-and three types that you will almost certainly need
to create-placement tests, diagnostic tests and achievements tests.
Language
Aptitude Tests
One type of
test-although admittedly not a very common one-predicts a person’s success
prior to exposure to the second language. A language aptitude test is designed to measure capacity or general
ability to learn a foreign language and ultimate success in that undertaking.
two standardized aptitude tests have been used in the united states: the Modern Language Aptitude Tests (MLAT) and the Pimsleur Language Aptitude Battery (PLAB). Both are english language tests and require students to perform a number of language-related tasks. The MLAT, for example consist of five tasks.
two standardized aptitude tests have been used in the united states: the Modern Language Aptitude Tests (MLAT) and the Pimsleur Language Aptitude Battery (PLAB). Both are english language tests and require students to perform a number of language-related tasks. The MLAT, for example consist of five tasks.
1. Number
learning: Examinees must learn a set of numbers through aural input and then
discriminate different combinations of those numbers.
2. Phonetic
script: Examinees must learn a set of correspondences between speech sounds and
phonetic symbols.
3. Spelling
clues: Examinees must read words that are spelled somewhat phonetically, and
then select from a list the one word whose meaning is closest to the
“disguised” word.
4. Words
in sentences: Examinees are given a key word in a sentence and are then asked
to select a word in a second sentence that performs the same grammatical
function as the key word.
5. Paired
associates: Examinees must quickly learn a set of vocabulary words from another
language and memorize their English meanings.
Proficiency
Tests
If your aim is to test
global competence in a language, then you are, in conventional terminology,
testing proficiency. A proficiency test is not limited to any one course,
curriculum or single skill in the language. It tests overall ability.
Proficiency test have traditional consisted of standardized multiple-choice
items on grammar, vocabulary, reading comprehension, and aural comprehension.
Sometimes a sample of writing is added, and more recent tests also include oral
production performance. As noted in the previous chapter, such tests often have
brought us much closer to constructing successful communicative proficiency
tests.
A
typical example of a standardized proficiency test is the test of english as a
Foreign Language (TOEFL) produced by the Educational Testing Service. The TOEFL
is ussed by more than a thousand indtitutions of higher education in the United
States as an indicator of a prospective student’s ability to undertake academic
work in an English-speaking millieu. The TOEFL consists of sections on
listening comprehension, structure, reading comprehension, and written
expression.
A
key issue in testing proficiency is how the constructs
of language ability are specified the task that test-takers are required to
perform must be legitimate samples of English language use in a defined
context. Creating these tasks and validating them with research is a
time-consuming and costly process. Language teachers would be wise not to
create an overall proficiency test on their own. A far more practical method is
to choose one of a number of commerciallly available proficiency tests.
Placement
Tests
Certain proficiency
tests can act in the role of placement tests, the purpose of which is to placea
student into a particular level or section of a language curriculum or school.
A placement test usually, but not always, includes a sampling of the material
to be covered in the various courses in a curriculum; a student’s preformance
on the test should indicate the point at which the student will find material
neither too easy nor too difficult but appropriately challenging.
Placement
tests come in many varieties: assesing comprehension and productioin,
responding through written and oral performance, open-ended and limited
responses, selection (e.g., multiple choice) and gap-filling formats, depending
on the nature of a program and its needs. Some program simply use existing
standardized profeciency tests because of their obvious advantage in
practicality –cost, speed in scoring, and efficient reporting of results.
Others prefer the performance data available in more open-ended written and\or
oral productioin. The ultimate objective of a placement test is, of course, to
correctly place a student into a course or level. Secondary benefits to consider
include face validity, diagnostic information on student’s performance and
authenticity.
In
a recent one-month special summer program in English conversation and writing
at San Francisco Sate University, 30 students were to be placed into one of two
sections. The ultimate objective of the placement test (consisting of a
five-minute oral interview and an essay-writting task) was to find a
performance-based means to devide the students evently into sections. This
objective might have been achieved easily by administering a simple
grid-scorable multiple-choice grammar-vocabulary test. But the interview and writting sample added
some important face validity, gave a more personal touch in small program, and
provided some diagnostic information on a group of learners about whom we knew
very little prior to their arrival on campus.
Diagnostic
Tests
A diagnostic test is designed to diagnose specified
aspects of a language. A test in pronunciation, for example, might diagnose the
phonological features of English that are difficult for learners and should
therefore become part of a curriculum. Usually, such tests offer a checklist of
features for the administrator (often the teacher) to use in pinpointing
difficulties. A writing diagnostic would elicit a writing sample from students
that would allow the teacher to identify those rhetorical and linguistic
features on which the course needed to focus special attention.
Diagnostic and
placement tests, as we have already implied, may sometimes be indistinguishable
from each other. The San Francisco state ESLPT serves dual purposes. Any
placement test that offers information beyond simply designating a course level
may also serve diagnostic purposes.
There is also a fine
line of difference between a diagnostic test and a general achievement test. Achievement tests
analyze the extent to which students have acquired language features that have
already been taught; diagnostic tests should elicit information on what
students need to work on in the future. Therefore, a diagnostic test will
typically offer more detailed subcategorized information on the learner. In a
curriculum that has a form- focused phase, foe example, a diagnostic test might
offer information about a learner’s acquisition of verb tenses, modal
auxiliaries, definite articles, relative clauses, and the like.
A
typical diagnostic test of oral production was created by Clifford Prator
(1972) to accompany a manual of English pronunciation. Test-takers are directed
to read a 150-word passage while they are tape-recorded. The test administrator then refers to an inventory of phonological
items for analyzing a learner’s production. After multiple listenings, the
administrator produces a checklist of errors in five separate categories, each
of which has several subcategories. The main categories include
1.
Stress and
rithym,
2.
Intonation,
3.
Vowels,
consonants, and
4.
Other factors.
An
example of subcategories is shown in this list for the first category (stress
and rhythm) :
a.
Stress on the
wrong syllable (in multi-syllabic words)
b.
Incorrect
sentence stress
c.
Incorrect
division of sentences into thought groups
d.
Failure to make
smooth transitions between words or syllables
Each subcategory is appropriately
referenced to a chapter and section of Prator’s manual. This information can
help teachers make decisions about aspects of English phonology on which to
focus. This same information can help a student become aware of errors and
encourage the adoption of appropriate
compensatory strategies.
Achievemet Test
An achievement test is related
directly to classroom lessons, units, or even a total curriculum. Achievement
tests are (or should be) limited to particular material addressed in a
curriculum within a particular time frame and are offered after a course has
focused on the objectives in question. Achievement tests can
also serve the diagnostic role of indicating what a student needs to continue
to work on in the future, but the primary role of an achievement test is to
determine whether course objectives have been met- and appropriate knowledge
and skills acquired – by the end of a period of instruction.
Achievement tests are
often summative because they are administered at the end of a unit or term of
study. They also play an important formative role. An effective achievement
test will offer washback about the quality of a learner’s performance in
subsets of the unit or course. This washback contributes to the formative
nature of such tests.
The specifications for an achievement test should be
determined by
·
The objectives of the lesson, unit, or
course being assessed,
·
The relative importance (or weight)
assigned to each objective,
·
The task employed in classroom lessons
during the unit of time,
·
Practicality issues, such as the time
frame for the test and turnaround time, and
·
The extent to which the test structure
lends itself to formative washback.
Achievement tests range
from five- or ten-minute quizzes to
three- hour final examinations, with an almost infinite variety of item types
and formats. Here is the outline for a midterm examination offered at the high
–intermediate level of an intensive English program in the United States. The
course focus is on academic reading and writing ; the structure of the course
and its objectives may be implied from the sections of the test.
|
SOME
PRACTICAL STEPS TO TEST CONSTRUCTION
The descriptions of types of tests in the preceding
section are intended to help you understand how to answer the first question
posed in this chapter. What is the purpose of the test? It is unlikely that you
would be asked to design an aptitude test or a proficiency test, but for the
purposes of interpreting those tests, it is important that you understand their
nature. However, your opportunities to design placement, diagnostic, and
achievement tests- especially the latter – will be plentiful. In the remainder
of this chapter, we will explore the four remaining questions posed at the
outset, and the focus will be on equipping you with the tools you need to
create such classroom- oriented tests.
You may think that
every test you devise must be a wonderfully innovative instrument that will
garner the accolades of your colleagues and the admiration of your students.
Not so. First, new and innovative testing formats take a lot of effort to
design and a long time to refine through trial and error. Second, traditional
testing techniques can, with a little creativity, conform to the spirit of an
interactive, communicative language curriculum.Your
best tack as a new teacher is to work within the guidelines of accepted, known,
traditional testing techniques. Slowly, with experience, you can get bolder in
your attempts. In that spirit, then, let us consider some practical steps in
constructing classroom tests.
Assessing Clear, Unambiguous Objectives
In
addition to knowing the purpose of the test you’re creating, you need to know
as specifically as possible what it is you want to test. Sometimes teachers
give tests simply because it’s Friday of the third week of the course, and
after hasty glances at the chapter (s) covered during those three weeks, they
dash off some test items so that students will have something to do during the
class. This is no way to approach a test. Instead, begin by taking a careful
look at everything that you think your students should “know” or be able to
“do,” based on the material that the
students are responsible for. In other words, examine the objectives for
the unit you are testing.
Remember that every curriculum should have appropriately framed
assessable objectives, that is, objectives that are stated in terms of evort
performance by students. Thus, an objective that states “Students will learn
tag questions” or simply names the grammatical focus “Tag questions” is not
testable. You don’t know whether students should be able to understand them in
spoken or written language, or whether they should be able to produce them
orally or in writing. Nor do you know in what context (a conversation? An
essay? An academic lecture?) those linguistics forms should be used. Your first
task in designing a test, then, is to determine appropriate objectives.
If you’re lucky, someone will have already stated those objectives
clearly in performance terms. If you’re little less fortunate, you may have to
go back through a unit and formulate them yourself. Let’s say you have been
teaching a unit in a low intermediate integrated –skills class with an emphasis
on social conversation, and involving some reading and writing, that includes
the objectives outlined below, either stated already or as you have reframed
them. Notice that each objective is stated in terms of the performance
elicited and the target linguistic domain.
|
You may find, in reviewing the
objectives of a unit or a course, that you cannot possibly test each one. You
will then need to choose a possible subset of the objectives to test.
Drawing Up Test
Specifications
Test specifictions for classroom use
can be a simple and practical outline of your test. (for large-scale
standarized tests that are intended to be widely distributed and therefore are
broadly generalized, test specifications are much more formal and detailed). In
the unit discussed above, your specifications will simply comprise (a) a broad
outline of the test, (b) what skills you will test, and (c) what the items will
look like. Let’s look at the first two in relation to the midterm unit
assessment already referred to above.
(a)
Outline of the
test and (b) skills to be included. Because
of the constraints of your curriculum, your unit test must take no more than 30
minutes. This is an integrated curriculum, so you need to test all four skills.
Since you have the luxury of teaching a small class (only 12 students !), you
decide to include an oral production component in the preceding period (taking
students one by one into a separate room
while the rest of the class reviews the unit individually and completes
workbook exercises). You can therefore test oral production objectives directly
at that time. You determine that the 30- minute test will be divided equally in
time among listening, reading, and writing.
(cIitem types
and tasks. The next and
potentially more complex choices involve the item types and tasks to use in
this test. It is suprising that there are a limited number of modes of
eliciting responses (that is, prompting) and of responding on tests of any
kind. Consider the options : the test prompt can be oral (student listens) or
written (student reads), and the student can respond orally or in writing. It’s
that simple. But some complexity is added when you realize that the types of
prompts in each case vary widely, and within each response mode, of course,
there are a number of options, all of which are depicted in Figure 3.1.
|
![]() |
Granted, not all of the response
modes correspond to all of the elicitation modes. For example, it is unlikely
that directions would be read aloud, nor would spelling a word be matched with
a monologue. A modicum of intuition will elliminate these non sequiturs.
Armed with a number of elicitation
and response formats, you have decided to design your specs as follows, based
on the objectives stated earlier:
|
These informal, classroom-oriented
specifications give you an indication of
·
The topics
(objectives ) you will cover,
·
The implied
elicitation and response formats for items,
·
The number of
items in each section, and
·
The time to be
allocated for each.
Notice that three of the six
speaking objectives are not directly tested. This decision may be based on the
time you devoted to these objectives, but more likely on the feasibility of
testing that objective or simply on the finite number of minutes available to
administer the test. Notice, too, that objectives 4 and 8are not assessed.
Finally, notice that this unit was mainly focused on listening and speaking,
yet 20 minutes of the 35-minute test is devoted to reading and writing tasks.
is this an appropriate decision?
One
more test spec that needs to be included as a plan for scoring and assigning
native weight to each section and each tem within. This issue will be addressed
later in this chapter when we look at scoring, grading, and feedback.
Revising Test
Tasks
Your oral interview comes first, and
so you draft questions to conform to the accepted pattern of oral interviews
for information on constructing oral interviews). You begin and end with
nonscored items (warm-up and wind-down) designed to set students at case, and
then sandwich between them items intended to test the objective (level check)
and a little beyond (probe)
|
You are now ready to draft other
test items. To provide a sense of authenticity and interest, you have decided
to conform your items to the context of a recent Tv sitcom that you used in
class to illustrate certain discourse and form- focused factors. The sitcom
depicted a loud, noisy party with lots of small talk. As you devise your test
items, consider such factors as how students will perceive them (face
validity), the extent to which authentic language and contexts are present,
potential difficulty caused by cultural schemata, the length of the listening
stimuli, how well a story line comes across, how things like the cloze testing
format will work, and other practicalities.
Let’s say your first draft of items
produces the following possibilities within each section:
|
|
As you can see, these items are
quite traditional. You might self-critically admit that the format of some of the
items is contrived, thus lowering the level of authenticity. But the thematic
format of the sections, the authentic language within each item, and the
contextualization add face validity, interest, and some humor to what might
otherwise be a mundane test. All four skills are represented, and the tasks are
varied within the 30 minutes of the test.
In revising your draft, you will
want to ask yourself some inportant questions:
1.
Are the
directions to each section absolutely clear?
2.
Is there an
example item for each section?
3.
Does each item
measure a specified objective?
4.
Is each item
stated in clear, simple language?
5.
Does each
multiple-item have appropriate distractors; that is, are the wrong items clearly wrong and yet
sufficiently “aluring”that they aren’t ridiculously easy? (see below for a
primer on creating effective distractors.)
6.
Is the
diffficulty of each item appropriate for your students?
7.
Is the language
of each item sufficiently authentic?
8.
Do the sum of
the items and the test as a whole adequately reflect the learning objectives?
In the current example that we have
been analyzing, your revising process is likely to result in at least four
changes or additions:
1.
In both
interview and writing sections, you recognize that a scoring rubric will be
essential. For the interview, you decide to create a holistic scale, and for
writing section you devise a simple analytic scale that captures only the
objectives you have focus on.
2.
In the
interview questions, you realize that follow-up questions may be needed for students
who give one-word or very short answers.
3.
In the
listening section, part b, you intend choice “c” as the correct answer, but you
realize that choice “d” is also acceptable. You nedd an answer that is
unambigiously incorrect. You shorten it “d”. Around eleven o’clock.” You also
note that providing the prompts for this section on an audio recordingwill be
logistically difficult, and so you opt to read these items to your students.
4.
In the writing
prompt, you can see how some students would not use the words so or because ,
which were in your objectives, so
you reword the prompt:”name on of the characters at the party in the TV sitcom
we saw. Then, use the word so at least once and the word because
at least once to tell why you liked or didn’t like that person.”
Ideally, you would try out all your
tests on students not in your class before actually administering the tests.
But in our daily classroom teaching, the tryout phase is almost impossible.
Alternativelly, you could enlist the aid of a colleague to look over your test.
And so you must do what you can to bring to your students an instrument that
is, to the best of your ability, practical, and reliable.
In the final revision of your test,
imagine that you are a student taking the test. Go through each set of
directions and all items slowly and deliberately. Time your-self. (often we
underestimate the time students will need to complete a test). If the test
should be shortened or lengthened, make the necessary adjustments. Make sure
your test is neat and uncluttered on the page, reflecting all the care and
precision you have put into its construction. If there is an audio component,
as there is in our hypothetical test, make sure that the script is clear, that
your voice and any other voices are clear, and that the audio equipment is in
working order before starting the test.
Desigining Multipl-Choice Test Items
In the sample achievement test
above,twoof the five compenents ( both
of the listening sections ) specified a multiple – choice format for items.This
was a bold stepto take.Multi-choice items,which may appear to be the simples
kind of item to construct,are extremely
difficult to design correctly.Hughes ( 2003,pp.76-78 ) cautions against a
number of weaknesses of multiple-choice items :
·
The technique test only recognition
knowledge.
·
Guessing may have a considerable effect
on test scores.
·
It is very difficult to write successful
items.
·
Washback may be facilitated.
The
two priciples that stand out in support of multiple-choice formats are,of
course,practicality and reliability.With their predetemined corrert responses
and time-saving scoring procedures,multiple-choice items offer overworked
teachers the tempting possibility of an easy and consistent process of
scoring and grading.but is the
preparation phase worth the effort ? Sometimes it is, but you might spend even
more time designing such items than you save in grading the test.Of
course,ifyour objective is to design a large-scale standardized test for repeated
administrations,then a multiple-choice format does indeed become viable.
First, primer on terminology.
1.
Multiple-choice
item are all recevtive,or selective,or selective,response items in that the
test-taker chooses from a set of reponses ( commonly called a supply type of
response ) rather than creating a
responses.Other receptive itemtypes include true-false questions and matching
lists.( In the discussiaon here,the
guidelines apply primarily to multiple-choice item types and not necessarily to
other receptive types ).
2.
Every
multple-choice item has a stem,wich presents a stimulus,and several ( usually
between theree and five ) options or
alternatives to choose from.
3.
ON of those
options,the key,is the correct response,while the others serve as distractors.
Since there will be occasions when
multiple-choice item are appropiate,consider the following four guidelines for
designing multiple-choice item for both classroom-bassed and large-scale situations ( adapted from Gronlund,1998,pp.60-75,
and J.D.Brown,1996,pp.54-57 ).
1.
Design each
item to measure a specific objective.
Voice : where
did George goafter the party last night ?
S reads : a.yes,he did
b.because
hewas tired.
c.to
elaine’s place for another party.
d.around
eleven o’clock.
The spesific objective being stasted
here is comprehesion of wh questions.Distractor (a) is desgned toascertain that
the student knows the difference between an answer to a wh-question and a
yes/no question.Distractors (b)and (d),as well as the key item (c),test comprehesion
of the meaning of where as opposed to
why and when.The objective has been directly addreassed.
On
the other hand,here is an item that was designed to test recognition of the
correct word order of inderect questions.
Multiple-choice item,flawed
Exuse me,do you know_?
a.
Where is the
post office
b.
Where the post
office is
c.
Whwre post
office is
2.
State both stem
and options as simply and directly as possible.
We are sometimes tempted to make multiple-choice items too wordy.A good rule of thumb is to get
directly to the point. Here’s an example.
Multiple-choice cloze item, flawed
My eyesight has really been deteriorating
lately. I wonder if i need glasses. I think i’d better go to the to have my eyes checked.
a.
Pediatrician
b. Dermatologist
c. Optometrist
It should be placed in the stem to keep the item as succint as possible.
Multiple-choice item, flawed
We went to visit the temples fascinating.
a.
Which were beautiful
b. Which were especially
c. Which were holy
3. Make certain that the intended answer is clearly the only correct one.
In the proposed unit test
described earlier ,the following item appread in the original draft:
Multiple-choice item,flawed
Voice :where
did George go after the party last night?
S read :a.yes
he did.
b. because he was tired.
:c.To
Elaine’splace to another party.
d.He went
home around eleven o’clock.
4.
Use item
indices to accept,discrd,or revise items.
The
appropriate selection and arrangement of suitable mulyi[le-choice items on a
test can best be accomplished by measuring items against three indices:item
facility (or item difficulty),item discrimination (sometimes called item
differentiation),and distractor analysis.Althought measuring these factors on
classroom test would be useful,you probably will have neither the time nor the
expertise to do his for everyclassroom test you create,especially one-time
tests.But they are a must for standardized norm-refrenced test that are
designed to be administered a number of times and/or administered in multiple
forms.
1.
Item
facility (or IF) is the
extent to wich an item is easy or difficult for the proposed group of
test-takers.you may wonder why that is important if in your estimation the item
achieves validity.The answer is that an item that is too easy ( say 99 percent
of respondents get it right) or too difficult ( 99 percent get it wrong) really
does nothing to separate high-ability and low-ability test-takers.It is really
performing much “work”for you on a test.
If simply reflects the percentage of
students answering the item correctly.the formula looks like this:
IF= # of Ss answering the item
correctly
Total # of Ss responding to that item
For
example ,if you have an item on which 13 out of 20 students respond
correctly,your IF index is 13 divided by 20 or.65 (65 percent).
2.
Item
distrimination (ID) is the
extend to wich an item differentiates between high-and low-ability test
takers.An item an onwich high-ability students (who did well in the test) and
low-ability students (who didn’t) scoreequally well would have poor ID because
it did not discriminate between yhe two groups.conversely,an item that garners
correct responses from most of the high-ability group and incorrect responses
from most of the low-ability group has
good discrimination power.
Suppose
your class of 30 students has taken a test.Once you have calculated final
scores for all 30 students,divide them roughly into thirds-that is,create three
rank-ordered ability groups including the top 10 scores,the middle 10,and the
lowest 10.To find out which of your 50 0r so test items were most”powerful” in
discriminating between high and low ability,eliminate the middle group,leaving
two groups with results that might look something like this on a particular
item:
Item
#23
High-ability Ss (top 10)
Low-ability Ss (bottom 10)
|
#Correct
7
2
|
#Incorrect
3
8
|
Using the ID formula (7 – 2 = 5 ÷ 10 = 50), you
would find that this item has an ID of .50, or a moderate level.
The
formula for calculating ID is
ID
= high group #
correct – low group #correct = 7
– 2 =
5 = .50
1/2 X total of your two comparison Groups
1/2 X 20 10
The result of this
example item tells you that the item has a moderate level of ID. High discriminating power would approach
a perfect 1.0, and no discriminating power at all would zero. In most cases,
you would want to discard an item that scored near zero. As with IF, no
absolute rule governs the establishment of acceptable and unacceptable ID indicies.
One
clear, partical use for ID indices is to select items from a test bank that
includes more items than you need. You might decide to discard or improve some
item with lower ID because you know they won’t be as powerful an indicator of
success on your test.
For
most teacher who are using multiple-choice items to create a classroom- based
unit test, juggling IF and ID indices is more a matter of intuition and “art”
than a science. Your best calculated hunches may provide sufficient support for
retaining, revising, discarding proposed items. But if you are constructing a
large-scale test, or one that will be administered multiple times, these
indices are important factors in creating test forms that are comparable in
difficulty. By engaging in a sophisticated procedureusing what is called item
response theory (IRT), professional test designers can produce test forms whose
equated test scores are reliable measure of performance. (For more information
on IRT, see Bachman, 1990, pp. 202-209.)
3.
Distractor efficiency is one
more important measure of multiple-choce item’s value in a test and one that is
related to item discrimination. The efficiency of distectors is the extent to
which (a) the distracters “lure” a sufficient number of test takers, especially
lower-ability ones, and (b) those responses are somewhat evenly distributed
across all distracters. Those of you who have a fear of mathematical formulas
will be happy to read that there is no formula for calculating efficiency and
that an inspection of a distribution of responses will usually yield the information you need.
Consider the following. The same item (#23) used above is a multiple-choice item
with five choices, and responses across upper- and lower-ability students are
distributed as follows:
Choice
High ability Ss (10)
Low-ability Ss (10)
*Note: C is
the correct response.
|
A
0
3
|
B
1
5
|
C*
7
2
|
D
0
0
|
E
2
0
|
No mathematical formula is needed to tell you that
this item successfully attracts seven of the ten high-ability students toward
the correct response, while only two of the low-ability students get this one
right. As shown above, its ID is .50, which is acceptable, but the might be
improved in two ways: (a) Distractor D doesn’t
fool anyone. No one picked it, and therefore it probably has no utility. A
revision might provide a distractor that actually attracts a response or two.
(b) Distractor E attracts more
responses (2) from the high-ability group than the low-ability group (0). Why
are good students choosing this one? Perhaps it includes a subtle reference
that entices the high group but is “over the head” of the low group, and therefore
the latter students don’t even consider it.
The
other two distracters (A and B ) seem to be fulfilling their
function of attracting some attention from lower-ability students.
SCORING,
GRANDING, AND GIVING FEEDBACK
Scoring
As
you design a classroom test, you must consider how the test will be scored and
graded. Your scoring plan reflects the relative weight that you place on each
section and items in each section. The integrated-skills class that we have
been using as an example focuses on listening and speaking skills with some
attention to reading and writing. Three of your nine objectives target reading
and writing skills. How do you asign scoring to the various components of this
test?
Because
oral production is a driving force in your overall objectives, you decide to
place more weight on the speaking (oral interview) section that on the other
three sections. Five minutes is actually a long time to spend in a one-on-one
situation with a student, and some significant information can be extracted
from such a session. You therefore designate 40 percent of the grade to the
oral interview . you consider the listening and reading section to be equally
important, but each of them, especially in this multiple-choice format, is of
less consequence than the oral interview. So you give each of them a 20 percent
weight. That leaves 20 percent for the writing section, which seems about right
to you given the time and focus on
writing in this unit of the course.
Your
next task is to assign scoring for each item. This may take a little numerical
common sense, but it doesn’t require a degree in manth. To make matters simple,
you decide to have a 100-point test in which.
▪ the listening and reading items are each
worth 2 points.
▪ the oral interview
will yield four scores ranging from 5 to 1, reflecting fluency, prosodic
features, acurary of the target grammatical objectives, and discourse
appropriateness. To weight these scores appropriately, you will double each
individual score and then add them together for a possible total score of 40.
▪ the writing sample
has two scores: one for grammar/mechanics (including the correct use of so and because) and one for overall effectiveness of the message, each
ranging from 5 to 1. Again, to achieve the correct weight for writing, you wil
double each score and add them, so the possible total is 20 points.
Here are your decisions
your test:
Percent of
Total Grade
|
Possible Total
Correct
|
||
Oral Interview
Listening
Reading
Writing
total
|
40%
20%
20%
20%
|
4 scores, 5 to 1 range X 2 = 40
10 items @ 2 points each = 20
10 items @ 2 points each = 20
2 scores, 5 to range X 2 = 20
|
|
100
|
|||
At this point you may wonder if the interview
should carry less weight or the written essay more, but your intuition tells
you that these weight are plausible representations of the relative emphases in
this unit of the course .
After
administering the test once, you may decide to shift some of these weights or
to make other changes. You will then have valuable information about how easy
of difficult the test was, about whether the time limit was reasonable, about
your students’ affective reaction to it, and about their general performance.
Finally, you will have an intuitive judgement about whether this test correctly
assessed your students. Take note of these impressions, however nonemprical
they may be, and use them for revising the test in another term.
Grading
Your
first thought might be that assigning grades to student performance on this
test would be easy; just give an “A” for 90-100 percent, a “B” for 80-90
percent, and so on. Not so fats! Grading is such a thorny issue that all
Chapter 11 is devocated to the topic. How you assign letter grades to this test
is a product of.
•
the country, culture, and context of this English classroom,
•
institutional, expectations (most of them unwritten),
•
explicit and implicit definitions of
grades that you have set forth,
•
the relationship you have establish with this class, and
• stundent expectations
that have been engendered in previous test and quizzes in this classs.
For the time being,
then, we will set aside issues that deal with grading this test in particular,
in favor of the comprehensive treatment
of grading.
Giving Feedback
A
section on scoring and grading would not be complete without some consideration
of the forms in which you will offer feedback to your students, feedback that
you want to become beneficial washback. In the example test that we have been
referring to here-which is not unusual in the universe of possible formats for
preodic classroom tests-consider the multitude of options. You might choose to
return the test to the student with one of, or a combination of, any the
possibilities below:
1. a
letter grade
2. a
total score
3. four
subscore (speaking, listening, reading, writing)
4. for
the listening and reading section
a. an
indication of correct/incorrect responses
b. marginal
comments
5. for
the oral interview
a. scores
for each element being rated
b. a
checklist of areas needing work
c. oral
feedback after the interview
d. a
post-interview conference to go over the results
6. on
the essay
a. scores
for each element being rated
b. a
checklist of areas needing work
c. marginal
and end-of-essay comments, suggestions
d. a
post-test conference to go over work
e. a
self-assessment
7. on
all or selected parts of the test, peer checking of results
8. a
whole-class discussion of result of the test
9. individual
conferences with each student to review the whole test
obviously, options 1
and 2 give virtually no feedback. They offer the student only a modest sense of
where that student stands and vague idea of overall performance, but the
feedback they present does not become washback. Washback is achieve when
students can, through the testing experience, identify their areas of success
and challenge. When a test becomes a learning experience, it achieves washback.
Option 3 given a
student a chance to see the relative strength of each skill area and so becomes
minimally useful. Option 4,5, and 6 represent the kind of response a teacher
can give (including stimulating a student self-assessment) that approaches
maximum washback. Students are provided with individualized feedback that has
good potential for “washing back” into their subsequent performance. Of course,
time and the logistics of large classes may not permit 5d and 6d, which for
many teachers may be going above and beyond expectations for a test like this.
Likewise option 9 may be impractical. Option 6 and 7, however, are creatly
viable possibilities that solve some of the practicality issues that are so
important in teachers’ busy schedules.
CHAPTER II
CLOSING
A.
Summary
There are
five kinds of test types: Language aptitude tests, proficiency tests, placement
tests, diagnostic tests, and achievement tests. Every test must be a
wonderfully innovative instrument that will garner the accolades of the
colleagues and the admiration of the students.
In the test,
we have some practical steps to test construction, they are: assessing clear
and unambiguous objectives, drawing up test specifications, devising test
tasks, and designing multiple-choice test items.
Evaluation
can fulfill two functions: assessment and feedback. Assessment is a matter of
measuring what the learners already know. Any assessment should also provide
positive feedback to inform teachers and learners about what is still not
known, thus providing important input to the content and methods of future
works.
REFERENCE
Brown, H. Douglas. (2004). LANGUAGE ASSESSMENT:principlesand
Clasroom Practices. Pearson Education: NewYork.
http://febbyeni.blogspot.co.id/2013/06/designing-classroom-language-test.html
COMPILED BY: GROUP 3
Lestari Yuli Prehatin
Reyzha Ramadhan Putra
Rievy Yuwandarie Ulga
Nita Kurniawati
Riki Pratama B.

0 komentar:
Posting Komentar