Minggu, Oktober 23, 2016

2. PRINCIPLES OF LANGUAGE ASSESSMENT


I.                   PRINCIPLES OF LANGUAGE ASSESSMENT
Practicality
A test that is prohibitively expensive is impractical. A test of language proficiency that take a student five hours to complete is impractical-it consumes more time (and money) than necessary to accomplish its objective. A test that requires individual one-on-one proctoring is impractical for a group of several hundred test-takers and only a handful of examiners. A test that takes a few minutes for a student to take and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can be scored only by a computer is impractical if the test takes place a thousand miles away from the nearest computer. The value and quality of a test sometimes hinge on such nitty-gritty, practical considerations.
            Here’s a little horror story about practicality gone awry. An administrator of a six-week summertime short course needed the place the 50 or so students who had enrolled in the program. A quick search yielded a copy of an old English Placement Test from the University of Michigan. It had 20 listening items based on an audiotape and 80 items on grammar, vocabulary, and reading comprehension, all multiple choice format. A scoring grid accompanied the test. On the day of the test, the require number of the test booklets had been secured, a proctor had been assigned to monitor the process, and the administrator and a proctor had planned to have the scoring completed by later that afternoon so students could begin classes the next day. Sound simple, right? Wrong.
            The students arrived, test booklets were distributed, and direction were given. The proctor started the tape. Soon students began to look puzzled. By the time the tenth item played, everyone looked bewildered. Finally, the proctor checked a test booklets and was horrified to discover that the wrong tape was playing; it was a tape for another form of the same test! Now what? She decided to randomly select a short passage from a textbook that was in the room and give the student dictation. The students responded reasonably well. The next 80 non-tape-based items proceeded without incident, and the students handed in their score sheets and dictation papers.
            When the red-face administrator and the proctor got together later the score the tests, they faced the problem of how to score the dictation-a more subjective process than some other forms of assessment. After a lengthy exchange, the two established a point system, but after the first few papers had been scored, it was clear that the point system needed revision. That meant going back to the first papers that make sure the new system was followed.
            The two faculty members had barely begun to score the 80 multiple-choice items when students began returning to the office to receive their placements. Students were told to come back the next morning for the results. Later that evening, having combined dictation score and 80 multiple-choice scores, the two frustrated examiners finally arrived at placement for all students. It’s easy to see what when wrong here. While the listening comprehension section of the test was apparently highly practical, the administrator had failed to check the materials ahead of time. Then, they established scoring procedure that did not fit into the time constraints. In classroom based testing, time is almost always a crucial practicality factor for busy teachers with a few hours in the day.
Reliability
A reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results. The issue of reliability of the test may best be addressed by considering a number of factors that may contribute to the unreliability of a test. Consider the following possibilities: fluctuations in the student, in scoring, in test administration, and in the test itself.
Student-Related-Reliability
The most common learner-related issue in reliability is caused by temporary illness, fatigue, a “bad day”, anxiety, and other physical or psychological factors, which may make an “observed” score deviate from one’s true score. Also include in this category are such factors as a test-taker’s “test-wideness” or strategies for efficient test taking.
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occur when two or more scorers yield inconsistent scores of the same test, possibly for lack of attention to scoring criteria, inexperience, inattention or ever preconceived biases. In the story above about the placement test, the initial scoring plan for the dictation was found to be unreliable-that is, the two scorers were not applying the same standards.
            Rater reliability issues are not limited to contact where two or more scorers are involved. Intra-rater reliability is a common occurrence for classroom teachers because of unclear scorer criteria, fatigue, and bias toward particular “good” and “bad” students or simple carelessness. When I am faced with up to 40 tests to grade in only a week, I know that the standards I apply-however-subliminally-to the first few tests will be different from those I apply to the last few. I may be “easier” or “harder” on those first few papers or I may get tired, and the result may be an inconsistent evaluation across all tests. One solution to such intra-rater unreliability is to read through about half of tests before rendering any final scorer or grades, then to recycle back through the whole set of tests to ensure an even-handed judgment. In tests of writing skill, rater reliability is particularly hard to achieve since writing proficiency involves numerous traits that are difficult to define. The careful specification of an analytical scoring instrument, however can increase rater reliability (J. D. Brown, 1991).
Test Administration Reliability
            Unreliability may also result from the conditions in which the test is administered. I once witnessed the administration of the test of oral comprehension in which a tape recorded played items for comprehension, but because of street noise outside the building, student sitting next to window could not hear the tape accurately. This was the clear case of unreliability caused by the condition of the tests administration. Other sources of unreliability are found in photocopying variations, the amount of light in different of the room, variations in temperature, and even the condition of desk and chair.
Test Reliability
            Sometime the nature of the test itself can cause measurement errors. If a test is too long, test-takers may become fatigued by the time they reach the later items and hastily respond incorrectly. Timed test may discriminate against students who do not perform well on a test with the time limit. We all know people (and you may be included in this category) who “know” the course material perfectly but who are adversely affected by the presence of a clock ticking away. Poorly written test items (that are ambiguous or that have more than one correct answer) may be a further source of test unreliability.
Validity
            By far the most complex criterion of an effective test-and arguably the most important principle-is validity, “the extent to which inferences made from assessment results are appropriate, meaningful, and useful in terms of the purpose of the assessment” (Ground, 1998, p. 226). A valid test of reading ability actually measure reading ability-not 20-20 vision, nor previous knowledge in a subject, nor some other variable of questionable relevance. To measure writing ability, one might ask student to write as many words as they can in 15 minutes, then simply count the words for the final score. Such a test would be easy to administer (practical), and the scoring quite dependable (reliable). But it would not constitute a valid test of writing ability without some consideration of comprehensibility, rhetorical discourse element, and the organization of ideas, among other factors.
How is the validity of a test established? There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. In some cases, it may be appropriate to examine the extent to which a test calls for performance that matches that of the course or unit of study being tested. In other cases, we may be concerned with how well a test determines whether or not students have reached an established set of goals or level of competence. Statistical correlation with other related but independent measure is another widely accepted form of evidence. Other concerns about a test’s validity may focus on the consequences-beyond measuring the criteria themselves-of a test, or even on the test-taker’s perception of validity. We will look at these five types of evidence below.
Content-Relation Evidence
If a test actually samples the subject matter about which conclusions are to be drawn, and if it requires the test-taker to perform the behavior that is being measured, it can claim content related evidence of validity, often popularly referred to as content validity (e.g.,. Mousasi, 2002; Hughes, 2003). You can usually identity content-relation evidence observationally if you can clearly define the achievement that you are measuring. A test of tennis competency that asks someone to run a 100-yard dash obviously lacks content validity. If you are trying to assess a person’s ability to speak a second language in a conversational setting, asking the learner to answer paper-and-pencil multiple-choice questions requiring grammatical judgment does not achieve content validity. A test that requires the learners actually to speak within some sort of authentic context does. And if a course has perhaps ten objectives but only two are covered in a test, then content validity suffers.
            Consider the following quiz on English article for a high beginner level of a conversation class (listening and speaking) for English learners.

Directions: the purpose of this quiz is for you and I to find out how well you know can apply the rules of article usage. Read the following passage and write a/an, the, or o (no article) in each blank.
Last night, I had (1)               very strange dream. Actually, it was (2)               nightmare! You know how much I love (3)                zoos. Well, I dream that I went to (4)            San Francisco zoo with (5)                 few friends. When we got there, it was very dark, but (6)             moon was out, so we weren’t afraid. I wanted to see (7)             monkeys first, so we walked past (8)               merry-go-round and (9)           lion’s cages to (10)                 monkey section.

(The story continues, with a total of 25 blanks to fill)

English article quiz
The students had a unit on zoo animals and had engaged in some open discussions and group work in which they had practiced articles, all in listening and speaking modes of performance. In that quiz uses a familiar setting and focuses of previously practiced language forms, it is somewhat content valid. The fact that it was administered in written form, however, and required students to read the passage and write their responses make it quite low in content validity for a listening/speaking class.
There are a few cases of highly specialized and sophisticated testing instrument that may have questionable content-related evidence of validity. It is possible to contend, for example, that the standard language proficiency tests, with their context reduced, academically oriented language and limited stretches of discourse, lack content validity since do not require the full spectrum of communicative performance on the part of the learner (see Bachman, 1990, for a full discussion). There is good reasoning behind such criticism; nevertheless, what such proficiency tests lack in content-related evidence they may gain in other form of evidence, not to mention practicality and reliability.
            Another way of understanding content validity is to consider the difference between direct and indirect testing. Direct testing involves the test-taker in actually performing the target task. •          Criterion-Related Evidence
A second of evidence of the validity of a test may be found in what is called criterion-related evidence, also referred to as criterion-related validity, or the extent to which the “criterion” of the test has actually been reached. You will recall that in Chapter I it was noted that most classroom-based assessment with teacher-designed tests fits the concept of criterion-referenced assessment. In such tests, specified classroom objectives are measured, and implied predetermined levels of performance are expected to be reached (80 percent is considered a minimal passing grade).
Construct-Related Evidence
A third kind of evidence that can support validity, but one that does not play as large a role classroom teachers, is construct-related validity, commonly referred to as construct validity. A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions. Constructs may or may not be directly or empirically measured-their verification often requires inferential data.
Consequential Validity
As well as the above three widely accepted forms of evidence that may be introduced to support the validity of an assessment, two other categories may be of some interest and utility in your own quest for validating classroom test. Mesick (1989), Grounlund (1998), McNamara (2000), and Brindley (2001), among others, underscore the potential importance of the consequences of using an assessment. Consequential validity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s interpretation and use.

Face Validity
An important facet of consequential validity is the extent to which “students view the assessment as fair, relevant, and useful for improving learning” (Gronlund, 1998, p. 210), or what is popularly known as face validity. “Face validity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examines who take it, the administrative personnel who decode on its use, and other psychometrically unsophisticated observers” (Mousavi, 2002, p. 244).
Authenticity
A fourth major principle of language testing is authenticity, a concept that is a little slippery to define, especially within the art and science of evaluating and designing tests. Bachman and Palmer (1996, p. 23) define authenticity as “the degree of correspondence of the characteristics of a given language test task to the features of a target language task,” and then suggest an agenda for identifying those target language tasks and for transforming them into valid test items.
Washback
A facet of consequential validity, discussed above, is “the effect of testing on teaching and learning” (Hughes, 2003, p. 1), otherwise known among language-testing specialists as washback. In large-scale assessment, washback generally refers to the effects the test have on instruction in terms of how students prepare for the test.
“Cram” course and “teaching to the best” are examples of such washback. Another form of washback that occurs more in classroom assessment is the information that “washes back” to students in the form of useful diagnoses of strengths and weaknesses. Washback also includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment is by nature more likely to have built-in washback effects because the teacher is usually providing interactive feedback. Formal tests can also have positive washback, but they provide no washback if the students receive a simple letter grade or a single overall numerical score.
            The challenge to teachers is to create classroom tests that serve as learning devices through which washback is an achieved. Students’ incorrect responses can become windows of insight into further work. Their correct responses need to be praised, especially when they represent accomplishments in a student’s inter language. Teacher can suggest strategies for success as part of their “coaching” role. Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, inter language and strategic investment, among others.
            One way to enhance washback to comment generously and specifically on test performance. Many overworked teachers return tests to students with a single letter grade or numerical and consider their job done. In reality, letter grades and numerical scores give absolutely no information of intrinsic interest to student. Grades and scores reduce a mountain of linguistic and cognitive performance data to an absurd molehill. At best, they give a relative indication of a formulaic judgment of performance as compared to others in the class which fosters competitive, not cooperative, learning.
            With this in mind, when you return a written test or a data sheet from an oral production test, consider giving more than a number, grade, or phrase as your feedback. Even if your evaluation is not a neat little paragraph appended to the test, you can respond to as many details throughout the test as time will permit. Give praise for strengths -the “good stuff”- as well as constructive criticism of weaknesses. Give strategic hints on how a student might improve certain elements of performance. In other words, take some time to make the test performance an intrinsically motivating experience from which a student will gain a sense of accomplishment and challenge.
            A little bit of washback may also help students through a specification of the numerical scores on the various subsections of the test. A subsection on verb tenses, for example, that yields a relatively low score may serve the diagnostic purpose of showing the student an area of challenge.
            Another viewpoint on washback is achieved by a quick consideration of differences between formative and summative tests. Formative tests, by definition, provide washback in the form of information to the learner on progress toward goals. But teachers might be tempted to feel that summative tests, which provide assessment at the end of a course or program, do not need to offer much in the way of washback. Such an attitude in unfortunate because the end of every language course or program is always the beginning of further pursuits, more learning, more goals, and more challenges to face. Even a final examination in a course should carry with it some means for giving washback to students.
            In my courses I never give a final examination as the last scheduled classroom session. I always administer a final exam in order to return them to students during the last class. At this time, the students receive scores, grades, and comments on their work, and I spend some of the class session addressing material on which the students were not completely clear. My summative assessment is thereby enhanced by some beneficial washback that is usually not expected of final examinations.
            Finally, washback also implies that students have ready access to you to discuss the feedback and evaluation you have given. While you almost certainly have known teachers with whom you wouldn’t dare argue about a grade, an interactive, cooperative, collaborative classroom nevertheless can promote an atmosphere of dialogue between students and teachers regarding evaluative judgments. For learning to continue, students need to have a chance to feedback on your feedback, to seek clarification of any issues that are furry, and to set new and appropriate goals for themselves for the days and weeks ahead.

II.                APPLYING PRINCIPLES TO THE EVALUATION OF CLASSROOM TEST 
            The five principles of practicality, reliability, validity, authenticity, and washback go a long way toward providing useful guidelines for both evaluating an existing assessment procedure and designing one on your own. Quizzes, tests, exams, and standardized proficiency tests can be scrutinized through these five lenses.
            Are there other principles that should be invoked in evaluating and designing assessment? The answer, of course, is yes. Language assessment is an extraordinarily broad discipline with many branches, interest areas, and issue. The process of designing effective assessment instruments is far too complex to be reduced to five principles. Good test construction, for example, is governed by research-based rules of test preparation, sampling of tasks, item design and construction, scoring responses, ethical standard, and so on. But the five principles cited here serve as an excellent foundation on which to evaluate existing instruments and to build your own.
            The questions that follow here, indexed by the five principles, will help you evaluate existing tests for your own classroom. It is important for you to remember, however, that the sequence of these questions does not imply a priority order. Validity, for example, is certainly the most significant cardinal principle of assessment evaluation. Practically may be a secondary issue in classroom testing. Or, for a particular test, you may need to place authenticity as your primary consideration. When all is said and done, however, if validity is not substantiated, all other considerations may be rendered useless.
1.   Are the test procedures practical?
Practically is determined by the teacher’s (and the students’) time constraints, costs, and administrative details, and to some extent by what occurs before and after the test. To determine whether a test is practical for you needs you may want to use the checklist below
Practically checklist
§  1. Are administrative details clearly established before the test?
§  2. Can students complete the test reasonably within the set time frame?
§  3. Can the test be administered smoothly, without procedural “glitches”?
§  4. Are all materials and equipment ready?
§  5. Is the cost of the test within budgeted limits?
§  6. is the scoring/evaluation system feasible in the teacher’s time frame?
§  7. Are methods for reporting results determined in advance?

            As this checklist suggests, after your account for the administrative details of giving a test, you need to think about the practicality of your plans scoring the test. In teacher’s busy lives, times, often emerges as the most important factor, one that overrides other consideration in evaluating an assessment. If you need to tailor test to fit your own time frame, as teachers frequently do, you need to accomplish this without damaging the test’s validity and washback. Teachers should, for example, avoid the temptation to offer only quickly scored multiple-choice selection items that may be neither appropriate nor well-designed. Everyone knows teachers secretly hate to grade tests (almost as much as students hate to take them!) and will do almost anything to get through that task as quickly and effortlessly as possible. Yet good teaching almost always implies an investment of the teacher’s time I giving feedback-comments and suggestions-to students on their tests.
2.   Is the test reliable?
           Reliability applies to both the test and the teacher, and at least four sources of unreliability must be guarded against, as noted in the section of this chapter. Test and test administration reliability can be achieved by making sure that all students receive the same quality of input, whether written or auditory. Part of achieving test reliability deepens on the physical context-making sure, for example, that
·         Every student has a cleanly photocopied test sheet,
·         Sound amplification is clearly audible to everyone in the room,
·         Video input is equally visible to all,
·         Lighting, temperature, extraneous noise, and other classroom conditions are equal (and optimal) for all students, and
·         Objective scoring procedures leave little debate about correctness of an answer.
            Rater reliability, another common issue in assessments, may be more difficult, perhaps because we too often overlook this as an issue. Since classroom test rarely involve two scores, inter-rater reliability is seldom an issue. Instead, intra-rater reliability is of constant concern to teachers: what happens to our fallible concentration and stamina over the period of time during which we are evaluating a test? Teachers need to find ways to maintain their concentration and stamina over the time it takes to score assessment. In open-ended response tests, this issue is a paramount importance. It is easy to let mentally established standards erode over the hours you require to evaluate the test.
            Intra-rater reliability for open-ended responses may be enhanced by the following guidelines:
·         Use consistent sets of criteria for a correct response.
·         Give uniform attention to those sets throughout the evaluation time.
·         Read through tests at least twice to check for your consistency.
·         If you have made “mid-stream” modifications of what you consider as a correct response, go back and apply the same standards to all.
·         Avoid fatigue by reading the tests in several sittings, especially if the time requirement is a matter of several hours.
3.      Does the procedure demonstrate content validity?
            The major source of validity in a classroom test is content validity: the extent to which the assessment requires students to perform tasks that were included in the previous classroom lesson and that directly represent the objective of the unit on which the assessment is based. If you have been teaching an English language class to fifth graders who have been reading, summarizing, and responding to short passages, and if your assessment is based on this work, then to be content valid, the test needs to include performance in those skills.
            There are two steps to evaluating the content validity of a classroom test.
1.   Are classroom objectives identified and appropriately framed? Underlying every good classroom test are the objectives of the lesson, module, or unit of course in question. So the first measure of an effective classroom test is the identification of objectives. Sometimes this is easier said than done. Too often teachers work through lessons day after day with little or no cognizance of the objectives they seek to fulfill. Or perhaps those objectives are so poorly framed that determining whether or not they were accomplished is impossible. Consider the following objectives for lessons, all of which appeared on lesson plans designed by students in teacher preparation programs:
a.       Students should be able to demonstrate some reading comprehension.
b.      To practice vocabulary in context.
c.       Students will have fun through a relaxed activity and thus enjoy their learning.
d.      To give students a drill on the /I/ - /I/ contrast.
e.       Students will produce yes/no questions with final rising intonation.

            Only the last objective is framed in a form that lends itself to assessment. In (a), the modal should is ambiguous and the expected performance is not stated. In (b), everyone can fulfill the act of “practicing”; no standards are started or implied. For obvious reasons, (c) cannot be assessed. And (d) is really just a teacher’s note on the type of activity to be used.
            Objective (e), on the other hand, includes a performance verb and a specificlinguistic target. By specifying acceptable and unacceptable levels of performance, the goal can be tested. An appropriate test would elicit an adequate number of samples of student performance, have a clearly framed set of standards for evaluating the performance, and provide some sort of feedback to the student.
2.      Are lesson objectives represented in the form of test specifications? The next content-validity issue that can be applied to a classroom test centers on the concept of test specifications. Don’t let this word scare you. It simply means that a test should have a structure that follow logically from the lesson or unit you are testing. Many tests have a design that
·         Divides them into a number of sections (corresponding, perhaps, to the objectives that are being assessed),
·         Offers students a variety of item types, and
·         Gives an appropriated relative weight to each section.
            Some tests, of course, do not lend themselves to this kind of structure. A test in a course in academic writing at the university level might justifiably consists of an in class written essay on given topic-only one “item” and one response, in a manner of speaking. But in this case the specs (specifications) would be embedded in the prompt itself and in the scoring or evaluation rubric used to grade it and give feedback. We will return to the concept of test specs in the next chapter.
            The content validity of an existing classroom test should be apparent in how the objectives of the unit being tested are represented in the form of the content of items, clusters of items, and item types. Do you clearly perceive the performance of test-takers as reflective of the classroom objectives? If so, and you can argue this, content validity has probably been achieved.
Is the Procedure Face Valid and “Biased for Best”?
This question integrates the concept of face validity with the importance of structuring an assessment procedure to elicit the optimal performance of the student. Students will generally judge a test to be face valid if
·         Directions are clear;
·         The structure of the test is organized logically,
·         Its difficulty level is appropriately pitched,
·         The test has no “surprises,” and
·         Timing is appropriate
A phrase that has come to be associated with face validity is “biased for best,” term that goes a  little beyond how the students views the test to a degree of strategic involvement on the part of student and teacher in preparing for, setting up and following up on the test itself. According to Swain (1984), to give an assessment procedure that is “biased for best,” a teacher.
·         Offers students appropriate review and preparation for the test
·         Suggests strategies that will be beneficial, and
·         Structure the test so that the best students will  be modestly challenged and the weaker students will not be overwhelmed
It’s easy for teachers to forget how challenging some test can be, and so a well-planned testing experience will include some strategic suggestions on how students might optimize their performance. In evaluating a classroom test, consider the extent to which before, during and after-test options are fulfilled.

Test Taking Strategies
Before the Test
1.      Give students all the information you can about the test: Exactly what will the test cover? Which topics will be the most important? What kind of items will be on it? How long will it be?
2.      Encourage students to do a systematic review of material. For example, they should skim the textbook and other material, outline major points write down example.
3.      Give them practice test or exercises, if available.
4.      Facilitate formation of a study group, if possible.
5.      Caution students to get a good night’s rest before the test.
6.      Remind students to get to the classroom early.
During the Test
1.      After the test is distributed, tell students to look over the whole test quickly in order to get a good grasp of its different parts.
2.      Remind them to mentally figure out how much time they will need for each part.
3.      Advise them to concentrate as carefully as possible.
4.      Warn students a few minutes before the end of the class period so that they can finish on time, proofread their answer, and catch careless errors.
After the Test
1.      When you return the test, include feedback on specific things the students did well, what he or she did not do well, and if possible, the reasons for your comments.
2.      Advice students to pay careful attention in class to whatever you say about the test result.
3.      Encourage questions from students.
4.      Advise students to pay special attention in the future to points on which they are weak.
Keep in mind that what comes before and after the test also contributes to its face validity. Good class preparation will give students a comfort level with the test, and good feedback washback will allow them to learn from it.

Are the Test Tasks as Authentic as Possible?
Evaluate the extent to which a test is authentic by asking the following questions:
·         Is the language in the test as natural as possible?
·         Are items as contextualized as possible rather than isolated?
·         Are topics and situations interesting, enjoyable, and/or humorous?
·         Is some thematic organization provided, such as through a story line or episode?
·         Do tasks represent, or closely approximate, real-world tasks?
Consider the following two excerpts from test, and the concept of authenticity may become a little clearer.
Multiple-Choice Tasks-Contextualized
“Going To”
1.      What ________ this weekend?
a.       You are going to do
b.      Are going to do
c.       You going to do
2.      I’m not sure _________ anything special?
a.       Are going to do
b.      You are going to do
c.       Is going to do
3.      My friend Melissa and I _______ a party. Would you like to come?
a.       Am going to
b.      Are going to go to
c.       Go to

4.      I’d love to!
a.       What’s it going to be?
b.      Who’s going to be?
c.       Where’s it going to be?

5.      It is __________ to be at Ruth’s house.
a.       go
b.      going
c.       going to
Multiple Choice Tasks Decontextualized
1.      There are three countries I would like to visit. One it Italy.
a.       The other is New Zealand and other is Nepal
b.      The others are New Zealand and Nepal.
c.       Others are New Zealand and Nepal.
2.      When I was twelve years old, I used _______ every day.
a.       swimming
b.      artistic
c.       artist
3.      When Mr. Brown designs a website, he always creates it _________
a.       Artistically
b.      Artistic
c.       Artist
4.      Since the beginning of the year, I ________ at Millennium Industries.
a.       am working
b.      had been working
c.       have been working
5.      When Mona bore her leg, she asked her husband ________ her to work.
a.       to drive
b.      driving
c.       drive
The sequence of items in the contextualized tasks achieves a modicum of authenticity by contextualizing all the items in a story line. The conversation is one that might occur in the real world, even if with a little less formality. The sequence of items in the decontextualized tasks takes the test-taker into five different topic areas with no context for any. Each sentences is likely to be written or spoken in the real world, but not in that sequence. Given the constraints of a multiple-choice format, on a measure of authenticity I would say the first excerpt is “good” and the second excerpt is only “fair”

Does the Test Offer Beneficial Washback to the Learner?
The design of an effective test should point the way to beneficial washback. A test that achieves content validity demonstrates relevance to the curriculum in question and thereby sets the stage for washback. When test items represent the various objectives of a unit, and/or when sections of a test clearly focus on major topics of the unit, classroom test can serve in a diagnostic capacity even if they aren’t specifically labeled as such.
Other evidence of washback may be less visible from an examination of the test itself. Here again, what happens before and after that test is critical. Preparation time before the test can contribute to washback since the learner is reviewing and focusing in a potentially broader way on the objectives in question. By spending classroom time after the test reviewing the content, students discover their areas of strength and weakness. Teachers can raise the washback potential by asking students to use test result as a guide to setting goals for their future effort. The key is to play down the “Whew, I’m glad that’s over” feeling that students are likely to have, and play up the learning that can now take place from their knowledge of the results.
Some of the “alternatives” in assessment referred to in Chapter 1 may also enhance washback from tests. Self-assessment may sometimes be an appropriate way to challenge students to discover their own mistakes. This can be particularly effective for writing performance: once the pressure of assessment has come and gone, students may be able to look back on their written to simply listening to the teacher tell everyone what they got right and wrong and why. Journal writing may offer students a specific place to record their feelings, what they learned, and their resolutions for future effort.
The five basic principles of language assessment were expanded here into six essential questions you might ask yourself about an assessment. As you use the principles and the guidelines to evaluate various forms of test and procedures, be sure to allow each one of the five to take on greater or lesser importance, depending on the context. In large scale standardized testing, for example, practicality is usually more important than washback but the reverse may be true of a number of classroom test. Validity is of course always the final arbiter. And remember, too, that these principles, important as they are, are not the only considerations in evaluating or making an effective test. Leave some space for other factors to enter in.
In the next chapter, the focus is on h ow to design a test. The same five principles underlie test construction as well as test evaluation, along with some new facets that will expand your ability to apply principles to the practicalities of language assessment in your own classroom.



By. 2nd group

Catrine Mei Windri
Cindy Aprilia
Santi Novitasari
Novri Karyati
Junitrin


Share:

2. Principles of Language Assessment (individual assignments)



Group 1 :
Ariani Andespa
Lusy Bebi Hertika
Ririn Ariska
Tiya Rosalina

Group 2 :
Junitrin
Novri Karyati

Group 3:
Nita Kurnilawati
Reyzha Ramadhan Putra
Rievy Yuwandarie Ulga
Riki Pratama Bakti

Group 4:
Elok Utari
Istiansyah

Group 5:
Desi Septiana
Devi Wulandari
Recka
Santri Puryani

Group 6 :
M. Reza Junaidi
Nopraja
Puji Sundari


Share:

1. Testing, Assessing, and Teaching (individual assignments)

Ariani Andespa
Lusy Bebi Hertika
Ririn Ariska
Tiya Rosalina

Group 2 :
Citrine Mei Windri
Junitrin
Novri Karyati

Group 3:
Nita Kurnilawati
Reyzha Ramadhan Putra
Rievy Yuwandarie Ulga
Riki Pratama Bakti

Group 4:
Elok Utari
Istiansyah

Group 5:
Desi Septiana
Devi Wulandari
Ika Nurlela
Recka
Santri Puryani

Group 6 :
Fitri
M. Reza Junaidi
Nopraja
Puji Sundari
Rany Pangesti
Wahyu Ningrum


Share:

Selasa, Oktober 18, 2016

1. TESTING, ASSESSING, TEACHING




A.    WHAT IS A TEST?
  A TEST an in simple terms, is a method of measuring a person’s ability, knowledge, or performance, in a given domain. Let’s look at the components of this definition. A test is first, a method.  It is an instruments – a set of techniques, procedures, or items that requires performance on the part of the test taker. To qualify as a test, the method that requires must be explicit and structured. : Multiple – choice question with prescribed correct answers: a writing prompt with a scoring rubric an oral interview based on a question script and a checklist of expected responses to be filled in by the administrator.
Second, a test measure.  Some test measure general ability, while others focus on very specific competencies or objectives. A multi – skill proficiency test determines a general ability level: a quiz on recognizing correct use of definite articles measure specific knowledge. The way the result or measurements are communicated may vary. Some test, such as classroom – based short – answer essay test, may earn the test – taker a letter grade accompanied by the instructor’s marginal comments.
Next, a test measures an individual’s ability, knowledge, or performance.   Testers need to understand who the test –takers are. What is their previous experience and background? Is the test appropriately matched to their abilities? How should test = takers interpret their scores?
A test measures performance, but the result imply the test – taker ability, or, to use a concept common in the field of linguistic, competence. Most language test measure one’s ability to perform language that is, to speak write, read, or listen to a subset of language. On the other hand, it is not uncommon to find test designed to subset of language: defining a vocabulary item, reciting, a grammatical rule, or identifying a rhetorical feature in written discourse. Performance based test sample the test – takers actual use of language, but from those sample the test administrator infers general competence. A test of reading general comprehension, for example, may consist of several short reading behavior. But from the result of that test, the examiner may infer a certain level of general reading ability. Finally, a test measures a given domain. In the proficiency test, even though the actual performance on the test involves only a sampling of skills, that domain is overall proficiency in a language – general competence in all skills of a language.

B.     ASSESSMENT AND TEACHING
   Assessment is a popular and sometime misunderstood terms in current educational practice. You might be tempted to think of testing and assessing as synonymous terms, but they are not. Test are prepared administrative procedures that occur that occur at identifiable times in a curriculum when learners muster all their faculties to offer peak performance, knowing that their responses are being measured and evaluated.
Test then, are subset of assessment: they are certainly not the only form of assessment that a teacher can make.  Test can be useful devices, but they are only one among many procedures and task that teachers can ultimately use to assess students.
But now, you might be thinking, if you make assessment every time you teach something in the classroom, does all teaching involve assessment? Are teachers constantly assessing students with no interaction that is assessment- free?
The answer depends on your perspective. For optimal learning to take place, students in the classrooms must have the freedom to experiment, to try out their own hypothesis about language without feeling that their overall competence is being .Teaching set up the practice games of language learning the opportunities for learners to listen. Think, take, set goals, and process feedback from the “coach” and then recycle through the skill that they are trying to master. (Of diagram of the relationship among testing, teaching, and assessment is found in figure 1.1).
At the same time, during these practice activities, teachers (and tennis coaches) are indeed observing students’ performance were better than others in the same learning community? In the ideal classroom, all these observation feed into the way the teachers provides instruction to each students.

  1. Informal And Formal Assessment
   One way to begin untangling the lexical conundrum created by distinguishing among tests, assessment, and teaching is to distinguish between informal and formal assessment. Informal assessment can take a number forms, starting with incidental, unplanned comments and responses, along with coaching and other impromptu feedback to the students. Examples include saying “nice job ““good work “! Did you say can or cannot? “I think you mean to say you broke the glass, not you break the glass, “or putting on same work. Informal assessment does not stop there. A good deal of a teacher informal assessment is embedded in classroom tasks designed to elicit performance without recording result and making fixed judgments about a student’s competence. 
Examples at this end of the continuum are marginal comments on papers, responding to a draft of an essay, advice about how to better pronounce a word, suggestion for a strategy for compensating for a reading difficulty, and showing how to modify a student’s note – taking to better remember the content of a lecture.
On the other hand, Formal Assessment are exercise or procedures specifically designed to tap into a store house of skills and knowledge. They are systematic, planned sampling techniques constructed to give teacher and student achievement. Is formal assessment the same as a test? We can say that all   tests are formal assessments, but not all formal assessment is testing. For example you might use a student’s journal or portfolio of materials as a formal assessment of the attainment of certain course object but its problematic to call those two procedures “ tests “ . Test are usually relatively time – constrained (usually spanning a class period or at most several hours) and draw on a limited sample of behavior.

  1. Formative and Summative Assessment
Another useful distinction to bear in mind is the function of an assessment: How is the procedure to be used? Two functions are commonly identified in the literature: formative and summative assessment. Most of our classroom assessment is formative assessment: evaluating students in the process of “forming” their competencies and skills with the goal of helping them to continue that growth process. The key to such formation is the delivery (by the teacher) and internalization (by the student) of appropriate feedback on performance, with an eye toward the future continuation (or formation) of learning.
For all practical purposes, virtually all kinds of informal assessment are (or should be) formative. They have as their primary focus the ongoing development of the learner’s language. So when you give a student a comment or a suggestion, or call attention to an error, that feedback is offered in order to improve the learner’s language ability.
Summative assessment aims to measure, or summarize, what a student has grasped, and typically occurs at the end of a course or unit of instruction. A summation of what a student has learned implies looking back and taking stock of how well that student has accomplished objectives, but does not necessarily point the way to future progress. Final exams in a course and general proficiency exams are examples of summative assessment.
One of the problems with prevailing attitudes toward testing is the view that all tests (quizzes, periodic review tests, midterm exams, etc.) are summative. At various points in your past educational experiences, no doubt you’ve considered such tests as summative. You may have thought, “Whew! I’m glad that’s over. Now I don’t have attitude among your students: Can you instill a more formative quality to what your students might otherwise view as a summative test? Can you offer your students an opportunity to convert teats into “learning experiences”? We will take up that challenge in subsequent chapters in this book.

  1. Norm-Referenced and Criterion-Referenced Tests
Another dichotomy that is important to clarify here and that aids in sorting out common terminology in assessment is the distinction between norm-referenced and criterion-referenced testing. In norm-referenced tests, each test-take’s score is interpreted in relation to a mean (average score), median (middle score), standard deviation (extent of variance in scores), and/or percentile rank. The purpose in such tests is to place test-takers along a mathematical continuum in rank order. Scores are usually reported back to the test-taker in the form of a numerical score (for example, 230 out of 300) and a percentile rank (such as 84 percent, which means that the test-taker’s score was higher than 84 percent of the total number of test-takers, but lower than 16 percent in that administration). Typical of norm-referenced tests are standardized tests like the Scholastic Aptitude Test (SAT) or the test of English as a Foreign Language (TOEFL), intended to be administered to large audiences, with result efficiently disseminated to test-takers. Such test must have fixed, predetermined responses in a format that can be scored quickly at minimum expense. Money and efficiency are primary concerns in these tests.
Criterion-Referenced tests, on the other hand, are designed to give test-takers feedback, usually in the form of grades, on specific course or lesson objectives. Classroom tests involving the students in only one class, and connected to a curriculum, are typical or criterion-referenced testing. Here, much time and effort on the part of the teacher (test administrator) are sometimes required in order to deliver useful, appropriate feedback to students, or what Oller (1979, p. 52) called “instructional value.” In a criterion-referenced test, the distribution of students’ scores across a continuum may be of little concern as long as the instrument assesses appropriate objectives. In Language Assessment, with an audience of classroom language teachers and teachers in training, and with its emphasis on classroom-based assessment (as opposed to standardized, large-scale testing), criterion-referenced testing is of more prominent interest than norm-referenced testing.

C.    APPROACHES TO LANGUAGE TESTING: A BRIEF HISTORY
Now that you have a reasonably clear grasp of some common assessment terms, we now turn to one of the primary concerns of this book: the creation and use of tests, particularly classroom tests. A brief history of language testing over the past half-century will serve as a backdrop to an understanding of classroom-based testing.
Historically, language-testing trends and practices have followed the shifting sands of teaching methodology (for a description of these trends, see Brown, Teaching by Principles [hereinafter TBP], Chapter 2). For example, in the 1950s, an era of behaviorism and special attention to contrastive analysis, testing focused on specific language elements such as the phonological, grammatical, and lexical contrasts between two languages. In the 1970s and 1980s, communicative theories of language brought with them a more integrative view of testing in which specialists claimed that “the whole of the communicative event was considerably greater than the sum of its linguistic elements” (Clark, 1983, p. 432). Today, test designers are still challenged in their quest for more authentic, valid instruments that simulate real-world iterance.

a.      Discrete Point and Integrative Testing
This historical perspective underscores two major approaches to language testing that were debated in the 1970s and early 1980s. These approaches still prevail today, even if in mutated form: the choice between discrete-point and integrative testing methods (Holler, 1979). Discrete point tests are constructed on the assumption that language can be broken down into its component parts and that those parts can be tested successfully. These components are the skills of listening, speaking, reading and writing, and various units of language (discrete points) of phonology/graphology, morphology, lexicon, syntax, and discourse. It was claimed that an overall language proficiency test, then, should sample all four skills and as many linguistic discrete points as possible.
Such an approach a demanded a decontextualization that often confused the test-taker. So, as the profession emerged into an era of emphasizing communication, authenticity and context, new approaches were sought. Holler (1979) argued that language competence is a unified set of interacting abilities that cannot be tested separately. His claim was that communicative competence is so global and requires such integration (hence the term "integrative" testing) that it cannot be captured in additive tests of grammar, reading, vocabulary, and other discrete points of language. Others (among them Cziko, 1982, and Savignon, 1982) soon followed in their support for integrative testing.
What does an integrative test look like? Two types of tests have historically been claimed to be examples of integrative tests: cloze tests and dictations. A cloze test is a reading passage (perhaps 150 to 300 words) in which roughly every sixth or seventh word has been deleted; the test-taker is required to supply words that fit into those blanks. Holler (1979) claimed that cloze test results are good measures of overall proficiency. According to the theoretical constructs underlying this claim, the ability to supply appropriate words in blanks requires a number of abilities that lie at the heart of competence in a language: knowledge of vocabulary, grammatical structure, discourse structure, reading skills and strategies, and an internalized "expectancy" grammar (enabilizing one to predict an item that will come next in a sequence). It was argued that successful completion of cloze items taps into all of those abilities, which were said to be the essence of global language proficiency.
Dictation is a familiar language-teaching technique that evolved into a testing technique.  Essentially, learners listen to a passage of 100 to 150 words read aloud by an administrator (or audiotape) and write what they hear, using correct spelling. The listening portion usually has three stages: an oral reading without pauses; an oral reading with long pauses between every phrase (to give the learner time to write down what is heard); and a third reading at normal speed to give test-takers a chance to check what they wrote. (See Chapter 6 for more discussion of dictation as assessment device)
Supporters argue that dictation is an integrative test because it taps into grammatical and discourse competencies required for other modes of performance in language.  Success on a dictation requires careful listening, reproduction in writing of what is heard, and efficient short-term memory and, to an extent, some expect rules to aid the short-term memory. Further, dictation test results tend to correlate strongly with other tests of proficiency. Dictation testing usually classroom-centered since large-scale administration of dictations quite impractical from scoring standpoint. Reliability of scoring criteria for dictation tests can be improved by designing multiple-choice or exact-word cloze test scoring.
Proponents of integrative test methods soon centered their arguments on what became known as the unitary trait hypothesis, which suggested an "indivisible" view of language proficiency: that vocabulary, grammar, phonology, the "four skills", and other discrete points of language could not be disentangled from catch other of tasks for at language performance. The unitary trait hypothesis contended that there that is a general factor of language proficiency such that all the discrete points do not add up to that whole.
Others argued strongly against the unitary trait position. In a study of students in Brazil and the Philippines, Farhady (1982) found significant and widely varying differences in performance on an ESL proficiency test, depending on subject’s native country, major field of study, and graduate versus undergraduate status. For example, Brazilians scored very low in listening comprehension and relatively high in reading comprehension. Filipinos, whose scores on five of the six components of the test were considerably higher than Brazilian's scores, were actually lower than Brazilians in reading comprehension scores. Farhady's contentions were supported in other research that seriously questioned the unitary trait hypothesis. Finally, in the face of the evidence, Holler retreated from his earlier stand and admitted that "the unitary trait hypothesis was wrong" (1983, p.352).
b.      Communicative language teaching
By the mid-1980s, the language testing field had abandoned arguments about the unitary trait hypothesis and had begun to focus on designing communicative language-testing tasks. Bachman and Palmer (1996, p. 9) include among “fundamental” principles of language testing the need for a correspondence between language test performance and language use: “in order for a particular language test to be useful for its intended purposes, test performance must correspond in demonstrable ways to language use in non-test situations. The problem that language assessment experts faced was that tasks tended to be artificial, contrived, and unlikely to mirror language use in real life. As Weir (1990, p. 6) noted, “Integrative tests such as cloze only tell us about a candidate’s linguistics competence. They do not tell us anything directly about a student’s performance ability.”
And so a quest for authenticity was launched, at test designers centered on communicative performance. Following Canale and Swain’s (1980) model of communicative competence, Bachman (1990) proposed a model of language competence consisting of organizational and pragmatic competence, respectively subdivided into grammatical and textual components, and into illocutionary and sociolinguistic components. (Further discussion of both Canale and Swain’s and Bachman’s models can be found in PLLT). Bachman and Palmer (1996, pp. 70f) also emphasized the importance of strategic competence (the ability to employ communicative strategies to compensate for breakdowns as well as to enhance the rhetorical effect of utterances) in the process of communication. All elements of the model, especially pragmatic and strategic abilities, needed to be included in the constructs of language testing and in the actual performance required of test-takers.
   Communicative testing presented challenges to test designers. Test constructors began to identify the kinds of real-world tasks that language learners were called upon to perform. It was clear that the contexts for those tasks were extraordinarily widely varied and that the sampling of texts for any one assessment procedure needed to be validated by what language users actually do with language. Weir (1990, p.11) reminded his reader that “to measure language proficiency … account must now to be taken of: where, when, how, with whom, and why language is to be used, and on what topics, and whit what effect”. And the assessment field became more and more concerned with the authenticity of tasks and the genuineness of texts.

c.       Performance-Based assessment
In this language course and programs around the world, test designers are now tackling this new and more student-centered agenda (Alderson, 2001, 2002). Instead of just offering paper-and-pencil selective response tests of a plethora of separate items, performance-based assessment of language typically involve oral production, written production, open-ended responses, integrated performance (across skill areas), group performance, and other interactive tasks. To be sure, such assessment is time-consuming and therefore expensive, but those extra efforts are paying off in the form of more direct testing because students are assessed as they perform actual or simulated real-world tasks.  In technical terms, higher content validity is achieved because learners are measured in the process of performing the targeted linguistic acts.
In an English language teaching context, performance-based assessment means that you may have a difficult time distinguishing between formal and informal assessment.  If you rely a little less on formally structured tests and a little more on evaluation while students are performing various tasks, you will be taking some steps toward meeting the goals of performance-based testing
  A characteristic of many (but not all) performance-based language assessments is the presence of interactive tasks.  In such cases, the assessments involve learners in actually performing the behavior that we want to measure.  In interactive tasks, test-takers are measured in the act of speaking, requesting, responding, or in combining listening and speaking, and in integrating reading and writing.  Paper-and-pencil tests certainly do not elicit such communicative performance.  A prime example of an interactive language assessment procedure is an oral interview.  The test-taker is required to listen accurately to someone else and to respond appropriately. If care is taken in the test design process, language elicited and volunteered by the student can be personalized and meaningful, and tasks can approach the authenticity of real life language use.

D.    CURRENT ISSUES IN CLASSROOM TESTING
The design of communicative, performance-based assessment rubrics continues to challenge both assessment experts and classroom teachers. Such efforts to improve various facets of classroom testing are accompanied by some stimulating issues, all of which are helping to shape our current understanding of effective assessment. Let's look at three such issues:  the effect of new theories of intelligence on the testing industry; the advent of what has come to be called "alternative" assessment; and the increasing popularity of computer-based testing.

a.      New Views on Intelligence
Intelligence one once viewed strictly as the ability to perform (a) linguistic and (b) logical mathematical problem solving. This “IQ” (intelligence quotient) concept of intelligence has permeated the western world and its way pf testing for almost a century, since “smartness” in general is measured by timed, discrete-points tests consisting of a hierarchy of separate items, why shouldn’t every field of study be so measured? For many years, we have lived in a world of standardized, norm-referenced tests that are timed in a multiple-choice format consisting of a multiplicity of logic constrained items, many of which are inauthentic.
However, research on intelligence by psychologist like Howard Gardner, Robert Sternberg, and Daniel Golemen has begun to turn the psychometric world upside down. Gardner (1983, 1999), for example, intended the traditional view of intelligence to seven different components. He accepted the traditional conceptualizations of linguistic intelligence and logical mathematical intelligence on which standardized IQ tests are based, but he included five others “frames of mind” in this theory of multiple intelligence:
·         Spatial intelligence (the ability to find your way around an environment, to form mental images of reality)
·         Musical intelligence (the ability to perceive and create pitch and rhythmic patterns)
·         Bodily-kinesthetic intelligence (fine motor movement, athletic prowess)
·         Interpersonal intelligence (the ability to understand others and how they feel, and to interact effectively with them)
·         Intrapersonal intelligence (the ability to understand oneself and to develop sense of self-identify)
Robert Stenberg (1988; 1997) also charted new territory in intelligence research in recognizing creative thinking and manipulative strategies as part of intelligence. All “smart” people aren’t necessary adept and fast, reactive thinking. They may be very innovative in being able to think beyond the normal limits imposed by existing tests, but they may need a good deal of processing time to enact this creativity. Other forms of smartness are found in those who know how to manipulate their environment, namely, other people, debaters, politicians, successful salespersons, smooth talkers, and con artists are all smart in their manipulative ability to persuade others to think their way, vote for them, make a purchase, or do something they might not otherwise do...
More recently, Daniel Golemen’s (1995) concept of”IQ” (emotional quotient) has spurred as to underscore the importance of the emotions in our cognitive processing. Those who manage their emotions – especially emotions that can be detrimental-tend to be more capable of fully intelligence processing. Anger, grief, resentment, self-doubt, and other feelings can easily impair peak performance in everyday tasks as well as higher-order problem solving.
These new conceptualizations of intelligence have not been universally accepted by the academic community (see while, 1998, for example). Nevertheless, their intuitive appeal infused the decade of the 1990s with a sense of booth freedom and responsibility in our testing agenda. Coupled with parallel educational reforms at the time (Armstrong, 1994,) they helped to free us from relying exclusively on timed, discrete-point, analytical tests in measuring language. We were prodded to cautiously combat the potential tyranny of “objectivity” and it’s accompanying impersonal approach. But we also assumed the responsibility for tapping into whole language skills, learning processes, and the ability to negotiate meaning. Our challenge was to test impersonal, creative, communicative, interactive skills, and in doing so to place some trust in our subjectivity and tuition.

b.      Traditional and “Alternative” Assessment
Implied in some of the earlier descriptions of performance-based classroom assessment is a trend to supplement traditional test designs with alternatives that are more authentic in their elicitation of meaningful communication.
Two caveats need to be stated here. First, the concepts in table 1.1 represent some overgeneralization and should therefore be considered with caution. It is difficult, in fact, to draw a clear line of distinction between what Amstrong (1994) and Ballely (1998) have called traditional and alternative assessment. Many forms of assessment fail in between the two, and some combine the best of both.
Second, it is obvious that the table shows a blast toward alternative assessment and one should not be misled into thinking that everything one the left-hand side is tainted while the list on the right-hand side offers salvation to the field of language assessment as Brown and Hudson (1998) aptly pointed out, the assessment traditions available to us should be valued and utilized for the function that they provide. At the same time, we might all be stimulated to look at the right-hand list and ask ourselves if, among those concepts, there are alternatives to assessment that we can constructively use in our classroom.
It should be noted here that considerable more time and higher institutional budgets are required to administer and score assessment that presuppose more.

Table 1.1. Traditional and alternative assessment
Traditional Assessment
Alternative Assessment
One-shot, standardized exams
Timed, multiple-choice format
Decontextualized test items
Scores-suffice for feedback
Norm-referenced scores
Focus on the “right” answer
Summative.
Oriented to product
Non-interactive performance
Fosters extrinsic motivation
Continuous Long-Term assessment
Untimed, free-response format
Contextualized communicative tasks
Individualized Feedback and wash back
Criterion-referenced scores
Open-ended, creative answers
Formative.
Oriented to process
Interactive performance
Fosters intrinsic motivation
   
Subjective evaluation, more individualization, and more interaction in the process of offering feedback. The payoff for the letter, however, comes with more useful feedback to student, the potential for intrinsic motivation, and ultimately a more complete description of a student’s ability. More and more educators and advocates for educational reform are arguing for a de-emphasis on large-scale standardized tests in favor of building budgets that will offer the kind of contextualized, communicative performance-based assessment that will better facilitate learning in our schools.

c.       Computer-Based Testing
Recent years have seen a burgeoning of assessment in which the test-taker performs responses on a computer. Some computer-based tests (also known as “computer assisted” or” web-based tests) are small-scale “home-grown” tests available on websites. Others are standardized, large-scale test in which thousands or even tens of thousands of test-takers are involved. Students receive prompts (or probes, as they are sometimes referred to) in the form of spoken or written stimuli from the computerized test and are required to type (or in some cases, speak) their responses. Almost all computer-based test items have fixed, closed-ended responses; however, test like the test of English as a foreign language (TOEFL) offer a written essay section that must be scored by humans (as opposed to automatic, electronic, or machine scoring). As this book goes to press, the designers of the TOEFL are on the verge of offering a spoken English section.
A specific type of computer-based test, a computer-adaptive test, has been available for many years but has recently gained momentum. In a computer-adaptive test (CAT), each test-taker receives a set of questions that meet the test specifications and that are generally appropriate for his or her performance level. The CAT starts with questions of moderate difficulty. As test-takers answer each questions, the computer scores the questions and uses that information, as well as the responses to previous question, to determine which question will be presented next. As long as examinees respond correctly, the computer typically selects questions of greater or equal difficulty. Incorrect answers, however, typically bring question of lesser or equal difficulty. The computer is programmed to fulfill the test design as it continuously adjust to find questions of appropriate difficulty for test-takers at all performance levels. In CATs, the test-taker sees only one question at a time, and the computer scores each questions before selecting the next one. As a result, test-takers cannot skip questions, and once they have entered and confirmed their answers, they cannot return to questions or to any earlier part of the test.
Computer-based testing, with or without CAT technology, offers these advantages:
·         Classroom-based testing
·         Self-directed testing on various aspects of a language (vocabulary, grammar, discourse, one or all of the four skills, etc.)
·         Practice for high stakes standardized tests
·         Some individualizing, in the case of CATs
·         Large-scale standardized tests that can be administered easily to thousands of test-takers at many different stations, then scored electronically for rapid reporting of results.
Of course, some disadvantages are present in our current predilection for computerizing testing. Among them:
·         Lack of security and the possibility of cheating are inherent in classroom-based, unsupervised computerized tests.
·         Occasional “home-grown” quizzes that appear on unofficial websites may be mistaken for validated assessments.
·         The multiple-choice format preferred for most computer-based tests contains the usual potential for flawed item design (see chapter 3)
·         Open-ended responses are less likely to appear because of the need for human scorers, with all the attendant issue of cost, reliability, and turnaround time.
·         The human interactive element (especially in oral production) is absent.
More is said about computer-based testing in subsequent chapters, especially chapter 4, in a discussion of large-scale standardized testing. In addition, the following website provide further information and examples of computer-based tests:

Educational Testing Service                                                      www.ets.org
Test of English as a Foreign Language                                     www.toefl.org
Test of English for International Communication                     www.toelc.com
International English Language Testing System                       www.lelts.org
Dave’s ESL CafĂ© (computer quizzes)                                       www.eslcafe.com

Some argue that computer-based testing, pushed to its ultimate level, might mitigate against recent efforts to return testing, to its artful form of being tailored by teachers for their classrooms, of being designed to be performance-based, and of allowing a teacher-student dialogue to form the basis of assessment. This need not be the case. Computer technology can be a boon to communicative language testing. Teachers and test-makers of the future will have access to an ever-increasing range of tools to safeguard against impersonal, stamped-out formulas for assessment. By using technological innovations creatively, testers will be able to enhance authenticity, to increase interactive exchange, and to promote autonomy.
As you read this book, I hope you will do so with an appreciation for the place of testing in assessment, and with a sense of the interconnection of assessment and teaching. Assessment is an integral part of the teaching-learning cycle. In an interactive, communicative curriculum, assessment is almost constant .tests, which are a subset of assessment, can provide authenticity, motivation, and feedback to the learner. Tests are essential components of a successful curriculum and one of several partners in the learning process. Keep in mind these basic principles:
1.      Periodic assessments, both formal and informal, can increase motivation by serving as milestones of student progress.
2.      Appropriate assessment aid in the reinforcement and retention of information.
3.      Assessment can confirm areas of strength and pinpoint areas needing further work.
4.      Assessment can provide a sense of periodic closure to modules within a curriculum.
5.      Assessments can promote student autonomy by encouraging students’ self-evaluation of their progress.
6.      Assessments can spur learners to set goals for themselves.
7.      Assessments can aid in evaluating teaching effectiveness.


By. 1st Group
Ariani Andespa
Ririn Ariani
Lusy Bebi Hertika
Rivaria Safitri
Tiya Rosalina
Mery Herlina
Share: