STANDARDIZED TESTS
Compiled by: THE FOURTH GROUP
Oka Rodi Putra (NPM. 1423 037)
Istiansya (NPM. 1423 003)
Rice Ilmaini (NPM. 1423 019)
Elok Utari (NPM. 1423 032)
Mona Novika (NPM. 1423 010)
General Overview
Every
educated person has at some point been touched – if not deeply affected – by a
standardized test. For almost a century, schools, univeristies, business, and
governments have looked to standardized measures for economical, reliable, and
valid assessments or those who would enter, continue in, or exit their institutions.
Proponents of these large-scale instruments make strong claims for their
usefulness when great numbers of people must be measured quickly and
effectively. Those claims are well supported by reams of research data that
comprise construct validations of their efficacy. And so we have become a world
that abides by the results of standardzed tests as if they were sacrosanct.
The rush to carry out standardized testing in every walk of life has not gone
unchecked. Some psychometricians have stood up in recent years to caution the
public against reading too much into tests that require what may be a narrow
band of specialized intelligence (Sternberg, 1997; Gardner, 2000; Kohn, 2000).
Organizatons such as the National Center for Fair and Open Testing (www.fairtest.org) have reminded us that
standardization of assessment procedures creates an illusion of validity.
Strong claims from the giants of the tetsing industry, they say, have pulled
the collective wool over the public’s eyes and in the process have incorrectly
marginalized thousands, if not millions, of children and adults worldwide.
Whichever side is “right” – and both sides have legitimate cases – it is
important for teachers to udnerstand the educational institutions they are
workng in, and an integral part of virtually all of those institutions is the
use of standardized tests. So it is important for you to understand what
standardized tests are, what they are not, how to intepret them, and how to put
them into a balanced perspective in which we strive to accurately assess all
learners on all proposed objectives. We can learn a great deal about many
elarners and their competencies through standardized forms of asssessment. But
some of those learners and some of those objectives may not be adequately
measured by a sit-down, timed, multiple-choice format that is likely to be
decontextualized.
This STANDARDIZED
TESTING materials have two goals: to introduce the process of
constructing, validating, admnistering, and interpreting standardized tests of
language, and to acquaint you with a variety of current standaridized tests
taht claim to test overall language proficiency. We are not focusing centrally
on classroom-based assessment. Some of the practical steps that are involved in
creating standardized tests are directly transferable to designing classroom
tests.
A. What is Standardization?
Most elementary and secondary schools in the United States have standardized
achievement test to measure children’s mastery of the standards or competencies
that have been prescribed for specified grade levels. These tests vary by
states, counties, and school districts, but they all share the common objective
of economical large-scale assessment. College entrance exams such as the
Scholastic Aptitude Test (SAT®) are part of the educational
experience of many high school seniors seeking further education. The Graduate
Record Exam (GRE®) is a required standardized test for entry into
many graduate school programs. Tests like the Graduate Management Admission
Test (GMAT) and the Law School Aptitude Test (LSAT) specialize in particular disciplines.
One genre of standardized test that you may already be familiar with is the
Test of English as Foreign Language (TOEFL), produced by the Educational
Testing Service (ETS) in the United States and/or its British counterpart, the International
English Language Testing System (IELTS), which features standardized tests in affiliation
with the University of Cambridge Local Examinations Syndicate(UCLES). They are
all standardized because they specify a set of competencies(or standards) for a
given domain, and through a process of construct validation they program a set
of tasks that have been designed to measure those competencies.
Many people are under the incorrect impression that all standardized tests
consist of items that have predetermined responses presented in a
multiple-choice format. While it is true that many standardized tests conform
to a multiple-choice format, by no means is multiple-choice a prerequisite
characteristic. It so happens that a multiple-choice format provides the test
producer with an “objective” means for determining correct and incorrect
responses, and therefore is the preferred mode for large-scale tests. However, standards
are equally involved in certain human –scored tests of oral production and
writing, such as the Test of Spoken English (TSE®) and the Test of
Written English (TWE®), both produced by ETS.
B. The Advantages and Disadvantages of Standardized Tests
Advantages of standardized testing include, foremost, a ready-made
previously validated product that frees the teacher from having to spend hours
creating a test. Administration to large groups can be accomplished within
reasonable time limits. In the case of multiple-choice formats, scoring
procedures are streamlined (for either scannable computerized scoring or
hand-scoring with a hole-punched grid) for fast turnaround time. And, for
better or for worse, there is often an air of face validity to such
authoritative-looking instruments.
Disadvantages centre largely on the inappropriate use of such tests, for
example, using an overall proficiency test as an achievement test simply
because of the convenience of the standardization. A colleague told me about a
course director who, after a frantic search for a last-minute placement test,
administered a multiple-choice grammar achievement test, even though the
curriculum was mostly listening and speaking and involved few of the grammar
points tested. This instrument had the appearance and face validity of a good
test when in reality it had no content validity whatsoever.
Another disadvantage is the potential misunderstanding of the difference
between direct and indirect testing (see Chapter
2). Some standardized tests include tasks that do not directly specify
performance in the target objective. For example, before 1996, the TOEFL
included neither the written nor an oral production section, yet statistics
showed a reasonablystrong correspondence between performance on the TOEFL and a
student’s written and-to a lesser extent – oral production. The
comprehension-based TOEFL could therefore be claimed to be an indirect test of
production. A test of reading comprehension that proposes to measure ability to
read extensively and that engages test-takers in reading only short one-or
two-paragraph passages is an indirect measure of extensive reading.
Those who use standardized tests need to acknowledge both the advantages
and limitations of indirect testing. In the pre-1996 TOEFL administrations, the
expense of giving a direct test of production was considerably reduced by
offering only comprehension performance and showing through construct
validation the appropriateness of conclusions about a test-taker’s production
competence. Likewise, short reading passages are easier to administer, and if
research validates the assumption that short reading passages indicate
extensive reading ability, then the use of the shorter passages is justified.
Yet the construct validation statistics that offer that support never offer a
100 percent probability of the relationship, living room for some possibility
that the indirect test is not valid for its targeted use.
A more serious issue lies in the assumption (alluded to above) that standardized
tests correctly assess all learners equally well. Well-established standardized
tests usually demonstrate high correlations between performance on such tests
and target objectives, but correlations are not sufficient to demonstrate
unequivocally the acquisition of criterion objectives by all
test-takers. Here is a non-language example. In the United States, some
driver’s license renewals require taking a paper-and-pencil multiple-choice test
that covers signs, safe speeds and distances lone changes, and other “rules of
the road”. Correlational statistics show a strong relationship between high
scores on those tests and good driving records, so people who do well on these
tests are a safe bet to relicense. Now, an extremely high correlation (of
perhaps.80 or above) may be loosely interpreted to mean that a large majority
of the drivers whose licenses are renewed by virtue of their having passed the
little quiz are good behind-the-wheel drivers. What about those few who do not
fit the model? That small minority of drivers could endanger the lives of the
majority, and is that a risk worth taking? Motor vehicle registration departments
in the United States seem to think so, and thus avoid the high cost of
behind-the-wheel driving tests.
Are you willing to rely on a standardized test result in the case of all
the learners in your class? Of an applicant to your institution, or of a
potential degree candidate exiting your program? These questions will be
addressed more fully in Chapter 5, but for the moment, think carefully about
what has come to be known as high-stakes testing, in which standardized
tests have become the only criterion for inclusion or exclusion. The
widespread acceptance, and sometime misuse, of this gate-keeping role of
the testing industry has created a political, educational, and moral maelstrom.
C. Developing A Standardized Test
How are standardized tests developed? Where do test
tasks and items come from? How are they evaluated? Who selects items and their
arrangement in a test? How do such items and tests achieve consequential
validity? How are different forms of tests equated for difficulty level? Who
sets norms and cut-off limits? Are security and confidentiality an issue? Are
cultural and racial biases an issue in test development? All these questions
typify those that you might pose in an attempt to understand the process of
test development.
In the steps outlined below, three different
standardized tests will be used to exemplify the process of standardized test
design:
(A)
The Test of English as a Foreign
Language (TOEFL), Educational Testing Service (ETS).
(B)
The English as a Second Language
Placement Test (ESLPT), San Francisco State University (SFSU).
(C)
The Graduate Essay Test (GET), SFSU.
The first is a test of general language ability or
proficiency. The second is a placement test at a university. And the third is a
gate-keeping essay test that all prospective students must pass in order to
take graduate-level courses. As we look at the steps, one by one, you will see
patterns that are consistent with those outlined in the previous materials for
evaluating and developing a classroom test.
1. Determine the Purpose and Objectives of the Test
Most standardized tests are expected to provide high
practicality in administration and scoring without unduly compromising
validity. The initial outlay of time and money for such a test is significant,
but the test would be used repeatedly. It is therefore important for its
purpose and objectives to be stated specifically. Let’s look at the three
tests.
(A)
The purpose of the TOEFL is “to
evaluate the English proficiency of people whose native language is not
English” (TOEFL Test and Score Manual,
2001, p.9). More specifically, the TOEFL is designed to help institutions of
higher learning make “valid decisions concerning English language proficiency
in terms of [their] own requirements” (p.9). Most colleges and universities in
the United States use TOEFL scores to admit or refuse international applicants
for admission. Various cut-off scores apply, but most institutions require
scores from 475 to 525 (paper-based) or from 150 to 195 (computer-based) in
order to consider students for admission. The high-stakes, gate-keeping nature
of the TOEFL is obvious.
(B)
The ESLPT is designed to place
already admitted students at San Francisco State University in an appropriate
course in academic writing, with the secondary goal of placing students into
courses in oral production and grammar-editing. While the test’s primary purpose
is to make placements, another desirable objective is to provide teachers with
some diagnostic information about their students on the first day or two of class.
The ESLPT is locally designed by university faculty and staff.
(C)
The GET, another test designed at
SFSU, is given to prospective graduate students – both native and non-native
speakers – in all disciplines to determine whether their writing ability is sufficient
to permit them to enter graduate-level courses in their programs. It is offered
at the beginning of each term. Students who fail or marginally pass the GET are
technically ineligible to take graduate courses in their field. Instead, they
may elect to take a course n graduate-level writing of research papers. A pass
in that course is equivalent to passing the GET.
As you can see, the objectives of
each of these tests are specific. The content of each test must be designed to
accomplish those particular ends. This first stage of goal-setting might be
seen as one in which the consequential validity of the test is foremost in the
mind of the developer: each test has a specific gate-keeping function to
perform; therefore the criteria for entering those gates must be specified
accurately.
2. Design Test Specification
Now comes the hard part. Decisions need to be made on how to go about structuring
the specifications of the test. Before specs can be addressed, a comprehension
program of research must identify a set of constructs underlying the test
itself. This stage of laying thefoundation stones can occupy weeks, months, or
even years of effort. Standardized test that don’t work are often the product
of short-sighted construct validation. Let’s look at the three tests again.
(A)
Construct validation
for the TOEFL is carried out by the TOEFL staff at ETS under the guidance of a
policy Council that works with a Committee of Examiners that is composed of
appointed external university faculty, linguists, and assessment specialists.
Dozens of employees are involved in a complex process of reviewing current
TOEFL specifications, commissioning and developing test tasks and items,
assembling forms of the test, and performing on-going exploratory research
related to formulating new specs. Reducing such a complex process to a set of
simple steps runs the risk of gross overgeneralization, but here is an idea of
how a TOEFL is created.
Because the TOEFL is a proficiency test, the first step in the developmental
is process is to define the construct of language
proficiency. First, it should be made clear that many assessment
specialists such as Bachman (1990) and Palmer (Bachman & Palmer, 1996)
prefer the term ability to proficiency and thus speak of Language ability as
the overarching concept. The latter phrase is more consistent, they argue, with
our understanding that the specific components of language ability must be
assessed separately. Others, such as the American Council on Teaching Foreign
Languages (ACTFL), still prefer the term proficiency because it connotesmore of
a holistic, unitary trait view of language ability (Lowe, 1988). Most current
views accept the ability argument and therefore strive to specify and assess
the many components of language. For the purposes of consistency in this book, the
term proficiency will nevertheless be retained, with the above caveat.
How you view language will make a difference in how you assess language
proficiency. After breaking language competence down into subsets of listening,
speaking, reading and writing, each performance mode can be examined on a
continuum of linguistic units: phonology (pronunciation) and orthography
(spelling), words (lexicon), sentences (grammar), discourse, and pragmatic (
sociolinguistic, contextual, functional, cultural) features of language.
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image011.jpg)
From the sea of potential performance modes that could be sampled in a
test, the developer must select a subset on some systematic basis. To make a
very long story short (and leaving out numerous controversies), the TOEFL had
for many years included three types of performance in its organizational specifications:
listening, structure, and reading, all of which tested comprehension through
standard multiple-choice tasks. In 1996 a major step was taken to include
written production in the computer based TOEFL by adding a slightly modified
version of the already existing Test of Written English (TWE). In doing so,
some face validity and content validity were improved along with, of course, a
significant increase in administrative expense! Each of these four major
sections is capsulized in the box below (adapted from the descriptions are not,
strictly speaking, specifications, which are kept confidential by ETS.
Nevertheless, they can give a sense of many of the constraints that are placed
on the design of actual TOEFL specifications.
Figure 1: TOEFLÒ specifications
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image013.jpg)
(B)
The designing of the test specs for the ESLPT was a somewhat
simpler task because the purpose is placement and the construct validation of
the test consisted of an examination of the content of the ESL courses. In
fact, in a recent revision of the ESLPT, content validity (coupled with its
attendant face validity) was the central theoretical issue to be considered.
The major issue centered on designing practical and reliable tasks and item
response formats. Having established the importance of designing ESLPT tasks
that simulated classroom tasks used in the courses, the designers ultimately
specified two writing production tasks (one a response to an essay that
students read, and the other a summary of another essay) and one
multiple-choice grammar- editing task. These specifications mirrored the
reading- based, process writing approach used in the courses.
(C)
Specifications for the GET arose out of the perceived need to
provide a threshold of acceptable writing ability for all prospective graduate
students at SFSU, both native and non-native speakers of English. The
specifications for the GET are the skills of writing grammatically and
rhetorically acceptable prose on a topic of some interest, with clearly
produced organization of ideas and logical development. The GET is a direct
test of writing ability in which test-takers must, in a two-hour time period,
write an essay on a given topic.
3. Design, Select, and Arrange Test Tasks/Items
Once specifications for a standardized test have been stipulated,
the sometimes never-ending task of designing, selecting, and arranging items
begins. The specs act much like a blueprint in determining the number and types
of items to be created. Let’s look at the three examples.
(A)
TOEFL test
design specifies that each item be coded for content and statistical characteristics.
Content coding ensures that each examinee will receive test questions that
assess a variety of skills (reading, comprehending the main idea, or understanding
inferences) and cover a variety of subject matter without unduly biasing the
content toward a subset of test-takers (for example, in the listening section involving
an academic lecture, the content must be universal enough for students from
many different academic fields of study). Statistical characteristics,
including the IRT equivalents of estimates of item facility (IF) and the
ability of an item to discriminate the (ID) between higher or lower ability
levels, are also coded.
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image015.gif)
Items are
then designed by a team who select and adapt items solicited from a bank of
items that have been “deposited” by free-lance writers and ETS staff. Probes
for the reading section, for example, are usually excerpts from authentic
general or academic reading that are edited for linguistic difficulty, culture
bias, or other topic biases. Items are designed to test overall comprehension,
certain specific information, and inference.
Consider
the following sample of a reading selection and ten items based on it, from a
practice TOEFL (Philips, 2001, pp.423-424):
Figure 2: Reading selection and ten items based on it,
from a practice TOEFL
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image017.jpg)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image019.jpg)
As you can
see, items target assessment of comprehension of the main idea (item #11),
stated details (#17, 19), unstated details (312, 15, 18), implied details (#14,
20), and vocabulary in context (#13, 16). An argument could be made about the
cultural schemata implied in a passage about pirate ships, and you could engage
in an “angels on the head of a pin” argument about the importance of picking
certain vocabulary for emphasis, but every test item is a sample of a larger
domain, and each of these fulfils its designated specification.
Before any
such items are released into a form of the TOEFL (or any validated standardized
test), they are piloted and scientifically selected to meet difficulty
specifications within each subsection, section, and the test overall.
Furthermore, those items are also selected to meet a desired discrimination
index. Both of these indices are important considerations in the design of a
computer-adaptive test, where performance on one item determines the next one
to be presented to the test-taker.
(B)
The selection of items in the ESLPT
entailed two entirely different processes. In the two subsections of the test
that elciit writing performance (summary of reading; response to reading), the
main hurdles were (a) selectng appropriate passages for test-takers to read,
(b) providing appropriate prompts, and (c) processing data from pilot testing. Passages
have to conform to standards of content validity by being within the genre and
the difficulty of the material used in the courses. The prompt ne ach case (the
section asking for a summary and the section askng for response) has to be
tailored to fit the passage, but a general template is used.
In the multiple-choice
editing test that seeks to test grammar proofreading ability, the first and
easier task is to choose an appropriate essay within which to embed errors. The
more complicated task is to embed a specified number of errors from a
previosuly detemrined taxonomy of error categories. Those error categores came
directly from student errors as perceived by their teachers (verb tenses, verb
agreement, logical connectors, articles, etc.). the distractors for each item
were selected from actual errors that students make. Items in pilot versions
were then coded for difficulty and discrimination indices, after which final
assembly of items could occur.
(C)
The GET prompts ar edesigned by a
faculty committee of examiiners who are specialists in the field of university
academic writing. The assumption is made that the topics are universally
appealing and capable of yielding the intended product of an esssay that requires
an organized logical argument and conclusion. No pilot testing of prompts is
conducted. The conditions for administration remain constant: two-hour time
limit, sit-down context, paper and pencil, closed-book format. Consider the
followng recent prompt:
Figure 3: Graduate Essay Test, sample prompt
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image021.jpg)
It is clear from such a
prompt that the problem the test-takers must address is complex, that there is
sufficient information here for writing an essay, and that test-takers will be
reasonably challenged to write a clear statement of opinion. What also emerges
from this prompt (and virtually any prompt that one might propose) is the
potential cultural effect on the numerous international students who must take
the GET. Is it possible that such students, who are not familiar with school
systems in the United States, with hiring procedures, and perhaps with the
“politics” of school board elections, might be at a disadvantage in mounting
arguments within a two-hour time frame? Some (such as Hosoya, 2001) have
strongly claimed a bias.
4. Make Appropriate Evaluations of Different Kinds of Items
The concepts of item facility (IF), item discrimination (ID), and
distractor analysis are introduced here. It showed such calculations provide
useful information for classroom tests, but sometimes the time and effort
involved in performing them may not be practical, especially if the classroom-
based test is a one-time test. Yet for a standardized multiple- choice test that
is designed to be marketed commercially, and/or administered a number of times,
and/or administered in a different form, these indices are a must.
For other types of response formats, namely, production responses,
different forms of evaluation become important. The principles of practicality
and reliability are prominent, along with the concept of facility. Practicality
issues in such items include the clarity of directions, timing of the test,
ease of administration, and how much time is required to score responses.
Reliability is a major player in instances where more than one scorer is employee
and toa lesser extent when a single scorer has to evaluate tests over long
spans of time that could lead to deterioration of standards. Facility is also a
key to the validity and success of an item type: unclear directions, complex
language, obscure topics, fuzzy data, and culturally biased information may all
lead to a higher level of difficulty than one desires.
(A) The IF,ID, and efficiency statistics of the multiple-choice
items of current forms of the TOEFL are not publicly available information. For
reasons of security and protection of patented, copyrighted materials, they
must remain behind the closed doors of the ETS development staff. Those
statistics remain of paramount importance in the on-going production of TOEFL
items and forms and are the foundation stones for demonstrating the equitability
of forms. Statistical indices on retired forms of the TOEFL are available on
request for research purposes.
The essay portion of the TOEFL
undergoes scrutiny for its practicality, reliability, and facility. Special
attention is given to reliability since two human scorers must read each essay,
and every time a third reader becomes necessary (when the two readers disagree
by more than one point), it costs ETS more money.
(B) In the case of the open- ended responses on the two
written tasks on the ESLPT, a similar set of judgments must be made. Some
evaluative impressions of the effectiveness of prompts and passages are gained
from informal student and scorer feedback. In the developmental stage of the
newly revised ESLPT, both types of feedback were formally solicited through
questionnaires and interviews. That information proved to be invaluable in the
revision of prompts and stimulus reading passages. After each administration
now, the teacher-scorers provide informal feedback on their perceptions of the
effectiveness of the prompts and readings.
The multiple- choice editing
passage showed the value of statistical findings in determining the usefulness
of items and pointing administrators toward revisions. Following is a sample of the format used:
Figure 4: Multiple Choice editing passage
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image023.jpg)
The task
was to locate the error in each sentence. Statistical tests on the experimental
version of this section revealed that a number of the45 items were found to be
of zero IF (no difficulty whatsoever) and of inconsequential discrimination
power (some IDs of.15 and lower). Many distractors were of no consequence
because they lured no one. Such information let to a revision of numerous items
and their options, eventually strengthening the effectiveness of this section.
(C) The GET, like its written counterparts in the ESLPT,
is a test of written ability with a single prompt, and therefore questions of
practicality and facility are also largely observational. No data are collected
from students on their perceptions, but the scorers have an opportunity to reflect
on the validity of a given topic. After one sitting, a topic is retired, which
eliminates the possibility of improving a specific topic, but future framing of
topics might benefit from scorers’ evaluations. Inter-rater reliability is
checked periodically, and reader training sessions are modified if too many
instances of unreliability appear.
5. Specify Scoring Procedures and Reporting Formats
A
systematic assembly of test items in pre-selected arrangements and sequences,
all of which are validated to conform to an expected difficulty level, should
yield a test that can then be scored accurately and reported back to test
–takers and institutions efficiently.
(A)
Of the three tests being exemplified here, the
most straightforward scoring procedures comes from the TOEFL, the one with the
most complex issues of validation, design, and assembly. Scores are calculated
and reported for (a) three sections of the TOEFL (the essay ratings are combined
with the Structure and Written Expression score) and (b) a total score (range
40 to 300 on the computer –based TOEFL and 310 to 677 on the paper-and-pencil
TOEFL). A separate score (c) for the Essay (range 0 to 6) is also provided on
the examinee’s score record.
Figure 5: Facsimile of a TOEFL® score report
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image025.jpg)
The rating
scale for the essay is virtually the same one that is used for the Test of
Written English (see chapter 9 for details), with a “zero” level added for no
response, copying the topic only, writing completely off topic, or not writing
in English.
(B) The ESLPT reports a score for each of the essay
sections, but the rating scale differs between them because in one case the
objective is to write a summary, and in the other to write a response to a
reading. Each essay is read by two readers; if there is a discrepancy of more
than one level, a third reader resolves the difference. The editing section is
machine- scanned and - scored with a total score and with part-scores for each
of the grammatical/ rhetorical sections. From these data, placement
administrators have adequate information to make placements, and teachers
receive some diagnostic information on each student in their classes. Students
do not receive their essays back.
(C) Each GET is
read by two trained readers, who give a score between 1 and 4 according to the
following scale :
Figure 6:Graduate Essay Test: Scoring Guide
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image026.jpg)
The two readers’ scores are added to yield a
total possible score of 2 to 8. Test administrators recommend a score of 6 as
the threshold for allowing a student to pursue graduate – level courses.
Anything below that is accompanied by a recommendation that the students either
repeat the test or take a “remedial” course in graduate writing offered in one
of several different departments. Students receive neither their essays nor any
feedback other than the final score.
6. Perform On-going Construct Validation Studies
From the above discussion, it should be clear that not standardized
instrument is expected to be used repeatedly without a rigorous program of on-going
construct validation. Any standardized test, once developed, must be
accompanied by systematic periodic corroboration of its effectiveness and by
steps toward its improvement. This rigor is especially true of tests that are
produced in equated forms, that is, forms must be reliable across tests
such that a score on a subsequent form of a test has the same validity and
interpretability as its original.
(A) The TOEFL program, in cooperation with
other tests produced by ETS, has an impressive program of research. Over the
years dozens of TOEFL-sponsored research studies have appeared in the TOEFL
Monograph Series. An early example of such a study was the semi-final Duran
et al. (1985) study, TOEFL from a Communicative Viewpoint on Language
Proficiency, which examined the content characteristics of the TOEFL from a
communicative perspective based on current research in applied linguistics and
language proficiency assessment. More recent studies (such as Ginther, 2001;
Leacock &Chodorow, 2001; Powers et al.,2002) demonstrate an impressive
array of scrutiny.
(B) For approximately 20 years, the ESLPT
appeared to be placing students reliably by means of an essay and a
multiple-choice grammar and vocabulary test. Over the years the security of the
latter became suspect, and the faculty administrators wished to see some
content validity achieved in the process. In the year 2000 that process began
with a group of graduate students (Imao et al.,2000) in consultation with
faculty members, and continued to fruition in the form of a new ESLPT, reported
in Imo (2002). The developments of the new ESPLT involved a lengthy process of
both content and construct validation, along with facing such practical issues
as scoring the written sections and a machine-scorable multiple-choice answer
sheet.
The process of on-going
validation will no doubt continue as new forms of the editing section are
created and as new prompts and reading passages are created for the writing
section. Such a validation process should also include consistent checks on
placement accuracy and on face validity.
(C) At this time there is little or no research
to validate the GET itself. For its construct validation, its administrators
rely on a stockpile of research on university- level academic writing tests
such as the TWE. The holistic scoring rubric and the topics and administrative
conditions of the GET are the some extent patterned after that of the TWE. In
recent years some criticism of the GET has come from international test- takers
(Hosoya, 2001) who posit that the topics and time limits of the GET, among
other factors, work to the disadvantage of writers whose native language is not
English. These validity issues remain to be full addressed in a comprehensive
research study.
D. Standardized Language Proficiency Testing
Tests of language
proficiency presuppose a comprehensive definition of the specific competence
that comprise overall language ability. The specifications for the TOEFL
provided an illustration of an operational definition of ability for assessment
purposes. This is not the only way to conceptualize the concept. Swain (1990)
offered a multidimensional view of proficiency assessment by referring to three
linguistic traits (grammar, discourse, and sociolinguistics) that can be
assessed by means of oral, multiple-choice, and written responses (see the
Table 1.1). Swain’s conception was not to be an exhaustive analysis of ability,
but rather to serve as an operational framework for constructing proficiency
assessments.
Another
definition and conceptualization of proficiency is suggested by the ACTFL
association, mentioned earlier. ACTFL takes a holistic and more unitary view of
proficiency in describing four levels: superior, advanced, ntermediate, and
novice. Within each level, descriptions of listening, speakng, reading, and
writing are provided as guidelines for assessment. For example, the ACTFL
Guidelines describe the superior level of speaking as follows:
Figure
7: ACTFL speaking guidelines, summary, superior-level
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image028.jpg)
The other
three ACTFL levels use the same parameters in describing progressively lower
proficiencies across all four skills. Such taxonomies have the advantage of
considering a number of functions of linguistic discourse, but the
disadvantage, at the lower levels, of overly emphasizing test-takers’
deficiencies.
Table 1.1: Traits of second language
proficiency (Swain, 1990, p.403)
Trait
|
Grammar
|
Discourse
|
Sociolinguistic
|
|
Method
|
Focus on
grammatical accuracy within sentences
|
Focus on
textual cohesion and coherence
|
Focus on
social appropriateness of language use
|
|
Oral
|
Structured
interview
|
Story
telling and argumentation/persuasion
|
Role-play
of speech acts: requests, offers, complaints
|
|
Scored
for accuracy of verbal morphology, prepositions, syntax
|
Detailed
rating for identification, logical sequence, and time orientation, and global
ratings for coherence
|
Scored
for ability to distinguish formal and informal register
|
||
Multiple-choice
|
Sentence-level
‘select the correct form’ exercise
|
Paragraph-level
‘select the coherent sentence’ exercise
|
Speech
act-level ‘select the appropriate utterance’ exercise
|
|
(45
items)
|
(29
items)
|
(28
items)
|
||
Involving
verb morphology, prepositions, and other items
|
||||
Written composition
|
Narrative
and letter of persuasion
|
Narrative
and letter of persuasion
|
Formal request
letter and informal note
|
|
Scored
for accuracy of verb morphology, prepositions, syntax
|
Detaled
ratings much as for oral discourse and global rating for coherence
|
Scored
for the ability to distinguish formal and informal register
|
||
E. Four Standardized Language Proficiency Tests
We now turn to some of the better-known standardized
tests of overall language ability, or proficiency, to examine some of the
typical formats used in commercially available tests. We will not look at
standardized tests of other specific skills here, but that should not lead you
to think, by any means, that proficiency is the only kind of test in the field
that is standardized. Three standardized oral production tests, the Test of
Spoken English (TSE), the Oral Proficiency Inventory (OPI), Phone Pass, and the
Test of Written English (TWE).
Four commercially produced standardized tests of
English language proficiency are described briefly in this section: the TOEFL,
the Michigan English Language Assessment Battery (MELAB), the International
English Language Testing System (IELTS), and the Test of English for
International Communication (TOEIC®). These following questions can help us to
evaluate these following test and their subsections, including (1) What item
types are included?, (2) How practical and reliable does each subsection of
each test appear to be?, (3) Do the item types and tasks appropriately
represent a conceptualization of language proficiency (ability)? That is, can
you evaluate their construct validity?, (4) Do the tasks achieve face
validity?, (5) Are the tasks authentic?, and (6) Is there some washback
potential in the tasks?
Figure 8:Test of English as a Foreign Language (TOEFL®)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image030.jpg)
Figure 9: Michigan English Language Assessment Battery
(MELAB)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image032.jpg)
Figure 10: International English Language Testing System (IELTS)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image034.jpg)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image036.jpg)
Figure 11: Test of English for International
Communication (TOEIC®)
![](file:///C:\Users\15\AppData\Local\Temp\msohtmlclip1\01\clip_image038.jpg)
The construction of a valid standardized test is no
minor accomplishment, whether the instrument is large- or small-scale. The
designing of specifications alone requires a sophisticated process of construct
validation coupled with considerations of practicality. Then, the construction
of items and scoring/interpretation procedures may require a lengthy period of
trial and error with prototypes of the final form of the test. With painstaking
attention to all the details of construction, the end product can result in a
cost-effective, time-saving, accurate instrument. Your use of the results of
such assessments can provide useful data on learners’ language abilities. But
your caution is warranted as well.
REFERENCES:
Brown, H. Douglas. (2004).LANGUAGE
ASSESSMENT:Principlesand Classroom Practices. Pearson Education: NewYork.
http://febbyeni.blogspot.co.id/2013/06/designing-classroom-language-test.html
“these tests are too crude to be used” ~Frederick J.Kelly
0 komentar:
Posting Komentar