STANDARDIZED TESTING (Group 4) ~ LTEClass English Department Baturaja University

STANDARDIZED TESTS

Compiled by: THE FOURTH GROUP

Oka Rodi Putra (NPM. 1423 037)

Istiansya (NPM. 1423 003)

Rice Ilmaini (NPM. 1423 019)

Elok Utari (NPM. 1423 032)

Mona Novika (NPM. 1423 010)

General Overview

Every educated person has at some point been touched – if not deeply affected – by a standardized test. For almost a century, schools, univeristies, business, and governments have looked to standardized measures for economical, reliable, and valid assessments or those who would enter, continue in, or exit their institutions. Proponents of these large-scale instruments make strong claims for their usefulness when great numbers of people must be measured quickly and effectively. Those claims are well supported by reams of research data that comprise construct validations of their efficacy. And so we have become a world that abides by the results of standardzed tests as if they were sacrosanct.

The rush to carry out standardized testing in every walk of life has not gone unchecked. Some psychometricians have stood up in recent years to caution the public against reading too much into tests that require what may be a narrow band of specialized intelligence (Sternberg, 1997; Gardner, 2000; Kohn, 2000). Organizatons such as the National Center for Fair and Open Testing (www.fairtest.org) have reminded us that standardization of assessment procedures creates an illusion of validity. Strong claims from the giants of the tetsing industry, they say, have pulled the collective wool over the public’s eyes and in the process have incorrectly marginalized thousands, if not millions, of children and adults worldwide.

Whichever side is “right” – and both sides have legitimate cases – it is important for teachers to udnerstand the educational institutions they are workng in, and an integral part of virtually all of those institutions is the use of standardized tests. So it is important for you to understand what standardized tests are, what they are not, how to intepret them, and how to put them into a balanced perspective in which we strive to accurately assess all learners on all proposed objectives. We can learn a great deal about many elarners and their competencies through standardized forms of asssessment. But some of those learners and some of those objectives may not be adequately measured by a sit-down, timed, multiple-choice format that is likely to be decontextualized.

This STANDARDIZED TESTING materials have two goals: to introduce the process of constructing, validating, admnistering, and interpreting standardized tests of language, and to acquaint you with a variety of current standaridized tests taht claim to test overall language proficiency. We are not focusing centrally on classroom-based assessment. Some of the practical steps that are involved in creating standardized tests are directly transferable to designing classroom tests.

A. What is Standardization?

A standardized test presupposes certain standard objectives, or criteria, that are held constant across one form of the test to another. The criteria in large-scale standardized tests are designed to apply to a broad band of competencies that are usually not exclusive to one particular curriculum. A good standardized test is the product of a through process of empirical research and development. It dictates standard procedures for administration and scoring. And finally, it is typical of a norm-referenced test, the goal of which is to place test-takers by their relative ranking.

Most elementary and secondary schools in the United States have standardized achievement test to measure children’s mastery of the standards or competencies that have been prescribed for specified grade levels. These tests vary by states, counties, and school districts, but they all share the common objective of economical large-scale assessment. College entrance exams such as the Scholastic Aptitude Test (SAT^®) are part of the educational experience of many high school seniors seeking further education. The Graduate Record Exam (GRE^®) is a required standardized test for entry into many graduate school programs. Tests like the Graduate Management Admission Test (GMAT) and the Law School Aptitude Test (LSAT) specialize in particular disciplines. One genre of standardized test that you may already be familiar with is the Test of English as Foreign Language (TOEFL), produced by the Educational Testing Service (ETS) in the United States and/or its British counterpart, the International English Language Testing System (IELTS), which features standardized tests in affiliation with the University of Cambridge Local Examinations Syndicate(UCLES). They are all standardized because they specify a set of competencies(or standards) for a given domain, and through a process of construct validation they program a set of tasks that have been designed to measure those competencies.

Many people are under the incorrect impression that all standardized tests consist of items that have predetermined responses presented in a multiple-choice format. While it is true that many standardized tests conform to a multiple-choice format, by no means is multiple-choice a prerequisite characteristic. It so happens that a multiple-choice format provides the test producer with an “objective” means for determining correct and incorrect responses, and therefore is the preferred mode for large-scale tests. However, standards are equally involved in certain human –scored tests of oral production and writing, such as the Test of Spoken English (TSE^®) and the Test of Written English (TWE^®), both produced by ETS.

B. The Advantages and Disadvantages of Standardized Tests

Advantages of standardized testing include, foremost, a ready-made previously validated product that frees the teacher from having to spend hours creating a test. Administration to large groups can be accomplished within reasonable time limits. In the case of multiple-choice formats, scoring procedures are streamlined (for either scannable computerized scoring or hand-scoring with a hole-punched grid) for fast turnaround time. And, for better or for worse, there is often an air of face validity to such authoritative-looking instruments.

Disadvantages centre largely on the inappropriate use of such tests, for example, using an overall proficiency test as an achievement test simply because of the convenience of the standardization. A colleague told me about a course director who, after a frantic search for a last-minute placement test, administered a multiple-choice grammar achievement test, even though the curriculum was mostly listening and speaking and involved few of the grammar points tested. This instrument had the appearance and face validity of a good test when in reality it had no content validity whatsoever.

Another disadvantage is the potential misunderstanding of the difference between direct and indirect testing (see Chapter 2). Some standardized tests include tasks that do not directly specify performance in the target objective. For example, before 1996, the TOEFL included neither the written nor an oral production section, yet statistics showed a reasonablystrong correspondence between performance on the TOEFL and a student’s written and-to a lesser extent – oral production. The comprehension-based TOEFL could therefore be claimed to be an indirect test of production. A test of reading comprehension that proposes to measure ability to read extensively and that engages test-takers in reading only short one-or two-paragraph passages is an indirect measure of extensive reading.

Those who use standardized tests need to acknowledge both the advantages and limitations of indirect testing. In the pre-1996 TOEFL administrations, the expense of giving a direct test of production was considerably reduced by offering only comprehension performance and showing through construct validation the appropriateness of conclusions about a test-taker’s production competence. Likewise, short reading passages are easier to administer, and if research validates the assumption that short reading passages indicate extensive reading ability, then the use of the shorter passages is justified. Yet the construct validation statistics that offer that support never offer a 100 percent probability of the relationship, living room for some possibility that the indirect test is not valid for its targeted use.

A more serious issue lies in the assumption (alluded to above) that standardized tests correctly assess all learners equally well. Well-established standardized tests usually demonstrate high correlations between performance on such tests and target objectives, but correlations are not sufficient to demonstrate unequivocally the acquisition of criterion objectives by all test-takers. Here is a non-language example. In the United States, some driver’s license renewals require taking a paper-and-pencil multiple-choice test that covers signs, safe speeds and distances lone changes, and other “rules of the road”. Correlational statistics show a strong relationship between high scores on those tests and good driving records, so people who do well on these tests are a safe bet to relicense. Now, an extremely high correlation (of perhaps.80 or above) may be loosely interpreted to mean that a large majority of the drivers whose licenses are renewed by virtue of their having passed the little quiz are good behind-the-wheel drivers. What about those few who do not fit the model? That small minority of drivers could endanger the lives of the majority, and is that a risk worth taking? Motor vehicle registration departments in the United States seem to think so, and thus avoid the high cost of behind-the-wheel driving tests.

Are you willing to rely on a standardized test result in the case of all the learners in your class? Of an applicant to your institution, or of a potential degree candidate exiting your program? These questions will be addressed more fully in Chapter 5, but for the moment, think carefully about what has come to be known as high-stakes testing, in which standardized tests have become the only criterion for inclusion or exclusion. The widespread acceptance, and sometime misuse, of this gate-keeping role of the testing industry has created a political, educational, and moral maelstrom.

C. Developing A Standardized Test

While it is not likely that a classroom teacher, with a team of test designers and researchers, would be in a position to develop a brand-new standardized test of large-scale proportions, it is a virtual certainty that someday you will be in a position (a) to revise an existing test, (b) to adapt or expand an existing test, and/or (c) to create a smaller-scale standardized test for a program you are teaching in. And even if none of the above three cases should ever apply to you, it is of paramount importance to understand the process of the development of the standardized tests that have become ingrained in our educational institutions.

How are standardized tests developed? Where do test tasks and items come from? How are they evaluated? Who selects items and their arrangement in a test? How do such items and tests achieve consequential validity? How are different forms of tests equated for difficulty level? Who sets norms and cut-off limits? Are security and confidentiality an issue? Are cultural and racial biases an issue in test development? All these questions typify those that you might pose in an attempt to understand the process of test development.

In the steps outlined below, three different standardized tests will be used to exemplify the process of standardized test design:

(A) The Test of English as a Foreign Language (TOEFL), Educational Testing Service (ETS).

(B) The English as a Second Language Placement Test (ESLPT), San Francisco State University (SFSU).

The first is a test of general language ability or proficiency. The second is a placement test at a university. And the third is a gate-keeping essay test that all prospective students must pass in order to take graduate-level courses. As we look at the steps, one by one, you will see patterns that are consistent with those outlined in the previous materials for evaluating and developing a classroom test.

1. Determine the Purpose and Objectives of the Test

Most standardized tests are expected to provide high practicality in administration and scoring without unduly compromising validity. The initial outlay of time and money for such a test is significant, but the test would be used repeatedly. It is therefore important for its purpose and objectives to be stated specifically. Let’s look at the three tests.

(A) The purpose of the TOEFL is “to evaluate the English proficiency of people whose native language is not English” (TOEFL Test and Score Manual, 2001, p.9). More specifically, the TOEFL is designed to help institutions of higher learning make “valid decisions concerning English language proficiency in terms of [their] own requirements” (p.9). Most colleges and universities in the United States use TOEFL scores to admit or refuse international applicants for admission. Various cut-off scores apply, but most institutions require scores from 475 to 525 (paper-based) or from 150 to 195 (computer-based) in order to consider students for admission. The high-stakes, gate-keeping nature of the TOEFL is obvious.

(B) The ESLPT is designed to place already admitted students at San Francisco State University in an appropriate course in academic writing, with the secondary goal of placing students into courses in oral production and grammar-editing. While the test’s primary purpose is to make placements, another desirable objective is to provide teachers with some diagnostic information about their students on the first day or two of class. The ESLPT is locally designed by university faculty and staff.

(C) The GET, another test designed at SFSU, is given to prospective graduate students – both native and non-native speakers – in all disciplines to determine whether their writing ability is sufficient to permit them to enter graduate-level courses in their programs. It is offered at the beginning of each term. Students who fail or marginally pass the GET are technically ineligible to take graduate courses in their field. Instead, they may elect to take a course n graduate-level writing of research papers. A pass in that course is equivalent to passing the GET.

As you can see, the objectives of each of these tests are specific. The content of each test must be designed to accomplish those particular ends. This first stage of goal-setting might be seen as one in which the consequential validity of the test is foremost in the mind of the developer: each test has a specific gate-keeping function to perform; therefore the criteria for entering those gates must be specified accurately.

2. Design Test Specification

Now comes the hard part. Decisions need to be made on how to go about structuring the specifications of the test. Before specs can be addressed, a comprehension program of research must identify a set of constructs underlying the test itself. This stage of laying thefoundation stones can occupy weeks, months, or even years of effort. Standardized test that don’t work are often the product of short-sighted construct validation. Let’s look at the three tests again.

(A) Construct validation for the TOEFL is carried out by the TOEFL staff at ETS under the guidance of a policy Council that works with a Committee of Examiners that is composed of appointed external university faculty, linguists, and assessment specialists. Dozens of employees are involved in a complex process of reviewing current TOEFL specifications, commissioning and developing test tasks and items, assembling forms of the test, and performing on-going exploratory research related to formulating new specs. Reducing such a complex process to a set of simple steps runs the risk of gross overgeneralization, but here is an idea of how a TOEFL is created.

Because the TOEFL is a proficiency test, the first step in the developmental is process is to define the construct of language proficiency. First, it should be made clear that many assessment specialists such as Bachman (1990) and Palmer (Bachman & Palmer, 1996) prefer the term ability to proficiency and thus speak of Language ability as the overarching concept. The latter phrase is more consistent, they argue, with our understanding that the specific components of language ability must be assessed separately. Others, such as the American Council on Teaching Foreign Languages (ACTFL), still prefer the term proficiency because it connotesmore of a holistic, unitary trait view of language ability (Lowe, 1988). Most current views accept the ability argument and therefore strive to specify and assess the many components of language. For the purposes of consistency in this book, the term proficiency will nevertheless be retained, with the above caveat.

How you view language will make a difference in how you assess language proficiency. After breaking language competence down into subsets of listening, speaking, reading and writing, each performance mode can be examined on a continuum of linguistic units: phonology (pronunciation) and orthography (spelling), words (lexicon), sentences (grammar), discourse, and pragmatic ( sociolinguistic, contextual, functional, cultural) features of language.

How will the TOEFL sample from all these possibilities? Oral production tests can be test of overall conversational fluency or pronunciation of a particular subset of phonology, can take the form of imitation, structured responses, or free responses. Listening comprehension tests can concentrate on a particular feature of language or on overall listening for general meaning. Tests of reading can cover the range of language units and can aim to test comprehension of long or short passages, single sentences, or even phrases and words. Writing tests can take on an open-ended form with free composition, or be structured to elicit anything from correct spelling to discourse- level competence. Are you overwhelmed yet?

From the sea of potential performance modes that could be sampled in a test, the developer must select a subset on some systematic basis. To make a very long story short (and leaving out numerous controversies), the TOEFL had for many years included three types of performance in its organizational specifications: listening, structure, and reading, all of which tested comprehension through standard multiple-choice tasks. In 1996 a major step was taken to include written production in the computer based TOEFL by adding a slightly modified version of the already existing Test of Written English (TWE). In doing so, some face validity and content validity were improved along with, of course, a significant increase in administrative expense! Each of these four major sections is capsulized in the box below (adapted from the descriptions are not, strictly speaking, specifications, which are kept confidential by ETS. Nevertheless, they can give a sense of many of the constraints that are placed on the design of actual TOEFL specifications.

Figure 1: TOEFL^Ò specifications

(B) The designing of the test specs for the ESLPT was a somewhat simpler task because the purpose is placement and the construct validation of the test consisted of an examination of the content of the ESL courses. In fact, in a recent revision of the ESLPT, content validity (coupled with its attendant face validity) was the central theoretical issue to be considered. The major issue centered on designing practical and reliable tasks and item response formats. Having established the importance of designing ESLPT tasks that simulated classroom tasks used in the courses, the designers ultimately specified two writing production tasks (one a response to an essay that students read, and the other a summary of another essay) and one multiple-choice grammar- editing task. These specifications mirrored the reading- based, process writing approach used in the courses.

(C) Specifications for the GET arose out of the perceived need to provide a threshold of acceptable writing ability for all prospective graduate students at SFSU, both native and non-native speakers of English. The specifications for the GET are the skills of writing grammatically and rhetorically acceptable prose on a topic of some interest, with clearly produced organization of ideas and logical development. The GET is a direct test of writing ability in which test-takers must, in a two-hour time period, write an essay on a given topic.

3. Design, Select, and Arrange Test Tasks/Items

Once specifications for a standardized test have been stipulated, the sometimes never-ending task of designing, selecting, and arranging items begins. The specs act much like a blueprint in determining the number and types of items to be created. Let’s look at the three examples.

(A)

TOEFL test design specifies that each item be coded for content and statistical characteristics. Content coding ensures that each examinee will receive test questions that assess a variety of skills (reading, comprehending the main idea, or understanding inferences) and cover a variety of subject matter without unduly biasing the content toward a subset of test-takers (for example, in the listening section involving an academic lecture, the content must be universal enough for students from many different academic fields of study). Statistical characteristics, including the IRT equivalents of estimates of item facility (IF) and the ability of an item to discriminate the (ID) between higher or lower ability levels, are also coded.

Items are then designed by a team who select and adapt items solicited from a bank of items that have been “deposited” by free-lance writers and ETS staff. Probes for the reading section, for example, are usually excerpts from authentic general or academic reading that are edited for linguistic difficulty, culture bias, or other topic biases. Items are designed to test overall comprehension, certain specific information, and inference.

Consider the following sample of a reading selection and ten items based on it, from a practice TOEFL (Philips, 2001, pp.423-424):

Figure 2: Reading selection and ten items based on it, from a practice TOEFL

As you can see, items target assessment of comprehension of the main idea (item #11), stated details (#17, 19), unstated details (312, 15, 18), implied details (#14, 20), and vocabulary in context (#13, 16). An argument could be made about the cultural schemata implied in a passage about pirate ships, and you could engage in an “angels on the head of a pin” argument about the importance of picking certain vocabulary for emphasis, but every test item is a sample of a larger domain, and each of these fulfils its designated specification.

Before any such items are released into a form of the TOEFL (or any validated standardized test), they are piloted and scientifically selected to meet difficulty specifications within each subsection, section, and the test overall. Furthermore, those items are also selected to meet a desired discrimination index. Both of these indices are important considerations in the design of a computer-adaptive test, where performance on one item determines the next one to be presented to the test-taker.

(B) The selection of items in the ESLPT entailed two entirely different processes. In the two subsections of the test that elciit writing performance (summary of reading; response to reading), the main hurdles were (a) selectng appropriate passages for test-takers to read, (b) providing appropriate prompts, and (c) processing data from pilot testing. Passages have to conform to standards of content validity by being within the genre and the difficulty of the material used in the courses. The prompt ne ach case (the section asking for a summary and the section askng for response) has to be tailored to fit the passage, but a general template is used.

In the multiple-choice editing test that seeks to test grammar proofreading ability, the first and easier task is to choose an appropriate essay within which to embed errors. The more complicated task is to embed a specified number of errors from a previosuly detemrined taxonomy of error categories. Those error categores came directly from student errors as perceived by their teachers (verb tenses, verb agreement, logical connectors, articles, etc.). the distractors for each item were selected from actual errors that students make. Items in pilot versions were then coded for difficulty and discrimination indices, after which final assembly of items could occur.

(C) The GET prompts ar edesigned by a faculty committee of examiiners who are specialists in the field of university academic writing. The assumption is made that the topics are universally appealing and capable of yielding the intended product of an esssay that requires an organized logical argument and conclusion. No pilot testing of prompts is conducted. The conditions for administration remain constant: two-hour time limit, sit-down context, paper and pencil, closed-book format. Consider the followng recent prompt:

Figure 3: Graduate Essay Test, sample prompt

It is clear from such a prompt that the problem the test-takers must address is complex, that there is sufficient information here for writing an essay, and that test-takers will be reasonably challenged to write a clear statement of opinion. What also emerges from this prompt (and virtually any prompt that one might propose) is the potential cultural effect on the numerous international students who must take the GET. Is it possible that such students, who are not familiar with school systems in the United States, with hiring procedures, and perhaps with the “politics” of school board elections, might be at a disadvantage in mounting arguments within a two-hour time frame? Some (such as Hosoya, 2001) have strongly claimed a bias.

4. Make Appropriate Evaluations of Different Kinds of Items

The concepts of item facility (IF), item discrimination (ID), and distractor analysis are introduced here. It showed such calculations provide useful information for classroom tests, but sometimes the time and effort involved in performing them may not be practical, especially if the classroom- based test is a one-time test. Yet for a standardized multiple- choice test that is designed to be marketed commercially, and/or administered a number of times, and/or administered in a different form, these indices are a must.

For other types of response formats, namely, production responses, different forms of evaluation become important. The principles of practicality and reliability are prominent, along with the concept of facility. Practicality issues in such items include the clarity of directions, timing of the test, ease of administration, and how much time is required to score responses. Reliability is a major player in instances where more than one scorer is employee and toa lesser extent when a single scorer has to evaluate tests over long spans of time that could lead to deterioration of standards. Facility is also a key to the validity and success of an item type: unclear directions, complex language, obscure topics, fuzzy data, and culturally biased information may all lead to a higher level of difficulty than one desires.

(A) The IF,ID, and efficiency statistics of the multiple-choice items of current forms of the TOEFL are not publicly available information. For reasons of security and protection of patented, copyrighted materials, they must remain behind the closed doors of the ETS development staff. Those statistics remain of paramount importance in the on-going production of TOEFL items and forms and are the foundation stones for demonstrating the equitability of forms. Statistical indices on retired forms of the TOEFL are available on request for research purposes.

The essay portion of the TOEFL undergoes scrutiny for its practicality, reliability, and facility. Special attention is given to reliability since two human scorers must read each essay, and every time a third reader becomes necessary (when the two readers disagree by more than one point), it costs ETS more money.

(B) In the case of the open- ended responses on the two written tasks on the ESLPT, a similar set of judgments must be made. Some evaluative impressions of the effectiveness of prompts and passages are gained from informal student and scorer feedback. In the developmental stage of the newly revised ESLPT, both types of feedback were formally solicited through questionnaires and interviews. That information proved to be invaluable in the revision of prompts and stimulus reading passages. After each administration now, the teacher-scorers provide informal feedback on their perceptions of the effectiveness of the prompts and readings.

The multiple- choice editing passage showed the value of statistical findings in determining the usefulness of items and pointing administrators toward revisions. Following is a sample of the format used:

Figure 4: Multiple Choice editing passage

The task was to locate the error in each sentence. Statistical tests on the experimental version of this section revealed that a number of the45 items were found to be of zero IF (no difficulty whatsoever) and of inconsequential discrimination power (some IDs of.15 and lower). Many distractors were of no consequence because they lured no one. Such information let to a revision of numerous items and their options, eventually strengthening the effectiveness of this section.

(C) The GET, like its written counterparts in the ESLPT, is a test of written ability with a single prompt, and therefore questions of practicality and facility are also largely observational. No data are collected from students on their perceptions, but the scorers have an opportunity to reflect on the validity of a given topic. After one sitting, a topic is retired, which eliminates the possibility of improving a specific topic, but future framing of topics might benefit from scorers’ evaluations. Inter-rater reliability is checked periodically, and reader training sessions are modified if too many instances of unreliability appear.

5. Specify Scoring Procedures and Reporting Formats

A systematic assembly of test items in pre-selected arrangements and sequences, all of which are validated to conform to an expected difficulty level, should yield a test that can then be scored accurately and reported back to test –takers and institutions efficiently.

(A) Of the three tests being exemplified here, the most straightforward scoring procedures comes from the TOEFL, the one with the most complex issues of validation, design, and assembly. Scores are calculated and reported for (a) three sections of the TOEFL (the essay ratings are combined with the Structure and Written Expression score) and (b) a total score (range 40 to 300 on the computer –based TOEFL and 310 to 677 on the paper-and-pencil TOEFL). A separate score (c) for the Essay (range 0 to 6) is also provided on the examinee’s score record.

Figure 5: Facsimile of a TOEFL® score report

The rating scale for the essay is virtually the same one that is used for the Test of Written English (see chapter 9 for details), with a “zero” level added for no response, copying the topic only, writing completely off topic, or not writing in English.

(B) The ESLPT reports a score for each of the essay sections, but the rating scale differs between them because in one case the objective is to write a summary, and in the other to write a response to a reading. Each essay is read by two readers; if there is a discrepancy of more than one level, a third reader resolves the difference. The editing section is machine- scanned and - scored with a total score and with part-scores for each of the grammatical/ rhetorical sections. From these data, placement administrators have adequate information to make placements, and teachers receive some diagnostic information on each student in their classes. Students do not receive their essays back.

Figure 6:Graduate Essay Test: Scoring Guide

The two readers’ scores are added to yield a total possible score of 2 to 8. Test administrators recommend a score of 6 as the threshold for allowing a student to pursue graduate – level courses. Anything below that is accompanied by a recommendation that the students either repeat the test or take a “remedial” course in graduate writing offered in one of several different departments. Students receive neither their essays nor any feedback other than the final score.

6. Perform On-going Construct Validation Studies

From the above discussion, it should be clear that not standardized instrument is expected to be used repeatedly without a rigorous program of on-going construct validation. Any standardized test, once developed, must be accompanied by systematic periodic corroboration of its effectiveness and by steps toward its improvement. This rigor is especially true of tests that are produced in equated forms, that is, forms must be reliable across tests such that a score on a subsequent form of a test has the same validity and interpretability as its original.

(A) The TOEFL program, in cooperation with other tests produced by ETS, has an impressive program of research. Over the years dozens of TOEFL-sponsored research studies have appeared in the TOEFL Monograph Series. An early example of such a study was the semi-final Duran et al. (1985) study, TOEFL from a Communicative Viewpoint on Language Proficiency, which examined the content characteristics of the TOEFL from a communicative perspective based on current research in applied linguistics and language proficiency assessment. More recent studies (such as Ginther, 2001; Leacock &Chodorow, 2001; Powers et al.,2002) demonstrate an impressive array of scrutiny.

(B) For approximately 20 years, the ESLPT appeared to be placing students reliably by means of an essay and a multiple-choice grammar and vocabulary test. Over the years the security of the latter became suspect, and the faculty administrators wished to see some content validity achieved in the process. In the year 2000 that process began with a group of graduate students (Imao et al.,2000) in consultation with faculty members, and continued to fruition in the form of a new ESLPT, reported in Imo (2002). The developments of the new ESPLT involved a lengthy process of both content and construct validation, along with facing such practical issues as scoring the written sections and a machine-scorable multiple-choice answer sheet.

The process of on-going validation will no doubt continue as new forms of the editing section are created and as new prompts and reading passages are created for the writing section. Such a validation process should also include consistent checks on placement accuracy and on face validity.

(C) At this time there is little or no research to validate the GET itself. For its construct validation, its administrators rely on a stockpile of research on university- level academic writing tests such as the TWE. The holistic scoring rubric and the topics and administrative conditions of the GET are the some extent patterned after that of the TWE. In recent years some criticism of the GET has come from international test- takers (Hosoya, 2001) who posit that the topics and time limits of the GET, among other factors, work to the disadvantage of writers whose native language is not English. These validity issues remain to be full addressed in a comprehensive research study.

D. Standardized Language Proficiency Testing

Tests of language proficiency presuppose a comprehensive definition of the specific competence that comprise overall language ability. The specifications for the TOEFL provided an illustration of an operational definition of ability for assessment purposes. This is not the only way to conceptualize the concept. Swain (1990) offered a multidimensional view of proficiency assessment by referring to three linguistic traits (grammar, discourse, and sociolinguistics) that can be assessed by means of oral, multiple-choice, and written responses (see the Table 1.1). Swain’s conception was not to be an exhaustive analysis of ability, but rather to serve as an operational framework for constructing proficiency assessments.

Another definition and conceptualization of proficiency is suggested by the ACTFL association, mentioned earlier. ACTFL takes a holistic and more unitary view of proficiency in describing four levels: superior, advanced, ntermediate, and novice. Within each level, descriptions of listening, speakng, reading, and writing are provided as guidelines for assessment. For example, the ACTFL Guidelines describe the superior level of speaking as follows:

Figure 7: ACTFL speaking guidelines, summary, superior-level

The other three ACTFL levels use the same parameters in describing progressively lower proficiencies across all four skills. Such taxonomies have the advantage of considering a number of functions of linguistic discourse, but the disadvantage, at the lower levels, of overly emphasizing test-takers’ deficiencies.

Table 1.1: Traits of second language proficiency (Swain, 1990, p.403)

Trait	Grammar	Discourse	Sociolinguistic
Method	Focus on grammatical accuracy within sentences	Focus on textual cohesion and coherence	Focus on social appropriateness of language use
Oral	Structured interview	Story telling and argumentation/persuasion	Role-play of speech acts: requests, offers, complaints
	Scored for accuracy of verbal morphology, prepositions, syntax	Detailed rating for identification, logical sequence, and time orientation, and global ratings for coherence	Scored for ability to distinguish formal and informal register
Multiple-choice	Sentence-level ‘select the correct form’ exercise	Paragraph-level ‘select the coherent sentence’ exercise	Speech act-level ‘select the appropriate utterance’ exercise
	(45 items)	(29 items)	(28 items)
	Involving verb morphology, prepositions, and other items
Written composition	Narrative and letter of persuasion	Narrative and letter of persuasion		Formal request letter and informal note
	Scored for accuracy of verb morphology, prepositions, syntax	Detaled ratings much as for oral discourse and global rating for coherence		Scored for the ability to distinguish formal and informal register

E. Four Standardized Language Proficiency Tests

We now turn to some of the better-known standardized tests of overall language ability, or proficiency, to examine some of the typical formats used in commercially available tests. We will not look at standardized tests of other specific skills here, but that should not lead you to think, by any means, that proficiency is the only kind of test in the field that is standardized. Three standardized oral production tests, the Test of Spoken English (TSE), the Oral Proficiency Inventory (OPI), Phone Pass, and the Test of Written English (TWE).

Four commercially produced standardized tests of English language proficiency are described briefly in this section: the TOEFL, the Michigan English Language Assessment Battery (MELAB), the International English Language Testing System (IELTS), and the Test of English for International Communication (TOEIC®). These following questions can help us to evaluate these following test and their subsections, including (1) What item types are included?, (2) How practical and reliable does each subsection of each test appear to be?, (3) Do the item types and tasks appropriately represent a conceptualization of language proficiency (ability)? That is, can you evaluate their construct validity?, (4) Do the tasks achieve face validity?, (5) Are the tasks authentic?, and (6) Is there some washback potential in the tasks?

Figure 8:Test of English as a Foreign Language (TOEFL®)

Figure 9: Michigan English Language Assessment Battery (MELAB)

Figure 10: International English Language Testing System (IELTS)

Figure 11: Test of English for International Communication (TOEIC®)

The construction of a valid standardized test is no minor accomplishment, whether the instrument is large- or small-scale. The designing of specifications alone requires a sophisticated process of construct validation coupled with considerations of practicality. Then, the construction of items and scoring/interpretation procedures may require a lengthy period of trial and error with prototypes of the final form of the test. With painstaking attention to all the details of construction, the end product can result in a cost-effective, time-saving, accurate instrument. Your use of the results of such assessments can provide useful data on learners’ language abilities. But your caution is warranted as well.

REFERENCES:

Brown, H. Douglas. (2004).LANGUAGE ASSESSMENT:Principlesand Classroom Practices. Pearson Education: NewYork.

http://febbyeni.blogspot.co.id/2013/06/designing-classroom-language-test.html

“these tests are too crude to be used” ~Frederick J.Kelly

LTEClass English Department Baturaja University

Senin, November 07, 2016

STANDARDIZED TESTING (Group 4)

STANDARDIZED TESTS

0 komentar:

Posting Komentar