Oral Proficiency Testing Bibliography

The purpose of this annotated bibliography is to compile and present pertinent and useful research in the field of Oral Proficiency Testing, ranging from foundational publications to the latest innovations and studies, from 1988 to the present. It is divided into several categories by topic and common theme for ease of use: (1) Overviews and background; (2) Validity and validation studies; (3) Test and task design; (4) Oral proficiency assessment development; (5) Interlocutor and examinee characteristics; (6) Raters and interviewers; (7) Implementation and use; (8) Oral proficiency testing and curriculum; and (9) Technology. Throughout the bibliography, the American Council of Teachers of Foreign Languages will be referred to by the acronym ACTFL, and the Oral Proficiency Interview will be referred to as the OPI. This annotated bibliography was developed by AELRC research interns and staff at the Center for Applied Linguistics, including Margaret Anne Rowe, Amy Kim, Randi Dermo, Sonya Park, and John Wildes.

Please cite the use of the bibliography as follows: 

AELRC. (2018). Oral proficiency testing annotated bibliography. Washington, DC: Assessment and Evaluation Language Resource Center.

Download the bibliography as a PDF here.

 Overviews and background

These publications provide background about oral proficiency testing and research, as well as overviews of various oral proficiency testing issues.

Chalhoub-Deville, M. (1995). A contextualized approach to describing oral language proficiency. Language Learning, 45(2), 251-281.

Keywords: construct validity, interview, oral proficiency, rater, rating, speaking

Chalhoub-Deville investigates the second language (L2) oral proficiency construct on the premise that both task and rater are influential factors impacting interpretation of proficiency. Her study incorporated a variety of tasks including an interview, a narration, and a read-aloud. Six university students enrolled in an intermediate level MSA (Modern Standard Arabic) course provided the speech samples, while eighty-two raters, all native speakers of Arabic but from different backgrounds and locations, furnished proficiency ratings. She used multidimensional scaling analyses to derive dimensions underlying the holistic oral proficiency ratings given for each of the three tasks. Findings from this study affirm that the nature of the L2 oral construct is not constant but context-specific. Chalhoub-Deville advises proficiency researchers against employing generic dimensions and recommends they empirically derive dimensions based on the specific elicitation task and audience.

Chalhoub-Deville, M., & Fulcher, G. (2003). The Oral Proficiency Interview: A research agenda. Foreign Language Annals, 36(4), 498-506.

Keywords: ACTFL, OPI, oral proficiency, reliability, speaking, validity

In their article, Chalhoub-Deville and Fulcher call for ACFTL to implement a systematic research agenda to provide supporting evidence for the interpretations and uses of OPI ratings. The authors point out several high-priority issues upon which the research could focus: validity and reliability, purpose, interview talk, rater behavior, native speaker criterion, and classroom impact.

Liskin-Gasparro, J. E. (2003). The ACTFL proficiency guidelines and the Oral Proficiency Interview: A brief history and analysis of their survival. Foreign Language Annals, 36(4), 483-490.

Keywords: ACTFL, foreign language education, OPI, oral proficiency, summary

After providing a brief description of how the ACTFL Proficiency Guidelines and the OPI emerged at the forefront of foreign language education, Liskin-Gasparro touches on the controversies surrounding the Guidelines and the OPI. Two major areas of criticisms identified are the Guidelines’ questionable validity claims and ACTFL’s overreaching assertions that the OPI can measure future, real-life language performance. In spite of these shortcomings, the influence of the Guidelines and the OPI has continued to expand, and Liskin-Gasparro names the advantages they offer to the areas of teaching, testing, and research as the reasons behind this phenomenon.

Malone, M. E. (2003). Research on the Oral Proficiency Interview: Analysis, synthesis, and future directions. Foreign Language Annals, 36(4), 491-497.

Keywords: inter-rater reliability, OPI, oral proficiency

In her article, Malone provides an overview of the research conducted on the ACTFL OPI, which has been diffuse in the field of second language acquisition since its initial publication in 1982. She does this by following the development of research trends, discussing both empirical studies and critical analyses, as well as their limitations. Of the many broad patterns identified, the author observed a theme of criticism against the OPI that focused on issues of validity and inter-rater reliability. Stemming from her investigation, Malone offers six recommendations for future research on the OPI, which are: (1) conducting ongoing reliability studies; (2) replicating existing studies; (3) looking at more languages; (4) examining the rather training process; (5) analyzing research on similar tests; and (6) increasing research interaction.   

Malone, M. E., & Montee, M. J. (2010). Oral proficiency assessment: Current approaches and applications for post-secondary foreign language programs. Language and Linguistics Compass, 4(10), 972-986.

Keywords: computer-based, OPI, oral proficiency, outcomes, post-secondary

In this paper, Malone and Montee review the current state of oral proficiency assessment and touch on the implications for program outcomes at the post-secondary level. They discuss the strengths and limitations of the OPI and examine other new approaches, including computer-based assessments. They affirm that, while these technology-based approaches may mitigate some of the time and cost burdens associated with the OPI and provide programs with a greater capacity to conduct formative assessments, these alternative approaches also come with their own set of unique challenges. The authors advocate for both the increase of oral proficiency assessment in alignment with program goals and for better documentation of proficiency outcomes to help stakeholders set realistic expectations.

Malone, M. E., Rifkin, B., Christian, D., and Johnson, D. E. (2004). Attaining high levels of proficiency: Challenges for language education in the United States. Journal for Distinguished Language Studies, 2, 67-88.

Keywords: ACTFL, foreign language education, heritage speakers, high-level learners, LCTLs, oral proficiency

In this paper, Malone et al. call attention to the lack of U.S. graduates who achieve high proficiency in a foreign language, despite its high demand in many sectors, such as business, diplomacy, and the military. After pointing out a number of successes and challenges with current approaches, the authors go on to discuss several ways in which U.S. foreign language education, particularly in regards to Less Commonly Taught Languages (LCTLs), could possibly be improved. Some of their recommendations include: incentivizing school districts to implement solid learning sequences at lower grades, providing incentives for students to seek high proficiency levels, supporting heritage language learning, and researching teaching methods to ascertain their effectiveness. On the whole, Malone et al. urge that U.S. foreign language education be transformed, and that sufficient planning and funding is critical in order to foster a language-proficient population that includes high-level speakers of LCTLs.

Swender, E. (2003). Oral proficiency testing in the real world: Answers to frequently asked questions. Foreign Language Annals, 36(4), 520-526.

Keywords: ACTFL, OPI, oral proficiency, undergraduate students

Given the pervasiveness of the ACTFL OPI in terms of users and uses, it is hardly surprising that there are many questions about it. The current article seeks to answer the following frequently asked questions: (1) Does taking an OPI over the phone produce a different rating than a face-to-face interview? (2) Are there differences in testing performance from one testing occasion to another when there is no significant opportunity for learning or forgetting between the two tests? (3) How proficient are today’s foreign language undergraduate majors? (4) What minimum levels of proficiency are required in the workplace? After explaining each question, Swender proceeds to respond to each in turn. The first and second questions are answered by a study that compared face-to-face and telephonic interviews. The findings of that study demonstrate that testing modality does not seem to greatly impact test scores, and also that close proximity retests remain consistent with the initial test scores. The answers to the third and fourth questions originate from an analysis of data drawn from the ACTFL Test Archives. The data shows that most undergraduate language majors attain proficiency levels ranging from Intermediate-High to Advanced-Low, and also that different jobs have different standards for minimum proficiency.

Tsagari, D., & Csepes, I. (Eds.). (2012). Language testing and evaluation: Vol. 26. Collaboration in language testing and assessment. Frankfurt, Germany: Peter Lang.

Keywords: CEFR, computer-based, outcome-based assessment, self-assessment, validity

This book is divided into fourteen chapters, each containing the work of various researchers in the field of second language acquisition. The chapters deal with an array of language testing issues in a European context. Such issues include: working with the Common European Framework of Reference for Languages (CEFR), test requirements, test validity, test development, computerized language testing, teaching to language outcomes, and understanding the results of self-assessment. 

See also in Interlocutor and Examinee Characteristics:

Watanabe, S. (2003). Cohesion and coherence strategies in paragraph-length and extended discourse in Japanese Oral Proficiency Interviews.

See also in Oral Proficiency Assessment: Development

Yan, X., Maeda, Y., Lv, J., Ginther, A. (2015). Elicited imitation as a measure of second language proficiency: A narrative review and meta-analysis.

Validity and Validation Studies

The publications in this category investigate the concept and establishment of validity in language testing.

Bachman, L. F. (1988). Problems in examining the validity of the ACTFL Oral Proficiency Interview. Studies in Second Language Acquisition, 10(2), 149-164.

Keywords: ACTFL; OPI; test method facets; validity 

Bachman is concerned with the interpretations and uses of the ACTFL OPI for measuring communicative language ability. He cautions against misinterpretation of the scores and argues that two aspects of the current procedure make validation difficult: (1) the confounding effect between traits and test methods in both the interview design and the interpretation of the ratings; and (2) the provision of single rating that has no theoretical or empirical basis. In order to enhance the validity of the OPI, Bachman suggests that abilities are clearly distinguished from test methods, and that ACTFL takes an active role in ensuring the validity during the test development phase and in identifying the appropriate use of the oral interview.

Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the ACTFL proficiency guidelines and oral interview procedure. Foreign Language Annals, 23(1), 11-22.

Keywords: ACTFL; communicative language proficiency; construct validity; OPI

This article provides research results regarding the construct validity of the ACTFL Proficiency Guidelines and oral interview procedure. The authors conducted a multitrait-multimethod validation analysis of data from tests of speaking, writing, listening, and reading tests in French and English as a Second Language. Findings from the study generally present strong support for using the Guidelines as a basis in the development of proficiency tests, as well as for the OPI’s reliability and validity. 

Fulcher, G. (1996). Invalidating validity claims for the ACTFL oral rating scale. System, 24(2), 163-172.

Keywords: ACTFL; construct validity; OPI; rating scales

In this paper, Fulcher critiques the studies conducted by Dandonoli and Henning (1990) and Henning (1992) which purportedly establish construct validity for the ACTFL Proficiency Guidelines. He outlines each part of Dandonoli’s and Henning’s 1990 study pertaining to the data from the English tests and points out flaws in the design of the study and misinterpretations or overgeneralizations in the analysis of the results. Fulcher also presents the results from a maximum likelihood (ML) confirmatory factor analysis, which he says should have been the follow up to Dandonoli’s and Henning’s study. In the end, Fulcher argues that empirical evidence for the ACTFL scale is still lacking and stresses the importance of considering construct validity in the initial stages of test development, not post hoc.

Johnson, M. (2001). The art of non-conversation: A reexamination of the validity of the Oral Proficiency Interview. New Haven, CT: Yale University Press.

Keywords: communicative language proficiency; discourse; interaction; OPI; validity

Johnson’s book addresses the question of the OPI’s validity as an instrument for assessing oral language proficiency. Johnson analyzes the three-stage structure of the OPI, discussing whether each stage and the interview as a whole fit the criteria to be considered a “conversation” as proponents of the OPI claim it to be. The author concludes that the OPI’s communicative speech event is a highly-controlled research survey sandwiched in between less formal sociological interviews but is certainly not a conversation. She proposes a new model for assessing spoken language which focuses on context-specific interactional competence à la Vygostky.  

Magnan, S. S. (1988). Grammar and the ACTFL Oral Proficiency Interview: Discussion and data. Modern Language Journal, 72(3), 266-276.

Keywords: grammar; OPI; oral proficiency; rating; validity

Magnan initiates serious empirical research on the influence of grammatical accuracy on the OPI with this paper. Using data from a larger study, Magnan conducted a post hoc analysis investigating the relationship between proficiency level on the OPI and frequency of grammatical error. She examined forty transcribed OPI interviews of university-level French language students, scored at levels Novice Mid through Advanced Plus, to investigate the relationship between proficiency level and frequency of grammatical error. The following seven categories were selected for consideration: (1) verb conjugation; (2) tense/mood; (3) determiners; (4) adjectives; (5) prepositions; (6) object pronouns; and (7) relative pronouns. The results indicate that the OPI is sensitive to grammatical accuracy in the aforementioned categories and that a significant relationship exists between grammatical accuracy and proficiency level. 

Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99-123.

Keywords: a posteriori validation; a priori validation; concurrent validity; correlation; Hebrew; OPI; SOPI 

This paper points to how our conceptualization of concurrent validity, often captured by a correlation between two tests, is unwarranted. In order to obtain more convincing evidence, a variety of qualitative analyses of direct- and semi-direct tests of Hebrew—OPI and Simulated Oral Proficiency Interview (SOPI), respectively—were performed. First, the analysis of the elicitation tasks (a priori validation) found that the two tests differ in the range of speech functions and topics, resulting in a bias towards certain proficiency levels. Second, the textual analysis of speech samples (a posteriori validation) showed that learners use different features of language depending on the testing context and elicitation modes. These results suggest that the two tests are not interchangeable; therefore, care must be taken when deciding on the test. The author concludes with a claim that multiple perspectives should be employed for test validation.

Stansfield, C. W., & Kenyon, D. (1992). The development and validation of a Simulated Oral Proficiency Interview. Modern Language Journal, 76(2), 129-141.

Keywords: Indonesian; reliability; SOPI; validity 

The Simulated Oral Proficiency Interview (SOPI) is a tape-mediated oral proficiency assessment designed by the Center for Applied Linguistics (CAL) and is available in a variety of languages, including Chinese, Japanese, and Russian.  This article reports on one of the SOPIs, the Indonesian Speaking Test (IST), describing its development and the testing it underwent to ensure achievement of its objectives. Using correlations and generalizability theory (G theory), the authors estimated the validity and reliability of the IST. The results of the analyses indicate that the IST is a reliable assessment which can be given as an appropriate surrogate measure in place of the OPI. 

See also in Technology:

Fall, T., Adair-Hauck, B., & Glisan, E. (2007). Assessing students’ oral proficiency: A case for online testing. Foreign Language Annals, 40(3), 377-406.

Test and Task Design

These publications focus on the general design features of various tests and tasks, as well as the implications of these design features for oral proficiency outcomes.

Ahmadi, A., & Sadeghi, E. (2016). Assessing English language learners’ oral performance: A comparison of monologue, interview, and group oral test. Language Assessment Quarterly, 13(4), 341-358.

Keywords: oral assessment; English as a second language–ESL; test method facets

The authors of this article investigate how a test’s format affects oral test-takers’ performance with regards to test scores and discourse features, as well as how these scores relate to the features. 24 EFL learners participated in a monologue, an interview, and a group oral test, all scored with consideration to accuracy, fluency, and syntactic complexity. Results indicated that students did best on the group test, followed by the monologue and interview, but that differences were generally not significant. While students’ most accurate productions occurred during the group test, their longest, most complex productions occurred during the monologue. The authors speculate, following Robinson (2001, 2005) and Gan (2012), that this is due to communicative pressure, or a lowering of production complexity due to test-takers’ increased attention to their peers in interactive settings. The results also showed a general relation between test scores and accuracy. Finally, the authors discuss implications for language assessment.

Clifford, R. (2016). A rationale for criterion-referenced proficiency testing. Foreign Language Annals, 49(2), 224-234.

Keywords: assessment; criterion-referenced; listening; proficiency; reading

In this article, Clifford argues the dangers of using a norm-referenced (NR) scoring procedure to score a criterion-referenced (CR) test. NR tests compare all test-takers from highest to lowest score. In contrast, CR tests compare individual test-takers against clear criteria, such as the ACTFL proficiency guidelines, which allows them to be rated in multiple domains. Because language students begin higher-level skills developing skills before achieving mastery of those in the level below, it is important that ratings take into account floor and ceiling levels, or students’ patterns of strengths and weaknesses. While the ACTFL speaking and writing sections have been assessed using CR criteria, the listening and reading portions of the test have been norm-rated. These single, average scores may have been influenced or diluted by a student’s performance in certain areas, misrepresenting students’ proficiency and lowering the test’s validity. To conclude, the author offers examples of why combining scoring methods does not work and then makes several suggestions for more accurately determining proficiency.

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347-368.

Keywords: difficulty; examinee attitude; oral proficiency; performance assessment; speaking

Elder et al. conducted a study in which task characteristics and performance conditions were manipulated in order to examine the effects on perceived levels of task difficulty on a speaking test. While performing a series of narrative tasks, test takers documented their perceptions of how difficult they thought each task was, as well as their attitudes toward said tasks. The results of this study suggest little in the way of Skehan’s (1998) framework in the context of oral proficiency assessment. In addition, the results call into question the reliability of test takers’ post hoc estimates of task difficulty.  

Gaillard, S., & Tremblay, A. (2016). Linguistic proficiency assessment in second language acquisition research: The elicited imitation task. Language Learning, 66(2), 419–447.

Keywords: elicited imitation; proficiency assessment; second language; French

Following a literature review of previous studies on the elicited imitation task (EIT)’s validity and theoretical underpinnings, the authors explain their own research exploring the validity and reliability of the EIT for measuring proficiency in an L2. EIT raters evaluated adult test-takers’ French-language speech based on meaning, syntax, morphology, vocabulary, and pronunciation. The authors then examined the relationship between EIT scores and test-takers’ scores on a test of writing proficiency, arguing that because high performance on both tests requires sufficient lexical and grammatical knowledge, performance should co-vary between the two. Following their analysis, the authors conclude that the EIT can be valid given the strong relationship between learners’ EIT and cloze test scores, as well as between EIT scores and both learners’ class level and self-rated speaking and listening proficiencies.

Kasper, G., & Ross, S. J. (2007). Multiple questions in oral proficiency interviews. Journal of Pragmatics, 39(11), 2045-2070.

Keywords: discourse; interaction; interlocutor; OPI; rater; rater training 

In this paper, Kasper and Ross investigate what multiple questions (MQs) achieve in the OPI. The authors discuss two types of MQs, reactive (vertical) and proactive (horizontal), and how they help examinees produce ratable spoken responses in the interview setting. Their findings expose the variance in approaches to questioning between interviewers, once again bringing forth the issue of construct validity for the OPI. It is evident that an interviewer’s questioning strategy and overall management of examinee comprehension has the potential to impact the examinee’s responses as well as the scores assigned by raters who follow strict protocols to purely assess performance. In light of these findings, Kasper and Ross call for an increase in awareness about the influence of interviewer questioning style on the OPI and its outcomes in order to incite changes to interviewer and rater training/certification, beginning with a shift from the sole emphasis on examinee speech production.

Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177-204.

Keywords: communicative language proficiency; construct validity; difficulty; discourse; interaction; rating

Leaper and Riazi examine the effect that prompts have on discourse in group oral tests. They collected and analyzed data from one hundred and forty-one Japanese university students of English who were divided into groups of three or four for end-of-year oral proficiency assessments conducted in group oral test format and recorded on video. Each group received one of four prompts to center their discourse around. These prompts were created with the intention of keeping them all at the same level of difficulty, but the authors found that the four prompts each elicited significantly different responses, despite the lack of any significant differences in the scores. Certain prompts elicited longer, more complex speaking turns from each participant, while others elicited shorter, less complex turns. The results of this study provide strong supporting evidence that prompts indeed have a significant impact on the shape and linguistic complexity of discourse in group oral test. The implications for test development and scoring are substantial. Leaper and Riazi stress the importance of construct identification, in addition to crafting versions of prompts that elicit similar interactions, in the design of tests. Their results indicate that a prompt asking test takers to elaborate on their context is ideal for assessing extended discourse ability, while a more factual prompt is useful for gauging interactive skills. As for scoring, the findings highlight the need for rating descriptors to functionally represent the features of the oral discourse to be produced and for raters to receive specific training regarding the features they will be assessing.  

Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning effects on paired oral test performance. Language Testing, 31(2), 147-175.

Keywords: co-construction; conversation analysis; interactional competence; paired assessment; pre-task planning; task-based language assessment

Pre-task planning purports to establish a fair environment for test takers in controlling the level of cognitive demand produced by what may be unfamiliar topics, in turn ensuring that test takers perform at their maximum level. The goal of the current study is to examine the effect of pre-task planning in a paired format. Thirty-two English majors at a Japanese university participated in the study and were formed into sixteen pairs. Each pair took a speaking test, which consisted of a warm-up task and two decision-making tasks under two different conditions: with pre-task planning time and without. The findings of the study suggest complex relationships between test scores and discourse analytical measures. Furthermore, the results of the study demonstrate that while pre-task planning appeared to initially improve test takers’ fluency and complexity, it proved to be deleterious to the speed of fluency, thus potentially inhibiting the successful demonstration of interactional abilities.

Sasayama, S. (2016). Is a ‘complex’ task really complex? Validating the assumption of cognitive task complexity. The Modern Language Journal, 100, 231-254.

Keywords: Task-Based Language Teaching (TBLT); materials development; proficiency; oral communication

This study adopts diverse methods from cognitive psychology in order to measure cognitive task complexity in TBLT, in contrast to the traditional but unvalidated assumption that complexity can simply be inferred. After a review of previous studies about determining cognitive load, Sasayama discusses her four story-telling tasks, which were presented at the same time as a color change detection task. Following the tasks, the participants then estimated how long the tasks took them, completed task difficulty and mental effort self-assessments, and took a cloze test. The author found that language proficiency interacted with tasks as well as measures of cognitive load. Participants at different levels devoted distinct amounts of resources to the same tasks, with the surprising result that high-proficiency participants were as slow or slower at detecting the color change as the low-proficiency participants. However, the only statistically significant difference of cognitive complexity was between the first and last task, which were designed to be the simplest and most complex, respectively. The findings underscore the need to validate the assumptions about cognitive task complexity before investigating its effects on participants’ performance and learning. To conclude, Sasayama discusses her tasks’ complexity, the impact of her methodology on results, and implications and limitations of her work.

See also in Oral Proficiency Assessment Development: 

Yan, X., Maeda, Y., Lv, J., Ginther, A. (2015). Elicited imitation as a measure of second language proficiency: A narrative review and meta-analysis.

Oral Proficiency Assessment: Development

This category focuses on the development of specific oral proficiency assessments.

Bowden, H. W. (2016). Assessing second-language oral proficiency for research: The Spanish elicited imitation task. Studies in Second Language Acquisition, 38(4), 647-675.

Keywords: elicited imitation; external validity; internal reliability; oral proficiency; Spanish

The core premise of Bowden’s article is that second language (L2) oral proficiency remains an underrepresented field in current second language acquisition research in terms of the attempts made to measure it in a standardized and consistent manner. To address this gap, Bowden employs the Spanish “elicited imitation task,” or EIT, (Ortega, Iwashita, Rabie, & Norris, 1999) for the purpose of reliably and efficiently measuring oral proficiency in the second language. The author administered the EIT on a group of thirty-seven L2 learners of Spanish representing a wide range of experience levels. The results of the study lend strong evidence in support of the EIT as a valuable and effective assessment tool for measuring L2 oral proficiency in terms of its external validity, internal reliability, discriminatory power, and ability to cluster subgroups of learners together. 

Cox, T. L., Brown, J., & Burdis, J. (2015). Exploring proficiency-based vs. performance-based items with elicited imitation assessment. Foreign Language Annals, 48(3), 350-371.

Keywords: best practices; oral proficiency; program monitoring and assessment; Russian

In light of the rising popularity of elicited imitation (EI) assessments and the high correlation found between OPI/OPI-computer ratings and EI scores, Cox et al. set out to study the effects of EI assessments on learners, specifically the differences in item difficulty and test scores between proficiency- and performance-based EI assessments. They created two EI instruments: one measuring general proficiency using items from a wide, content-general domain and another measuring language for specific purposes (LSP) performance using items from a content-specific domain. The results of the study indicate that the two approaches assess different constructs and therefore cannot be used interchangeably, although these aspects of language ability converge at higher skill levels. Thus, it is the aspect(s) of language ability necessitating evaluation (i.e., general proficiency vs. LSP) that ought to determine which type of EI assessment is to be used (i.e., proficiency-based vs. performance-based, or both). Additionally, the authors discovered that the item difficulty of general domain items was significantly higher than the difficulty for LSP items. This result suggests that domain-specific items are more appropriate for performance assessment rather than general proficiency assessment. 

Hoekje, B., & Linnell, K. (1994). “Authenticity” in language testing: Evaluating spoken language tests for international teaching assistants. TESOL Quarterly, 28(1), 103-126.

Keywords: authenticity; international teaching assistants (ITAs); oral proficiency; performance assessment; validity 

With increased legislation regarding international teaching assistants’ (ITAs) English language fluency but no consensus regarding appropriate language proficiency assessment methods, Hoekje and Linnell present an evaluation of three oral proficiency assessment tools using Bachman’s (1990, 1991) framework of language testing, specifically his standard of “authenticity.” The TSE/SPEAK (Test of Spoken English/Spoken Proficiency English Assessment Kit) test, the OPI, and their own IP (Interactive Performance) test were each examined to assess merit as non-native speaker (NNS) ITA testing instruments. Findings from the study indicate that the IP test comes closest to simulating the target-use context, giving candidates the opportunity to demonstrate discourse, sociolinguistic, and strategic competence, as well as grammar skills. This correlation is significant according to Hoekje and Linnell, who emphasize the critical need for language tests to not only be statistically valid but also authentic in regards to situational specificity.

LeBlanc, L. B. (1997). Testing French teacher certification candidates for speaking ability: An alternative to the OPI. The French Review, 70(3), 383-394.

Keywords: oral proficiency; rater; reliability; speaking; tape-mediated

Florida’s Department of Education initiated the development of the Initial Teacher Subject Area Test (ITSAT) French Speaking test in an effort to set higher standards for foreign language teacher certification in the state. The goal was to create a test comparable to the OPI but more suited to the state’s particular needs for assessing oral proficiency. Major advantages of this tape-mediated test include low cost, simultaneous testing of multiple candidates, and relatively easy and efficient test administration. Moreover, results from inter-rater and intra-rater analyses indicate a high degree of reliability in the scoring procedures. The authors suggest that the application of this test could extend to foreign language teaching certification procedures in other states, exit examinations for foreign language programs in universities, and language certification processes for jobs in the private sector.

Malone, M. E. (2000). Simulated Oral Proficiency Interviews: Recent developments. ERIC Digest. Washington, DC: ERIC Clearinghouse on Languages and Linguistics.

Keywords: COPI; rater training; SOPI; test administration

This article discusses the nature of the Simulated Oral Proficiency Interview (SOPI) and several of its key aspects, including its administration, application, and rater training. Malone also covers new directions in technological advancements, specifically that of the Computerized Oral Proficiency Interview (COPI), which has been designed to adapt to examinees’ oral proficiency level. Overall, the article suggests that the SOPI is a reliable instrument that can be administered with ease. It also suggests that further developments in technology applied to test administration and rater training hold the potential to advance semi-direct approaches to performance testing. 

Salaberry, R. (2000). Revising the revised format of the ACTFL Oral Proficiency Interview. Language Testing, 17(3), 289-310.

Keywords: ACTFL; OPI; reliability; validity 

Despite revisions to both the ACTFL Guidelines and the Tester Training Manual, they remain in nearly verbatim format since the Oral Proficiency Interview (OPI) proposal in 1986. In his article, Salaberry takes the opportunity to discuss several ways in which the ACFTL OPI can be improved for better validity and reliability, without affecting any of its foundational tenets. The author chooses to focus on three factors which he claims are critical for achieving a thorough evaluation of second language proficiency performance tests: (1) the assessment of professional standards and accountability, social and educational values, and legal consequences; (2) the analysis of the theoretical construct of proficiency as represented in the OPI; and (3) the identification and evaluation of aspects of the OPI that might be improved following a critical assessment of factors (1) and (2). Salaberry therefore argues for a slew of recommendations in which minor structural changes of the OPI can be made in order to address serious issues and concerns related to the validity and reliability of the speaking test. These amendments include employing more types of interactional formats (e.g., small-group discussion or playing a game) in oral performance tests, accounting for the “educated native speaker” prejudice by identifying valid target norms, and reconsidering how weight is assigned to each category of overall competence, among several others. 

Yan, X., Maeda, Y., Lv, J., & Ginther, A. (2015). Elicited imitation as a measure of second language proficiency: A narrative review and meta-analysis. Language Testing. Advance online publication. doi: 10.1177/0265532215594643.
Keywords: construct validity; elicited imitation; systematic review; task features

This study presents a two-phase systematic review of elicited imitation (EI) for the purpose of clarifying the construct and usefulness of EI in second language research. Phase I provides a review of the history and current state of EI, highlighting its popularity in the 1970s and 1980s at the time of its introduction and the new wave of interest in recent years. This narrative review serves as the theoretical basis for the quantitative meta-analysis in Phase II, through which Yan et al. found that EI is a valid measure of language proficiency capable of distinguishing between speakers across proficiency levels. The findings also suggest that the manipulation of certain task features can influence the sensitivity of the EI instrument. According to the results of this study, EI has greater sensitivity when it measures global constructs (rather than discrete grammatical knowledge or skills), uses sentences with varying length (as opposed to sentences with fixed length), and employs a more refined rating scale. 

Interlocutor and Examinee Characteristics

These publications investigate the characteristics of examinees and interlocutors in terms of how they affect oral proficiency outcomes.

Allen, D. (2016). Investigating washback to the learner from the IELTS test in the Japanese tertiary context. Language Testing in Asia, 6(7).

Keywords: consequential validity; validation washback; IELTS Japan

Allen administered two IELTS tests at a Japanese university, then issued surveys to the 190 participating students to collect data on their test preparation strategies. He then followed up by interviewing 19 participants to investigate factors mediating washback. Test results showed significant increases in speaking ability on the second test, with more significant increases in speaking and listening among participants who reported preparing more intensely. The interviews revealed washback to be shaped by a complex array of factors related to learners’ perceptions and their access to resources, which were themselves shaped by sociocultural and educational contexts. Results indicated that washback ultimately had a positive effect on language proficiency and test preparation studies, especially with regards to speaking proficiency. However, Allen cautions that mediating factors must be addressed in order to ensure positive washback across the board.

Butler, Y. G., & Zeng, W. (2014). Young foreign language learners’ interactions during task-based paired assessments. Language Assessment Quarterly, 11(1), 45-75.

Keywords: elementary school; interactional competence; paired assessment; task-based language assessment 

This article discusses the evaluation of foreign language abilities using task-based language assessments (TBLAs) to examine interactional competence among elementary students. Upon studying the interaction patterns of fourth- and sixth-grade students, the authors found that the paired assessment method works better for sixth graders than fourth graders. Since interactional functions like mutual topic development and turn-taking are not yet fully developed in young learners, the full advantage of employing task-based paired assessments is difficult to achieve with students younger than ten years old. 

Cox, T. L. (2017). Understanding intermediate-level speakers’ strengths and weaknesses: An examination of OPIc tests from Korean learners of English. Foreign Language Annals, 50(1), 84-113.

Keywords: English as a foreign/second language; oral proficiency

In this paper, Cox explores which contribute to Intermediate-level speakers’ final rating on the OPI. He notes that many learners at this level fossilize and do not learn to produce the extended discourse necessary to progress to the Advanced level. Because of this, instructors should understand the developmental stages of language learning, in conjunction with ACTFL proficiency guidelines, to better support their students’ growth. Looking at 300 OPIc exams scored at the Intermediate level, Cox finds that test takers did consistently better at each sub-level, but never enough to reach Advanced. While students need to improve their speaking across the board in order to progress to Advanced, Cox notes that Intermediate-Mid and Intermediate-High speakers particularly need to improve their grammar and sentence structure, their sentence length, and their discourse organization. He then lists important milestones in language development for instructors to look for and discusses which tasks were easiest and hardest for test-takers. Finally, he offers suggestions for helping Intermediate students progress to Advanced.

Huang, S. (2015). Understanding learners’ self-assessment and self-feedback on their foreign language speaking performance. Assessment & Evaluation in Higher Education, 41(6), 803-820 doi: 10.1080/02602938.2015.1042426.

Keywords: self-assessment; speaking

This study explores self-assessment and self-feedback on oral language performance among freshman EFL (English as a Foreign Language) students at a Taiwanese university. Huang designed a self-assessment/self-feedback task in which had students listen to, transcribe, and analyze their own recorded speaking samples from a test administered in the first of a two-semester sequence of required English courses. In their self-feedback, students identified discrepancies, discussed feed up, feedback, and feed forward points, and examined performance at the task, self, process, and self-regulation feedback levels. The author found that the students’ analyses and self-feedback were, by and large, comprehensive and enlightening, beyond the scope of what most teachers can give in instructor feedback. These findings suggest the potential of self-assessment/self-feedback to impact speaking performance among college-level FL students.

Kang, O., & Moran, M. (2014). Functional loads of pronunciation features in nonnative speakers’ oral assessment. TESOL Quarterly, 48(1), 176-187.

Keywords: oral proficiency; pronunciation; rating 

As part of a larger study, Kang and Moran investigated how phonological errors impact assessment of non-native speakers’ (NNSs) oral proficiency. They analyzed one hundred and twenty speaking responses from Cambridge ESOL tests in the functional load (FL) approach in order to identify the segmental features of pronunciation that distinguish CEFR (Common European Framework of Reference for Languages) speaking levels, specifically from B1 to C2. Findings from the study shed light on pronunciation features categorized as high FL versus low FL and their dynamics across the proficiency levels, with implications for both pedagogy and assessment. The results also support the World Englishes viewpoint, in which phonological “errors” are regarded as deviations from native speaker norms (Kenkel & Tucker, 1989) as opposed to an inaccuracy or a mistake.

Mohan, B. (1998). Knowledge structures in Oral Proficiency Interviews for international teaching assistants. In R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 173-204). Amsterdam, Netherlands: John Benjamins.

Keywords: discourse; functional linguistics; international teaching assistants (ITAs); OPI; question-and-answer interaction

Although the OPI is a widely used assessment for oral proficiency, this article claims that it falls short of its objectives by not connecting its rating criteria to functional linguistic theory on discourse. To investigate this issue, Mohan studies a selection of OPIs given to eight international teaching assistants (ITAs) and analyzes their interview discourse through the lens of knowledge structures. Specifically, Mohan points out semantic and lexico-grammatical features in the OPI excerpts to demonstrate how the ITAs build, differentiate, and communicate four subtypes of knowledge structures: procedures, schedules, analytic rules, and empirical generalizations. By highlighting the pervasiveness of knowledge structure construction—and co-construction—in interview discourse and thereby showing how these knowledge structures can be used to appropriately analyze oral dialogue, Mohan uses his findings to argue for a more functional view of language to be used with the OPI. The implications of this proposed shift would involve raters accounting for the larger patterns of meaning being constructed and co-constructed in discourse. 

Nakatsuhara, F. (2011). Effects of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508.

Keywords: co-construction; conversational styles; extraversion; group oral tests; group size 

Group formats, both in low- and high-stakes tests, have become very popular in the past several decades. Although there are certain advantages to a candidate interacting with two or more candidates while being assessed, due to the difficulty in separating individual scores from the other group members it has been recommended that a group format should only be used as one segment in a battery of tests, as opposed to a sole means of assessment (Bonk & Ockey, 2003; Van Moere, 2006, 2007). The current study employs conversation analysis methods to focus on the impact of two test-taker characteristics—extraversion level and oral proficiency level—on group oral discourse between groups of three and groups of four. After analyzing the video recordings of two hundred and sixty-nine participants, Nakatsuhara found that test-taker characteristics affected the two group sizes in differing degrees. Student extraversion levels had a greater impact on groups-of-four discourse, while student oral proficiency held greater influence on groups-of-three discourse. As a result, the impact of group members’ characteristics on test-taker performance in group discourse ought to be accounted for in test design. For instance, test designers seeking to mitigate the effect of individual extraversion levels in group discourse should consider grouping in threes to allow test-takers to display their communication ability.

O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19(2), 169-192. 

Keywords: discourse; gender; IELTS; interaction; rater

A small number of previous research studies have focused on the possibility of a gender effect on the ratings of oral proficiency interviews, and of those, most provide evidence that there exists some type of gender effect on test scores. Interestingly, the effects are not always the same across studies. In the current study, the IELTS was selected for investigation of gender effects because it is a conversational interview assessment in which the interviewer also acts as the rater. Sixteen students, as well as eight accredited IELTS interviewers, participated in the study, in which each of the candidates—eight male and eight female—were interviewed on two separate occasions—once by a female interviewer and once by a male. The results from the interview discourse and test score analyses reveal no significant impact of gender effect on the IELTS oral interview. 

O’Sullivan, B. (2000). Exploring gender and Oral Proficiency Interview performance. System, 28(3), 373-386.

Keywords: gender; performance; qualitative; quantitative; speaking

Previous research has often demonstrated that examinees’ performance on oral tests is significantly affected by the gender of the person with whom the examinee interacts (Locke, 1984; Porter, 1991a; 1991b; Porter and Shen, 1991). As such, the current study aims to address the following issues: (1) the presence or absence of a gender effect in Japanese learners’ spoken English within the OPI; (2) the nature of such a gender effect with Japanese learners; and (3) the characteristic and distinctive features of male and female interviewer and interviewee use of the spoken foreign language. Twelve Japanese university students—six males and six females—were interviewed by six native speakers of English—three males and three females. Upon analyzing the taped interactions, the author concludes that there is indeed a gender effect on the scoring of candidates’ performance during oral proficiency testing. Among other observed effects, nearly all candidates achieved higher scores when interviewed by a woman. The findings also revealed that grammatical production is influenced by the gender of the interviewer. Specifically, the candidates generally preferred more grammatically accurate language when interacting with female interviewers. Lastly, O’Sullivan provides some recommendations for avoiding potential gender effects.   

O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277-295.

Keywords: acquaintanceship effect; paired assessment; performance assessment

There is much anecdotal evidence that supports the claim that familiarity with one’s partner in an interactional activity on a language elicitation task may positively impact performance on that task, yet few studies have examined this phenomenon in detail. For the current study, data was collected from thirty-two Japanese university students in order to study the effect of acquaintanceship—as well as a variety of gender combinations—on linguistic performance in paired interactions. The participants completed three different tasks under two pairing conditions: one with a friend and one with a stranger. The results of within-subject tests revealed that when participants were paired with a friend their performance was significantly higher than when their partner was a stranger. In addition, although other research has documented an effect caused by the sex of the stranger, no such effect was observed in this study. Overall, this study provides strong evidence that familiarity with one’s partner will likely have a positive effect on performance during pair-work language elicitation tasks.   

Swender, E., Martin, C. L., Rivera-Martinez, M., & Kagan, O. E. (2014). Exploring oral proficiency profiles of heritage speakers of Russian and Spanish. Foreign Language Annals, 47(3), 423-446.

Keywords: heritage speakers; implementation and assessment; interpersonal and presentational speaking; oral proficiency; program design; program monitoring and assessment 

The aim of this study is to gain better insight into the various linguistic, educational, and experiential factors that influence the speaking proficiency of heritage speakers. The authors compiled and analyzed background information on heritage speakers of Russian and Spanish and collected speech samples that were elicited using the ACTFL OPI-computer and were rated by certified raters according to the ACTFL Proficiency Guidelines 2012-Speaking. In the study, the strengths and weaknesses of individual heritage language learners (HLLs) were identified, and a variety of instructional strategies and practices were suggested that support continued HLL language development. Findings from this study also indicate that some relationships exist between linguistic profile and proficiency level.

Watanabe, S. (2003). Cohesion and coherence strategies in paragraph-length and extended discourse in Japanese Oral Proficiency Interviews. Foreign Language Annals, 36(4), 555-565.

Keywords: discourse; Japanese; OPI; speaking 

Based on the ACTFL Oral Proficiency Interview Tester Training Manual (Swender, 1999), test takers are expected to be able to narrate and describe using connected discourse of paragraph length, which is a trademark of the OPI. Through the analysis of fifteen OPIs in Japanese, Watanabe examines the text types produced at different proficiency levels, looking especially at how they are characterized in terms of length, clausal linkage, and organizational structure. Results of the study suggest that the level of proficiency influences test takers’ ability to employ cohesive and coherence devices, which in turn affect the level of cohesion and coherence in OPI discourse. Watanabe concludes with implications for Japanese foreign language pedagogy and for the OPI. 

See also in Test and Task Design

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer?

Raters and Interviewers

This category contains publications that consider the characteristics of raters and interviewers and how they can affect oral proficiency outcomes.

Berwick, R., & Ross, S. (1996). Cross-cultural pragmatics in Oral Proficiency Interview strategies. In M. Milanovic & N. Saville (Eds.), Studies in language testing: Vol. 3. Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem (pp. 34-54). Cambridge, United Kingdom: Cambridge University Press.

Keywords: ACTFL; cross-cultural pragmatics; OPI; rater; rating 

This paper examines potential threats to the validity of the ACTFL OPI. By means of a comparative case study, the article aims to identify fundamental differences in OPI ratings at a nominally equivalent level by two raters of differing cultural backgrounds: American and Japanese. Upon examination of the communicative styles of discourse represented in the interviews, results reveal that OPI ratings may be based on contrasting kinds of evidence for oral proficiency. The analysis also strongly suggests that the cultural background of interviewers may highly influence their understanding of the level of proficiency being demonstrated by interviewees, resulting in a dramatic difference among ratings. The paper goes on to explore several resolutions to validating the OPI among varying cultures, including a reformulation of training procedures for raters, as well as remedying cultural differences through means of a universal protocol.  

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1-25.

Keywords: conversational interview; discourse analysis; fairness; IELTS; interviewer variation; rater perception

This article explores the extent to which the variation amongst interviewers affects candidate performance in conversational oral interviews. Through discourse analysis, Brown examines the distinct ways two different interviewers handle the interview and elicit speech from the same examinee during the conversational phase of the IELTS interview. The findings from this study point to two major areas of concern: the adequacy of interviewer training and of construct definition in tests of second language communicative competence.

Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency? Language Testing, 28(2), 201-219.

Keywords: accent familiarity; inter-rater reliability; oral proficiency; pronunciation; rater training; speaking test 

Carey et al. argue that pronunciation ratings on an English speaking test may be highly susceptible to variation depending on raters’ level of exposure to non-native English accents. Their study involves an inter-rater variability analysis on the English pronunciation ratings of three test candidate interlanguages (Chinese, Korean, and Indian English) supplied by ninety-nine geographically-dispersed IELTS examiners with either prolonged exposure or little to no exposure to the candidates’ interlanguages. Findings from this study suggest that a rater’s prolonged exposure to a candidate’s interlanguage phonology will result in a higher rating of pronunciation, and unfamiliarity or low exposure will result in a lower rating. This confirms the authors’ argument that perception of an examinee’s performance may be affected in a positive or negative way depending on the examiner’s linguistic experience with the test taker’s accent, which they coined “interlanguage phonology familiarity.”

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.

Keywords: inter-rater reliability; rater experience; rater expertise; rater training; scoring aids; speaking assessment

To investigate if rater training and scoring experience contribute to rater scoring patterns in oral exams, Davis had 20 teachers of English score recorded TOEFL iBT speaking test responses prior to rater training using only exemplars. They then scored tests in three further sessions following training on rubric and exemplar use. Results showed that while participants achieved a typical level of scoring performance prior to training, training resulted in increased interrater correlation and improved agreement with established reference scores. While experience gained after training had little effect on interrater consistency, score accuracy continued to increase in post-training sessions. The study was fairly consistent with previous research, but the difference between exemplar-only and exemplar and rubric scoring leads Davis to speculate on the relative contribution of scoring aids to scoring patterns.

Ling, G., Mollaun, P., & Xi, X. (2014). A study on the impact of fatigue on human raters when scoring speaking responses. Language Testing, 31(4), 479-499.

Keywords: accuracy; rater; rating; speaking; TOEFL iBT; validity 

This study explores the effect of fatigue on human raters’ abilities to rate TOEFL iBT Speaking Test responses accurately and consistently. Ling et al. found that rater productivity, accuracy, and consistency varies across hours and also depending on shift conditions such as shift length and session length. Results indicate that raters working shorter shifts and sessions tend to outperform those working longer shifts or sessions in terms of scoring quality. Survey data on raters’ perception also corroborate these results, with a significant majority of raters reporting higher levels of fatigue during an eight-hour shift than a six-hour shift, and a majority of raters reporting more fatigue in the afternoon than in the morning. In the end, the authors emphasize that these findings only suggest the impact that fatigue may have on rating; since all six of a candidate’s responses are each scored by three to six randomly selected raters, the chances of fatigue affecting scoring (i.e., all three scores happen to come from raters scoring in the same time frame and under the same shift conditions) are slim.

Ross, S. J. (2007). A comparative task-in-interaction analysis of OPI backsliding. Journal of Pragmatics, 39(11), 2017-2044.

Keywords: alignment; backchannel; discourse analysis; footing; framework; interviewer variation; OPI

This article reports on the backsliding performance of a Japanese businessman on the OPI. Through a contrastive discourse analysis of two interactions conducted by two different interviewers, the study explores the effect of interviewer variation on the rating of proficiency, tapping into their comparability. It was found that the interviewers differed widely in their interaction styles (e.g., use of backchannels), demeanor (e.g., businesslike) and alignment with a candidate, which could then influence their rating behavior. However, the consistency among five repeated second ratings of the recorded interactions indicated that the current rating system was unaffected by divergent interviewer styles. Overall, whether the candidate actually backslid or not remains unclear. 

Thompson, I. (1995). A study of interrater reliability of the ACTFL Oral Proficiency Interview in five European languages: Data from ESL, French, German, Russian, and Spanish. Foreign Language Annals, 28(3), 407-422.

Keywords: inter-rater reliability; OPI 

Based on seven hundred and ninety-five double-rated oral proficiency interviews, the present study seeks to provide insight into the inter-rater reliability of the ACTFL OPI in the following European languages: English as a Second Language (ESL), French, German, Russian, and Spanish. To do this, Thompson addresses five separate questions: (1) What is the inter-rater reliability of ACTFL-certified testers in the above languages? (2) What is the relationship between interviewer-assigned ratings and second ratings based on audio replay of the interviews? (3) Does inter-rater reliability vary as a function of proficiency level? (4) Do different languages exhibit different patterns of inter-rater agreement across level? (5) Are inter-rater disagreements confined mostly to the same main proficiency level? On the whole, the findings of this study provide evidence that trained interviewers can apply the ACTFL oral proficiency scale in the languages of concern with a relatively high degree of consistency. The findings also recommend practical steps to ensure and improve inter-rater reliability. 

Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.

Keywords: accent familiarity; construct validity; FACETS; Item Response Theory (IRT); oral assessment; rater bias; rater performance 
Prior research demonstrates that familiarity with a particular accent makes that accent easier to understand in comparison to speech with an unfamiliar accent. This article examines whether accent familiarity due to a rater’s language learning background leads to a bias on oral proficiency assessments. One hundred and seven raters at Michigan State University participated in a study to determine whether or not accent familiarity acted as a rater effect on the TOEFL iBT. The results indicate that raters tended to score test takers more leniently if they were familiar with the accent, thereby providing evidence for this new type of rating bias. In their conclusion, the authors advocate certain measures to be taken when selecting and training raters in order to minimize the effect of this bias. 

See also in Interlocutor and Examinee Characteristics:

O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19(2), 169-192. 

O’Sullivan, B. (2000). Exploring gender and Oral Proficiency Interview performance. System, 28(3), 373-386.

Implementation and Use

The publications in this category describe how oral proficiency assessment has been implemented and used in various language programs.

Kagan, O., & Friedman, D. (2003). Using the OPI to place heritage speakers of Russian. Foreign Language Annals, 36(4), 536-545.

Keywords: ACTFL; heritage speakers; OPI; Russian; speaking

Kagan and Friedman examine whether or not the OPI is an appropriate instrument for assessing the oral proficiency of heritage language speakers of Russian, considering they are not the traditional foreign language learners toward which the OPI is usually targeted. Results from their study indicate that, dissimilar to findings from Valdes’s (1989) study involving heritage speakers of Spanish, the OPI is suitable for assessing heritage speakers of Russian. In addition, characteristics of the Russian language, which has few dialectal features, make it an ideal educated speaker standard on which to base the assessment. The authors support using the OPI as one component in the placement process, in conjunction with the collection of biographical information and a written test if the student is literate.

Ke, C., & Reed, D. J. (1995). An analysis of results from the ACTFL Oral Proficiency Interview and the Chinese Proficiency Test before and after intensive instruction in Chinese as a foreign language. Foreign Language Annals, 28(2), 208-222.

Keywords: Chinese; intensive language program; OPI; oral proficiency

Ke and Reed report on their study of proficiency gains of adult learners of Chinese as a foreign language in an intensive domestic immersion program. The data is comprised of one hundred and twenty-two sets of scores gathered from students who took both the Chinese Proficiency Test (CPT) from the Center for Applied Linguistics (CAL) and the ACTFL OPI two times, once at the beginning of a nine-week summer program and again at the end. The authors analyzed the scores from the two types of tests, with the OPI being a direct test and the CPT indirect. They found an overall moderate correlation and, interestingly, higher correlations based on exit scores each time when compared to corresponding correlations based on entrance scores. Analysis of the results also suggest that programs like this can yield significant gains for foreign language learners and might be particularly beneficial for students of difficult, less commonly taught languages. At the end of the summer intensive program, about 60% of the students improved their OPI scores and over 95% improved their CPT scores. As for implications on pedagogy, the authors support a dual approach, making the assertion that the traditional and skills-based can go hand in hand with the communicative and interactional.

Kuo, J., & Jiang, X. (1997). Assessing the assessments: The OPI and the SOPI. Foreign Language Annals, 30(4), 503-512.

Keywords: interaction; interviewer variation; OPI; SOPI; speaking

In this article, Kuo and Jiang analyze and compare ACTFL’s OPI and the Simulated Oral Proficiency Interview (SOPI) by the Center for Applied Linguistics (CAL) from a user’s perspective. They explore each test and its individual strengths and weaknesses. As a standardized test, the SOPI offers a greater degree of validity. As a live interview, the OPI offers interaction and individuality but with the drawback of interviewer variance. In the end, both are found to be effective and accurate tools for measuring language proficiency. Therefore, it would be up to each individual and/or institution to determine which test to employ based on their specific needs and other considerations.

Liu, J. J. (2010). Assessing students’ language proficiency: A new model of study abroad program in China. Journal of Studies in International Education, 14(5), 528-544.

Keywords: Chinese; combination model; oral proficiency; short-term program; study abroad

Liu presents a study of the proficiency gains of eleven students of Mandarin Chinese who participated in a combination language study program spanning two summers. The combination program was composed of a domestic intensive program with a residency component followed by a short-term study abroad program the next summer. His study examines if this type of combination program increases students’ language proficiency, cultural awareness, and personal career development and also explores the appropriateness of using the SAT II Chinese test, OPI, and writing portfolio assessment as methods of measuring outcomes. The results indicate that the combination program helped to increase the proficiency level of every student partaking in this study over the fifteen-month period. Specifically, the pre-departure summer played a significant role in the success of the subsequent summer abroad. Students also affirmed overall an enhancement in their cultural awareness and personal career development. Liu found a positive relationship between SAT II Chinese test scores and OPI scores and, based on his findings, suggests a cutoff score of 590 on the SAT II Chinese test to be indicative of a prospective participant’s likelihood of achieving advanced proficiency level through this type of combination program.

Meredith, R. A. (1990). The Oral Proficiency Interview in real life: Sharpening the scale. Modern Language Journal, 74(3), 288-296.

Keywords: implementation; inter-rater reliability; OPI; oral proficiency

The OPI’s popularity as a common instrument of measurement across languages and institutions warrants further investigation of its potential as a research tool. The main research question of this study is threefold: (1) to determine the feasibility of using a numeric version of a modified oral proficiency scale in research; (2) to assess the correlation between prior study of a foreign language and performance on the OPI; and (3) to test the feasibility of using the OPI with an achievement testing program for first-year students. Two hundred and thirty-one students, who were enrolled in a second-semester level Spanish course at Brigham Young University, were administered the OPI. Looking at the results, Meredith found adequate feasibility and significant correlations for all three research questions. While lacking finesse as a singular testing instrument, overall results from this study suggest that as a segment in a larger sequence of tests, the OPI has proved invaluable as a motivation for students to develop their conversational skills. The findings also reveal that under certain circumstances, the OPI can be implemented to assign grades to oral skills. 

Oral Proficiency Testing and Curriculum

The focus of this category is the relationship of oral proficiency testing and oral proficiency outcomes, language curriculum, and language instruction.

Glisan, E. W., & Foltz, D. A. (1998). Assessing students’ oral proficiency in an outcome-based curriculum: Student performance and teacher intuitions. Modern Language Journal, 82(1), 1-18.

Keywords: high school curriculum; oral proficiency; outcome-based assessment; rater; teachers

In this article, Glisan and Foltz report on the results of a pilot study conducted with two school districts in the state of Pennsylvania, which had newly incorporated oral proficiency in a foreign language into its graduation requirements. Pennsylvania schools made a landmark move in establishing performance-based outcomes for foreign language study, requiring all of its students to demonstrate at least an Intermediate-Low level of proficiency in a language other than English according to the ACTFL Proficiency Guidelines. Some of the most significant findings from this study include the conclusion that more than four sequential years of foreign language study is often needed for students to reach the mandated Intermediate-Low proficiency level—although in some cases students attain that level earlier and are capable of achieving higher levels—and that OPI training would equip teachers to more accurately assess the oral proficiency levels of their students and monitor their progress towards (or beyond) the graduation requirements.

Govoni, J. M., & Feyten, C. M. (1999). Effects of the ACTFL-OPI-type training on student performance, instructional methods, and classroom materials in the secondary foreign language classroom. Foreign Language Annals, 32(2), 189-204.

Keywords: interview; OPI; oral proficiency; teachers

In this article, Govoni and Feyten examine the effects of providing secondary foreign language teachers with ACTFL-OPI-type training in terms of student performance, methods of instruction, and classroom materials. Participants in the study were from Pinellas County, Florida, and included six teachers of Spanish III and/or IV, as well as their respective 9-12th grade students. The authors collected and analyzed both quantitative and qualitative data. Findings from the study indicate an increased awareness of a proficiency-oriented curriculum among teachers who have received ACTFL-OPI-type training. The trained teachers from this study were found to have provided more personalized and meaningful activities promoting oral communication for their students while cultivating a more student-centered approach to teaching. It is interesting to note, however, that there were no significant effects on student performance.

Johnson, C. H., & Manley, J. H. (1993). The Oral Proficiency Institute revisited. The French Review, 67(2), 263-275.

Keywords: OPI; oral proficiency; teachers; training 

In this article, Johnson and Manley discuss the origins and purpose of the Oral Proficiency Institute for secondary school language teachers, as well as the multiple modifications the institute has undergone since it was first established. The authors describe the many benefits these intensive summer institutes provide for classroom practitioners, of which most notable are improved language ability, enhanced pedagogical skills, and greater understanding of the target language culture. In light of existing institutes’ demonstrated effectiveness, Johnson and Manley outline a future vision for further improvement of the institutes, proposing an increase in staffing, an expansion adding a second level for previous participants, possible modifications to institute schedules to maximize the opportunities for teachers to attend, considerations for housing arrangements encouraging “esprit de corps,” follow-up activities post-institute, as well as stipends and other incentives for teachers to participate. 

Tsagari, D. (2011). Washback of a high-stakes English exam on teachers’ perceptions and practices. In E. Kitis, N. Lavidas, N. Topintzi, & T. Tsangalidis (Eds.), Selected papers from the 19th International Symposium on Theoretical and Applied Linguistics (pp. 431-445). Thessaloniki, Greece: Monochromia.

Keywords: high-stakes exams; interview; teachers’ perceptions and practices; test impact; washback

The First Certificate in English (FCE) is a high-stakes, English language proficiency exam administered by Cambridge ESOL, which is intended to have positive impacts upon both teaching and learning in Greece. However, local practitioners are skeptical as to how positive the impacts of this high-stakes assessment really are. This article examines the relationship between the expected influences of the FCE and teachers’ attitudes towards the exam, as well as their classroom practices. The results of the study demonstrate a close relationship between the content on the FCE and the teaching and learning that occurred in the classroom. Tsagari also provides suggestions for language teachers preparing their students for high-stakes exams in order to avoid misunderstanding of exam requirements which may lead to negative washback on classroom teaching and learning.

See also in Implementation and Use: 

Ke, C., & Reed, D. J. (1995). An analysis of results from the ACTFL Oral Proficiency Interview and the Chinese Proficiency Test before and after intensive instruction in Chinese as a foreign language. Foreign Language Annals, 28(2), 208-222.

Liu, J. J. (2010). Assessing students’ language proficiency: A new model of study abroad program in China.


The common theme uniting these publications is the use of technology with oral proficiency assessment.

Fall, T., Adair-Hauck, B., & Glisan, E. (2007). Assessing students’ oral proficiency: A case for online testing. Foreign Language Annals, 40(3), 377-406.

Keywords: ACTFL; computer-based; large-scale assessment; oral proficiency; outcome-based assessment; rating; reliability; validity

In this article, Fall et al. report on the Pittsburgh Public Schools Oral Ratings Assessment for Language Students (PPS ORALS) project, a grant-funded project for the creation of an online testing software program to be used in large-scale oral language testing. This study aims to validate the PPS ORALS assessment model as an instrument measuring oral proficiency on the ACTFL scale. Findings of the study support its validity, reliability, and feasibility, and the authors also affirm the usefulness of the PPS ORALS in collecting longitudinal assessment data and in closing the gap between instruction and assessment.

Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and response time on a computerized oral proficiency test. Language Testing, 22(1), 59-92.

Keywords: computer-based; COPI; oral proficiency; SOPI; tape-mediated

In their article, Malabonga et al. undertake two studies to examine the technical aspects of the Computerized Oral Proficiency Instrument (COPI) in comparison with the Simulated Oral Proficiency Interview (SOPI), which is tape-mediated. The first study investigated how examinees used the COPI’s self-assessment mechanism to choose a starting level, while the second study focused on examinee choices in planning and response time. Results from Study 1 reveal that, although the self-assessment mechanism in the COPI is generally reliable, when compared to outcomes on the SOPI, the COPI self-assessment mechanism appeared to be unreliable in the case of several examinees. Results from Study 2 suggest that the maximum time allotted in the COPI for planning and speaking exceeded the necessary amount that examinees required.

Norris, J. M. (2001). Concerns with computerized adaptive oral proficiency assessment. Language Learning & Technology, 5(2), 99-105.

Keywords: computer-based; COPI; direct oral tests

In this article, Norris discusses the role of computer-based tests (CBTs) in language proficiency assessment, as well as their potential limits. He claims that the Computerized Oral Proficiency Instrument (COPI) developed by the Center for Applied Linguistics (CAL) offers a rather original solution to the computerization of direct tests of complex speaking skills. Norris goes on to discuss the COPI and its possible innovations and advantages over other tests, particularly the Simulated Oral Proficiency Interview (SOPI) and the ACTFL OPI. He also touches upon critical issues that should be addressed in future research of the COPI, as well as other possible concerns of computerized adaptive oral proficiency assessment. 

Thompson, G. L., Cox, T. L., & Knapp, N. (2016). Comparing the OPI and the OPIc: The effect of test method on oral proficiency scores and student preference. Foreign Language Annals, 49(1), 75-92.

Keywords: language proficiency; oral proficiency; post-secondary; Spanish

This study compares the OPI and the OPI-computer (OPIc) testing methods for their effects on oral proficiency scores and student preference, using a case study of Spanish language learners. The participants were administered both tests, as well as a series of surveys. The study found that 55% of participants received the same scores on both tests, while 32% scored higher on the OPIc, and 13% scored higher on the OPI. Even though several examinees scored much higher on the OPIc, the effect size was relatively small. Furthermore, most participants showed a preference for the OPI because of its authenticity, immediate feedback, interesting topics, and lenient time allowance. The study also found that scores from the OPIc tended to cluster at the middle of the major ACTFL Proficiency levels, suggesting that a human interviewer might be necessary to encourage participants to produce speech at higher levels. On the whole, the results of this study demonstrate the importance of understanding differences in testing methods and of considering students’ individual preferences when preparing them for oral proficiency exams.

VanPatten, B., Trego, D., & Hopkins, W. P. (2015). In-class vs. online testing in university-level language courses: A research report. Foreign Language Annals, 48(4), 659-668.

Keywords: online instruction; proficiency-oriented instruction; testing

This study seeks to answer the question of whether taking a test online versus in-class makes a difference. The participants were gathered from a third-semester communicative and performance-oriented university Spanish program and were divided fairly equally into two groups. One group took the test in class, and the other group took the same test online. The study found that there was no difference in scores between students who took the test online and students who took the test in class. These results suggest that language instructors can move testing online to allow more time for interactive and communicative language instruction in the classroom. However, as research regarding paper-and-pencil testing in language curricula is still limited, the results of this study may not be generalized to all languages and classroom contexts. Furthermore, whether tests are presented to students as “quizzes” or “activities” may also have an effect on performance, although this factor has not yet been extensively researched.

See also in Oral Proficiency Testing and Curriculum:

Malone, M. E. (2000). Simulated Oral Proficiency Interviews: Recent developments.