Abstract
Background: Developing assessments in multiple languages is hugely complex, impacting every stage from test development to scoring, and evaluating scores. Different approaches are needed to examine comparability and enhance validity in cross-lingual assessments. Method: A review of literature and practices relating to different methods used in cross-lingual assessment is presented. Results: There has been a shift from source-to-target language translation to developing items in multiple languages simultaneously. Quantitative and qualitative methods are used to link and evaluate assessments across languages and provide validity evidence. Conclusions: This article provides practitioners with an overview and research-based recommendations relating to test development, linking, and validation of assessments produced in multiple languages.
References
Alatli, B. (2020). Cross-cultural measurement invariance of the items in the Science Literacy Test in the Programme for International Student Assessment (PISA-2015). International Journal of Education and Literacy Studies, 8(2), 16–27.
Alatli, B. (2022). An investigation of cross-cultural measurement invariance and item bias of PISA 2018 reading skills items. International Online Journal of Education and Teaching, 9(3), 1047–1073.
Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, Development
approach Benefits Challenges 36(3), 185–198. https://doi.org/10.1111/j.1745-3984.1999.tb00553.x
Allalouf, A., Rapp, J., & Stoller, R. (2009). Which item types are better suited to the linking of verbal adapted tests? International Journal of (Successive) adaptation Simultaneous development Parallel development Stronger link across languages, Established statistical methods to investigate equivalence (e.g. DIF). Stronger link across languages, Established statistical methods to investigate equivalence (e.g. DIF), Linguistic and cultural decentering,
Reduced review time.
Cultural relevance & authenticity, Removes risk of translation errors, Reduces impact of language differences.
Cultural relevance & authenticity, Translation errors, Language differences (e.g. language idiosyncrasies, word frequencies, differential speededness).Weaker link across languages, Labour intensive, to investigate comparability statistically. Testing, 9(2), 92–107. https://doi.org/10.1080/15305050902880686 American Educational Research Association, American Psychological
Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Angoff, W. H., & Cook, L. L. (1988). Equating the scores of the Prueba de Aptitud Académica and the Scholastic Aptitude Test (Report No. 88-2). ETS Research Report Series. https://doi.org/10.1002/j.2330-8516.1988. tb00259.x
Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21(4), 495–508. https://doi.org/10.1080/10705511.2014.919210
Badham, L., & Furlong, A. (2023). Summative assessments in a multilingual context: What comparative judgment reveals about comparability across different languages in Literature. International Journal of Testing, 23(2), GenAI Time efficient, Cost effective, Reduced labour, Reduced security risk, Lowers exposure of test content. Copyright/intellectual ownership, Risk of bias, Not sufficiently developed in all languages. 111-134. https://doi.org/10.1080/15305058.2022.2149536
Blanco, C. (2024). 2024 duolingo language report. Duolingo. https://blog. duolingo.com/2024-duolingo-language-report/
Boldt, R. F. (1969). Concurrent validity of the PAA and SAT for bilingual Dade School County high school volunteers (College Entrance Examination Board Research and Development Report 68-69, No. 3). Educational Testing Service.
Brislin, R. W. (1970). Back-translation for cross-cultural research. Journal of Cross-cultural Psychology, 1(3), 185–216. https://doi.org/10.1177/135910457000100301
Cascallar, A. S., & Dorans, N. J. (2005). Linking scores from tests of similar content given in different languages: An illustration of methodological alternatives. International Journal of Testing, 5(4), 337–356. https://doi. org/10.1207/s15327574ijt0504_1
CTB/McGraw-Hill (1988). Spanish assessment of basic education: Technical report. McGraw Hill.
Davidov, E. (2011). Nationalism and constructive patriotism: A longitudinal test of comparability in 22 countries with the ISSP. International Journal of Public Opinion Research, 23(1), 88–103. https://doi.org/10.1093/ ijpor/edq031
Davis, S. L., Buckendahl, C. W., & Plake, B. S. (2008). When adaptation is not an option: An application of multilingual standard setting. Journal of Educational Measurement, 45(3), 287–304. https://doi.org/10.1111/ j.1745-3984.2008.00065.x
Dept, S., Ferrari, A., & Halleux, B. (2017). Translation and cultural appropriateness of survey material in large-scale assessments. In P. Lietz, J. Cresswell, K. Rust and R. Adams (Eds.), Implementation of large-scale education assessments (pp. 168–191). Wiley. https://doi. org/10.1002/9781118762462.ch6
Dorans, N. J., & Middleton, K. (2012). Addressing the extreme assumptions of presumed linkings. Journal of Educational Measurement, 49(1), 1–18. https://doi.org/10.1111/j.1745-3984.2011.00157.x
Ebbs, D., & Koršňáková, P. (2016). Translation and translation verification for TIMSS 2015. In Martin, M. O., Mullis I. V. & Martin H. (Eds.), Methods and procedures in TIMSS 2015 (pp. 7.1–7.16). TIMSS & PIRLS International Study Center, Boston College.
Ebbs, D., Flicop, S., Hidalgo, M. M., & Netten, A. (2021). Systems and instrument verification in PIRLS 2021. In Methods and procedures: PIRLS 2021 technical report (pp. 5.1–5.24). TIMSS & PIRLS International Study Center, Boston College. https://doi.org/10.6017/ lse.tpisc.tr2103.kb2485
El Masri, Y. H., Baird, J-A., & Graesser, A. (2016) Language effects in international testing: The case of PISA 2006 science items. Assessment in Education: Principles, Policy & Practice, 23(4), 427–455. https://doi. org/10.1080/0969594X.2016.1218323
Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessments. International Journal of Testing, 2(3–4), 199–215. https://doi.org/10.1080/15305058.2002.9669493
Ercikan, K., & Koh, K. (2005). Examining the construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5(1), 23–35. https://doi.org/10.1207/s15327574ijt0501_3
Ercikan, K., & Lyons-Thomas, J. (2013). Adapting tests for use in other languages and cultures. In K.Geisinger, B. Bracken, J. Carlson, J-I. Hansen, N. Kuncel, S. Reise, & M. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 3. Testing and assessment in school psychology and education (pp. 545–569). American Psychological Association. https://doi.org/10.1037/14049-026
Ercikan, K., & Por, H. (2020). Comparability in multilingual and multicultural assessment contexts. In A. I. Berman, E. H. Haertel, & J. W. Pellegrino (Eds.), Comparability of large-scale educational assessments: Issues and recommendations (pp. 205–225). National Academy of Education Press. https://naeducation.org/wp-content/uploads/2020/06/ Comparability-of-Large-Scale-Educational-Assessments.pdf
Fischer, R., & Fontaine, J. R. J. (2011). Methods for investigating structural equivalence. In D. Matsumoto and F. J. R. van de Vijver (Eds.), Cross- cultural research methods in psychology (pp. 179–215). Cambridge University Press. https://doi.org/10.1017/CBO9780511779381.010
Gierl, M. J., & Khaliq, S. N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests: A confirmatory analysis. Journal of Educational Measurement, 38(2), 164–187. https:// doi.org/10.1111/j.1745-3984.2001.tb01121.x
Gökçe, S., Berberoglu, G., Wells, C. S., & Sireci, S. G. (2021). Linguistic distance and translation differential item functioning on Trends in International Mathematics and Science Study mathematics assessment items. Journal of Psychoeducational Assessment, 39(6), 728–745. https://doi.org/10.1177/07342829211010537
Goodwin, S., Bilsky, L., Mulcaire, P., & Settles, B. (2023, 26–28 April). Machine learning applications to develop tests in multiple languages simultaneously and at scale [Conference presentation]. Association of Language Testers in Europe 8th International Conference, Madrid, Spain.
Grisay, A. (2003). Translation procedures in OECD/PISA 2000 international assessment. Language Testing, 20(2), 225–240. https://doi.org/10.1191/0265532203lt254oa
Grisay, A., Gonzalez, E., & Monseur, C. (2009). Equivalence of item difficulties across national versions of the PIRLS and PISA reading assessments. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 2, 63–83.
Hambleton, R. K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10(3), 229–244.
Hambleton, R. K. (2005). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R.K. Hambleton,
P. Merenda, & C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 3–38). Lawrence Erlbaum Publishers.
Hambleton, R. K., & Zenisky, A. L. (2011). Translating and adapting tests for cross-cultural assessments. In D. Matsumoto, & F. J. R. van de Vijver (Eds.), Cross-cultural research methods in psychology (pp. 46–74). Cambridge University Press.
Hao, J., von Davier, A. A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16–29. https://doi.org/10.1111/ emip.12602
Hernández, A., Hidalgo, M. D., Hambleton, R. K., & Gómez-Benito, J. (2020). International test commission guidelines for test adaptation: A criterion checklist. Psicothema, 32(3), 390–398. https://doi.org/10.7334/ psicothema2019.306
Hulin, C.L., Drasgow, F., & Komocar, J. (1982). Applications of item response theory to analysis of attitude scale translations. Journal of Applied Psychology, 67(6), 818–825. https://doi.org/10.1037/0021-9010.67.6.818
Hulin, C.L., & Mayer, L.J. (1986). Psychometric equivalence of a translation of the Job Descriptive Index into Hebrew. Journal of Applied Psychology, 71(1), 83–94. https://doi.org/10.1037/0021-9010.71.1.83
International Baccalaureate Organization (2018). Assessment principles and practices—Quality assessments in a digital age. International Baccalaureate Organization. https://ibo.org/globalassets/new-structure/ about-the-ib/pdfs/dp-final-statistical-bulletin-may-2024_en.pdf
International Baccalaureate Organization (2024). The IB Diploma Programme and Career-Related Programme: May 2024 assessment session final statistical bulletin. International Baccalaureate Organization. https://ibo.org/globalassets/new-structure/about-the-ib/ pdfs/the-ib-dp-and-cp-statistical-bulletin_en.pdf
International Test Commission. (2017). ITC Guidelines for translating and adapting tests (2nd edition). International Test Commission. https://www.intestcom.org/files/guideline_test_adaptation_2ed.pdf
International Test Commission and Association of Test Publishers (2022). Guidelines for technology-based assessments. International Test Commission and Association of Test Publishers. https://www. intestcom.org/upload/media-library/tba-guidelines-final-2-23-2023- v4-167785144642TgY.pdf
Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking: Methods and practices (2nd edition). Springer-Verlag.
Koršňáková, P., Dept, S., & Ebbs, D. (2020). Translation: The preparation of national language versions of assessment instruments. In H. Wagemaker (Ed.), Reliability and validity of international large- scale assessment: Understanding IEA’s comparative studies of student achievement, volume 10 (pp. 85–111). IEA Research for Education, Springer, Cham. https://doi.org/10.1007/978-3-030-53081-5_6
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Publishers.
Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement: Issues and Practice, 26(4), 29–37. https:// doi.org/10.1111/j.1745-3992.2007.00106.x
Martin, M. O., von Davier, M., & Mullis, I. V. (2020). Methods and procedures: TIMSS 2019 technical report. International Association for the Evaluation of Educational Achievement. https://timssandpirls. bc.edu/timss2019/methods/
McGrane, J., Kayton, H., Double, K., Woore, R., & El Masri, Y. (2022). Is science lost in translation? Language effects in the International Baccalaureate Diploma Programme science assessments. Oxford University Centre for Educational Assessment. https://ibo.org/ globalassets/new-structure/research/pdfs/ib-dp-sciene-translation-final- report.pdf
Milman, L. H., Faroqi-Shah, Y., Corcoran, C. D., & Damele, D. M. (2018). Interpreting mini-mental state examination performance in highly proficient bilingual Spanish–English and Asian Indian–English speakers: Demographic adjustments, item analyses, and supplemental measures. Journal of Speech, Language, and Hearing Research, 61(4), 847–856. OECD. (2016). PISA 2018 translation and adaptation guidelines. OECD Publishing. https://www.oecd.org/content/dam/oecd/en/about/
programmes/edu/pisa/pisa-database/survey-implementation-tools/ pisa-2018/PISA-2018-TRANSLATION-AND-ADAPTATION- GUIDELINES.pdf
OECD. (2024). PISA 2022 Technical Report. OECD Publishing. https:// www.oecd.org/en/publications/pisa-2022-technical-report_01820d6d- en.html
Oliveri, M. E., Olson, B., Ercikan, K., & Zumbo, B. D. (2012). Methodologies for investigating item- and test-level measurement equivalence in international large-scale assessments. International Journal of Testing, 12(3), 203–223. https://doi.org/10.1080/15305058.2011.617475
Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53(3), 315–333.
Oliveri, M. E., & von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14(1), 1–21. https://doi.org/10.1080/15305058.2013.825265
Oliveri, M. E., & von Davier, M. (2017). Analyzing invariance of item parameters used to estimate trends in international large-scale assessments. In H. Jiao & R.W. Lissitz (Eds.), Test fairness in the new generation of large-scale assessment (pp.121–146). Information Age Publishing.
Ong, S. L., & Sireci, S. G. (2008). Using bilingual students to link and evaluate different language versions of an exam. US-China Education Review, 5(11), 37–46.
Rapp, J., & Allalouf, A. (2003). Evaluating cross-lingual equating. International Journal of Testing, 3(2), 101–117. https://doi.org/10.1207/ S15327574IJT0302_1
Robin, F., Sireci, S. G., & Hambleton, R. K. (2003). Evaluating the equivalence of different language versions of a credentialing exam. International Journal of Testing, 3(1), 1–20. https://doi.org/10.1207/ S15327574IJT0301_1
Rogers, W. T., Gierl, M. J., Tardif, C., Lin, J., & Rinaldi, C. M. (2003). Differential validity and utility of successive and simultaneous approaches to the development of equivalent achievement tests in French and English. Alberta Journal of Educational Research, 49(3), 290–304. https://doi.org/10.11575/ajer.v49i3.54986
Rogers, W. T., Gierl, M. J., Tardif, C., Lin, J., & Rinaldi, C. M. (2010). Validity of the simultaneous approach to the development of equivalent achievement tests in English and French. Applied Measurement in Education, 24(1), 39–70. https://doi.org/10.1080/08957347.2011.532416 Sireci, S. G. (1997). Problems and issues in linking tests across languages. Educational Measurement: Issues and Practice, 16(1), 12–19. https://doi.org/10.1111/j.1745-3992.1997.tb00581.x
Sireci, S. G. (2005). Using bilinguals to evaluate the comparability of different language versions of a test. In R.K. Hambleton, P. Merenda, &
C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 117–138). Lawrence Erlbaum Publishers. Sireci, S. G., & Berberoglu, G. (2000). Using bilingual respondents to evaluate translated- adapted items. Applied Measurement in Education,
(3), 229–248. https://doi.org/10.1207/S15324818AME1303_1
Sireci, S. G., & Oliveri, M. E. (2023). A Critical Review of the International Baccalaureate Organization’s Multilingual Assessment Processes and Best Practices’ Recommendations [Report for the IB]. International Baccalaureate Organization.
Sireci, S. G., Rios, J. A., & Powers, S. (2016). Comparing test scores from tests administered in different languages. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 181– 202). Routledge.
Sireci, S. G., & Wells, C. S. (2010). Evaluating the comparability of English and Spanish video accommodations for English language learners. In P. Winter (Ed.), Evaluating the comparability of scores from achievement test variations (pp. 33–68). Council of Chief State School Officers.
Sukin, T., Sireci, S. G., & Ong, S. L. (2015). Using bilingual examinees to evaluate the comparability of test structure across different language versions of a mathematics exam. Actualidades en Psicología, 29(119), 131–139. http://doi.org/10.15517/ap.v29i119.19244
Tanzer, N. K. (2005). Developing tests for use in multiple languages and cultures: A plea for simultaneous development. In R. K. Hambleton,
P. Merenda, & C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 235–263). Lawrence Erlbaum Publishers.
van de Vijver, F. J. R., & Tanzer, N. K. (1997). Bias and equivalence in cross-cultural assessment: An overview. European Review of Applied Psychology, 47(4), 263–279. https://doi.org/10.1016/j.erap.2003.12.004
van de Vijver, F. J. R., & Poortinga, Y. H. (2005). Conceptual and methodological issues in adapting tests. In R. K. Hambleton, P. Merenda, & C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 39–64). Lawrence Erlbaum Publishers.
van de Vijver, F., Avvisati, F., Davidov, E., Eid, M., Fox, J. P., Le Donné, N., Lek, K., Meuleman, B., Paccagnella, M., & Van de Schoot, R. (2019). Invariance analyses in large-scale studies. OECD Publishing.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28(3), 197–219. https://doi.org/10.1111/j.1745-3984.1991.tb00354.x
Wolff, H. G., Schneider-Rahm, C. I., & Forret, M. L. (2011). Adaptation of a German multidimensional networking scale into English. European Journal of Psychological Assessment, 27(4), 244–250. https://doi.org/10.1027/1015-5759/a000070
Woodcock, R. W., & Muñoz-Sandoval, A. F. (1993). An IRT approach to cross-language test equating and interpretation. European Journal of Psychological Assessment, 9(3), 233– 241.
Zhao, X., Solano-Flores, G., & Qian, M. (2018). International test comparisons: Reviewing translation error in different source language- target language combinations. International Multilingual Research Journal, 12(1), 17–27. https://doi.org/10.1080/19313152.2017.1349527 Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Astivia, O. L. O., & Ark, T.
K. (2015). A methodology for Zumbo’s third generation DIF analyses and the ecology of item responding. Language Assessment Quarterly, 12(1), 136–151. https://doi.org/10.1080/15434303.2014.972559