Abstract
Background: Artificial Intelligence (AI) is increasingly used to enhance traditional assessment practices by improving efficiency, reducing costs, and enabling greater scalability. However, its use has largely been confined to large corporations, with limited uptake by researchers and practitioners. This study aims to critically review current AI-based applications in test construction and propose practical guidelines to help maximize their benefits while addressing potential risks. Method: A comprehensive literature review was conducted to examine recent advances in AI-based test construction, focusing on item development and calibration, with real-world examples to demonstrate practical implementation. Results: Best practices for AI in test development are evolving, but responsible use requires ongoing human oversight. Effective AI-based item generation depends on quality training data, alignment with intended use, model comparison, and output validation. For calibration, essential steps include defining construct validity, applying prompt engineering, checking semantic alignment, conducting pseudo factor analysis, and evaluating model fit with exploratory methods. Conclusions: We propose a practical guide for using generative AI in test development and calibration, targeting challenges related to validity, reliability, and fairness by linking each issue to specific guidelines that promote responsible, effective implementation.
References
Alasadi, E. A., & Baiz, C. R. (2023). Generative AI in education and research: Opportunities, concerns, and solutions. Journal of Chemical Education, 100(8), 2965–2971. https://doi.org/10.1021/acs.jchemed.3c00323
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Anthropic. (2024). Claude 3 Opus [Large language model]. https://www.anthropic.com
Arslan, B., Lehman, B., Tenison, C., Sparks, J. R., López, A. A., Gu, L., & Zapata-Rivera, D. (2024). Opportunities and challenges of using generative AI to personalize educational assessment. Frontiers in Artificial Intelligence, 7, 1460651. https://doi.org/10.3389/frai.2024.1460651
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer- based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077. https://doi.org/10.3389/frai.2022.903077
Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2002). A feasibility study of on‐the‐fly item generation in adaptive testing. ETS Research Report Series, i-44.
Bezırhan, U., & von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, 100161. https://doi.org/10.1016/j.caeai.2023.100161
Bißantz, S., Frick, S., Melinscak, F., Iliescu, D., & Wetzel, E. (2024). The potential of machine learning methods in psychological assessment and test construction. European Journal of Psychological Assessment, 40(1), 1–4. https://doi.org/10.1027/1015-5759/a000817
Borgonovi, F. & Suárez-Álvarez, J (2025). How can adult skills assessments best meet the demands of the 21st century?. OECD Social, Employment and Migration Working Papers, No. 319. OECD Publishing. https://doi.org/10.1787/853db37b-en
Bulut, O., Beiting-Parrish, M., Casabianca, J. M., Slater, S. C., Jiao, H., Song, D., Ormerod, C., Fabiyi, D. G., Ivan, R., Walsh, C., Rios, O., Wilson, J., Yildirim-Erbasli, S. N., Wongvorachan, T., Liu, J. X., Tan, B., & Morilova, P. (2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. Chinese/English Journal of Educational Measurement and Evaluation, 5(3), 3. https://doi.org/10.59863/MIQL7785
Burstein, J. (2025, April 17). The Duolingo English Test responsible AI standards (Duolingo Research Report No. DRR-25-05). Duolingo. https://englishtest.duolingo.com/research
Butterfuss, R., & Doran, H. (2025). An application of text embeddings to support alignment of educational content standards. Educational Measurement: Issues and Practice, 44(1), 73–83. https://doi.org/10.1111/emip.12581
Chang, D. H., Lin, M. P.-C., Hajian, S., & Wang, Q. Q. (2023). Educational design principles of using AI chatbot that supports self-regulated learning in education: Goal setting, feedback, and personalization. Sustainability, 15(17), 12921.
De la Fuente, D., & Armayones, M. (2025). AI in psychological practice: What tools are available and how can they help in clinical psychology? Psychologist Papers, 46(1), 18-24. https://doi.org/10.70478/pap.psicol.2025.46.03
Dixon-Román, E. (2024). AI and psychometrics: Epistemology, process, and politics. Journal of Educational and Behavioral Statistics, 49(5), 709–714. https://doi.org/10.3102/10769986241280623
Dolenc, K., & Brumen, M. (2024). Exploring social and computer science students’ perceptions of AI integration in (foreign) language instruction. Computers and Education: Artificial Intelligence, 7, 100285.
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv. https://doi.org/10.48550/arXiv.1702.08608
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Lawrence Erlbaum Associates Publishers.
Dumas, D., Greiff, S., & Wetzel, E. (2025). Ten guidelines for scoring psychological assessments using artificial intelligence [Editorial]. European Journal of Psychological Assessment, 41(3), 169–173. https://doi.org/10.1027/1015-5759/a000904
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197. https://doi.org/10.1037/0033-2909.93.1.179
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64(4), 407-433.
European Commission, OECD, & Code.org. (2025, May). Empowering learners for the age of AI: An AI literacy framework for primary and secondary education (Review draft). https://www.oecd.org/digital/empowering-learners-ai-literacy-framework
Fan, J., Sun, T., Liu, J., Zhao, T., Zhang, B., Chen, Z., … Hack, E. (2023, January 5). How well can an AI chatbot infer personality? Examining psychometric properties of machine-inferred personality scores. PsyArXiv. https://doi.org/10.31234/osf.io/pk2b7
Farrelly, T., & Baker, N. (2023). Generative artificial intelligence: Implications and considerations for higher education practice. Education Sciences, 13(11), 1109.
Feng, W., Tran, P., Sireci, S., & Lan, A. S. (2025). Reasoning and sampling- augmented MCQ difficulty prediction via LLMs. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, & S. Isotani (Eds.), Artificial intelligence in education. AIED 2025 (Lecture Notes in Computer Science, Vol. 15880). Springer, Cham. https://doi.org/10.1007/978-3-031-98459-4_3
Ferrando, P. J., Morales-Vives, F., Casas, J. M., & Muñiz, J. (2025). Likert scales:Apractical guide to their design, construction and use. Psicothema, 37(4), 1–15. https://doi.org/10.70478/psicothema.2025.37.24
Foster, N. & Piacentini, M (2023). Innovating assessments to measure and support complex skills. OECD Publishing. https://doi.org/10.1787/e5f3e341-en
Gierl, M. J., & Haladyna, T. M. (2012). Automatic item generation. Routledge. https://doi.org/10.4324/9780203803912
Goldberg, L. R. (1999). A broad-bandwidth, public domain personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe (Vol. 7, pp. 7-28). Tilburg University Press
Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029.
Guenole, N., D’Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. (2025).Enhancing scale development: Pseudo factor analysis of language embedding similarity matrices. PsyArXiv. https://osf.io/preprints/psyarxiv/vf3se_v2
Guenole, N., Samo, A., Sun, T. (2024). Pseudo-Discrimination Parameters from Language Embeddings. OSF. https://osf.io/9a4qx_v1
Guenole, N. (2025). Psychometrics.ai: Transforming Behavioral Science with Machine Learning. https://psychometrics.ai
Hao, J., von Davier, A. A., Yaneva, V., Lottridge, S., von Davier, M., & Harris,
D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16-29. https://doi.org/10.1111/emip.12602
He, Q., Borgonovi, F., Paccagnella, M. (2021). Leveraging process data to assess adults’ problem-solving skills: Identifying generalized behavioral patterns with sequence mining. Computers and Education, 166, 104170. https://doi.org/10.1016/j.compedu.2021.104170
He, Q., Borgonovi, F., & Suárez-Álvarez, J. (2023). Clustering sequential navigation patterns in multiple-source reading tasks with dynamic time warping method. Journal of Computer Assisted Learning, 39, 719–736. https://doi.org/10.1111/jcal.12748
Ho, A. D. (2024). Artificial intelligence and educational measurement: Opportunities and threats. Journal of Educational and Behavioral Statistics, 49(5), 715-722. https://doi.org/10.3102/10769986241248771
Holzinger, A., Langs, G., Denk, H., Zatloukal, K., & Müller, H. (2019). Causability and explainability of artificial intelligence in medicine.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4), e1312. https://doi.org/10.1002/widm.1312
Jen, F.-L., Huang, X., Liu, X., & Jiao, J. (2024). Can generative AI really empower teachers’ professional practices? Comparative study on human-tailored and GenAI-designed reading comprehension learning materials. In L. K. Lee, P. Poulova, K. T. Chui, M. Černá, F. L. Wang, &
S. K. S. Cheung (Eds.), Technology in Education. Digital and Intelligent Education. ICTE 2024. Communications in Computer and Information Science, vol. 2330 (pp. 112–123). Springer.
Johnson, M. S. (2025, April). Responsible AI for measurement and learning: Principles and practices (ETS Research Report No. RR-25-03). ETS Research Institute.
Joyce, D. W., Kormilitzin, A., Smith, K. A., & Cipriani, A. (2023). Explainable artificial intelligence for mental health through transparency and interpretability for understandability. NPJ Digital Medicine, 6(6). https://doi.org/10.1038/s41746-023-00751-9
Khosravi, H., Shum, S. B., Chen, G., Conati, C., Tsai, Y.-S., Kay, J., Knight, S., Martinez-Maldonado, R., Sadiq, S., & Gašević, D. (2022). Explainable artificial intelligence in education. Computers and Education: Artificial Intelligence, 3, 100074. https://doi.org/10.1016/j.caeai.2022.100074
Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. (Research Report No. 56). Institute for Simulation and Training. https://stars.library.ucf.edu/istlibrary/56
Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational chatbots: A systematic review. Education and Information Technologies, 28, 973–1018. https://doi.org/10.1007/s10639-022-11177-3
Kumar, P., Manikandan, S., & Kishore, R. (2024). AI-driven text generation: A novel GPT-based approach for automated content creation. 2024 2nd International Conference on Networking and Communications (ICNWC). IEEE.
Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2016). Handbook of test development (2nd ed.). Routledge.
Lorenzo-Seva, U., Timmerman, M. E., & Kiers, H. A. (2011). The Hull method for selecting the number of common factors. Multivariate Behavioral Research, 46(2), 340–364. https://doi.org/10.1080/00273171.2011.564527
Luecht, R. M. (2025). Assessment engineering in test design: Methods and applications (1st ed.). Routledge. https://doi.org/10.4324/9781003449464
Maas, A. C. (2024). An empirical study on training generative AI to create appropriate questions for English reading comprehension [Doctoral dissertation, Tohoku University]. Tohoku University Repository.
Mao, J., Chen, B., & Liu, J. C. (2024). Generative artificial intelligence in education and its implications for assessment. TechTrends, 68(1), 58-66.
Meeker, M., Simons, J., Chae, D., & Krey, A. (2025). Trends – artificial intelligence (AI). BOND. https://www.bondcap.com/report/tai/
McLaughlin, G. H. (1969). SMOG grading-a new readability formula. Journal of Reading, 12(8), 639-646.
Muñiz, J., & Fonseca-Pedrero, E. (2019). Ten steps for test development. Psicothema, 31(1), 7–16. https://doi.org/10.7334/psicothema2018.291
OECD (2025). Introducing the OECD AI capability indicators. OECD Publishing. https://doi.org/10.1787/be745f04-en
OpenAI. (2023). GPT-4 technical report. https://arxiv.org/abs/2303.08774
Pohl, S., Ulitzsch, E., & von Davier, M. (2021). Reframing rankings in educational assessments. Science, 372(6540), 338–340. https://doi.org/10.1126/science.abd3300
Ramandanis, D., & Xinogalos, S. (2023). Designing a chatbot for contemporary education: A systematic literature review. Information, 14(9), 503.
Russell-Lasalandra, L. L., Christensen, A. P., & Golino, H. (2024, September 12). Generative psychometrics via AI-GENIE: Automatic item generation and validation via network-integrated evaluation. PsyArXiv. https://doi.org/10.31234/osf.io/fgbj4
Samek, W., Wiegand, T., & Müller, K.-R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv. https://doi.org/10.48550/arXiv.1708.08296
Sanz, A., Tapia, J. L., García-Carpintero, E., Rocabado, J. F., & Pedrajas, L. M. (2025). ChatGPT simulated patient: Use in clinical training in Psychology. Psicothema, 37(3), 23-32. https://doi.org/10.70478/psicothema.2025.37.21
Schoenegger, P., Greenberg, S., Grishin, A., Lewis, J., & Caviola, L. (2025). AI can outperform humans in predicting correlations between personality items. Communications Psychology, 3, 23. https://doi.org/10.1038/s44203-025-00123-1
Sheehan, K. M., Kostin, I., & Persky, H. (2006, April). Predicting item difficulty as a function of inferential processing requirements: An examination of the reading skills underlying performance on the NAEP Grade 8 Reading Assessment. Paper presented at the annual meeting of the National Council on Measurement in Education (NCME), San Francisco, CA. Educational Testing Service.
Sheehan, K., & Mislevy, R. J. (1994). A tree-based analysis of items from an assessment of basic mathematics skills (ETS RR-94-14). Educational Testing Service.
Sireci, S., & Benítez, I. (2023). Evidence for test validation:Aguide for practitioners. Psicothema, 35(3), 217-26. https://doi.org/10.7334/psicothema2022.477
Sireci, S. G., Crespo Cruz, E, Suárez-Álvarez, J, & Rodríguez Matos, G. (2025). Understanding UNDERSTANDardization research. In R. Bennett, R., L. Darling-Hammond & A. Barinarayan (Eds.), Socioculturally responsive assessment: Implications for theory, measurement, and systems-level policy, Routledge. https://doi.org/10.4324/9781003435105
Sireci, S. G., Suárez-Álvarez, J., Zenisky, A. L., & Oliveri, M. E. (2024). Evolving educational testing to meet students’ needs: Design-in-real- time assessment. Educational Measurement: Issues and Practice, 43(4), 112–118. https://doi.org/10.1111/emip.12653
Smith, E. A., & Senter, R. J. (1967). Automated readability index (Vol. 66, No. 220). Aerospace Medical Research Laboratories, Aerospace Medical Division, Air Force Systems Command.
Suárez-Álvarez, J., Fernández-Alonso, R., García-Crespo, F. J., & Muñiz, J. (2022). The use of new technologies in educational assessments: Reading in a digital world. Psychologist Papers, 43(1), 36–47. https://doi.org/10.23923/pap.psicol.2986
Suárez-Ávarez, J., Oliveri, M. E., Zenisky, A., & Sireci, S. G (2024). Five key actions for redesigning adult skills assessments from learners, employees, and educators. Journal for Research on Adult Education, 47, 321–343. https://doi.org/10.1007/s40955-024-00288-8
Sun, T, B., Drasgow, F., & Zhou, M. X. (2024, May 1). Developmente and validation of an artificial chatbot to assess personality. PsyArXiv. https//doi.org/10.131234/osf.io/ahtr9
Swiecki, Z., Khosravi, H., Chen, G., Martinez-Maldonado, R., Lodge, J. M., Milligan, S., Selwyn, N., & Gašević, D. (2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, 100075. https://doi.org/10.1016/j.caeai.2022.100075
Ulitzsch, E., Shin, H. J., & Lüdtke, O. (2023). Accounting for careless and insufficient effort responding in large-scale survey data—Development, evaluation, and application of a screen-time-based weighting procedure. Behavior Research Methods, 56(2), 804–825. https://doi.org/10.3758/s13428-022-02053-6
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://arxiv.org/abs/1706.03762
von Davier, A. A., Runge, A., Park, Y., Attali, Y., Church, J., & LaFlair, G. (2024). The item factory: Intelligent automation in support of test development at scale. In H. Jiao & R. W. Lissitz (Eds.), Machine learning, natural language processing, and psychometrics (Marces Book Series) (pp. 1– 25). Information Age Publishing Inc.
von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. https://doi.org/10.1007/s11336-018-9608-y
von Davier, M.,Tyack, L.,& Khorramdel, L. (2022). Scoring graphical responses in TIMSS 2019 using artificial neural networks. Educational and Psychological Measurement, 83(3), 556–585. https://doi.org/10.1177/00131644221098021
Walker, M. E., Olivera-Aguilar, M., Lehman, B., Laitusis, C., Guzman- Orth, D., & Gholson, M. (2023). Culturally responsive assessment: Provisional principles (ETS RR-23-11). Educational Testing Service. https://doi.org/10.1002/ets2.12374
Wang, Y., Pan, Y., Yan, M., Su, Z., & Luan, T. H. (2023). A survey on ChatGPT: AI-generated contents, challenges, and solutions. Open Journal of Computer Science, 4, 280–286. https://doi.org/10.48550/arXiv.2305.18339
Wise, S. L., Im, S., & Lee, J. (2021). The impact of disengaged test taking on a state’s accountability test results. Educational Assessment, 26(3), 163–174. https://doi.org/10.1080/10627197.2021.1956897
Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 1-11. https://doi.org/10.1038/s41562-024-02089-y
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2019). Introduction of multistage adaptive testing design in PISA 2018. OECD Education Working Papers, No. 209, OECD Publishing, https://doi.org/10.1787/b9435d4b-en
Yan, L., Greiff, S., Teuber, Z., & Gašević, D. (2024). Promises and challenges of generative artificial intelligence for human learning. Nature Human Behaviour, 8, 1839–1850. https://doi.org/10.1038/s41562-024-02004-5 Yaneva, V., & von Davier, M. (Eds.). (2023). Advancing natural language processing in educational assessment (1st ed.). Routledge. https://doi.org/10.4324/9781003278658
Yang, H., Kim, H., Lee, J. H., & Shin, D. (2022). Implementation of an AI chatbot as an English conversation partner in EFL speaking classes. ReCALL, 34(3), 327–343. https://doi.org/10.1017/S0958344022000039
Yuan, L. (I.), Sun, T., Dennis, A. R., & Zhou, M. (2024). Perception is reality? Understanding user perceptions of chatbot-inferred versus self- reported personality traits. Computers in Human Behavior: Artificial Humans, 2, 100057. https://doi.org/10.1016/j.chbah.2024.100057
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large- scale assessment. Applied Measurement in Education, 15(4), 337–362. https://doi.org/10.1207/S15324818AME1504_02