Evidence for Test Validation: A Guide for Practitioners


Assessment; Educational and psychological; testing; Sources of validity evidence; Testing standards; Validity Evaluación educativa; Evaluación psicológica; Fuentes de validez; Estándares de evaluación; Validez

How to Cite

Sireci, S., & Benítez, I. (2023). Evidence for Test Validation: A Guide for Practitioners. Psicothema, 35(3). Retrieved from https://reunido.uniovi.es/index.php/PST/article/view/20128


Background: Validity is a core topic in educational and psychological assessment. Although there are many available resources describing the concept of validity, sources of validity evidence, and suggestions about how to obtain validity evidence; there is little guidance providing specific instructions for planning and carrying out validation studies. Method: In this paper we describe (a) the fundamental principles underlying test validity, (b) the process of validation, and (c) practical guidance for practitioners to plan and carry out sufficient validity research to support the use of a test for its intended purposes. Results: We first define validity, describe sources of validity evidence, and provide examples where each of these sources are addressed. Then, we describe a validation agenda describing steps and tasks for planning and developing validation studies. Conclusions: Finally, we discuss the importance of addressing validation studies from a comprehensive approach.




American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Psychological Association. https://www.apa.org/science/programs/testing/standards.

American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2018). Estándares para pruebas educativas y psicológicas [Standards for educational and psychological testing]. https://www.testingstandards.net/uploads/7/6/6/4/76643089/9780935302745_web.pdf

American Psychological Association (2010). Ethical principles of psychologists and code of conduct. Author. https://doi.org/10.1037/amp0000102

Atchison, D., Garet, M. S., Smith, T. M., & Song, M. (2022). The validity of measures of instructional alignment with state standards based on surveys of enacted curriculum. AERA Open, 8, 1-17. https://doi.org/10.1177/23328584221098761.

Beck, K. (2020). Ensuring content validity of psychological and educational tests--the role of experts. Frontline Learning Research, 8(6), 1-37. https://doi.org/10.14786/flr.v8i6.517

Benítez, I., Van de Vijver, F., & Padilla, J. L. (2022). A mixed methods approach to the analysis of bias in cross-cultural studies. Sociological Methods & Research, 51(1), 237-270. https://doi.org/10.1177/0049124119852390

Benítez, I., Padilla, J.L., Hidalgo Montesinos, M. D., & Sireci, S. G. (2016). Using mixed methods to interpret differential item functioning. Applied Measurement in Education, 29(1), 1-16. https://doi.org/10.1080/08957347.2015.1102915

Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21-29.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.

Cavalcanti, R. V. A., Junior, H. V. M., de Araújo Pernambuco, L., & de Lima, K. C. (2020). Screening for masticatory disorders in older adults (SMDOA): An epidemiological tool. Journal of Prosthodontic Research, 64(3), 243-249. https://doi.org/10.1016/j.jpor.2019.07.011

Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68(3), 397-412. Cronbach, L. J. (1971). Test Validation. In R.L. Thorndike (Ed.) Educational measurement (2nd ed., pp. 443-507). American Council on Education.

Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.

De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between selection quality and adverse impact. Journal of Applied Psychology, 92(5), 1380. http://doi.org/10.1037/0021-9010.92.5.1380

Dee, T. S., & Jacob. B. (2011). The impact of No Child Left Behind on student achievement. Journal of Policy Analysis and Management, 30, 418-446. http://doi.org/10.1002/pam.20586

Dumas, D., Dong, Y., & McNeish, D. (2022). How Fair is my Test? A Ratio Coefficient to Help Represent Consequential Validity. European Journal of Psychological Assessment, 0(0), 1-25. https://doi.org/10.1027/1015-5759/a000724

Engelhardt, L., & Goldhammer, F. (2019). Validating test score interpretations using time information. Frontiers in Psychology, 10, Article 1131. https://doi.org/10.3389/fpsyg.2019.01131

Ferrando, P. J., Lorenzo Seva, U., Hernández Dorado, A., & Muñiz, J. (2022). Decalogue for the factor analysis of test items. Psicothema, 34(1), 7-17. https://doi.org/10.7334/psicothema2021.456

Georgia Department of Education and Data Recognition Corporation. (2019). Georgia Milestones Assessment System 2019 operational technical report. Georgia Department of Education.

Gómez-Benito, J., Sireci, S., Padilla, J.L., Hidalgo, M. D., & Benítez, I. (2018). Differential item functioning: Beyond validity evidence based on internal structure. Psicothema, 30(1), 104-109. https://doi.org/10.7334/psicothema2017.183.

International Test Commission & Association of Test Publishers (2022). Guidelines for technology-based assessment. International Test Commission. https://www.intestcom.org/page/28.

Irwin, V., De La Rosa, J., Wang, K., Hein, S., Zhang, J., Burr, R., Roberts, A., Barmer, A., Bullock Mann, F., Dilig, R., & Parker, S. (2022). Report on the Condition of Education 2022 (NCES 2022-144). National Center for Education Statistics. https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2022144

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.

Kane, M.T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 17–64). American Council on Education and Praeger.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.

Keenan, J. & Meenan, C. E. (2014). Test differences in diagnosing reading comprehension deficits. Journal of Learning Disabilities, 47, 125-135.

Lafuente-Martínez, M., Lévêque, O., Benítez, I., Hardebolle, C., & Dehler Zufferey, J. (2022). Assessing computational thinking: Development and validation of the Algorithmic Thinking Test for adults. Journal of Educational Computing Research, 60(6), 1436-1463. https://doi.org/10.1177/07356331211057819

Lane, S. (2014). Validity evidence based on testing consequences. Psicothema, 26(1), 127-135. https://doi.org/10.7334/psicothema2013.258

Lee, S., & Winke, P. (2018). Young learners’ response processes when taking computerized tasks for speaking assessment. Language Testing, 35(2), 239-269. https://doi.org/10.1177/0265532217704009

Luesia, J. F., Sánchez-Martín, M., & Benítez, I. (2021). The effect of personal values on academic achievement. Psychological Test and Assessment Modeling, 63(2), 168-190.

Marchant, G. J. & Paulson, S. E. (2005, January 21). The relationship of high school graduation exams to graduation rates and SAT scores. Education Policy Analysis Archives, 13(6). http://epaa.asu.edu/epaa/v13n6/

Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332-1361. https://doi.org/10.3102/0034654309341375

Messick, S. (1989). Validity. In R. Linn (Ed.), Educational Measurement, (3rd ed., pp. 13-100). American Council on Education.

Mislevy, R. J. (2019). On integrating psychometrics and learning analytics in complex assessments. In H. Jiao, R. W. Lissitz, & A. van Wie (Eds.), Data analytics and psychometrics: Informing assessment practices (pp. 1–52). Information Age.

Morris, S. B., & Dunleavy, D. M. (2016). Adverse impact analysis: Understanding data, impact, and risk. Routledge.

Muñiz, J., & Fonseca-Pedrero, E. (2019). Ten steps for test development. Psicothema, 31(1), 7-16. https://doi.org/10.7334/psicothema2018.291

Newman, D. A., Tang, C., Song, Q. C., & Wee, S. (2022). Dropping the GRE, keeping the GRE, or using GRE-optional admissions? Considering tradeoffs and fairness. International Journal of Testing, 22(1), 43-71. https://doi.org/10.1080/15305058.2021.2019750

Newton, P. E., & Shaw, S. D. (2013). Standards for talking and thinking about validity. Psychological Methods, 18, 301–319. https://doi.org/10.1037/a0032969

Noble, T., Rosebery, A., Suarez, C., Warren, B., & O’Connor, M. C. (2014). Science assessments and English language learners: Validity evidence based on response processes. Applied Measurement in Education, 27(4), 248–260. https://doi.org/10.1080/08957347.2014.944309

Padilla, J. L., & Benítez, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136-144. https://doi.org/10.7334/psicothema2013.259

Randall, J. (2021). “Color-neutral” is not a thing: Redefining construct definition and representation through a justice-oriented critical antiracist lens. Educational Measurement: Issues and Practice, 40(4), 82-90.

Reynolds, K. A. & Moncaleano, S. (2021). Digital module 26: Content alignment in standards-based educational assessment. Educational Measurement: Issues & Practice, 40(3), 127-128.

Rios, J., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108-116. https://doi.org/10.7334/psicothema2013.260

Rushton, P. W., Routhier, F., Miller, W. C., Auger, C., & Lavoie, M. P. (2015). French-Canadian translation of the WheelCon-M (WheelCon-MF) and evaluation of its validity evidence using telephone administration. Disability and Rehabilitation, 37(9), 812-819. https://doi.org/10.3109/09638288.2014.941019

Russell, M. (2022). Clarifying the terminology of validity and the investigative tages of validation. Educational Measurement: Issues and Practice, 41(2), 25-35.

Segool, N. K., Carlson, J. S., Goforth, A. N., von der Embse, N. & Barterian, J. N. (2013). Heightened test anxiety among young children: Elementary school students’ anxious responses to high-stakes testing. Psychology in the Schools, 50, 489-499.

Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405-450.

Sinha, R., Oswald, F., Imus, A & Schmitt, N. (2011). Criterion-focused approach to reducing adverse impact in college admissions. Applied Measurement in Education, 24, 137-161, https://doi.org/10.1080/08957347.2011.554605

Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45(1), 83-117.

Sireci, S. G. (2016a). Comments on valid (and invalid?) commentaries. Assessment in Education: Principles, Policy & Practice, 23, 319-321.


Sireci, S. G. (2016b). On the validity of useless tests. Assessment in Education: Principles, Policy & Practice, 23, 226-235. https://doi.org/10.1080/0969594X.2015.1072084

Sireci, S. G. (2020). De-“constructing” test validation. Chinese/English Journal of Educational Measurement and Evaluation| 教育测量与评估双语季刊, 1(1), Article 3.

Sireci, S., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100-107. https://doi.org/10.7334/psicothema2013.256

Sireci, S. G., & Geisinger, K. F. (1998). Equity issues in employment testing. In J.H. Sandoval, C. Frisby, K. F. Geisinger, J. Scheuneman, & J. Ramos-Grenier (Eds.), Test interpretation and diversity (pp. 105-140). American Psychological Association.

Sireci, S. G., & Greiff, S. (2019). On the importance of educational tests. European Journal of Psychological Assessment, 35, 297-300. https://doi.org/10.1027/1015-5759/a000549.

Sireci, S. G., Lim, H., Rodriguez, G., Banda, E., & Zenisky, A. (2018, April 12-16). Evaluating criteria for validity evidence based on test content [Conference presentation]. Annual meeting of the National Council on Measurement in Education, New York, United States.

U.S. Equal Employment Opportunity Commission (2010). Fact sheet on employment tests and selection procedures. Washington, DC: Author. Available at https://urldefense.com/v3/__https://www.eeoc.gov/policy/docs/factemployment_procedures. html__;!!D9dNQwwGXtA!UIxPPhFeyMiN5JZpTwpvh_T8FQTdj7TaEssb4sMT8hiLFhN1ssQa2qSwdQ1SbkMr58y5-dBdeGBiWt4DAts$

Whitney, C. R., & Candelaria, C. A. (2017). The effects of No Child Left Behind on children’s socioemotional outcomes. AERA Open, 3(3), 1-21. https://doi.org/10.1177/2332858417726324

Zenisky, A. L., Sireci, S. G., Lewis, J., Lim, H., O’Donnell, F., Wells, C. S., Padellaro, F., Jung, H., Banda, E., Pham, D. Hong, S., Park, Y., Botha, S., Lee, M, & Garcia, A. (2018, September). Massachusetts Adult Proficiency Tests for college and career readiness: Technical manual. Center for Educational Assessment research report No. 974. Center for Educational Assessment.