Enhancing Content Validity Assessment With Item Response Theory Modeling
PDF (Español (España))

Keywords

Content validity
Subject matter experts
Item response theory
Validity
Test development Validez de contenido
Expertos en la materia
Teoría de respuesta al ítem
Validez
Desarrollo de tests

How to Cite

Schames Kreitchmann, R., Nájera, P., Sanz, S., & Sorrel, M. Ángel. (2024). Enhancing Content Validity Assessment With Item Response Theory Modeling. Psicothema, 36(2), 145–153. Retrieved from https://reunido.uniovi.es/index.php/PST/article/view/21248

Abstract

Background: Ensuring the validity of assessments requires a thorough examination of the test content. Subject matter experts (SMEs) are commonly employed to evaluate the relevance, representativeness, and appropriateness of the items. This article proposes incorporating item response theory (IRT) into model assessments conducted by SMEs. Using IRT allows for the estimation of discrimination and threshold parameters for each SME, providing evidence of their performance in differentiating relevant from irrelevant items, thus facilitating the detection of suboptimal SME performance while improving item relevance scores. Method: Use of IRT was compared to traditional validity indices (content validity index and Aiken’s V) in the evaluation of conscientiousness items. The aim was to assess the SMEs’ accuracy in identifying whether items were designed to measure conscientiousness or not, and predicting their factor loadings. Results: The IRT-based scores effectively identified conscientiousness items (R2 = 0.57) and accurately predicted their factor loadings (R2 = 0.45). These scores demonstrated incremental validity, explaining 11% more variance than Aiken’s V and up to 17% more than the content validity index. Conclusions: Modeling SME assessments with IRT improves item alignment and provides better predictions of factor loadings, enabling improvement of the content validity of measurement instruments.

PDF (Español (España))

References

Abad, F. J., Sorrel, M. A., Garcia, L. F., & Aluja, A. (2018). Modeling general, specific, and method variance in personality measures: Results for ZKA-PQ and NEO-PI-R. Assessment, 25(8), 959–977. https://doi.org/10.1177/1073191116667547

Aiken, L. R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40(4), 955– 959. https://doi.org/10.1177/001316448004000419

Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in Social and Administrative Pharmacy, 15(2), 214–221. https://doi.org/10.1016/j. sapharm.2018.03.066

American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME] (Eds.). (2014). Standards for educational and psychological testing (14th ed.). American Educational Research Association.

Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with States’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29. https://doi.org/10.1111/j.1745-3992.2003. tb00134.x

Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

Collado, S., Corraliza, J. A., & Sorrel, M. A. (2015). Spanish version of the Children’s Ecological Behavior (CEB) scale. Psicothema, 27(1), 82–87. https://doi.org/10.7334/psicothema2014.117

Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personaliry Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources.

Fitzpatrick,A. R. (1983). The meaning of content validity. Applied Psychological Measurement, 7(1), 3–13. https://doi.org/10.1177/014662168300700102

García, P. E., Díaz, J. O., & Torre, J. de la. (2014). Application of cognitivediagnosis models to competency-based situational judgment tests. Psicothema, 26(3), 372–377. https://doi.org/10.7334/psicothema2013.322

Gómez-Benito, J., Sireci, S., & Padilla, J.-L. (2018). Differential item functioning: Beyond validity evidence based on internal structure. Psicothema, 30, 104–109. https://doi.org/10.7334/psicothema2017.183 Jennrich, R. I., & Bentler, P. M. (2011). Exploratory bi-factor analysis. Psychometrika, 76(4), 537–549. https://doi.org/10.1007/s11336-011-9218-4

Kreitchmann, R. S., Abad, F. J., Ponsoda, V., Nieto, M. D., & Morillo, D. (2019). Controlling for response biases in self-report scales: Forced-choice vs. psychometric modeling of likert items. Frontiers in Psychology, 10, Article 2309. https://doi.org/10.3389/fpsyg.2019.02309

Li, X., & Sireci, S. G. (2013). A new method for analyzing content validity data using multidimensional scaling. Educational and Psychological Measurement, 73(3), 365–385. https://doi.org/10.1177/0013164412473825 Lunz, M. E., Stahl, J. A., & Wright, B. D. (1994). Interjudge reliability and decision reproducibility. Educational and Psychological Measurement, 54(4), 913–925. https://doi.org/10.1177/0013164494054004007

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. https://doi.org/10.1207/s15324818ame0304_3

Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332– 1361. https://doi.org/10.3102/0034654309341375

Martuza, V. R. (1977). Applying norm-referenced and criterion-referenced measurement in education. Allyn and Bacon.

Mastaglia, B., Toye, C., & Kristjanson, L. J. (2003). Ensuring content validity in instrument development: Challenges and innovative approaches. Contemporary Nurse, 14(3), 281–291. https://doi.org/10.5172/conu.14.3.281

McCoach, D. B., Gable, R. K., & Madura, J. P. (2013). Instrument development in the affective domain: School and corporate applications. Springer. https://doi.org/10.1007/978-1-4614-7135-6

Nájera, P., Abad, F. J., & Sorrel, M. A. (2021). Determining the number of attributes in cognitive diagnosis modeling. Frontiers in Psychology, 12, Article 614470.

Nieto, M. D., Abad, F. J., Hernández-Camacho, A., Garrido, L. E., Barrada, J. R., Aguado, D., & Olea, J. (2017). Calibrating a new item pool to adaptively assess the Big Five. Psicothema, 29(3), 390–395. https://doi.org/10.7334/psicothema2016.391

Oltmanns, J. R., & Widiger, T.A. (2020). The five-factor personality inventory for ICD-11: A aacet-level assessment of the ICD-11 trait model. Psychological Assessment, 32(1), 60–71. https://doi.org/10.1037/pas0000763

Penfield, R. D., & Giacobbi, Jr., Peter R. (2004). Applying a score confidence interval to Aiken’s item content-relevance index. Measurement in Physical Education and Exercise Science, 8(4), 213–225. https://doi.org/10.1207/ s15327841mpee0804_3

Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147

Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice. Educational Researcher, 31(7), 3–14. https://doi. org/10.3102/0013189X031007003

R Core Team. (2023). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https:// www.R-project.org/

Rios, J., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116. https://doi.org/10.7334/psicothema2013.260 Robitzsch, A., & Steinfeld, J. (2018). Item response models for human ratings:

Overview, estimation methods, and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101–138.

Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Dutch Journal of Educational Research, 2, 49–60.

Rubio, D. M., Berg-Weger, M., Tebb, S. S., Lee, E. S., & Rauch, S. (2003). Objectifying content validity: Conducting a content validity study in social work research. Social Work Research, 27(2), 94–104. https://doi.org/10.1093/swr/27.2.94

Samejima, F. (1968). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100–100.

Sireci, S. G. (1998a). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299–321. https://doi.org/10.1207/s15326977ea0504_2

Sireci, S. G. (1998b). The construct of content validity. Social Indicators

Research, 45(1/3), 83–117.

Sireci, S., & Benítez, I. (2023). Evidence for test validation: A guide for practitioners. Psicothema, 35(3), 217-226. https://doi.org/10.7334/ psicothema2022.477

Sireci, S. G., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. https://doi.org/10.7334/ psicothema2013.256

Tatsuoka, K. K. (1983). Rule Space: An Approach for Dealing with Misconceptions Based on Item Response Theory. Journal of Educational Measurement, 20(4), 345–354.

Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory.

Psychometrika, 47(4), 397-412. https://doi.org/10.1007/BF02293705 Waugh, M. H., McClain, C. M., Mariotti, E. C., Mulay, A. L., DeVore, E. N., Lenger, K. A., Russell, A. N., Florimbio, A. R., Lewis, K. C., Ridenour, J. M., & Beevers, L. G. (2021). Comparative content analysis of self-report scales for level of personality functioning. Journal of Personality Assessment, 103, 161–173. https://doi.org/10.1080/00223891.2019.1705464

Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7–25. https://doi.org/10.1080/08957340709336728

Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470.