Resumen
Antecedentes: Garantizar la validez de evaluaciones requiere un examen exhaustivo del contenido de una prueba. Es común emplear expertos en la materia (EM) para evaluar la relevancia, representatividad y adecuación de los ítems. Este artículo propone integrar la teoría de respuesta al ítem (TRI) en las evaluaciones hechas por EM. La TRI ofrece parámetros de discriminación y umbral de los EM, evidenciando su desempeño al diferenciar ítems relevantes/ irrelevantes, detectando desempeños subóptimos, mejorando también la estimación de la relevancia de los ítems. Método: Se comparó el uso de la TRI frente a índices tradicionales (índice de validez de contenido y V de Aiken) en ítems de responsabilidad. Se evaluó la precisión de los EM al discriminar si los ítems medían responsabilidad o no, y si sus evaluaciones permitían predecir los pesos factoriales de los ítems. Resultados: Las puntuaciones de TRI identificaron bien los ítems de responsabilidad (R2 = 0,57) y predijeron sus cargas factoriales (R2 = 0,45). Además, mostraron validez incremental, explicando entre 11% y 17% más de varianza que los índices tradicionales. Conclusiones: La TRI en las evaluaciones de los EM mejora la alineación de ítems y predice mejor los pesos factoriales, mejorando validez del contenido de los instrumentos.
Citas
Abad, F. J., Sorrel, M. A., Garcia, L. F., & Aluja, A. (2018). Modeling general, specific, and method variance in personality measures: Results for ZKA-PQ and NEO-PI-R. Assessment, 25(8), 959–977. https://doi.org/10.1177/1073191116667547
Aiken, L. R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40(4), 955– 959. https://doi.org/10.1177/001316448004000419
Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in Social and Administrative Pharmacy, 15(2), 214–221. https://doi.org/10.1016/j. sapharm.2018.03.066
American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME] (Eds.). (2014). Standards for educational and psychological testing (14th ed.). American Educational Research Association.
Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with States’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29. https://doi.org/10.1111/j.1745-3992.2003. tb00134.x
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06
Collado, S., Corraliza, J. A., & Sorrel, M. A. (2015). Spanish version of the Children’s Ecological Behavior (CEB) scale. Psicothema, 27(1), 82–87. https://doi.org/10.7334/psicothema2014.117
Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personaliry Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources.
Fitzpatrick,A. R. (1983). The meaning of content validity. Applied Psychological Measurement, 7(1), 3–13. https://doi.org/10.1177/014662168300700102
García, P. E., Díaz, J. O., & Torre, J. de la. (2014). Application of cognitivediagnosis models to competency-based situational judgment tests. Psicothema, 26(3), 372–377. https://doi.org/10.7334/psicothema2013.322
Gómez-Benito, J., Sireci, S., & Padilla, J.-L. (2018). Differential item functioning: Beyond validity evidence based on internal structure. Psicothema, 30, 104–109. https://doi.org/10.7334/psicothema2017.183 Jennrich, R. I., & Bentler, P. M. (2011). Exploratory bi-factor analysis. Psychometrika, 76(4), 537–549. https://doi.org/10.1007/s11336-011-9218-4
Kreitchmann, R. S., Abad, F. J., Ponsoda, V., Nieto, M. D., & Morillo, D. (2019). Controlling for response biases in self-report scales: Forced-choice vs. psychometric modeling of likert items. Frontiers in Psychology, 10, Article 2309. https://doi.org/10.3389/fpsyg.2019.02309
Li, X., & Sireci, S. G. (2013). A new method for analyzing content validity data using multidimensional scaling. Educational and Psychological Measurement, 73(3), 365–385. https://doi.org/10.1177/0013164412473825 Lunz, M. E., Stahl, J. A., & Wright, B. D. (1994). Interjudge reliability and decision reproducibility. Educational and Psychological Measurement, 54(4), 913–925. https://doi.org/10.1177/0013164494054004007
Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. https://doi.org/10.1207/s15324818ame0304_3
Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332– 1361. https://doi.org/10.3102/0034654309341375
Martuza, V. R. (1977). Applying norm-referenced and criterion-referenced measurement in education. Allyn and Bacon.
Mastaglia, B., Toye, C., & Kristjanson, L. J. (2003). Ensuring content validity in instrument development: Challenges and innovative approaches. Contemporary Nurse, 14(3), 281–291. https://doi.org/10.5172/conu.14.3.281
McCoach, D. B., Gable, R. K., & Madura, J. P. (2013). Instrument development in the affective domain: School and corporate applications. Springer. https://doi.org/10.1007/978-1-4614-7135-6
Nájera, P., Abad, F. J., & Sorrel, M. A. (2021). Determining the number of attributes in cognitive diagnosis modeling. Frontiers in Psychology, 12, Article 614470.
Nieto, M. D., Abad, F. J., Hernández-Camacho, A., Garrido, L. E., Barrada, J. R., Aguado, D., & Olea, J. (2017). Calibrating a new item pool to adaptively assess the Big Five. Psicothema, 29(3), 390–395. https://doi.org/10.7334/psicothema2016.391
Oltmanns, J. R., & Widiger, T.A. (2020). The five-factor personality inventory for ICD-11: A aacet-level assessment of the ICD-11 trait model. Psychological Assessment, 32(1), 60–71. https://doi.org/10.1037/pas0000763
Penfield, R. D., & Giacobbi, Jr., Peter R. (2004). Applying a score confidence interval to Aiken’s item content-relevance index. Measurement in Physical Education and Exercise Science, 8(4), 213–225. https://doi.org/10.1207/ s15327841mpee0804_3
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147
Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice. Educational Researcher, 31(7), 3–14. https://doi. org/10.3102/0013189X031007003
R Core Team. (2023). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https:// www.R-project.org/
Rios, J., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116. https://doi.org/10.7334/psicothema2013.260 Robitzsch, A., & Steinfeld, J. (2018). Item response models for human ratings:
Overview, estimation methods, and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101–138.
Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Dutch Journal of Educational Research, 2, 49–60.
Rubio, D. M., Berg-Weger, M., Tebb, S. S., Lee, E. S., & Rauch, S. (2003). Objectifying content validity: Conducting a content validity study in social work research. Social Work Research, 27(2), 94–104. https://doi.org/10.1093/swr/27.2.94
Samejima, F. (1968). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100–100.
Sireci, S. G. (1998a). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299–321. https://doi.org/10.1207/s15326977ea0504_2
Sireci, S. G. (1998b). The construct of content validity. Social Indicators
Research, 45(1/3), 83–117.
Sireci, S., & Benítez, I. (2023). Evidence for test validation: A guide for practitioners. Psicothema, 35(3), 217-226. https://doi.org/10.7334/ psicothema2022.477
Sireci, S. G., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. https://doi.org/10.7334/ psicothema2013.256
Tatsuoka, K. K. (1983). Rule Space: An Approach for Dealing with Misconceptions Based on Item Response Theory. Journal of Educational Measurement, 20(4), 345–354.
Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory.
Psychometrika, 47(4), 397-412. https://doi.org/10.1007/BF02293705 Waugh, M. H., McClain, C. M., Mariotti, E. C., Mulay, A. L., DeVore, E. N., Lenger, K. A., Russell, A. N., Florimbio, A. R., Lewis, K. C., Ridenour, J. M., & Beevers, L. G. (2021). Comparative content analysis of self-report scales for level of personality functioning. Journal of Personality Assessment, 103, 161–173. https://doi.org/10.1080/00223891.2019.1705464
Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7–25. https://doi.org/10.1080/08957340709336728
Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470.