Efectos del corrector en las evaluaciones educativas de alto impacto./ Rater effects in high-impact educational assessments.
PDF

Palabras clave

efectos del corrector
fiabilidad entre jueces
rúbricas de evaluación
evaluación de la escritura
evaluar ejecuciones
rater effects
inter-rater reliability
scoring rubrics
assessing writing
performance ratings

Cómo citar

Woitschach, P., Díaz-Pérez, C., Fernández-Argüelles, D., Fernández-Castañón, J., Fernández-Castillo, A., Fernández-Rodríguez, L., González-Canal, M. C., López-Marqués, I., Martín-Espinosa, D., Navarro-Cabrero, R., Osendi-Cadenas, L., Riesgo-Fernández, D., Suárez-García, Z., & Fernández-Alonso, R. (2018). Efectos del corrector en las evaluaciones educativas de alto impacto./ Rater effects in high-impact educational assessments. R.E.M.A. Revista electrónica De metodología Aplicada, 23(1), 12–27. https://doi.org/10.17811/rema.23.1.2018.12-27

Resumen

RESUMEN

Antecedentes: Los ítems de ejecución que son calificados por diferentes jueces mediante rúbricas es uno de los mayores desafíos en la evaluación educativa a gran escala y de alto impacto. Es conocido que los efectos del corrector afectan a los resultados de la evaluación. En este contexto, el presente estudio analiza la fiabilidad entre-correctores en la evaluación de la expresión escrita. Método: un grupo de 13 correctores calificaron 375 escritos de estudiantes de 6º curso, siguiendo una rúbrica analítica compuesta de 8 criterios de corrección. Los correctores se asignaron a 13 tribunales siguiendo un diseño de bloques incompletos balanceado. Los análisis realizados buscaron, en primer lugar, confirmar la estructura unidimensional de la rúbrica. A continuación se emplearon diferentes métodos clásicos para estudiar los efectos del corrector, la consistencia intra-juez y el acuerdo entre jueces. Resultados: se encontraron efectos diferenciales entre los correctores. Estas diferencias son importantes cuando se compara el grado de severidad de los jueces. También se encuentran diferencias en la consistencia interna de cada juez y en el acuerdo entre correctores. Este último efecto es especialmente significativo en algunos tribunales. Conclusiones: las diferencias entre correctores pueden tener diferentes fuentes, como son la experiencia, familiaridad con la tarea y grado de entrenamiento con las rúbricas; la naturaleza de la tarea a evaluar o el propio diseño de la rúbrica empleada.  

ABSTRACT

Antecedents: the constructed response test items that are qualified by different correctors with rubrics are one of the biggest challenges for Rater effects in high-impact educational assessments and are applied to large sample groups. It is known that rater bias affects the results of the evaluation. In this context, the present study analyzes the raters’ effects of the corrector on written expression. Method: a group of 13 raters rated 375 written productions of 6th-grade students, following an analytical rubric composed of 8 correction criteria. The raters were assigned to 13 groups of correction following a balanced incomplete block design. The first step of the analysis carried out was to confirm the one-dimensional structure of the rubric. The next and final step used different classical methods to study the raters’ effects, the intra-rater consistency and the agreement between judges. Results: differential effects were found among the raters. These differences are important when the raters’ severity is compared. There are also differences in the internal consistency of each judge and in the agreement between correctors. This last effect is especially significant in some raters’ groups. Discussion: differences between raters may have different sources, such as experience and familiarity with the task; the degree of training with the rubrics; the nature of the test; and the design of the rubric used.

https://doi.org/10.17811/rema.23.1.2018.12-27
PDF

Citas

Abad, F. J., Olea, J., Ponsoda, V., & García, C. (2011). Medición en ciencias sociales y de la salud. Madrid: Síntesis.

Adams, R., & Wu, M. (2010). The analysis of rater effects. Recuperado en diciembre de 2017 de: https://www.acer.org/files/Conquest-Tutorial-3-RaterEffects.pdf.

Baird, J. A., Meadows, M., Leckie, G., & Caro, D. (2017). Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems. Assessment in Education: Principles, Policy and Practice, 24(1), 44-59. doi: 10.1080/0969594X.2015.1108283.

Barrett, P. (2001). Conventional interrater reliability: definitions, formulae, and worked examples in SPSS and STATISTICA. Recuperado en diciembre de 2017 de: http://www.pbarrett.net/techpapers/irr_conventional.pdf.

Bravo-Arteaga, A. & Fernández del Valle, J. C. (2000). La evaluación convencional frente a los nuevos modelos de evaluación auténtica. Psicothema, 12(S. 2), 95-99.

Byrne, B. M. (2001). Structural Equation Modeling with AMOS. Mahwah, NJ: Lawrence Erlbaum Associates.

Cochran, W.G., y Cox, G.M. (1974). Diseños experimentales. Mexico: Trillas. (orig. 1957).

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hilsdale. NJ: Lawrence Earlbaum Associates.

Congdon, P. J. & McQueen, J. (2000). The stability of rater severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37(2), 163-178. doi: 10.1111/j.1745-3984.2000.tb01081.x.

Cuxart Jardí, A. (2000). Modelos estadísticos y evaluación: tres estudios en educación. Revista de Educación, 323, 369-394.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A multi-faceted Rasch analysis. Language Assessment Quarterly, 2(3), 197-221. doi: 10.1207/s15434311laq0203_2.

Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy. Recuperado en septiembre de 2017 de https://rm.coe.int/1680667a23#search=eckes.

European Commission/EACEA/Eurydice (2016). Structural indicators on achievement in basic skills in Europe – 2016. Eurydice Report. Luxembourg: Publications Office of the European Union. doi:10.2797/092314.

European Commission/EACEA/Eurydice (2009). National testing of pupils in Europe: Objectives, organisation and use of results. Luxembourg: Publications Office of the European Union. doi: 10.2797/18294.

European Commission/EACEA/Eurydice (2014). Modernisation of higher education in Europe: Access, retention and employability. Luxembourg: Publications Office of the European Union. doi: 10.2797/72146.

Engelhard, G. (1992). The measurement of writing ability with a multi-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191. doi: 10.1207/s15324818ame0503_1.

Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a multi-faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. doi: 10.1111/j.1745-3984.1994.tb00436.x.

Engelhard, G. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33(1), 56-70. doi: 10.1111/j.1745-3984.1996.tb00479.x.

Feeley, T. H. (2002). Comment on Halo Effects in Rating and Evaluation Research. Human Communication Research, 28: 578-586. doi:10.1111/j.1468-2958.2002.tb00825.x

Fernández-Alonso, R. & Muñiz, J. (2011). Diseño de cuadernillos para la evaluación de las competencias básicas. Aula Abierta, 39(2), 3-34.

Gyagenda, I., & Engelhard, G. (2009). Using classical and modern measurement theories to explore rater, domain, and gender influences on student writing ability. Journal of Applied Measurement, 10(3), 225-246.

Gwet, K. L. (2014). Handbook of inter-rater reliability. The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD: Advanced Analytics, LLC.

Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23-34.

Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185. https://doi.org/10.1007/BF02289447.

Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.

Koo. T. K., & Li. M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2). 155-163. http://doi.org/10.1016/j.jcm.2016.02.012.

Kuo, S. A. (2007): Which rubric is more suitable for NSS liberal studies? Analytic or holistic? Educational Research Journal, 22(2), 179-199.

LaRoche, S., Joncas, M., & Foy, P. (2016). Sample design in TIMSS 2015. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in TIMSS 2015 (pp. 3.1-3.37). Recuperado en diciembre de 2017 de: Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-3.html.

Leckie, G., & Baird, J. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. doi: 10.1111/j.1745-3984.2011.00152.x.

Linacre, J. M., Engelhard, G., Tatum, D. S., & Myford, C. M. (1994) Measurement with judges: Many-faceted conjoint measurement. International Journal of Educational Research, 21(6), 569-577. doi: 10.1016/0883-0355(94)90011-6.

Lunz, M. E., & Stahl, J. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(4), 425-444. doi: 10.1177/016327879001300405.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. doi: 10.1207/s15324818ame0304_3.

McGraw. K. O., & Wong. S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1). 30-46.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.

Ministerio de Educación, Cultura y Deporte (2016): Pruebas de la evaluación final de Educación Primaria. Curso 2015-2016. Madrid: Instituto de Evaluación.

OECD (2014). PISA 2012 Technical Report. Paris: OECD Publishing. Recuperado en Septiembre de 2017 de: https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf .

Park, T. (2010). An investigation of an ESL placement test of writing using multi-faceted Rasch measurement. Teachers College, Columbia University Working Papers in TESOL and Applied Linguistics, 4(1), 1-19.

Pérez-Gil, J. A., Chacón Moscoso, S. y Moreno Rodríguez, R. (2000). Construct Validity: The Use of Factor Analysis. Psicothema, 12(2), 441-446.

Prieto, G. (2011). Evaluación de la ejecución mediante el modelo manyfacet Rasch measurement. Psicothema, 23, 233-238.

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.

Shrout . P. E.. & Fleiss. J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2). 420-428.

Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Recuperado en Septiembre de 2007 de: http://PAREonline.net/getvn.asp?v=9&n=4.

Suárez-Álvarez, J., González-Prieto, C., Fernández-Alonso, R., Gil, G., & Muñiz, J. (2014). Psychometric assessment of oral expression in English language in the University Entrance Examination. Revista de Educación, 364, 93-118. doi: 10.4438/1988-592X-RE-2014-364-256.

Sudweeks. R. R., Reeve. S., & Bradshaw. W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing. 9. 239-261.

Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209-220. http://dx.doi.org/10.1037/a0023353.

Wang, Z., & Yao, L. (2013). The effects of rater severity and rater distribution on examinees’ ability estimation for constructed-response items. ETS Research Report Series, i-22. doi:10.1002/j.2333-8504.2013.tb02330.x.

Wolfe, E. W. & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement, 31(3), 31-37.

Descargas

Los datos de descargas todavía no están disponibles.