11 References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp & J. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (1st ed., pp. 297–327). John Wiley & Sons. https://doi.org/10.1002/9781118956588.ch13
Camilli, G., & Shepard, L. A. (1994). Method for identifying biased test items (4th ed.). SAGE Publications, Inc.
Chapelle, C. A. (2021). Argument-Based Validation in Testing and Assessment. SAGE Publications, Inc.
Chen, J., Torre, J. de la, & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123–140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. https://doi.org/10.1016/0895-4356(90)90159-M
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1007/BF02310555
Dynamic Learning Maps Consortium. (2022). 2021–2022 Technical Manual—Instructionally Embedded Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2024a). 2023–2024 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2024b). Accessibility Manual 2023–2024. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2024c). Educator Portal User Guide. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2024d). Test Administration Manual 2023–2024. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. https://doi.org/10.1016/0895-4356(90)90158-L
Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. https://doi.org/10.1177/0146621604272623
Henson, R., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. https://doi.org/10.1007/s11336-008-9089-5
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power raters using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349. https://doi.org/10.1207/S15324818AME1404_2
Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55(4), 635–664. https://doi.org/10.1111/jedm.12196
Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Karvonen, M., Wakeman, S. Y., Browder, D. M., Rogers, M. A. S., & Flowers, C. (2011). Academic curriculum for students with significant cognitive disabilities: Special education teacher perspectives a decade after IDEA 1997 [Research Report]. National Alternate Assessment Center. https://files.eric.ed.gov/fulltext/ED521407.pdf
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
O’Leary, S., Lund, M., Ytre-Hauge, T. J., Holm, S. R., Naess, K., Dalland, L. N., & McPhail, S. M. (2014). Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies. Physiotherapy, 100, 27–35. https://doi.org/10.1016/j.physio.2013.08.002
Pontius, R. G., Jr., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429. https://doi.org/10.1080/01431161.2011.552923
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x
Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30(2), 251–275. https://doi.org/10.1007/s00357-013-9129-4
Thompson, W. J., Clark, A. K., & Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32(4), 298–309. https://doi.org/10.1080/08957347.2019.1660345
Thompson, W. J., Nash, B., Clark, A. K., & Hoover, J. C. (2023). Using simulated retests to estimate the reliability of diagnostic assessment systems. Journal of Educational Measurement, 60(3), 455–475. https://doi.org/10.1111/jedm.12359
Wine, M., & Hoffman, A. (2021). Rigorous item feedback. AleDev Research & Consulting. https://static1.squarespace.com/static/61bc9f534dff2416f1f63492/t/64552da742e14c6c663536b6/1683303847762/Wine%2C+M.+%26+Hoffman%2C+A.+%282021%29.+Rigorous+Item+Feedback.+AleDev+Research.pdf
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF [Working Paper]. University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.