3 Assessment Design and Development

Chapter 3 of the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022) describes assessment design and development procedures. This chapter provides an overview of updates to assessment design and development for 2023–2024. The chapter first describes the design of English language arts (ELA) reading and writing testlets, as well as mathematics testlets. The chapter then provides an overview of 2023–2024 item writers’ characteristics and the 2023–2024 external review of items and testlets based on criteria for content, bias, and accessibility. The chapter concludes by presenting evidence of item quality, including summaries of field test data analysis and associated reviews, the pool of operational testlets available for administration, and an evaluation of differential item functioning (DIF).

3.1 Assessment Structure

The DLM Alternate Assessment System uses learning maps as the basis for assessment, which are highly connected representations of how academic skills are acquired as reflected in the research literature. Nodes in the maps represent specific knowledge, skills, and understandings in ELA and mathematics, as well as important foundational skills that provide an understructure for academic skills. The maps go beyond traditional learning progressions to include multiple pathways by which students develop content knowledge and skills.

Four broad claims were developed for ELA and mathematics, which were then subdivided into nine conceptual areas, to organize the highly complex learning maps. For a complete description, see Chapter 2 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022). Claims are overt statements of what students are expected to learn and be able to demonstrate as a result of mastering skills within a very large neighborhood of the map. Conceptual areas are nested within claims and comprise multiple conceptually related content standards and the nodes that support and extend beyond the standards. The claims and conceptual areas apply to all grades in the DLM system.

Essential Elements (EEs) are specific statements of knowledge and skills, analogous to alternate or extended content standards. The EEs were developed by linking to the grade-level expectations identified in the Common Core State Standards. The purpose of the EEs is to build a bridge from the Common Core State Standards to academic expectations for students with the most significant cognitive disabilities.

For each EE, five linkage levels—-small collections of nodes that represent critical junctures on the path toward and beyond the learning target-—were identified in the map. Assessments are developed at each linkage level for a particular EE.

Testlets are the basic units of measurement in the DLM system. Testlets are short measures of student knowledge, skills, and understandings. Each testlet is made up of three to nine assessment items. Assessment items are developed based on nodes at the five linkage levels for each EE. Each testlet measures an EE and linkage level, with the exception of writing testlets. See Chapter 4 of 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022) for a description of writing testlets. The Target linkage level reflects the grade-level expectation aligned directly to the EE. For each EE, small collections of nodes are identified earlier in the map that represent critical junctures on the path toward the grade-level expectation. Nodes are also identified beyond the Target at the Successor level to give students an opportunity to grow toward the grade-level targets for students without significant cognitive disabilities.

There are three levels below the Target and one level beyond the Target.

  1. Initial Precursor
  2. Distal Precursor
  3. Proximal Precursor
  4. Target
  5. Successor

3.2 Testlet and Item Writing

This section describes information pertaining to item writing and item writer demographics for the 2023–2024 year. For a complete summary of item and testlet development procedures, see Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022).

3.2.1 2024 Testlet and Item Writing

Item development for 2023–2024 focused on replenishing and increasing the pool of testlets in all subjects. External item writers and internal test development staff develop items. External item writers are recruited to write computer-delivered items and testlets. Item writers produced a total of 286 computer-delivered testlets (106 in ELA and 180 in mathematics). Teacher-administered testlets are based on templates created in partnership with Karen Erickson from the Center for Literacy and Disability Studies at the University of North Carolina. These templates are created specifically for students with pre-symbolic communication and are used at the Initial Precursor linkage level. Because the templates are already created, staff produced all ELA and mathematics teacher-administered testlets internally. DLM staff produced a total of 100 teacher-administered testlets (93 in ELA and seven in mathematics).

3.2.1.1 Item-Writing Process

The item-writing process for 2023–2024 began with item writers completing three advance training modules: an overview of the DLM module and two content-specific modules (either ELA or mathematics). In January 2024, item writers and staff gathered in Albuquerque, New Mexico for an on-site item-writing workshop. New item writers were invited to a first-day training that focused on the basics of DLM and item writing. Twelve mathematics item writers attended this first-day training. Veteran item writers joined new item writers for another 2 days to complete the item-writing workshop. During this portion of the workshop, item writers received additional training and worked on producing and peer reviewing two computer-delivered testlets. Following the on-site workshop, item writers continued producing and peer reviewing computer-delivered testlets virtually via a secure online platform through March 2024.

3.2.1.2 Item Writers

Item writers were selected from the Accessible Teaching, Learning, and Assessment Systems (ATLAS) MemberClicks database. The database is a profile-based recruitment tool hosted in MemberClicks and includes individuals recruited via the DLM governance board and social media, individuals who have previously participated in item writing and other events, and individuals who created profiles via the “sign up to participate in DLM events” link on the DLM homepage. Interested individuals create and update their participant profile. Participant profiles include demographic, education, and work experience data.

A total of 533 individual profiles were initially invited to participate in the 2024 item-writing workshop using the ATLAS MemberClicks database. Minimum eligibility criteria included at least 1 year of teaching experience, teaching in a DLM state, and experience with the DLM alternate assessment. Prior DLM event participation, subject matter expertise, population expertise, and distribution of experience in each grade band was also considered in selection and assignment to a subject area. Of the 533 individuals initially invited to participate, 45 individuals registered, completed a pre-reading task, and committed to attend the workshop. All 45 registered item writers attended both days of the item-writing training workshop where they learned about the DLM assessment system, EEs and linkage levels, and how to write testlets. They also completed at least rounds 1 and 2 of item writing. Of these item writers, 15 developed ELA testlets and 30 developed mathematics testlets.

Table 3.1 presents the item writer demographics. Table 3.2 shows the median and range of years of teaching experience for item writers. Item writers had expertise across Grades 3–8 and high school, as shown in Table 3.3.

Table 3.1: Demographics of the Item Writers
n %
Gender
Female 41 91.1
Male   3   6.7
Preferred to self-describe   1   2.2
Race
White 40 88.9
Asian   3   6.7
Black or African American   2   4.4
Hispanic ethnicity
Non-Hispanic 41 91.1
Hispanic   1   2.2
Chose not to disclose   3   6.7
Table 3.2: Item Writers’ Median Years of Teaching Experience
Teaching experience n Median Range
Pre-K–12 30 16.5 4–32
English language arts 28 14.0 0–30
Mathematics 27 13.0 1–30
Table 3.3: Item Writers’ Teaching Experience by Grade
Grade level n %
3 25 55.6
4 28 62.2
5 27 60.0
6 24 53.3
7 22 48.9
8 22 48.9
High school 18 40.0
Note. Item writers could indicate multiple grade levels.

The 45 item writers represented a highly qualified group of professionals with both content and special education perspectives. Table 3.4 shows the degrees held by item writers. All item writers held at least a bachelor’s degree. The vast majority (n = 42; 93%) also held a master’s degree.

Table 3.4: Item Writers’ Degree Type
Degree n %
Bachelor’s degree 45 100.0
Education 14   31.1
Special education 14   31.1
Other 15   33.3
Missing   2     4.4
Master’s degree 42   93.3
Education 16   38.1
Special education 23   54.8
Other   3     7.1

Item writers reported a range of experience working with students with disabilities, as summarized in Table 3.5. Item writers collectively had the most experience working with students with a significant cognitive disability (n = 37; 82%) or specific learning disabilities (n = 37; 82%).

Table 3.5: Item Writers’ Experience With Disability Categories
Disability category n %
Other health impairment 39 86.7
Significant cognitive disability 37 82.2
Specific learning disability 37 82.2
Multiple disabilities 36 80.0
Mild cognitive disability 34 75.6
Emotional disability 31 68.9
Speech impairment 31 68.9
Orthopedic impairment 26 57.8
Deaf/hard of hearing 22 48.9
Blind/low vision 21 46.7
Traumatic brain injury 17 37.8
Note. Item writers could select multiple categories.

Table 3.6 shows the professional roles reported by the item writers. While item writers had a range of professional roles, they were primarily classroom educators.

Table 3.6: Professional Roles of Item Writers
Role n %
Classroom educator 34 75.6
District staff   5 11.1
Instructional coach   2   4.4
State education agency   1   2.2
University faculty/staff   1   2.2
Other   2   4.4

Item writers came from 18 different states. Table 3.7 reports the geographic areas of the institutions in which item writers taught or held a position.

Table 3.7: Institution Geographic Areas for Item Writers
Geographic area n %
Rural 17 37.8
Suburban 16 35.6
Urban 12 26.7
Note. Rural: <2,000; Suburban: 2,000–50,000; Urban: >50,000

3.2.2 External Reviews

Following rounds of internal review and revision, items and testlets were externally reviewed. For a complete summary of item and testlet review procedures, see Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022).

3.2.2.1 Items and Testlets

External review of testlets was held on site during June and July of 2023 in Philadelphia, Pennsylvania. The content and bias/sensitivity reviews of all items and testlets were conducted across 2-3 days. Eight content panels reviewed items and testlets for a single grade band (elementary, middle school, high school) of one subject (ELA, mathematics). There were three bias and sensitivity panels, one for each grade band. Each bias and sensitivity panel reviewed items and testlets for all subjects. The accessibility reviews were conducted across five days because of the need to review testlets and items for all subjects and grade bands. The accessibility panels were split by grade band and reviewed all testlets for all subjects within that grade band. These reviews took the longest due to having more criteria and more testlets per panel.

The purpose of external reviews of items and testlets is to evaluate whether items and testlets measure the intended content, are accessible, and are free of bias or sensitive content. External reviewers use external review criteria established for DLM alternate assessments to rate items and recommend to accept, revise, or reject items and testlets. External reviewers provide recommendations for revise ratings and explanations for reject ratings. The test development team uses collective feedback from the external reviewers to inform decisions about items and testlets prior to field testing.

3.2.2.1.1 Overview of Review Process

External reviewers were selected from the ATLAS MemberClicks database based on predetermined qualifications for each panel type. To qualify as an external reviewer, the individual must be from a DLM consortium state, have at least 1 year of teaching experience, and must not have been an item writer for DLM within the past 3 years. External reviewers were assigned to content, accessibility, or bias and sensitivity panels based on additional qualifications, such as expertise in certain grade bands, subjects, or with the DLM population. Each external reviewer only serves on one of these panels.

There were 93 external reviewers. Of those, 19 were ELA external reviewers and 27 were mathematics external reviewers. There were also 27 accessibility external reviewers and 20 bias and sensitivity external reviewers who reviewed items and testlets from all subjects.

Prior to attending the in-person event, external reviewers were sent an email with instructions regarding accessing the platform used for their reviews. Each external reviewer was asked to access the platform, as well as read a guide about external review before attending the event. Each panel was led by an ATLAS facilitator and co-facilitator. Facilitators provided additional training on the platform used for reviews and criteria used to review items and testlets. External reviewers began their reviews by engaging in a calibration collection (reviewing two testlets) to calibrate their ratings for the review. Following the calibration sets, external reviewers reviewed collections of items and testlets independently. Once all external reviewers completed the review, facilitators used a discussion framework known as the Rigorous Item Feedback framework (Wine & Hoffman, 2021) to discuss any items or testlets that were rated either revise or reject by an external reviewer to obtain collective feedback about those items and testlets. The Rigorous Item Feedback framework helps facilitators elicit detailed, substantive feedback from external reviewers and record feedback in a uniform fashion. Following the discussion, external reviewers were given another collection of items and testlets to review. This process was repeated until all collections of items and testlets were reviewed. Collections ranged from five to 19 testlets, depending on the panel type. Content panels had fewer testlets per collection, and collections were organized by grade level. Because bias and sensitivity and accessibility panels were reviewing testlets for all subjects and had more testlets overall to review, these panels had more testlets per collection.

3.2.2.1.2 External Reviewers

Table 3.8 presents the demographics for the external reviewers. Table 3.9 shows the median and range of years of teaching experience. External reviewers had expertise across all grade levels, with slightly greater representation for Grades 6–8, as shown in Table 3.10. External reviewers had varying experience teaching students with the most significant cognitive disabilities. External reviewers had a median of 9 years of experience teaching students with the most significant cognitive disabilities, with a minimum of 1 year and a maximum of 30 years of experience.

Table 3.8: Demographics of the External Reviewers
n %
Gender
Female 82 88.2
Male   9   9.7
Chose not to disclose   2   2.2
Race
White 79 84.9
Black or African American   5   5.4
Chose not to disclose   4   4.3
Asian   3   3.2
American Indian   1   1.1
Other   1   1.1
Hispanic ethnicity
Non-Hispanic 87 93.5
Hispanic   1   1.1
Chose not to disclose   5   5.4
Table 3.9: External Reviewers’ Years of Teaching Experience
Teaching experience Median Range
Pre-K–12 17.0 6–31
English language arts 12.5 0–31
Mathematics 13.0 1–31
Table 3.10: External Reviewers’ Teaching Experience by Grade
Grade level n %
3 38 40.9
4 38 40.9
5 45 48.4
6 52 55.9
7 51 54.8
8 52 55.9
High school 41 44.1
Note. Reviewers could indicate multiple grade levels.

The 93 external reviewers represented a highly qualified group of professionals. The level of degree and most common types of degrees held by external reviewers are shown in Table 3.11. A majority (n = 77; 83%) held a master’s degree.

Table 3.11: External Reviewers’ Degree Type
Degree n %
Bachelor’s degree 86 92.5
Education 32 37.2
Special education 23 26.7
Other 31 36.0
Missing   7   8.1
Master’s degree 77 82.8
Education 33 42.9
Special education 31 40.3
Other 11 14.3
Missing   2   2.6

External reviewers reported a range of experience working with students with disabilities, as summarized in Table 3.12. Most external reviewers had experience working with students with disabilities (94%), and 59% had experience with the administration of alternate assessments. The variation in responses suggests some item writers may have had experience working with students with disabilities but perhaps did not participate in the administration of alternate assessments for students with the most significant cognitive disabilities.

Table 3.12: External Reviewers’ Experience With Disability Categories
Disability category n %
Multiple disabilities 70 75.3
Significant cognitive disability 70 75.3
Mild cognitive disability 65 69.9
Speech impairment 65 69.9
Specific learning disability 64 68.8
Other health impairment 61 65.6
Emotional disability 57 61.3
Blind/low vision 44 47.3
Traumatic brain injury 43 46.2
Orthopedic impairment 40 43.0
Deaf/hard of hearing 27 29.0
Note. Reviewers could select multiple categories.

Table 3.13 shows the professional roles reported by the external reviewers. While the reviewers had a range of professional roles, they were primarily classroom educators.

Table 3.13: External Reviewers’ Professional Roles
Role n %
Classroom educator 64 68.8
Other 11 11.8
Instructional coach 10 10.8
District staff   7   7.5
Not specified   1   1.1

External reviewers were from 13 different states. Table 3.14 reports the geographic areas of institutions in which reviewers taught or held a position.

Table 3.14: Institution Geographic Areas for External Reviewers
Geographic area n %
Rural 39 41.9
Suburban 28 30.1
Urban 25 26.9
Chose not to disclose   1   1.1
Note. Rural: <2,000; Suburban: 2,000–50,000; Urban: >50,000
3.2.2.1.3 Results of External Reviews

Table 3.15 presents the percentage of items and testlets rated as accept, revise, and reject across panels and rounds of review by subject.

Table 3.15: Range of Percentages for Item and Testlet Ratings Across Panels and Rounds of Review by Subject
Accept (%) Revise (%) Reject (%)
English language arts
Items 58–92 7–42 <1
Testlets 52–87 13–46 0–2
Mathematics
Items 56–90 11–42 0–2
Testlets 46–70 29–51 0–4
3.2.2.1.4 Test Development Team Decisions

Because each item and testlet is examined by three panels, ratings were compiled across panels, following the process described in Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022). The test development team reviews the collective feedback provided by external reviewers for each item and testlet. Once the test development team views each item and testlet and considers the feedback provided by the external reviewers, it assigns one of the following decisions to each one: (a) accept as is; (b) minor revision, pattern of minor concerns, will be addressed; (c) major revision needed; (d) reject; and (e) more information needed.

The ELA test development team accepted 53% of testlets and 76% of items as is. Apart from making no changes to items and testlets that were approved as is, the ELA test development team frequently avoided making changes to items that were cosmetic or preferential. For example, if a committee suggested a style change that conflicted with our style guide requirements, that request was ignored; further, swapping words or distractors with replacements that did not improve the item construction (e.g., stems that addressed targeted cognition, response options grounded in misconceptions) were likewise not considered. Of the items and testlets that were revised, most required major changes (e.g., stem or response option replaced) as opposed to minor changes (e.g., minor rewording but concept remained unchanged). The ELA test development team made 69 minor revisions and 180 major revisions; they rejected one testlet.

The mathematics test development team accepted 83% of testlets and 69% of items as is. Testlet-level comments from external reviewers that resulted in revisions to items were made at the item level and the testlet was marked as accept as is. Items with comments that would deviate from established methods of addressing nodes were accepted as is and other comments were taken to the mathematics test development team for additional internal discussion. Of the items and testlets that were revised, most required major changes (e.g., stem or response option replaced) as opposed to minor changes (e.g., minor rewording but concept remained unchanged). The mathematics test development team made four minor revisions and 326 major revisions to items; they rejected four testlets.

Most of the items and testlets reviewed will be field tested during the fall 2024 or spring 2025 assessment windows.

3.3 Evidence of Item Quality

Each year, testlets are added to and removed from the operational pool to maintain a pool of high-quality testlets. The following sections describe evidence of item quality, including evidence supporting field test testlets available for administration, a summary of the operational pool, and evidence of DIF.

3.3.1 Field Testing

During 2023–2024, field test testlets were administered to evaluate item quality before promoting testlets to the operational pool. Adding testlets to the operational pool allows for multiple testlets to be available in each assessment window for a given EE and linkage level combination. This allows teachers to assess the same EE and linkage level multiple times, if desired, and reduces item exposure for the EEs and linkage levels that are assessed most frequently. Additionally, deepening the operational pool allows for testlets to be evaluated for retirement in instances in which other testlets show better performance.

In this section, we describe the field test testlets administered in 2023–2024 and the associated review activities. A summary of prior field test events can be found in Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022).

3.3.1.1 Description of Field Tests Administered in 2023–2024

The Instructionally Embedded and Year-End assessment models share a common item pool. Testlets field tested during the fall instructionally embedded assessment window may be eventually promoted to the spring assessment window. Therefore, field testing from both assessment windows is described.

Testlets were made available for field testing based on the availability of field test content for each EE and linkage level.

During both the fall and spring windows, field test testlets were administered to each student after blueprint coverage requirements were met. A field test testlet was assigned for an EE that was assessed during the operational assessment at a linkage level equal or adjacent to the linkage level of the operational testlet.

Table 3.16 summarizes the number of field test testlets available during 2023–2024. A total of 516 field test testlets were available across grades, subjects, and windows.

Table 3.16: 2023–2024 Field Test Testlets by Subject
Fall window
Spring window
Grade English language arts (n) Mathematics (n) English language arts (n) Mathematics (n)
3 22 11 29   8
4 15 17 12 17
5 12 14 10   6
6 17   9 13 10
7 11 17   9 13
8 21 23 23 28
9 16   9 18   4
10 16   7 18   4
11 10   9 11   6
12 10  – 11  –
Note. In mathematics, high school is banded in Grades 9–11.

Table 3.17 presents the demographic breakdown of students completing field test testlets in ELA and mathematics in 2023–2024. Consistent with the DLM population, approximately 67% of students completing field test testlets were male, approximately 55% were White, and approximately 75% were non-Hispanic. Most students completing field test testlets were not English-learner eligible or monitored. The students completing field test testlets were split across the four complexity bands, with most students assigned to Band 1 or Band 2. See Chapter 4 of this manual for a description of student complexity bands.

Table 3.17: Demographic Summary of Students Participating in Field Tests
English language
arts
Mathematics
Demographic group n % n %
Gender
Male 57,363 67.6 60,080 67.8
Female 27,417 32.3 28,489 32.2
Nonbinary/undesignated       23   0.0       27   0.0
Other          1   0.0          1   0.0
Race
White 47,922 56.5 49,928 56.4
African American 18,631 22.0 19,074 21.5
Two or more races 10,763 12.7 11,542 13.0
Asian   4,805   5.7   5,243   5.9
American Indian   2,115   2.5   2,220   2.5
Native Hawaiian or Pacific Islander      431   0.5      445   0.5
Alaska Native      137   0.2      145   0.2
Hispanic ethnicity
Non-Hispanic 65,497 77.2 67,921 76.7
Hispanic 19,307 22.8 20,676 23.3
English learning (EL) participation
Not EL eligible or monitored 79,072 93.2 82,483 93.1
EL eligible or monitored   5,732   6.8   6,114   7.0
English language arts complexity band
Foundational 16,297 19.2 17,601 19.9
Band 1 27,056 31.9 28,631 32.3
Band 2 27,355 32.3 27,710 31.3
Band 3 14,096 16.6 14,655 16.5
Mathematics complexity band
Foundational 17,048 20.1 18,253 20.6
Band 1 35,500 41.9 37,154 41.9
Band 2 26,882 31.7 27,593 31.1
Band 3   5,374   6.3   5,597   6.3
Note. See Chapter 4 of this manual for a description of student complexity bands.

Participation in field testing was not required, but educators were encouraged to administer all available testlets to their students. Table 3.18 shows field test participation rates for ELA and mathematics in the fall and spring windows. Note that because the Instructionally Embedded and Year-End models share an item pool, participation numbers are combined across all states. In total, 77% of students in ELA and 80% of students in mathematics completed at least one field test testlet in either window. In the fall window, 80% of field test testlets had a sample size of at least 20 students (i.e., the threshold for item review). In the spring window, 95% of field test testlets had a sample size of at least 20 students.

Table 3.18: 2023–2024 Field Test Participation, by Subject and Window
Fall window
Spring window
Combined
Subject n % n % n %
English language arts 7,868 42.6 82,637 75.1 84,804 76.6
Mathematics 6,037 33.8 86,841 79.1 88,597 80.1

3.3.1.2 Field Test Data Review

Data collected during each field test are compiled, and statistical flags are implemented ahead of test development team review. Items are flagged for additional review if they meet either of the following statistical criteria:

  • The item is too challenging, as indicated by a proportion correct (p-value) less than .35. This value was selected as the threshold for flagging because most DLM assessment items offer three response options, so a value less than .35 may indicate less than chance selection of the correct response option.

  • The item is significantly easier or harder than other items assessing the same EE and linkage level, as indicated by a weighted standardized difference greater than two standard deviations from the mean p-value for that EE and linkage level combination.

Flagging criteria serve as a source of evidence for test development teams in evaluating item quality; however, final judgments are content based, taking into account the testlet as a whole, the underlying nodes in the DLM maps that the items were written to assess, and pool depth.

Review of field test data occurs annually during February and March. This review includes data from the immediately preceding fall and spring windows. That is, the review in February and March of 2024 includes field test data collected during the spring 2023 window and the fall window of 2023–2024. Data that were collected during the 2024 spring window will be reviewed in February and March of 2025, with results included in the 2024–2025 technical manual update.

Test development teams for each subject classified each reviewed item into one of four categories:

  1. No changes made to item. Test development team decided item can go forward to operational assessment.
  2. Test development team identified concerns that required modifications. Modifications were clearly identifiable and were likely to improve item performance.
  3. Test development team identified concerns that required modifications. The content was worth preserving rather than rejecting. Item review may not have clearly pointed to specific modifications that were likely to improve the item.
  4. Rejected item. Test development team determined the item was not worth revising.

For an item to be accepted as is, the test development team had to determine that the item was consistent with DLM item-writing guidelines and that the item was aligned to the node. An item or testlet was rejected completely if it was inconsistent with DLM item-writing guidelines, if the EE and linkage level were covered by other testlets that had better-performing items, or if there was no clear content-based revision to improve the item. In some instances, a decision to reject an item also resulted in the rejection of the testlet.

Common reasons for flagging an item for modification included items that were misaligned to the node, distractors that could be argued as partially correct, or unnecessary complexity in the language of the stem. After reviewing flagged items, the test development team looked at all items classified into Category 3 or Category 4 within the testlet to help determine whether to retain or reject the testlet. Here, the test development team could elect to keep the testlet (with or without revision) or reject it. If a revision was needed, it was assumed the testlet needed field testing again. The entire testlet was rejected if the test development team determined the flagged items could not be adequately revised.

3.3.1.3 Results of Item Analysis

Figure 3.1 and Figure 3.2 summarize the p-values for items that met the minimum sample size threshold of 20. Most items fell above the .35 threshold for flagging. In ELA, 1,012 (92%) items were above the .35 flagging threshold. In mathematics, 760 (82%) items were above the .35 flagging threshold. All flagged items are reviewed by test development teams following field testing. Test development teams for each subject reviewed 84 (8%) items for ELA and 166 (18%) for mathematics that were below the threshold.

Figure 3.1: p-values for English Language Arts 2023–2024 Field Test Items

This figure contains a histogram displaying p-value on the x-axis and the number of English language arts field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.2: p-values for Mathematics 2023–2024 Field Test Items

This figure contains a histogram displaying p-value on the x-axis and the number of mathematics field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

DLM assessment items are designed and developed to be fungible (i.e., interchangeable) within each EE and linkage level, meaning field test items should perform consistently with the operational items measuring the same EE and linkage level. To evaluate whether field test items perform similarly to operational items measuring the same EE and linkage level, standardized difference values are calculated for the field test items. Figure 3.3 and Figure 3.4 summarize the standardized difference values for items field-tested during the instructionally embedded window for ELA and mathematics, respectively. Most items fell within two standard deviations of the mean for the EE and linkage level. Items beyond the threshold were reviewed by test development teams for each subject.

Figure 3.3: Standardized Difference Z-Scores for English Language Arts 2023–2024 Field Test Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of English language arts field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.4: Standardized Difference Z-Scores for Mathematics 2023–2024 Field Test Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of mathematics field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

A total of 73 (38%) ELA testlets and 86 (45%) mathematics testlets had at least one item flagged due to their p-value and/or standardized difference value. Test development teams reviewed all flagged items and their context within the testlet to identify possible reasons for the flag and to determine whether an edit was likely to resolve the issue.

Of the 73 ELA testlets that were flagged, 36 (49%) were edited and reassigned to the field test pool, 36 (49%) were promoted as is to the operational pool to maintain pool depth given testlet retirement, and one (1%) was rejected and retired. Of the 86 mathematics testlets that were flagged, three (3%) were edited and reassigned to the field test pool, 73 (85%) were promoted as is to the operational pool to maintain pool depth given testlet retirement, one (1%) was sent back to the field test pool with no edits for additional data collection to get estimates of item difficulty that are based on larger samples, and nine (10%) were rejected and retired.

Field test items were also reviewed for evidence of DIF. See Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022) for a complete description of the methods and process for evaluating evidence of DIF. Two field test items in ELA and four field test items in mathematics were flagged for nonnegligible DIF. Of the two ELA items that were reviewed by the test development team as part of their item review, two items were designated for additional field testing to collect additional data. Of the four mathematics items that were reviewed by the test development team as part of their item review, four items were designated for additional field testing to collect additional data.

Of the 117 ELA testlets that were not flagged, 117 (100%) were promoted as is to the operational pool. Of the 106 mathematics testlets that were not flagged, 99 (93%) were promoted as is to the operational pool and seven (7%) were rejected and retired.

3.3.2 Operational Assessment Items for 2023–2024

There were several updates to the pool of operational items for 2023–2024: 255 testlets were promoted to the operational pool from field testing in 2022–2023, including 113 ELA testlets and 142 mathematics testlets. Additionally, 10 testlets (<1% of all testlets) were retired due to model misfit. For a discussion of the model-based retirement process, see Chapter 5 of this manual.

Testlets were made available for operational testing in 2023–2024 based on the 2022–2023 operational pool and the promotion of testlets field tested during 2022–2023 to the operational pool following their review. Table 3.19 summarizes the total number of operational testlets for 2023–2024. In total, there were 3,363 operational testlets available. This total included 611 EE/linkage level combinations (349 ELA, 262 mathematics) for which both a general version and a version for students who are blind or visually impaired or read braille were available.

Operational assessments were administered during the two instructionally embedded windows. A total of 522,009 test sessions were administered during both assessment windows. One test session is one testlet taken by one student. Only test sessions that were complete at the close of each testing window counted toward the total sessions.

Table 3.19: 2023–2024 Operational Testlets by Subject (N = 3,363)
Fall operational
Spring operational
Grade English language arts (n) Mathematics (n) English language arts (n) Mathematics (n)
3 133   64 136   64
4 139   92 134   95
5 162   86 151   91
6 142   65 142   71
7 122   89 116   84
8 126   79 119   86
9–10 113 158 113 163
11–12 117 * 111 *
* In mathematics, high school is banded in Grades 9–11.

3.3.2.1 Educator Perceptions of Assessment Content

Each year, the test administrator survey includes two questions about test administrators’ perceptions of the assessment content. Participation in the test administrator survey is described in Chapter 4 of this manual. Questions pertain to whether the DLM assessments measured important academic skills and reflected high expectations for their students. Table 3.20 describes the responses.

Test administrators generally responded that content reflected high expectations for their students (86% agreed or strongly agreed) and measured important academic skills (79% agreed or strongly agreed). While the majority of test administrators agreed with these statements, 14%–21% disagreed. DLM assessments represent a departure from the breadth of academic skills assessed by many states’ previous alternate assessments. Given the short history of general curriculum access for this population and the tendency to prioritize the instruction of functional academic skills (Karvonen et al., 2011), test administrators’ responses may reflect awareness that DLM assessments contain challenging content. However, test administrators were divided on its importance in the educational programs of students with the most significant cognitive disabilities.

Table 3.20: Educator Perceptions of Assessment Content
Strongly
disagree
Disagree
Agree
Strongly
agree
Statement n % n % n % n %
Content measured important academic skills and knowledge for this student. 1,173 7.8 2,014 13.4 9,011 60.0 2,811 18.7
Content reflected high expectations for this student.    645 4.3 1,406   9.4 9,045 60.7 3,793 25.5

3.3.2.2 Psychometric Properties of Operational Assessment Items for 2023–2024

The p-value was calculated to summarize information about item difficulty for all operational items with all of the data used to calibrate the scoring model.

Figure 3.5 and Figure 3.6 show the distribution of p-values for operational items in ELA and mathematics, respectively. To prevent items with small sample sizes from potentially skewing the results, the sample size cutoff for inclusion in the p-value plots was 20. In total, 329 items (3% of all items) were excluded due to small sample size, where 277 of the items were ELA items (4% of all ELA items) and 52 of the items were mathematics items (1% of all mathematics items). 226 (82%) items for ELA and 49 (94%) items for mathematics. In general, ELA items were easier than mathematics items, as evidenced by the presence of more items in the higher-bin (p-value) ranges.

Figure 3.5: p-values for English Language Arts 2023–2024 Operational Items

A histogram displaying p-value on the x-axis and the number of English language arts operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.6: p-values for Mathematics 2023–2024 Operational Items

A histogram displaying p-value on the x-axis and the number of mathematics operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

We use the standardized difference to evaluate the fungibility assumption since items are designed and developed to be fungible. Standardized difference values were calculated for all operational items, with a student sample size of at least 20 required to compare the p-value for the item to all other items measuring the same EE and linkage level. If an item is fungible with the other items measuring the same EE and linkage level, the item is expected to have a nonsignificant standardized difference value. The standardized difference values provide one source of evidence of internal consistency.

Figure 3.7 and Figure 3.8 summarize the distributions of standardized difference values for operational items in ELA and mathematics, respectively. Of all items measuring the EE and linkage level, 98% of ELA items and 99% of mathematics items fell within two standard deviations of the mean.

Figure 3.7: Standardized Difference Z-Scores for English Language Arts 2023–2024 Operational Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of English language arts operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.8: Standardized Difference Z-Scores for Mathematics 2023–2024 Operational Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of mathematics operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.9 summarizes the distributions of standardized difference values for operational items by linkage level. Most items fell within two standard deviations of the mean of all items measuring the respective EE and linkage level. The Successor linkage level has a slightly different distribution of standardized difference values than the other linkage levels. The difference in the distribution may be due to the smaller sample sizes for items measuring the Successor linkage level. This is consistent with the examination of items excluded from analysis where the majority of items excluded due to small sample sizes were measuring the Successor linkage level. As additional data are collected and decisions are made regarding item pool replenishment, test development teams will consider item standardized difference values, along with item misfit analyses, when determining which items and testlets are recommended for retirement.

Figure 3.9: Standardized Difference Z-Scores for 2023–2024 Operational Items by Linkage Level

This figure contains a histogram displaying standardized difference on the x-axis and the number of science operational items on the y-axis. The histogram has a separate row for each linkages level.

Note. Items with a sample size less than 20 were omitted.

3.3.3 Evaluation of Item-Level Bias

The DIF analyses identify instances where items are more difficult for some groups of examinees despite these examinees having similar knowledge and understanding of the assessed concepts (Camilli & Shepard, 1994). Using DIF analyses can uncover internal inconsistency if items function differently in a systematic way for identifiable subgroups of students (American Educational Research Association et al., 2014). While identification of DIF does not always indicate a weakness in the item, it can point to construct-irrelevant variance, posing considerations for validity and fairness.

Inclusion criteria based on sample sizes and p-values were applied to attain adequate technical standards for evaluating evidence of DIF. For items to be evaluated for evidence of DIF for gender and race subgroups, we required:

  • At least 100 focal group students to have completed the item.
  • An item p-value between .05 and .95.
  • Subgroup p-values between .03 and .97

The logistic regression DIF detection method was used to evaluate evidence of uniform and nonuniform DIF for gender and race subgroups (Swaminathan & Rogers, 1990). Evidence of uniform DIF indicates a logistic regression model using total linkage levels mastered and group membership better predicts students’ correct responses than a logistic regression model only using total linkage levels mastered as a predictor. When evidence of uniform DIF is present, one group outperforms the other group along the range of linkage levels mastered. Evidence of nonuniform DIF indicates a logistic regression model using total linkage levels mastered, group membership, and the interaction between total linkage levels mastered and group membership better predicts students’ correct responses than a logistic regression model only using total linkage levels mastered as a predictor. When evidence of nonuniform DIF is present, the group with the highest probability of a correct response to the item differs along the range of total linkage levels mastered; thus, one group is favored at the low end of the spectrum and the other group is favored at the high end. Items were flagged for evidence of DIF if the effect size reflecting the magnitude of DIF based on the Nagelkerke pseudo \(R^2\) was moderate (B-level DIF) or large (C-level DIF), as defined by Jodoin and Gierl (2001).

For a complete description of the methods and process used to evaluate evidence of DIF, see Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022).

3.3.3.1 DIF Results

Using the above criteria for inclusion, 7,548 (65%) items were evaluated for at least one gender group comparison, and 5,602 (49%) items were evaluated for at least one racial group comparison. The number of items evaluated by grade and subject for gender ranged from 74 in grades 9–10 ELA to 617 in grade 6 ELA. Because students taking DLM assessments represent three possible gender groups (male, female, and nonbinary/undesignated), there are up to two comparisons that can be made for each item, with the male group as the reference group and each of the other two groups as the focal group. Across all items, this resulted in 23,078 possible comparisons. Using the inclusion criteria specified above, 7,548 (33%) item and focal group comparisons were included in the analysis. All 7,548 items were evaluated for the female focal group. No items met the focal group sample size criteria for the nonbinary/undesignated focal group.

The number of items evaluated by grade and subject for race ranged from 72 in grades 9–10 ELA to 416 in grades 9–10 mathematics. Because students taking DLM assessments represent seven possible racial groups (White, African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, and two or more races), See Chapter 7 of this manual for a summary of participation by race and other demographic variables. there are up to six comparisons that can be made for each item, with the White group as the reference group and each of the other six groups as the focal group. Across all items, this results in 69,234 possible comparisons. Using the inclusion criteria specified above, 11,764 (17%) item and focal group comparisons were included in the analysis. Overall, 2,533 items were evaluated for one racial focal group, 1,064 items were evaluated for two racial focal groups, 925 items were evaluated for three racial focal groups, 1,072 items were evaluated for four racial focal groups, and eight items were evaluated for five racial focal groups. One racial focal group and the White reference group were used in each comparison. Table 3.21 shows the number of items that were evaluated for each racial focal group. Across all gender and race comparisons, sample sizes for each comparison ranged from 234 to 18,633 for gender and from 348 to 15,082 for race.

Table 3.21: Number of Items Evaluated for Differential Item Functioning for Each Race
Focal group Items (n)
African American 5,597
American Indian 1,084
Asian 2,002
Native Hawaiian or Pacific Islander        8
Two or more races 3,073
Note. The reference group was White students.

Table 3.22 and Table 3.23 show the number and percentage of subgroup combinations that did not meet each inclusion criteria for gender and race, respectively, by subject and the linkage level of the items. A total of 3,991 items were not included in the DIF analysis for gender for any of the subgroups. Of the 15,530 item and focal group comparisons that were not included in the DIF analysis for gender, 15,128 (97%) had a focal group sample size of less than 100, 121 (1%) had an item p-value greater than .95, and 281 (2%) had a subgroup p-value greater than .97. A total of 5,937 items were not included in the DIF analysis for race for any of the subgroups. Of the 57,470 item and focal group comparisons that were not included in the DIF analysis for race, 56,778 (99%) had a focal group sample size of less than 100, 212 (<1%) had an item p-value greater than .95, and 480 (1%) had a subgroup p-value greater than .97. The majority of nonincluded comparisons come from ELA for both gender (n = 9,327; 60%) and race (n = 34,206; 60%).

Table 3.22: Comparisons Not Included in Differential Item Functioning Analysis for Gender by Subject and Linkage Level
Sample
size
Item
proportion
correct
Subgroup
proportion
correct
Subject and linkage level n % n % n %
English language arts
Initial Precursor 1,766 19.6   0   0.0   0   0.0
Distal Precursor 1,923 21.3   0   0.0 13   5.8
Proximal Precursor 1,852 20.5   2   2.3 88 39.5
Target 1,628 18.1 33 37.5 85 38.1
Successor 1,847 20.5 53 60.2 37 16.6
Mathematics
Initial Precursor 1,215 19.9   0   0.0   0   0.0
Distal Precursor    966 15.8   0   0.0 11 19.0
Proximal Precursor 1,104 18.1 19 57.6 19 32.8
Target 1,318 21.6   7 21.2 25 43.1
Successor 1,509 24.7   7 21.2   3   5.2
Table 3.23: Comparisons Not Included in Differential Item Functioning Analysis for Race by Subject and Linkage Level
Sample
size
Item
proportion
correct
Subgroup
proportion
correct
Subject and linkage level n % n % n %
English language arts
Initial Precursor 7,107 21.1     0   0.0     0   0.0
Distal Precursor 7,524 22.3     0   0.0   32 10.6
Proximal Precursor 7,339 21.7     1   0.6 105 34.7
Target 5,931 17.6   36 22.5   74 24.4
Successor 5,842 17.3 123 76.9   92 30.4
Mathematics
Initial Precursor 4,798 20.8     0   0.0     1   0.6
Distal Precursor 3,991 17.3     0   0.0   22 12.4
Proximal Precursor 4,535 19.7   33 63.5   48 27.1
Target 4,920 21.4     9 17.3   61 34.5
Successor 4,791 20.8   10 19.2   45 25.4
3.3.3.1.1 Uniform Differential Item Functioning Model

A total of 842 items for gender were flagged for evidence of uniform DIF. Using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but six combinations were found to have a negligible effect-size change after the gender term was added to the regression equation.

Additionally, 1,515 item and focal group combinations across 1,515 items for race were flagged for evidence of uniform DIF. Using the Zumbo and Thomas (1997) effect-size classification criteria, all but two combinations were found to have a negligible effect-size change after the race term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race term was added to the regression equation.

Table 3.24 and Table 3.25 summarize the total number of combinations flagged for evidence of nonnegligible uniform DIF by subject and grade for gender and race, respectively. The percentage of combinations flagged for uniform DIF ranged from 0% to <1% for gender and from 0% to <1% for race.

Table 3.24: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 38 509   7.5 0
4 58 518 11.2 2
5 56 543 10.3 1
6 77 617 12.5 0
7 46 534   8.6 0
8 54 560   9.6 1
9   3   25 12.0 0
10 10   49 20.4 0
11 40 221 18.1 0
9–10 43 347 12.4 0
11–12 13 216   6.0 1
Mathematics
3 45 347 13.0 0
4 58 464 12.5 0
5 40 477   8.4 0
6 33 337   9.8 0
7 58 442 13.1 0
8 56 436 12.8 0
9 36 304 11.8 1
10 30 270 11.1 0
11 48 332 14.5 0
Table 3.25: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Race
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 100 773 12.9 0
4 116 830 14.0 0
5   90 820 11.0 0
6 106 820 12.9 0
7 126 833 15.1 0
8 100 841 11.9 1
9     4   56   7.1 0
10   23 122 18.9 0
11   91 563 16.2 0
9–10   94 424 22.2 0
11–12     7 110   6.4 0
Mathematics
3   97 703 13.8 0
4   78 822   9.5 1
5 100 855 11.7 0
6   77 674 11.4 0
7   75 661 11.3 0
8 102 711 14.3 1
9   37 381   9.7 0
10   44 360 12.2 0
11   48 405 11.9 0

Table 3.26 provides information about the flagged items with a nonnegligible effect-size change after the addition of the group term, as represented by a value of B (moderate) or C (large). The test development team reviews all items flagged with a moderate or large effect size. The \(\beta_2G\) values (i.e., the coefficients for the group term) in Table 3.26 indicate which group was favored on the item after accounting for total linkage levels mastered, with positive values indicating that the focal group had a higher probability of success on the item and negative values indicating that the focal group had a lower probability of success on the item. The focal group was favored on three combinations.

Table 3.26: Combinations Flagged for Uniform Differential Item Functioning (DIF) With Moderate or Large Effect Size
Item ID Focal Grade EE \(\chi^2\) \(p\)-value \(\beta_2G\) \(R^2\) Z&T* J&G* Window
English language arts
15961 Female 4 ELA.EE.RL.4.1 10.03 .002 −1.04   .047 A B Spring
23698 Female 4 ELA.EE.RL.4.3 18.04 <.001    1.02 .057 A B Fall
14981 Female 5 ELA.EE.RI.5.4 33.15 <.001    −0.40   .820 C C Spring
39201 Female 8 ELA.EE.RI.8.8 11.87 <.001    −0.90   .042 A B Fall
39602 Asian 8 ELA.EE.RL.8.3 38.62 <.001    −0.75   .825 C C Spring
50786 Female 11–12 ELA.EE.RL.11-12.4 13.46 <.001    −0.91   .051 A B Fall
Mathematics
45523 Two or more races 4 M.EE.4.OA.5   7.05 .008 0.22 .787 C C Spring
64572 African American 8 M.EE.8.EE.2 10.73 .001 0.85 .036 A B Spring
67039 Female 9 M.EE.HS.A.SSE.1 10.67 .001 −0.82   .041 A B Spring
Note. ID = identification; EE = Essential Element; \(\beta_2G\) = the coefficient for the group term in the logistic regression DIF detection method; \(\beta_3G\) = coefficient for the interaction between the number of linkage levels mastered term and the group term; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
* Effect-size measure: A indicates evidence of negligible DIF, B indicates evidence of moderate DIF, and C indicates evidence of large DIF.
3.3.3.1.2 Nonuniform Differential Item Functioning Model

A total of 1,100 items for gender were flagged for evidence of nonuniform DIF. Using the Zumbo and Thomas (1997) effect-size classification criteria, all but two combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but 31 combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation.

Additionally, 1,845 item and focal group combinations across 1,845 items were flagged for evidence of nonuniform DIF when both the race and interaction terms were included in the regression equation. Using the Zumbo and Thomas (1997) effect-size classification criteria, all but three combinations were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but 12 combinations were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation.

Table 3.27 and Table 3.28 summarize the number of combinations flagged for nonnegligible effect-size change by subject and grade. The percentage of combinations flagged ranged from 0% to 1% for gender and from 0% to 1% for race.

Table 3.27: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 58 509 11.4 3
4 65 518 12.5 4
5 84 543 15.5 7
6 99 617 16.0 3
7 68 534 12.7 2
8 55 560   9.8 3
9   4   25 16.0 0
10   8   49 16.3 0
11 42 221 19.0 0
9–10 44 347 12.7 2
11–12 26 216 12.0 2
Mathematics
3 56 347 16.1 0
4 92 464 19.8 0
5 75 477 15.7 1
6 51 337 15.1 0
7 80 442 18.1 1
8 63 436 14.4 0
9 42 304 13.8 1
10 31 270 11.5 1
11 57 332 17.2 1
Table 3.28: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Race
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 103 773 13.3 0
4 143 830 17.2 0
5 111 820 13.5 1
6 124 820 15.1 0
7 131 833 15.7 0
8 114 841 13.6 1
9     7   56 12.5 0
10   21 122 17.2 0
11   95 563 16.9 0
9–10   97 424 22.9 0
11–12   19 110 17.3 0
Mathematics
3 123 703 17.5 2
4 122 822 14.8 3
5 119 855 13.9 1
6 102 674 15.1 0
7 125 661 18.9 0
8 111 711 15.6 1
9   45 381 11.8 0
10   57 360 15.8 0
11   76 405 18.8 3

Table 3.29 summarizes information about the flagged items with a nonnegligible change in effect size after adding both the group and interaction term, where B indicates a moderate effect size and C a large effect size. In total, 38 combinations had a moderate effect size and five combinations had a large effect size. The \(\beta_3\text{X}G\) values in Table 3.29 indicate which group was favored at lower and higher numbers of linkage levels mastered. A total of 22 combinations favored the focal group at higher numbers of total linkage levels mastered and the reference group at lower numbers of total linkage levels mastered.

The test development team reviews all items flagged with a moderate or large effect size. The results of these reviews may be used to prioritize items and testlets for retirement. Updates to the operational pool, including retirements based on the results of these reviews, will be included in the 2024–2025 technical manual update. See section 3.3.2 for updates to the operational pool, which includes updates based on the reviews of items flagged for evidence of nonnegligible DIF presented in the 2022–2023 technical manual update.

Table 3.29: Combinations Flagged for Nonuniform Differential Item Functioning (DIF) With Moderate or Large Effect Size
Item ID Focal Grade EE \(\chi^2\) \(p\)-value \(\beta_2G\) \(\beta_3\text{X}G\) \(R^2\) Z&T* J&G* Window
English language arts
11501 Female 3 ELA.EE.RL.3.2 12.54 .002 1.89 −0.18   .036 A B Spring
31405 Female 3 ELA.EE.RI.3.5 18.03 <.001    −1.08   0.25 .059 A B Fall
80659 Female 3 ELA.EE.RL.3.1 17.54 <.001    2.25 −0.06   .041 A B Spring
14820 Female 4 ELA.EE.RL.4.5 35.91 <.001    2.59 −0.18   .050 A B Fall
15961 Female 4 ELA.EE.RL.4.1 10.07 .006 −1.21   0.01 .047 A B Spring
23698 Female 4 ELA.EE.RL.4.3 19.46 <.001    0.45 0.03 .061 A B Fall
40993 Female 4 ELA.EE.RL.4.1 10.52 .005 −1.05   0.25 .036 A B Fall
14981 Female 5 ELA.EE.RI.5.4 37.67 <.001    0.25 −0.02   .820 C C Spring
33845 Female 5 ELA.EE.RL.5.4 11.30 .004 −2.85   0.16 .044 A B Spring
34566 Female 5 ELA.EE.RI.5.7   9.43 .009 2.83 −0.08   .042 A B Spring
55472 Female 5 ELA.EE.RL.5.6 12.51 .002 −0.96   0.23 .045 A B Fall
55475 Female 5 ELA.EE.RL.5.6 12.62 .002 −1.52   0.19 .045 A B Fall
56085 Female 5 ELA.EE.RL.5.4 22.48 <.001    0.40 −0.17   .049 A B Fall
75304 Female 5 ELA.EE.RI.5.1 19.25 <.001    −1.47   0.41 .050 A B Spring
23256 African American 5 ELA.EE.RL.5.3 42.93 <.001    1.26 −0.24   .038 A B Fall
13433 Female 6 ELA.EE.RL.6.4 12.09 .002 −0.78   0.15 .038 A B Fall
21409 Female 6 ELA.EE.RI.6.8 10.77 .005 −1.67   0.07 .044 A B Fall
39755 Female 6 ELA.EE.RL.6.6 24.99 <.001    −1.46   0.26 .056 A B Fall
33860 Female 7 ELA.EE.RL.7.5 12.81 .002 −1.35   0.06 .049 A B Fall
49638 Female 7 ELA.EE.RI.7.3 14.88 <.001    3.11 −0.11   .051 A B Fall
39201 Female 8 ELA.EE.RI.8.8 11.87 .003 −0.88   0.00 .042 A B Fall
56991 Female 8 ELA.EE.RI.8.5 11.17 .004 3.21 −0.09   .040 A B Spring
75954 Female 8 ELA.EE.RI.8.3 12.65 .002 3.35 −0.10   .048 A B Spring
39602 Asian 8 ELA.EE.RL.8.3 38.75 <.001    −0.66   0.00 .825 C C Spring
70285 Female 9–10 ELA.EE.RL.9-10.3 15.63 <.001    −1.22   0.14 .037 A B Spring
70825 Female 9–10 ELA.EE.RL.9-10.2 12.33 .002 −0.76   0.12 .037 A B Fall
33125 Female 11–12 ELA.EE.L.11-12.4.a   9.86 .007 2.06 −0.06   .041 A B Fall
50786 Female 11–12 ELA.EE.RL.11-12.4 13.64 .001 −1.07   0.01 .052 A B Fall
Mathematics
68833 African American 3 M.EE.3.OA.1-2 29.45 <.001    0.63 −0.19   .042 A B Fall
68836 African American 3 M.EE.3.OA.1-2 27.71 <.001    0.53 −0.18   .039 A B Fall
12186 African American 4 M.EE.4.MD.3   7.60 .022 −0.31   0.12 .755 C C Spring
45523 Two or more races 4 M.EE.4.OA.5   8.91 .012 0.09 0.05 .787 C C Spring
70558 African American 4 M.EE.4.MD.2.a 21.64 <.001    −1.88   0.68 .064 A B Spring
15727 Female 5 M.EE.5.NBT.5 10.47 .005 −0.14   0.07 .759 C C Spring
69046 African American 5 M.EE.5.G.1-4 28.92 <.001    0.47 −0.24   .047 A B Fall
70834 Female 7 M.EE.7.G.1 17.10 <.001    −1.83   0.22 .053 A B Spring
64572 African American 8 M.EE.8.EE.2 12.14 .002 1.95 −0.05   .041 A B Spring
67039 Female 9 M.EE.HS.A.SSE.1 10.77 .005 −1.02   0.01 .042 A B Spring
25442 Female 10 M.EE.HS.A.CED.2-4 10.60 .005 −0.04   0.05 .038 A B Fall
39090 Female 11 M.EE.HS.F.BF.2   8.96 .011 0.23 −0.07   .038 A B Fall
16496 African American 11 M.EE.HS.G.CO.6-8 15.71 <.001    0.02 −0.12   .037 A B Fall
40521 African American 11 M.EE.HS.N.RN.1 42.08 <.001    0.71 −0.36   .039 A B Fall
81243 African American 11 M.EE.HS.S.IC.1-2 23.82 <.001    1.36 −0.30   .039 A B Fall
Note. ID = identification; EE = Essential Element; \(\beta_2G\) = the coefficient for the group term in the logistic regression DIF detection method; \(\beta_3G\) = coefficient for the interaction between the number of linkage levels mastered term and the group term; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
* Effect-size measure: A indicates evidence of negligible DIF, B indicates evidence of moderate DIF, and C indicates evidence of large DIF.

3.4 Conclusion

During 2023–2024, the test development teams conducted on-site events for both item writing and external review. Overall, 286 testlets were written for ELA and mathematics, and 435 testlets were externally reviewed. Following external review, the test development team promoted 53% and 83% of ELA and mathematics testlets, respectively, to field testing without additional revisions. We field-tested 516 field test testlets across grades, subjects, and windows, and 325 (63%) testlets were promoted to the operational pool. Of the content already in the operational pool, most items had p-values within two standard deviations of the mean for the EE and linkage level, 7,539 (>99%) items were not flagged for nonnegligible uniform DIF, and 7,505 (99%) items were not flagged for nonnegligible nonuniform DIF.