In a nutshell: Students in Texas will be among the first to have state-mandated tests scored by an AI-powered platform. The written portion of the State of Texas Assessments of Academic Readiness (STAAR) exam, which gauges skill levels in reading, writing, science, and social studies, will be graded using an "automated scoring engine."

The test was redesigned in 2023. The revised exam now features fewer multiple choice questions and more open-ended questions, which are called constructed response items. The new tests have as many as seven times more open-ended questions than before.

According to the Texas Tribune, the natural language processing approach could save the state upwards of $20 million per year – money that would have otherwise been spent to hire human scorers from a third-party contractor.

Jose Rios, director of student assessment at the TEA, said they wanted to keep as many open ended responses as possible, but noted they take an incredible amount of time to score.

Machines aren't replacing human scorers entirely – at least, not yet. Last year, the Texas Education Agency (TEA) hired roughly 6,000 temporary human scorers. This year, they'll need fewer than 2,000.

A quarter of all constructed responses initially scored by AI will be reevaluated by humans, as will tests in which the computer isn't confident of its score. Responses written in a language other than English and those with slang words will also be passed along to human scorers.

The automated scoring engine was trained on 3,000 responses that first went through two rounds of human scoring. The samples allowed the AI to gauge common characteristics of responses, and instructed it on how to give the same score that a human would.

Chris Rozunick, division director for assessment development at the TEA, said they have always had very robust quality control processes with humans, and that it's similar with a computer system. Just don't call it AI.

"We are way far away from anything that's autonomous or can think on its own," Rozunick said. For example, the scoring solution doesn't "learn" from one response to the next; rather, it'll always defer to its original training as a reference.

Image credit: Pixabay, Katerina Holmes