upper waypoint

Students’ Reading Struggles Tied to Flawed Assessment. So Why Do Schools Use It?

Save ArticleSave Article
Failed to save article

Please try again

A woman wearing a scarf sits on a curb with her arms crossed.
San Francisco mom Havah Kelley has been trying for years to get help for son, who has dyslexia. The boy repeatedly did fine on the Benchmark Assessment System, even though other reading tests showed he lagged years behind his classmates. (Kori Suzuki for APM Reports)

The first thing Havah Kelley noticed was her son’s trouble with the alphabet. The San Francisco mom reviewed letters with him for hours at a time, reciting their names and tracing their shapes. But Kelley’s son couldn’t write most of them on his own. He reversed them or scrawled incoherent shapes. Halfway through his kindergarten year, his teacher said he still couldn’t recognize some letters on sight.

But that teacher told Kelley not to fret. She said she’d given the boy San Francisco Unified School District’s go-to reading test: the Benchmark Assessment System. His reading level on the test had landed within the appropriate range for his age. The teacher said he probably just needed time to catch on.

Kelley, a single parent living in Bayview, knew something wasn’t right. That year, in 2017, she asked the school to test her son for a learning disability. She said they gave her the runaround; their reading test, after all, showed her son was doing fine.

Near the end of first grade, the school finally agreed to do a more comprehensive evaluation. The results showed her son was so far behind his peers in reading and writing that he fit the profile for dyslexia. The Benchmark Assessment System had been — and would continue to be — wrong about how well he could read.

The Benchmark Assessment System (BAS) is one of the most popular measures of early reading ability in American elementary schools. Teachers should use it as a checkup to see how students progress throughout the year. But, researchers who’ve studied it said the BAS is too often wrong to be useful. It is also more expensive for the schools and more time-consuming for the teachers to administer, according to an analysis comparing it to other tests. One professor who analyzed the BAS said it was worse at identifying struggling readers than any assessment he had ever seen. That means struggling readers might be less likely to get the help they need before they fall even further behind their classmates.


“The more research I do, the more I realize it’s problematic,” Kelley said. “The assessment itself is faulty. And my son’s story is proof of that.”

(Kelley requested that her son not be identified so she could candidly discuss his academic performance and medical history while maintaining his privacy.)

For six years, Kelley has been fighting to get her son, now 13, a proper education in how to read. And she has tried to convince the district to drop the test that missed his reading difficulties.

This spring, after years of defending the BAS, San Francisco Unified finally conceded the test is too frequently inaccurate. It’s joining other schools around the country — including Fort Worth in Texas, Baltimore County in Maryland and Nashua in New Hampshire — in dropping the BAS as their district-wide assessment.

At a March school board meeting, San Francisco’s top administrators presented internal data showing the test did a poor job predicting how kindergarteners and first graders eventually scored on the state’s standardized test. Superintendent Matt Wayne said he’s looking for a replacement — one that “ensures that children are literate.”

However, the studies showing the problems with the BAS had been available for the better part of a decade. And a key tenet of the theory of teaching reading that underpins part of the test had been undermined by scientific evidence decades earlier.

Heinemann, the company that publishes the BAS, declined to answer questions for this article — or for “Sold a Story,” a podcast I co-reported that explored problems with a number of its educational products. A company spokesperson wrote in an email that “there is not confidence that when we provide information, it will be correctly and fairly represented.” An attorney representing Heinemann also sent APM Reports a letter questioning the validity of research that identified problems with the BAS.

Kelley said students like her son shouldn’t be going to middle school stumbling through text. Researchers have figured out how teachers can overcome most children’s reading difficulties, but they need good assessments to catch them early.

“How can you fix something if you don’t know how broken it is?” she asked.

Test misses most struggling readers, studies show

The Benchmark Assessment System was created by two of America’s most influential authorities on teaching reading: Irene Fountas, a professor at Lesley University in Massachusetts, and Gay Su Pinnell, a retired professor at The Ohio State University. Their books are among the most assigned in teacher-prep programs, and school districts widely use their curricular materials.

In the 1990s, Fountas and Pinnell began creating a system to help teachers find “just right books” for each student. It was a Goldilocks-type search for a text that was not too easy and not too hard. They wrote that matching a student to the right book would allow the student to focus on the story’s meaning.

Fountas and Pinnell released the BAS with their publisher, Heinemann, in 2007. The test attempts to identify children’s reading abilities by judging how well they progress through a set of stories rated at increasingly difficult reading levels. Those “leveled books” are supposed to represent each point in developing skilled reading, from Level A to Level Z.

To administer the Benchmark Assessment System, a teacher has a child read a series of those books out loud. The process takes about 20 to 30 minutes for each student, and it can take even longer in the upper grades. Fountas and Pinnell recommend getting a substitute to fill in for one or two days, so the teacher has enough time to get through the entire class.

Fountas and Pinnell’s products have made Heinemann tens of millions of dollars in revenue annually, a review of business filings shows. In 2012, Heinemann brought in around $123 million in sales, about half of which came from Fountas and Pinnell’s products, according to a financial report from its parent company. Heinemann broke sales records every year after that — until the pandemic snapped its growth streak in 2020.

The assessment is used in about one in six American elementary schools, according to recent surveys of educators. San Francisco has used the BAS for nearly a decade. In 2020, it paid more than $175,000 for the newest edition of the assessment. And at least 60 other school districts in California, including Long Beach, Palo Alto and Santa Monica-Malibu, purchased the BAS in the past three years, according to GovSpend, a government contracting database.

Across the country, teachers organize reading groups around the BAS’s findings. Major publishers of children’s literature, including Penguin Random House, Simon & Schuster and Candlewick Press, have all marketed books compatible with the BAS’s level system.

Fountas and Pinnell have released only one study to support the Benchmark Assessment System’s accuracy. But even their own study casts doubt on the reliability of the test. The study showed that a student reading two books rated at the same level would often get different results on the assessment. Only 43% of students in kindergarten through second grade scored at the same level on both books.

Independent research comparing the BAS to other assessments has found even bigger problems.

Matthew Burns, a University of Florida special education professor who conducted the first peer-reviewed study of the BAS, said that until he decided to try, the test had never been independently validated to see how closely its results aligned with other early reading assessments.

One of his studies showed that the BAS could distinguish between proficient and struggling readers only about half the time; the odds were slightly better than chance.

“So I could buy this test, train all my teachers to give it, take about 30 minutes per kid,” Burns said. “Or really just have a teacher flip a coin for every kid, and they’ll get it right just as often.”

And when it came to identifying the readers who were furthest behind, Burns said, the BAS performed even worse. It missed most of the struggling readers, students like Kelley’s son who needed intensive help. It caught only 31% of those students. Burns called that level of accuracy “shocking,” saying it was “quite literally the lowest I’ve ever seen.”

In that case, Burns said, “flipping a coin would actually be better.”

(A subsequent study by another team of researchers showed a higher 73% accuracy rate for third graders taking the BAS, but that study still found the test caught only 46% of struggling readers.)

Burns has found that other tests, which are available online for free and take as little as three minutes to administer, were more accurate than the BAS, which can cost close to $500 per classroom and is far more time-consuming.

Another recent study found that the BAS had the least accurate results and by far the highest price tag among three commonly used assessments. The BAS took so long to administer that, accounting for staff time, it cost double or triple what the other tests did. The researchers recommended against using it to identify struggling readers.

The problem, Burns hypothesized, is that the leveled books themselves aren’t a good measure of students’ reading abilities.

In another study, Burns asked second and third-graders to read aloud from two books, both at their designated level. As the children read, he took a simple measure of the number of words they read correctly per minute. He said he expected the scores to match up closely. But just as Fountas and Pinnell noted in their own study almost a decade earlier, he found that students’ reading of the two books was, at best, only moderately correlated.

“There’s not a lot of consistency,” Burns explained. “They read those two books with a very different level of skill. That means there’s something else other than the supposed reading level contributing to how well they read these books.” He inferred that a child’s vocabulary and background knowledge about a topic matter far more. For instance, a kid who’s obsessed with sharks might be able to read a story set in an aquarium well above their expected reading level.

Likewise, it can be difficult for a test to distinguish between a student struggling to read words and one struggling to understand an unfamiliar subject with all its new vocabulary. It would be like asking a literature professor to summarize a car repair manual; it probably wouldn’t indicate much about their overall reading ability.

Fountas and Pinnell each declined multiple interview requests.

John Cuti, a New York attorney representing their publisher, Heinemann, wrote in a letter that this article appears to “double down” on “misstatements and mischaracterizations.” Cuti also dismissed Burns’ research.

“That eight-year-old study is limited and flawed in several important ways and is not a reliable indicator of the effectiveness of BAS,” he wrote. He did not elaborate on the alleged flaws.

Burns released his research in 2015 — when Kelley’s son was still in preschool — and he believed it would prompt districts to take another look at their assessments. He says it was naïve now to think a study could be more persuasive than a publishing company’s sales team. “It’s almost unfair,” he said, “the level of marketing.”

At one major reading conference, educational publishers filled up an exhibition hall with booths of products for sale, Burns said. Under a huge “Fountas & Pinnell” banner that took up half the wall, Heinemann had posted teacher testimonials with “incredible anecdotes” of students succeeding with their products, he recalled.

“You can’t help but buy into the enthusiasm and the excitement. So, when one study comes out, three studies, four studies — whatever comes out — that says it doesn’t really work, it’s too late,” Burns continued. “You’re already bought in.”

Dyslexia undetected

Kelley didn’t know anything about reading assessments when she first dropped her son off for kindergarten. She just wanted to find a good school for him — needed to, really. A car accident left Kelley, a social services coordinator, unable to work. She and her son have been surviving on her disability payments since then. Kelley often wished she could give more to her son, but she reassured herself that being at home would give her time to help with her son’s schooling.

“I kind of made this promise to him that I would love him double to compensate,” she said. “And I would make sure that I was actively involved in his education. So even though we don’t have a lot, he could have that opportunity to thrive.”

Every two weeks, they went to the library together and picked out as many books as they could carry, she said. Kelley signed him up for his own library card to double their maximum checkout to 60 books. She bought books wherever she could find them — at garage sales and thrift stores — and filled bookcases in their apartment.

In kindergarten, Kelley asked the teacher what kind of books her son should be reading at home. She wanted something to match the BAS level she’d heard so much about. The kindergarten teacher said to let her son pick out the books.

The boy always seemed to gravitate to the books he’d had since he was 3 years old, which he’d practically memorized. When Kelley asked him to try the new library books, he’d look at the pictures to guess what the sentences said. Kelley tried to help him sound out the words, but he’d get upset. He was adamant that she was wrong: He wasn’t supposed to read that way, she recalled him saying.

Kelley said her son’s writing was the biggest giveaway. When she went in for first-grade parent-teacher conferences, she opened his writer’s notebook. “I flipped through his notebooks, and I saw a date and nothing: no writing, no input, zero. And I looked around the classroom and I saw other kids had, like, little paragraphs or little sentences. And he had nothing,” she said. “Blank.”

That was when Kelley convinced the school to give her son a full assessment. He was found to be dyslexic. But by the end of that school year, the BAS once again said he was close to where he should be. Kelley didn’t buy it. She kept thinking to herself: “I’m losing time, I’m losing time.”

A debunked theory

It’s no surprise that Kelley’s son used pictures to figure out what the words on the page said. That’s one strategy Fountas and Pinnell encourage students to use. Other popular curricula teach similar strategies, even though research dating back to the 1970s has shown that the approach is ineffective and potentially harmful to children’s progress in reading.

The books used to score the BAS often reward those problematic strategies. Especially at the lower levels, the books use repetitive sentence patterns accompanied by illustrations that make guessing words easy so a student can “read” them even if they can’t sound out the words in the sentences.

As a child reads for the BAS, the teacher notes every word the child misses and decides which of three sources of information — sometimes called cues — might have thrown them off: Did the incorrect word make sense in the story context? Did it fit grammatically? Did it match any letters? The results are meant to reveal a pattern, hinting at students’ strategies as they make their way through text.

At a November 2021 forum, Lisa Levin, an administrator at San Francisco Unified, told parents that teachers in the district used the BAS to understand “what readers are doing at the point of difficulty” — which cues they are using to decipher words, and which ones are throwing them off.

If a student is relying too heavily on the pictures, a teacher would want to draw their attention to the letters, she said. Conversely, if a student is relying too much on phonics, “like all they’re doing is sounding it out, so their fluency kind of is chunky and not fluid,” they might tell a very young reader to look at the picture, she said.

Levin gave the example of a 5-year-old who comes across the word “umbrella.” Rather than “having to pause and stop and trying to sound out those long words,” they should just look at the picture, she said.

However, decades of research by cognitive scientists have shown that encouraging students to use those clues to read can be detrimental to their progress in reading and, therefore, to their entire education.

“That’s not the way that we want kids to read words,” said Mark Seidenberg, a cognitive scientist at the University of Wisconsin who has conducted extensive research on how the brain processes language. “The idea that the child should be using all types of information all the time to read words is fundamentally wrong.”

Seidenberg said Fountas and Pinnell have it “backwards.” Their materials prompt students to use patterns, pictures and context. They give students “strategies for dealing with their failures,” rather than teaching them to read, he said. And the BAS measures how well they use all those flawed strategies.

Rebecca Fedorko, a special education teacher who worked with children like Kelley’s son in San Francisco schools, told administrators at the 2021 forum that Fountas and Pinnell’s materials only made her job harder. She said she had to undo the bad habits their system taught struggling readers to rely on.

“It teaches them to guess about words, instead of focusing on sounding them out,” said Fedorko, who has since left to work in another school. “I spend roughly one to two months at the beginning of every year trying to get my students to stop doing this. It directly contradicts what we’re doing in special education.”

After resisting that kind of criticism for years, Fountas and Pinnell now appear to be working on revisions to the BAS and their curriculum. Last year, their publisher offered teachers $25 Amazon gift cards to review a proposed test version that de-emphasizes the use of cues, according to an email one participant shared with APM Reports.

Havah Kelley (Kori Suzuki for APM Reports)

‘The damage has been done’

During the summer after her son was diagnosed with dyslexia, Kelley got him a scholarship to Lindamood-Bell, a tutoring company specializing in intensive reading instruction. And for the first time, she saw that there was another way to do it, one that seemed much more effective than the one his school had used.

The Lindamood-Bell tutors helped her son break spoken words into individual sounds and showed how those sounds matched up with letters. It was so different from what the teachers in San Francisco schools had done, she remembered thinking. Kelley said her son had “one lightning-bolt moment after another” with the lessons.

But when the school year began, teachers still pushed him to use pictures and context to figure out words — not to rely on sounding them out. Kelley could see the toll it was taking on her son. He knew he was behind, and he was frustrated. He’d complain of terrible headaches. When it was time to read, he’d sometimes go to the nurse with a stomachache. “There’s only so much he can take,” Kelley said.

As Kelley’s son progressed through second, third and fourth grades, he kept moving up the BAS’ levels. His scores were so high at one point that the principal suggested they discontinue special education services.

In fifth grade, when her son received a full reevaluation of his learning disability, as required under federal special education law, the results showed he’d made little progress on some reading measures compared to his peers. On others, he’d even slid backward. The evaluator said that, in some respects, Kelley’s son was still effectively reading at a first-grade level.

Kelley said getting such inaccurate information from the BAS has been disorienting. The school had told her all along that he was meeting his goals based on Fountas and Pinnell’s measures. But every outside assessment had given her different results, nearly “the complete opposite of what I’d been hearing,” she said. In early 2022, halfway through fifth grade, Kelley transferred her son to a specialized school for children with disabilities.

About half of San Francisco’s third graders aren’t where they should be in reading. On last year’s state test, 49% scored below grade-level standards. And there are stark disparities by race and ethnicity. Only 23% of Black third graders and 24% of Hispanic third graders met the standard in reading.

A spokesperson for San Francisco Unified declined APM Reports’ repeated requests to make district officials available for an interview.

Officials at San Francisco Unified are only now acknowledging that the BAS distorted their view of students’ early reading abilities. They’ve read the research parent advocates sent them showing there are likely other children in the district, like Kelley’s son, that the test missed.

“Can a student succeed on [the Benchmark Assessment System] and not be literate?” Superintendent Wayne asked rhetorically at the March school board meeting. “That’s the thing: Yes.”

Wayne said he doesn’t want teachers to feel that they can’t sit with children and listen to them read aloud. However, he said the district needs an early reading test to accurately measure foundational skills such as sounding out words. “And that’s the piece that’s missing,” Wayne said.

Kelley is now collaborating with the district. She’s been meeting with administrators as they select a new language arts curriculum and accompanying assessments. She says she’s advocating for San Francisco’s other struggling readers. But she says the changes are coming too late for her family.

Her son is now in seventh grade and still struggles with reading, “just throwing out words” when he’s stuck.

“The damage has been done, and it’s not an easy fix. It’s going to take a lot of time to get him just to that baseline,” Kelley said. “And I still don’t know if we’re going to be OK.”

Additional reporting by Emily Hanford and Will Callan.

This story was co-published with APM Reports. Its podcast Sold a Story investigates how teaching kids to read went so wrong.



lower waypoint
next waypoint