Will Computers Ever Be as Good as Physicians at Diagnosing Patients?

Soon after the well-publicized trouncing, IBM announced that one of its first “use cases” for Watson would be medicine. Sean Hogan, vice president for IBM Healthcare, told me that “health care jumped out as an area whose complexity and nuances would be receptive to what Watson was representing.”

Sticking Up for Team Human

Andy McAfee, coauthor with Erik Brynjolfsson of the terrific book "The Second Machine Age," agrees with Khosla that computers will ultimately take over much of what physicians do, including diagnosis. “I can’t see how that doesn’t happen,” McAfee, a self-described “technology optimist,” told me when we met for lunch near his MIT office. McAfee and Brynjolfsson argue that the confluence of staggering growth in computing power, zetabytes of fully networked information available on the Web, and the “combinatorial power” of innovation mean that areas that seemed like dead ends, such as artificial intelligence in medicine, are now within reach. They liken the speed with which old digital barriers are falling to Hemingway’s observation about how a person goes broke: “gradually, then suddenly."

In speaking with both McAfee and Khosla, I felt a strange obligation to stick up for my teams: humans and the subset of humans called doctors. I told McAfee that while I was in awe of the driverless car and IBM’s victories in chess (over world champion Garry Kasparov in 1997) and Jeopardy, he just didn’t understand how hard medicine is. Answering questions posed by Alex Trebek like, “While Maltese borrows many words from Italian, it developed from a dialect of this Semitic language” (the correct response is “What is Arabic?”—Watson answered it, and 65 of the 74 other questions it rang in for, correctly) is tricky, sure, but, at the end of the day, one is simply culling a series of databases to find a fact—a single right answer.

Medical diagnosis isn’t like that. For one thing, uncertainty is endemic, so that the “correct” answer is often a surprisingly probabilistic notion. For another, many diagnoses reveal themselves over time. The patient may present with, say, a headache, but not a worrisome one, and so the primary treatment is reassurance, Tylenol, and time. If the headache worsens over the next two weeks—particularly if it is now accompanied by additional symptoms such as weakness or nausea—that’s an entirely different story.

McAfee listened sympathetically—he’s obviously heard scores of versions of the "You just don’t understand; my work is different" argument—and then said, “I imagine there are a bunch of really smart geeks at IBM taking notes as guys like you describe this situation. In their heads, they’re asking, ‘How do I model that?’”

Undaunted, I tried another tack on Khosla when we met in his office in Menlo Park. “Vinod,” I said, “in medicine we have something we call the ‘eyeball test.’ That means I can see two patients whose numbers look the same”—things like temperature, heart rate, and blood counts—“and my training allows me to say, ‘That guy is sick [I pointed to an imaginary person across the imposing conference table] and the other is okay.’” And good doctors are usually right, I told him, as we possess a kind of sixth sense that we acquire from our training, our role models, and a thousand cases of trial and error.

Before Khosla could dismiss this as the usual whining from a dinosaur on the edge of extinction, I tossed him an example from his own world. “I’ll bet you have CEOs of start-ups constantly coming through this office pitching their companies,” I said. “I can imagine two companies that look the same on paper: both CEOs have Stanford MBAs; the proposals have similar financials. Your skill is to be able to point to one and say, ‘Winner’ and to the other, ‘Loser,’ and I’m guessing you’re right more often than not. You’re using information that isn’t measurable. Right?”

Nice try. He didn’t budge. “The question is, ‘Is it not measurable or is it not being measured?’” he responded. “And, when does your instinct work and when does it mislead? I think if you did a rigorous study, you’d find that your ‘eyeball test’ is far less effective than you think.”

Secrets of the Great Diagnosticians

There is a rich 50-year history of efforts to build artificial intelligence (AI) systems in health care, and it’s not a particularly uplifting story. Even technophiles admit that the quest to replace doctors with computers—or even the more modest ambition of providing them with useful guidance at the point of care—has been overhyped and unproductive. But times have changed. The growing prevalence of electronic health records offers grist for the AI and big data mills, grist that wasn’t available when the records were on paper. And in this, the Age of Watson, we have new techniques, like natural language processing and machine learning, at our disposal. Perhaps this is our “gradually, then suddenly” moment.

The public worships dynamic, innovative surgeons like Michael DeBakey; passionate, insightful researchers like Jonas Salk; and telegenic show horses like Mehmet Oz. But we seldom hear about those doctors whom other physicians tend to hold in the highest esteem: the great medical diagnosticians. These sages, like the legendary Johns Hopkins professors William Osler and A. McGehee Harvey, had the uncanny ability to deduce the truth from what others found to be a jumble of symptoms, signs, and lab results. In fact, Sir Arthur Conan Doyle, a physician by training, modeled Sherlock Holmes on one of his old professors, Joseph Bell, a renowned diagnostician at Edinburgh’s medical school.

For most doctors, diagnosis forms the essence of their practice (and of their professional souls), which may help explain why we find it so painful to believe that this particular skill could be replaced by silicon wafers.

In the 1970s, a Tufts kidney specialist named Jerome Kassirer (who later became editor of the New England Journal of Medicine) decided to try to unlock the cognitive secrets of the great diagnosticians. If he succeeded, the rewards could be great. The insights, problem-solving strategies, and reasoning patterns of these medical geniuses might be teachable to other physicians, perhaps even programmed into computers.

Kassirer focused first on the differential diagnosis, the method that doctors have long used to inventory and sort through their patients’ problems. The differential diagnosis is to a physician what the building of hypotheses is to a basic scientist: the core work of the professional mind. Let’s say a female patient complains of right lower abdominal pain and fever. We automatically begin to generate “a differential,” including appendicitis, pelvic inflammatory disease, kidney infection, and a host of less common disorders—some of them quite serious. Our job is to weigh the facts at hand in an effort to ultimately “rule in” one diagnosis on the list and “rule out” the others. Sometimes, the information we gather from the history and physical examination is sufficient.

More often, particularly when patients are truly ill, we require additional laboratory or radiographic studies to push one of the diagnoses over the “rule in” line. There is considerable skill, and no small amount of art, involved in this process. For one thing, we need to figure out whether the patient’s symptoms are part of a single disease or are manifestations of two or more distinct illnesses. The principle known as Occam’s Razor bids us to try to find a unifying diagnosis for all of a patient’s symptoms. But as soon as medical students memorize this so-called Law of Clinical Parsimony, we whipsaw them with Hickam’s Dictum, which counters, irreverently, that “patients can have as many diseases as they damn well please.”

Setting the “rule in” threshold is yet another challenge, since it’s wholly dependent on the context. For diseases with relatively benign treatments and prognoses—let’s say, stomach discomfort with no alarming features—I might make the diagnosis of “nonulcer dyspepsia” if I’m 75 percent certain that this is what’s going on. Why? Dyspepsia is a not-too-serious illness, the other illnesses that might present with the same symptoms aren’t likely to be acutely life-threatening either, and dyspepsia has a safe, inexpensive, and fairly effective treatment. All of this makes a 75 percent threshold high enough for me to try an acid-blocker and see what happens.

Now let’s turn to a patient who presents with acute shortness of breath and pleuritic chest pain. In this patient, I’m considering the diagnosis of pulmonary embolism (a blood clot to the lungs), a more serious disorder whose treatment (blood thinners) is riskier. Now, I’d want to be at least 95 percent sure before attaching that diagnostic label. And I won’t rule in a diagnosis of cancer—with its psychological freight, prognostic implications, and toxic treatments—unless I’m close to 100 percent certain, even if it takes a surgical biopsy to achieve this level of confidence.

Kassirer and his colleagues observed the diagnostic reasoning of scores of clinicians. They found that the good ones employed robust strategies to answer these knotty questions, even if they couldn’t always articulate what they were doing and why. The researchers ultimately came to appreciate that the physicians were engaging in a process called “iterative hypothesis testing” to transform the differential diagnosis (or, more accurately, diagnoses, since sick patients often have a variety of abnormalities to be explained) into something actionable. After hearing the initial portion of a case, the doctors began drawing possible scenarios to explain it, modifying their opinions as they went along and more information became available.

For example, when a physician confronts a case that begins with, “This 57-year-old man has three days of chest pain, shortness of breath, and lightheadedness,” she responds by thinking, “The worst thing this could be is a heart attack or a pulmonary embolism. I need to ask if the chest pain bores through to the back, which would make me worry about aortic dissection [a rip in the aorta]. I’ll also inquire about typical cardiac symptoms, such as sweating and nausea, and see if the pain is squeezing or radiates to the left arm or jaw. But even if it doesn’t, I’ll certainly get an EKG to rule out a heart attack or pericarditis [inflammation of the sac that surrounds the heart]. If he also reports a fever or a cough, I might begin to suspect pneumonia or pleurisy. The chest X-ray should help sort that out.”

Every answer the patient gives, and each positive or negative finding on the physical examination (yes, there is a heart murmur; no, the liver is not enlarged) triggers an automatic, almost intuitive recalibration of the most likely alternatives. When I see a master clinician at work—my favorite is my UCSF colleague Gurpreet Dhaliwal, who was profiled in a 2012 New York Times article—I know that these synapses are firing as he asks a patient a series of questions that may seem unrelated to the patient’s presenting complaint but are directed toward “narrowing the differential.” It turns out that there’s an even more impressive piece of cognitive magic going on. The master clinician embraces certain pieces of data (the patient’s trip to rural Thailand last year) while discarding others (an episode of belly pain and bloating three weeks ago). This is the part of diagnostic reasoning that beginners find most vexing, since they lack the foundational knowledge to understand why their teacher focused so intently on one nugget of information and all but ignored others that, to the novice, seemed equally crucial. How do the great diagnosticians make such choices?

We now recognize this as a relatively intuitive version of Bayes’ theorem. Developed by the eighteenth-century British theologian-turned-mathematician Thomas Bayes, this theorem (often ignored by students because it is taught to them with the dryness of a Passover matzo) is the linchpin of clinical reasoning. In essence, Bayes’ theorem says that any medical test must be interpreted from two perspectives. The first: How accurate is the test—that is, how often does it give right or wrong answers? The second: How likely is it that this patient has the disease the test is looking for?

These deceptively simple questions explain why, in the early days of the AIDS epidemic (when HIV testing was far less accurate than it is today), it was silly to test heterosexual couples applying for a marriage license, since the vast majority of positive tests in this very low-risk group would be wrong. Similarly, they show why it is foolish to screen healthy 36-year-old executives with a cardiac treadmill test or a heart scan, since positive results will mostly be false positives, serving only to scare the bejesus out of the patients and run up bills for unnecessary follow-up tests. Conversely, in a 68-year-old smoker with diabetes and high cholesterol who develops squeezing chest pain while jogging, there is a 95 percent chance that those pains are from coronary artery disease. In this case, a negative treadmill test only lowers this probability to about 80 percent, so the clinician who reassures the patient that his negative test means that his heart is fine—“take some antacids; it’s OK to keep jogging”—is making a terrible, and potentially fatal, mistake.

The AI Challenge

As if this weren’t complicated enough for the poor IBM engineer gearing up to retool Watson from answering questions about “Potent Potables” to diagnosing sick patients, there’s more. While the EHR at least offers a fighting chance for computerized diagnosis (older medical AI programs, built in the pen-and-paper era, required busy physicians to write their notes and then reenter all the key data), parsing an electronic medical record is far from straightforward. Natural language processing is getting much better, but it still has real problems with negation (“the patient has no history of chest pain or cough”) and with family history (“there is a history of arthritis in the patient’s sister, but his mother is well”), to name just a couple of issues. Certain terms have multiple meanings: when written by a psychiatrist, the term depression is likely to refer to a mood disorder, while when it appears in a cardiologist’s note (“there was no evidence of ST-depression”) it probably refers to a dip in the EKG tracing that is often a clue to coronary disease. Ditto abbreviations: Does the patient with “MS” have multiple sclerosis or mitral stenosis, a sticky heart valve? Finally, the computer can’t read a patient’s tone of voice or the anxious look on her face, although engineers are working on this. These clues—like one patient saying, “I have chest pain,” and another, “I HAVE CHEST PAIN!!!”—can make all the difference in the world diagnostically.

Perhaps the trickiest problem of all is that—at least today—the very collection of the facts needed to feed an AI system is itself a cognitively complex process. Let’s return to the example of aortic dissection, a rip in the aorta that is often fatal if it is not treated promptly. If the initial history raises the slightest concern about dissection, I’m going to ask questions about whether the pain bores through to the back and check carefully for the quiet murmur of aortic insufficiency as well as for asymmetric blood pressure readings in the two arms, all clues to dissection. If I don’t harbor a suspicion of this scary (and unusual) disease, I’m not going to look for these things—they’re not part of a routine exam.

Decades ago, MIT’s Peter Szolovits, an AI expert who worked with Kassirer and his colleagues in the early days, gave up thinking about diagnosis as a simple matter of question answering. This was mostly because he came to appreciate the importance of timing—a nonissue in Jeopardy but a pivotal one in medicine. “A heart attack that happened five years ago has different implications from one that happened five minutes ago,” he explained, and a computer can’t “know” this unless it is programmed to do so. (It turns out that such issues of foundational knowledge are fundamental in AI—computers have no way of “knowing” some of the basic assumptions that allow us to get through our days, things like water is wet, love is good, and death is permanent.)

Moreover, much of medical reasoning relies on feedback loops: observing how events unfold and using that information to refine the diagnostic possibilities.We think a patient has bacterial pneumonia, and so we treat the “pneumonia” with antibiotics, but the patient’s fever doesn’t break after three days. So now we consider the possibility of tuberculosis or lupus. This is the cognitive work of the practicing clinician—focused a bit less on “What is the diagnosis?” and more on “How do I best manage this situation?”—and an AI program that doesn’t account for this will be of limited value.

Early Attempts

Now that you appreciate the nature of the problem, it’s easy (in retrospect, at least) to see why the choice by early health care computer experts to focus on diagnosis was risky, perhaps even wrongheaded. It’s like tackling Saturday’s crossword puzzle in the New York Times before first mastering the one in USA Today.

Larry Fagan, an early Stanford computing pioneer, told me, “We were not naive about the complexity. It’s just that it was the most exciting question.” Diagnosis is not just exciting, it’s at the heart of safe medical care. Diagnostic errors are common, and they can be fatal. A number of autopsy studies conducted over the past 40 years have shown that major diagnoses were overlooked in nearly one in five patients. With the advent of CT scans and MRIs, the number has gone down a bit, but it still hovers around one in ten. Diagnostic errors contribute to 40,000 to 80,000 deaths per year in the United States. And reviews of malpractice cases have demonstrated that diagnostic errors are the most common source of mistakes leading to successful lawsuits.

Medical IT experts jumped into the fray in the 1970s, designing a series of computer programs that they believed could help physicians be better diagnosticians, or perhaps even replace them entirely. That decade’s literature was replete with enthusiastic articles about how microprocessors, programmed to think like experts, would soon replace the brains of harried doctors. The attitude was captured by one early computing pioneer in a 1971 paean to his computer: “It is immune from fatigue and carelessness; and it works day and night, weekends and holidays, without coffee breaks, overtime, fringe benefits or human courtesy.”

By the mid-1980s, disappointment had set in. The tools that had seemed so promising a decade earlier were, by and large, unable to manage the complexity of clinical medicine, and they garnered few clinician advocates and miniscule commercial adoption. The medical AI movement skidded to a halt, marking the start of a 20-year period that insiders still refer to as the “AI winter.” Ted Shortliffe, one of the field’s longstanding leaders, has said that the early experience with programs like INTERNIST, DXplain, and MYCIN reminded him of this cartoon:

'Version 0'

Vinod Khosla is prepared for this. He knows that even today’s generation of medical AI programs will produce some crazy output, akin to when Watson famously mistook Toronto for an American city during its Jeopardy triumph. (It was worse in rehearsal, when Watson referred to civil rights leader Malcolm X as “Malcolm Ten.”) Khosla points out that the enormous cellphones of the late 1980s would seem equally ridiculous when placed alongside our iPhone 6.0s. He calls today’s medical AI programs “Version 0,” and cautions that people should “expect these early systems and tools to be the butt of jokes from many a writer and physician.”

These cases illustrate a perennial debate in AI, one that pits two camps against each other: the “neats” and the “scruffies.” The neats seek solutions that are elegant and provable; they try to model the way experts think and work, and then code that into AI tools. The scruffies are the pragmatists, the hackers, the crazy ones; they believe that problems should be attacked through whatever means work, and that modeling the behavior of experts or the scientific truth of a situation isn’t all that important. IBM’s breakthrough was to figure out that a combination of neat and scruffy—programming in some of the core rules of the game, but then folding in the fruits of machine learning and natural language processing—could solve truly complicated problems.

When he was asked about the difference between human thinking and Watson’s method, Eric Brown, who runs IBM’s Watson Technologies group, gave a careful answer (note the shout-out to the humans, the bit players who made it all possible):

A lot of the way that Watson works is motivated by the way that humans analyze problems and go about trying to find solutions, especially when it comes to dealing with complex problems where there are a number of intermediate steps toget you to the final answer. So it certainly is inspired by that process. . . . But a lot of it is different from the ways humans work; it tends to leverage the powers and advantages of a computer system, and its ability to rapidly analyze huge amounts of data and text that humans just can’t keep track of.

However Watson works, we find ourselves today in a world with new tools, new mental models, and a new sense of optimism that computers can do pretty much anything. But have we finally reached the age when computers can master the art of clinical reasoning?

I asked Eric Brown, who worked on the "Jeopardy" project and is now helping to lead Watson’s efforts in medicine, what the equivalent event might be in health care, the moment when his team could finally congratulate itself on its successes. I wondered if it would be the creation of some kind of holographic physician—like “The Doctor” on Star Trek Voyager—with Watson serving as the cognitive engine. His answer, though, reflected the deep respect he and his colleagues have for the magnitude of the challenge:

Will Computers Ever Be as Good as Physicians at Diagnosing Patients?

Will Computers Ever Be as Good as Physicians at Diagnosing Patients?

Signed up.