The world is full of risk prediction algorithms. Algorithms tell lenders whether a borrower is likely to default on their loans. Colleges use algorithms to predict which applicants will not graduate. Doctors use algorithms to advise patients about health concerns. Risk assessment algorithms are used to predict the likelihood that an offender will reoffend.
All these algorithms have one thing in common: they rely on data. This is what prompted Julian Nyarko, a professor at Stanford Law School and associate director of the Stanford Institute for Human-Centered AI, to study the effectiveness of risk-based predictive models. The question at stake is whether risk assessment models actually predict the truth they purport to predict.
Nyarko and two of his Harvard colleagues Scientific advances They show that many risk models may not be what you expect, not because of a lack of data, but because of too much data. They call the conventional wisdom in the field the “kitchen sink” approach: an “everything…and the kitchen sink” strategy, where the more data, the better.
Read the study, “Risk Scores, Label Bias, and Everything But the Kitchen Sink”
“The thinking goes, 'Let's give the model access to as much data as we can. Can't hurt?'” Nyarko explains. “If the data shows that sunspots or shoe size or coffee price are good predictors of recidivism, researchers will want to know and use that information in their models.”
Proxy voting
The problem with risk models, Nyarko says, is that they typically don't measure what they're actually trying to measure, which is often hidden or unmeasurable, like crime or many medical conditions. Instead, these models measure it indirectly, using proxies. The use of inappropriate proxies leads to a research phenomenon called label bias: essentially, the proxy is incorrectly labeled as the truth. So the model is very good at predicting the proxy, but misses the mark when trying to infer the truth.
The researchers demonstrated the impact of label bias in several real-world case studies. The first case study concerns the criminal justice system, where judges often use models that estimate risk to public safety to determine bail for arrested individuals. Existing models are trained on the likelihood of future arrests, meaning that arrests serve as a proxy for unobserved behavior rather than the true outcome of interest, which is future criminal activity.
Nyarko and his colleagues showed that arrests are in fact a poor predictor of risk to public safety because they depend on both behavior and geography; that is, people engaging in the same illegal activity may experience different arrest rates depending on where they live. They point to a well-known study that showed how large U.S. cities concentrated police activity in certain areas, resulting in higher arrest rates for black citizens than white citizens, even though white citizens had the same recidivism rates. Based on such a model, black detainees are more likely to be denied bail.
The researchers next turned to healthcare, examining risk assessment tools used to identify patients eligible for high-risk care management programs that could extend or save lives. These models typically predict expected future healthcare costs as a proxy for healthcare need. Here too, black people are less likely to be enrolled than white people. White patients are more likely to seek healthcare, and therefore incur higher costs, than black patients with the same illness, and therefore score higher in terms of expected future healthcare costs.
Small Ball
Using that example, Nyarko and his collaborators trained two new medical risk models. One is a simpler model, One model had 128 risk predictors, and one more complex model had 150. They showed that the simpler model repeatedly identified more patients needing high-risk medical programs and enrolled more black patients in those programs. They argue that this more equitable distribution is because the simpler model prioritizes immediate medical needs over future costs, which makes it a more accurate representation of the truth.
“Researchers need to be aware and diligent when they only have proxy data and not the data they really need,” Nyarko advises. “When proxy data are all they have, being careful about the proxy data selection and reducing the complexity of the model by excluding certain data can improve both the accuracy and fairness of risk predictions.”
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. learn more.