Fair ML for Health

NeurIPS 2019 Workshop, East Ballroom B, Vancouver CA.

Schedule (Sat Dec 14): East Ballroom B

  • 09:00-09:15: Check-in and set up contributed posters
  • 09:15-09:30: Opening remarks from Irene Chen
  • 09:30-10:00: Keynote - Milind Tambe: "Applying AI in preventative health interventions: algorithms, deployments and fairness" [video, slides]
  • 10:00-10:30: Invited talk - Ziad Obermeyer : "Bad Proxies" [video, slides]
  • 10:30-11:00: Coffee break and poster sessions
  • 11:00-11:15: Organizers’ primer on unconference style breakout sessions
  • 11:15-12:30: Two rounds of breakout sessions
  • 12:30-12:45: Breakout session leaders discuss the conclusions with all attendees
  • 12:45-14:00: Lunch break
  • 14:00-14:30: Invited Talk - Sharad Goel: "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning" [video, slides]
  • 14:30-15:00: Invited Talk - Noa Dagan, Noam Barda: "Addressing Fairness in Prediction Models by Improving Subpopulation Calibration" [video, slides]
  • 15:00-15:30: Invited Talk - Chelsea Barabas: "Beyond Bias: Contextualizing “Ethical AI” Within the History of Race, Exploitation and Innovation in Medical Research" [video, slides]
  • 15:30-16:00: Coffee break and poster sessions
  • 16:00-17:00: Panel Discussion [video]
  • 17:00-17:30: Spotlight presentations.
    • 17:00-17:10 - Spotlight 1 - Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets [video, slides]
    • 17:10-17:20 - Spotlight 2 - Understanding racial bias in health using the Medical Expenditure Panel Survey data [video, slides]
    • 17:20-17:30 - Spotlight 3 - Fair Predictors under Distribution Shift [video, slides]
  • 17:30-18:00: Closing Remarks and Poster Session


*Date and location will be assigned by NeurIPS Workshop Chairs. Date will be either Saturday Dec 14. Room will be in the Vancouver Convention Centre East.

Vancouver Convention Center (source).

Speaker Talk Abstracts and Bios

Milind Tambe: "Applying AI in preventative health interventions: algorithms, deployments and fairness"

With the maturing of AI and multiagent systems research, we have a tremendous opportunity to direct our work towards addressing complex societal problems. In pursuing this research agenda of AI for Social Impact, we present algorithmic advances as well as deployments that address one key cross-cutting challenge: how to effectively deploy our limited intervention resources within these critical problem domains. In this talk, we focus on our on-going work in public health preventative interventions. I will present results from real-world pilot deployments, initial investigations of algorithmic approaches for addressing challenges in fairness, as well as a key question for future investigation: to understand the interaction between domain-specific stakeholder perspectives on fairness and algorithmic approaches.

Milind Tambe is Gordon McKay Professor of Computer Science and Director of Center for Research in Computation and Society at Harvard University; concurrently, he is also Director "AI for Social Good" at Google Research India. Prof. Tambe's research focuses on advancing AI and multiagent systems research for Social Good. He is recipient of the IJCAI (International Joint Conference on AI) John McCarthy Award, ACM/SIGAI Autonomous Agents Research Award from AAMAS (Autonomous Agents and Multiagent Systems Conference), AAAI (Association for Advancement of Artificial Intelligence) Robert S Engelmore Memorial Lecture award, INFORMS Wagner prize, the Rist Prize of the Military Operations Research Society, the Christopher Columbus Fellowship Foundation Homeland security award, International Foundation for Agents and Multiagent Systems influential paper award; he is a fellow of AAAI and ACM. He has also received meritorious Team Commendation from the US Coast Guard and LA Airport Police, and Certificate of Appreciation from US Federal Air Marshals Service for pioneering real-world deployments of security games.

Prof. Tambe has also co-founded a company based on his research, Avata Intelligence. Prof. Tambe received his Ph.D. from the School of Computer Science at Carnegie Mellon University.

Ziad Obermeyer: "Bad Proxies"

The choice of convenient, seemingly effective proxies for ground truth can be an important source of algorithmic bias in many contexts. We illustrate this with an empirical example from health, where commercial prediction algorithms are used to identify and help patients with complex health needs. We show a widely-used algorithm, typical of this industry-wide approach and affecting millions of patients, exhibits significant racial bias: at a given risk score, blacks are considerably sicker than whites, as evidenced by signs of uncontrolled illnesses. Remedying this would increase blacks receiving additional help from 17.7% to 46.5%. The bias arises because the algorithm predicts health care costs rather than illness. But unequal access to care means we spend less caring for blacks than whites. So, despite appearing to be an effective proxy for health by some measures of predictive accuracy, large racial biases arise.

Ziad Obermeyer is an Acting Associate Professor of Health Policy and Management at the UC Berkeley School of Public Health, where he does research at the intersection of machine learning, medicine, and health policy. He previously was an Assistant Professor at Harvard Medical School, where he received the Early Independence Award, the National Institutes of Health’s most prestigious award for exceptional junior scientists. He continues to practice emergency medicine in underserved parts of the US. Prior to his career in medicine, he worked as a consultant to pharmaceutical and global health clients at McKinsey & Co. in New Jersey, Geneva, and Tokyo.

Sharad Goel: The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last few years, several formal definitions of fairness have gained prominence. But, in this talk, I'll argue that nearly all of these popular mathematical formalizations suffer from significant statistical limitations. In particular, when used as design objectives, these definitions, perversely, can harm the very groups they were intended to protect. Paper: https://5harad.com/papers/fair-ml.pdf

Sharad Goel is an assistant professor at Stanford in the Department of Management Science & Engineering, in the School of Engineering. He also has courtesy appointments in Computer Science, Sociology, and the Law School. Sharad is the founder and executive director of the Stanford Computational Policy Lab, a team of researchers, data scientists, and journalists that addresses policy problems through technical innovation. In collaboration with the Computational Journalism Lab, they have created the Stanford Open Policing Project, a repository of data on over 100 million traffic stops across the United States.

He often write essays about contemporary policy issues from a statistical perspective. These include discussions of algorithms in the courts (in the New York Times and the Washington Post); policing (in Slate and The Huffington Post); election polls (in the New York Times); claims of voter fraud (in Slate, and also an extended interview with This American Life); and affirmative action (in Boston Review).

He studied at the University of Chicago (B.S. in Mathematics) and at Cornell (M.S. in Computer Science; Ph.D. in Applied Mathematics). Before joining the Stanford faculty, he worked at Microsoft Research in New York City.


Noa Dagan and Noam Barda: "Addressing Fairness in Prediction Models by Improving Subpopulation Calibration"

Background: The use of prediction models in medicine is becoming increasingly common, and there is an essential need to ensure that these models produce predictions that are fair to minorities. Of the many performance measures for risk prediction models, calibration (the agreement between predicted and observed risks) is of specific importance, as therapeutic decisions are often made based on absolute risk thresholds. Calibration tends to be poor for subpopulations that were under-represented in the development set of the models, resulting in reduced performance for these subpopulations. In this work we empirically evaluated an adapted version of the fairness algorithm designed by Hebert-Johnson et al. (2017) to improve model calibration in subpopulations, which should lead to greater accuracy in medical decision-making and improved fairness for minority groups.

Methods: This is a retrospective cohort study using the electronic health records of a large sick fund. Predictions of cardiovascular risk based on the Pooled Cohort Equations (PCE) and predictions of osteoporotic fracture risk based on the FRAX model were calculated as of a retrospective index date. We then evaluated the calibration of these models by comparing the predictions to events documented during a follow-up period, both in the overall population and in subpopulations. The subpopulations were defined by the intersection of five protected variables: age, sex, ethnicity, socioeconomic status and immigration history, resulting in hundreds of combinations. We next applied the fairness algorithm as a post processing step to the PCE and FRAX predictions and evaluated whether calibration in subpopulations improved using the metrics of calibration-in-the-large (CITL) and calibration slope. To evaluate whether the process had a negative effect on the overall discrimination, we measured the area under the Receiver Operating Characteristic Curve (AUROC).

Results: 1,021,041 patients aged 40-79 were included in the PCE population and 1,116,324 patients aged 50-90 were included in the FRAX population. After local adjustment, baseline overall model calibration of the two tested models was good (CITL was 1.01 and 0.99 for PCE and FRAX, respectively). However, the calibration in a substantial portion of the subpopulations was poor, with 20% having CITL values of greater than 1.49 and 1.25 for PCE and FRAX, respectively, and 20% having CITL values less than 0.81 and 0.87 for PCR and FRAX, respectively. After applying the fairness algorithm, subpopulation calibration statistics were greatly improved, with the 20th and 80th percentiles moving to 0.97 and 1.07 in the PCE model and 0.95 and to 1.03 in the FRAX model. In addition, the variance of the CITL values across all subpopulations was reduced by 98.8% and 95.7% in the PCE and FRAX models, respectively. The AUROC remained unharmed (+0.12% and +0.31% in the PCE and FRAX, respectively).

Conclusions: A post-processing and model-independent fairness algorithm for recalibration of predictive models greatly improved subpopulation calibration and thus fairness and equality, without harming overall model discrimination.


Noa Dagan has an MD and an MPH degree from the Hebrew University. She is currently Head of Data and AI Driven Medicine in the Clalit Research Institute at Clalit Health Services. In addition, she is a public health resident and a PhD student in the Computer Science department at Ben-Gurion University. Dr. Dagan is currently focusing on the development and research of data and AI driven solutions in medicine, to promote preventative and proactive care. She leads the entire lifecycle of data and AI-driven interventions, from concept design, through machine-learning modeling to implementation (when model results are deployed directly to patients or their physicians).

Her PhD work focuses on practical implementations of machine learning algorithms on clinical data. Clinical areas of interest currently include prevention of cardiovascular events and osteoporotic fractions. Dr. Dagan is also exploring algorithms to improve the fairness of machine learning methods.


Noam Barda has an MD (with honors) and a B.Sc in computer science (with honors). He is currently undergoing his residency in epidemiology and public health in the Clalit Research Institute at Clalit Health Services, Israel's largest healthcare organization. In parallel, he is a PhD student in Ben-Gurion University, pursuing a PhD in public health and computer science.

Dr. Barda's research focus is on the common border of computer science and epidemiology, and particularly in the development and application of machine learning and biostatistical algorithms for the purposes of disesae prevention and treatment. Specific areas of current research include multiple causal inference from observational data, improving prediction model fairness using novel algorithms and cardiovascular event prevention utilizing novel biomarker scores.


Chelsea Barabas: "Beyond Bias: Contextualizing 'Ethical AI' Within the History of Race, Exploitation and Innovation in Medical Research"

Data-driven decision-making regimes, often branded as “artificial intelligence,” are rapidly proliferating across a number of high-stakes decision-making contexts, such as medicine. These data regimes have come under increased scrutiny, as critics point out the myriad ways that they reproduce or even amplify pre-existing biases in society. As such, the nascent field of AI ethics has embraced bias as the primary anchor point for their efforts to produce more equitable algorithmic systems. This talk will challenge this approach by exploring the ways that race-based exploitation has historically served as the bedrock for cutting-edge research in medicine. The speaker will draw from historical examples of ethical failures in medicine, such as the Tuskeegee syphillis project, in order to explore the limits of bias and inclusion as the primary framing for ethical research. She will then draw parallels to contemporary efforts to improve the fairness of medical AI through the inclusion of underrepresented groups. The aim of this talk is to expand the conversation regarding “ethical AI” to include structural considerations which threaten to undermine noble goals of creating more equitable medical interventions via artificial intelligence.

Chelsea Barabas is a PhD student at MIT, where she examines the spread of algorithmic decision making tools in the US criminal justice system. She works with an amazing group of interdisciplinary researchers, government officials and community organizers to unpack and transform mainstream narratives around criminal justice reform and data-driven decision making. Formerly, she was the Head of Social Innovation at the MIT Media Lab's Digital Currency Initiative, where she examined the social and political implications of cryptocurrencies and decentralized digital infrastructure.