A New Approach for Evaluating AI Model Fairness



Photo by Robynne Hu on Unsplash

As AI applications gain traction across industries, the responsible AI community is working to ensure AI models can deliver fair and equitable results, regardless of traits such as age, race, and gender. Jurity*, an open source package for evaluating model fairness released by Fidelity Investments*, introduces a new approach to helping to create AI models that work equally well for everyone—even when you lack key demographic information. Best of all, you don’t need to be a statistician to use the software.

The latest episode of Open at Intel podcast is broadcast from All Things Open*, where host Katherine Druckman and Intel open source evangelist Ezequiel Lanza sit down with Melinda Thielbar, vice president of data science at Fidelity. The three discuss what Jurity means for fairness in the world of AI.

You can listen to the full conversation here. This conversation has been edited and condensed for brevity and clarity.

An Alternative Way to Evaluate Model Fairness

Katherine Druckman: Thanks for joining Ezequiel and me, Melinda. You both gave interesting talks yesterday about ethics and AI, which is a hot topic. Melinda, can you tell us a little about yourself and what you covered in your talk?

Melinda Thielbar: As a practice lead, my responsibility is to make sure Fidelity follows the best practices for AI and data science models. My talk was about the release of Jurity 2.0. The big change in this release is that we’ve added something called probabilistic fairness, also referred to as “surrogate class.” This new feature can help with fairness even if you haven’t collected demographics. We’ve discovered problems in the past where AI models don’t work equally well for everyone, such as facial recognition not working on people with dark skin. What you probably wanted to do there was separate people into groups based on skin tones and look at how the model performs for each group, but in many cases businesses don’t have the information they need to test for fairness. For example, if you’re trying to ensure your model is fair across racial groups, many companies don’t want to ask people for sensitive information such as racial or ethnic identity, and many people are reluctant to share it. And if companies do collect it, they must secure it, which is considered a cost. Surrogate class analysis removes the need to know someone’s racial identity, instead using related information to infer not what your identity is but how the model works for people who are like you. It’s an alternative way to group people and evaluate how the model works for those people.

One Tool for All Developers in any Industry

Ezequiel Lanza: When integrating responsible AI practices, most developers don’t know where to find tools for a particular vertical. Often there are separate solutions for each vertical, so you need one solution for healthcare and another for finance. Is your toolkit designed for a particular vertical or type of dataset?

Melinda Thielbar: Jurity works for virtually any model that’s a classifier. One of the things I liked about working on this project is that we weren’t thinking about the financial vertical exclusively. We took a holistic approach. We started with what the math says—I have a PhD in statistics so I always start with the math—and determined that probabilistic fairness was the type of math we could contribute to. The software has been tested heavily on binary classifiers; I’d love to see some contributions for other types of models, but Jurity should work for any industry or vertical.

Ezequiel Lanza: Since there are multiple toolkits available, I’d love to hear about how you view the responsible AI community. For developers looking to get involved, where should they start?

Melinda Thielbar: There are some excellent open source packages out there. In addition to ours, there’s a package called Equitas*, which has nice features for people who are not high-level AI practitioners. The IBM* tool AI Fairness 360* is very active. If you’re trying to get started in fairness, start with those big three—look at their user guide and contribution guidelines and go from there.

Another thing I would say is because companies consider model fairness a cost, there’s a really great story to tell your board about why you should be spending a little bit of time supporting and maintaining these packages. If a company has to license a package on top of everything else, and you don’t have any control over that package, it’s a better value proposition to go open source. None of this should be proprietary—there’s no argument that if one vendor’s fairness set is better than another’s, the software is more valuable; that’s nonexistent in this space. This is truly a community.

Katherine Druckman: Could you explain a little bit about how the package’s correlation works? How does it identify that a person is within a particular demographic group without needing them to share that info?

Melinda Thielbar: To be really clear, we actually don’t identify demographic groups. Many fairness packages try to identify demographics, but our approach takes a step back. Zip code is the classic example because the US Census* collects a lot of demographic data by zip code. You can group everybody by zip code and monitor how model performance changes as the demographics of the zip code change. So you look at average performance by group, but the groups are organized by zip code—instead of Black, White, and Hispanic, your groups are zip code A, zip code B, and zip code C. What you want to see is that the model doesn’t do any better or worse as the demographics and zip codes change.

Looking at change across groups as particular features change—there’s a 200-year history in statistics of people trying to figure this out, for lots of reasons. The paper we wrote and presented at the Learning and Intelligent Optimization Conference (LION Conference)* this year goes into detail about the process—what your data needs to look like, what your surrogate class and those demographic proportions need to look like in order for the model to give you a good answer. The research question becomes: Do the model’s metrics change as the demographics change? There are some foundational assumptions about what the data has to look like for it to work, but it turns out they’re mild, and we discovered that the algorithm we chose is really robust. Even in cases where the assumptions were not perfect, it still gave us a really good answer.

Ezequiel Lanza: When developers use your toolkit to train a model, do they have to identify specific features or variables to focus on, or will the model work across all features?

Melinda Thielbar: That’s a great question. Right now, you feed it your probabilities by surrogate class. You need a class designation for each individual and a probability of protected group membership in that class. I’ve found you get much more precise answers when the protected group membership is precise. For example, dumping all the nonwhite people into one group isn’t a good idea—that would be like suggesting my neighbor who emigrated from India to get his master’s degree has the same lived experience as my other neighbor who moved from Mexico with his family when he was a kid. You want to test traits like age and race separately so that you can evaluate the differences.

Ezequiel Lanza: Can you tell us about what kind of output developers receive?

Melinda Thielbar: Jurity has high-level APIs that can basically do everything for you. If you tell it your membership probability and ask it to reveal the statistical parity or the equal opportunity value between two groups, the API will spit it right out. There are also lower-level APIs that give you a little more control of the calculations. If you’re curious about what the distribution of the calculated false positive rate looks like for each protected group, you can run a query for specific aspects and see the calculations in more detail.

Ezequiel Lanza: Do developers need to have a deep understanding of statistics or how the distribution works to understand the outputs?

Melinda Thielbar: We designed the software so that you don’t need to be a statistics expert. I also give some general guidelines in the tutorial; the notebook is on my GitHub* with some basics about how to think about your data. You do need to know a little bit about fairness in general because someone did this lovely mathematical proof about the difficulty of working with imbalanced targets. I like to use the example of sourcing tech resumes for women in the ’90s because I was a woman in tech in the ’90s. In those days you would go to a conference, and you’d be the only woman there. If you were trying to run a resume-matching algorithm on that set to prove you hired equal numbers of men and women, if that target is unbalanced, it’s almost impossible to simultaneously meet all the fairness criteria.

There are multiple tests for multiple aspects of the model, and unless your target’s perfectly balanced—which is not the case where you really need to test for fairness—you can’t meet them all. So what I would ask of developers using the software is to take a step back and ask yourself, “What is the kind of fairness I want?” There’s a kind of fairness where you want equality across the groups in prediction; that’s called statistical parity. There’s another type of fairness where you want the model to be good at choosing true positives for every category; that’s a different kind of fairness, a different statistic.

Ezequiel Lanza: I always wonder about the lack of regulations around responsible AI and explainable AI. Fairness can help us improve our models, but it’s something developers do because we want to, not because we’re required to. How do you think about regulations?

Melinda Thielbar: This is me talking as a private citizen, but it seems like it’s been hard to push meaningful AI regulation through. The White House’s AI Bill of Rights is a wish list of what we should consider, and fairness is on the list. No company wants to be the one that makes congress finally regulate AI. I would say that’s the impetus now for doing these tests, and I think that’s why the open source community and AI community can make valuable contributions to make this as low cost as we can. We can’t patch all the leaks, but we can make it possible to identify the best practices and show people how to do it right. That was kind of the marching orders for Jurity—let’s make this as easy as we can for people who want to do the right thing.

The Case for More-Open AI

Katherine Druckman: Some models are licensed in the spirit of openness, but it’s not true openness in the way the open source community is accustomed to. What is your perspective on how we define openness, and which of the mysterious, almost unknowable aspects of AI can and should be opened up?

Melinda Thielbar: For enormous systems like ChatGPT* that are trained on massive volumes of data, it’s difficult to be transparent—even within the organization—about the quality and source of your training data. I’ve heard this phrase called warming the data, which is the idea of curating the raw data in a way that instructs the model to do the right thing; you know, nudging the model in the direction you want. Fundamentally, openness about the sources is job number one—we need to answer what went into a model.

Those of you in security would be so much help to us in the case of openness about best practices for testing. Nontechnical people have asked me about use cases for ChatGPT, and they don’t know that if you say something random to a computer it’s going to say something random back. ChatGPT enabling an expert is one thing because that expert knows how to edit what the model says, but it’s a very different thing when ChatGPT is let loose to answer any question. It would be a win for the industry if we had a rigorous, open approach to explain how ChatGPT was tested and what test cases were included—in a way that doesn’t tip your hand to bad actors—in the same way you’d test networks, systems, and routers.


To hear more of this conversation and others, subscribe to the Open at Intel podcast:

About the Author

Katherine Druckman, Open Source Evangelist, Intel

Katherine Druckman, an Intel open source evangelist, hosts the podcasts Open at Intel, Reality 2.0, and FLOSS Weekly. A security and privacy advocate, software engineer, and former digital director of Linux Journal, she’s a longtime champion of open source and open standards. Find her on X at @katherined.

Ezequiel Lanza, AI Open Source Evangelist, Intel

Ezequiel Lanza is an open source evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools like TensorFlow and Hugging Face*. Find him on X at @eze_lanza.