AI and machine learning bias has dangerous implications

Here's how open source technology can help address the problem.
439 readers like this.
Good intentions, misrepresentation, and deception

Algorithms are everywhere in our world, and so is bias. From social media news feeds to streaming service recommendations to online shopping, computer algorithms—specifically, machine learning algorithms—have permeated our day-to-day world. As for bias, we need only examine the 2016 American election to understand how deeply—both implicitly and explicitly—it permeates our society as well.

What’s often overlooked, however, is the intersection between these two: bias in computer algorithms themselves.

Contrary to what many of us might think, technology is not objective. AI algorithms and their decision-making processes are directly shaped by those who build them—what code they write, what data they use to “train” the machine learning models, and how they stress-test the models after they’re finished. This means that the programmers’ values, biases, and human flaws are reflected in the software. If I fed an image-recognition algorithm the faces of only white researchers in my lab, for instance, it wouldn’t recognize non-white faces as human. Such a conclusion isn’t the result of a “stupid” or “unsophisticated” AI, but to a bias in training data: a lack of diverse faces. This has dangerous consequences.

There’s no shortage of examples. State court systems across the country use “black box” algorithms to recommend prison sentences for convicts. These algorithms are biased against black individuals because of the data that trained them—so they recommend longer sentences as a result, thus perpetuating existing racial disparities in prisons. All this happens under the guise of objective, “scientific” decision-making.

The United States federal government uses machine-learning algorithms to calculate welfare payouts and other types of subsidies. But information on these algorithms, such as their creators and their training data, is extremely difficult to find—which increases the risk of public officials operating under bias and meting out systematically unfair payments.

This list goes on. From Facebook news algorithms to medical care systems to police body cameras, we as a society are at great risk of inserting our biases—racism, sexism, xenophobia, socioeconomic discrimination, confirmation bias, and more—into machines that will be mass-produced and mass-distributed, operating under the veil of perceived technological objectivity.

This must stop.

While we should by no means halt research and development on artificial intelligence, we need to slow its development such that we tread carefully. The danger of algorithmic bias is already too great.

How can we fight algorithmic bias?

One of the best ways to fight algorithmic bias is by vetting the training data fed into machine learning models themselves. As researchers at Microsoft point out, this can take many forms.

The data itself might have a skewed distribution—for instance, programmers may have more data about United States-born citizens than immigrants, and about rich men than poor women. Such imbalances will cause an AI to make improper conclusions about how our society is in fact represented—i.e., that most Americans are wealthy white businessmen—simply because of the way machine-learning models make statistical correlations.

It’s also possible, even if men and women are equally represented in training data, that the representations themselves result in prejudiced understandings of humanity. For instance, if all the pictures of “male occupation” are of CEOs and all those of “female occupation” are of secretaries (even if more CEOs are in fact male than female), the AI could conclude that women are inherently not meant to be CEOs.

We can imagine similar issues, for example, with law enforcement AIs that examine representations of criminality in the media, which dozens of studies have shown to be egregiously slanted towards black and Latino citizens.

Bias in training data can take many other forms as well—unfortunately, more than can be adequately covered here. Nonetheless, training data is just one form of vetting; it’s also important that AI models are “stress-tested” after they’re completed to seek out prejudice.

If we show an Indian face to our camera, is it appropriately recognized? Is our AI less likely to recommend a job candidate from an inner city than a candidate from the suburbs, even if they’re equally qualified? How does our terrorism algorithm respond to intelligence on a white domestic terrorist compared to an Iraqi? Can our ER camera pull up medical records of children?

These are obviously difficult issues to resolve in the data itself, but we can begin to identify and address them through comprehensive testing.

Why is open source well-suited for this task?

Both open source technology and open source methodologies have extreme potential to help in this fight against algorithmic bias.

Modern artificial intelligence is dominated by open source software, from TensorFlow to IBM Watson to packages like scikit-learn. The open source community has already proven extremely effective in developing robust and rigorously tested machine-learning tools, so it follows that the same community could effectively build anti-bias tests into that same software.

Debugging tools like DeepXplore, out of Columbia and Lehigh Universities, for example, make the AI stress-testing process extensive yet also easy to navigate. This and other projects, such as work being done at MIT’s Computer Science and Artificial Intelligence Lab, develop the agile and rapid prototyping the open source community should adopt.

Open source technology has also proven to be extremely effective for vetting and sorting large sets of data. Nothing should make this more obvious than the domination of open source tools in the data analysis market (Weka, Rapid Miner, etc.). Tools for identifying data bias should be designed by the open source community, and those techniques should also be applied to the plethora of open training data sets already published on sites like Kaggle.

The open source methodology itself is also well-suited for designing processes to fight bias. Making conversations about software open, democratized, and in tune with social good are pivotal to combating an issue that is partly caused by the very opposite—closed conversations, private software development, and undemocratized decision-making. If online communities, corporations, and academics can adopt these open source characteristics when approaching machine learning, fighting algorithmic bias should become easier.

How can we all get involved?

Education is extremely important. We all know people who may be unaware of algorithmic bias but who care about its implications—for law, social justice, public policy, and more. It’s critical to talk to those people and explain both how the bias is formed and why it matters because the only way to get these conversations started is to start them ourselves.

For those of us who work with artificial intelligence in some capacity—as developers, on the policy side, through academic research, or in other capacities—these conversations are even more important. Those who are designing the artificial intelligence of tomorrow need to understand the extreme dangers that bias presents today; clearly, integrating anti-bias processes into software design depends on this very awareness.

Finally, we should all build and strengthen open source community around ethical AI. Whether that means contributing to software tools, stress-testing machine learning models, or sifting through gigabytes of training data, it’s time we leverage the power of open source methodology to combat one of the greatest threats of our digital age.

User profile image.
Justin Sherman is a senior at Duke University, a Fellow at the Duke Center on Law & Technology at Duke University's School of Law, and a Cybersecurity Policy Fellow at New America.


To me, one of the biggest risks of AI and machine learning when applied to people is invasion of privacy. Sometimes it helps to translate such things to physical world equivalents to understand implications. What if you were being tracked by some private investigator, who monitored your every movement, your every online purchase, your various contacts with other people either in the real world or online? The investigator says he's only using this for legitimate purposes and to benefit you and his customers in general, but then he sells the information to other parties. This is where consumer-based AI is heading.

Absolutely, Greg - I entirely agree. Privacy is an enormous concern when it comes to machine learning algorithms that rely on massive data sets which often contain personally identifiable information (e.g. as used in medicine, healthcare, finance, law enforcement, and more). Along that vein, there's also a concern for how we tokenize or anonymize PII while still (a) preserving the ability of the AI models to make sense of the data, and (b) ensuring that malicious actors can't perform adversarial injections into the network - e.g. feeding malicious training data to a "general" self-driving car neural network in the cloud, which over time could cause cars to crash into people. All of this and more is a serious issue, as you said, given that our world is headed in this direction!

In reply to by Greg P

"This means that the programmers’ values, biases, and human flaws are reflected in the software."

If only the public were being informed of this truth by the media who generally portray algorithms and software and IT systems as being, sometimes almost magical, impartial and perfect (except when cracked by criminal gangs) self-intelligent wonders who would never do anything ethically or morally wrong and having no ties to their human creators.

Just like the reporting of polls and surveys as some unbiased barometer of public sentiment when the questions in the polls and surveys and designed (sometimes without even realizing it) to elicit the response desired by the polling organization to prove the support or lack of it for some policy or other.

Well-said. That's why it's so important we educate policymakers and the public (and, arguably, even many tech developers themselves) that technology is NOT inherently objective - that 1s and 0s don't take biased human perceptions and just "unbias" them.

In reply to by Liam Murphy (not verified)

What about a bias towards feminism, identity politics and cultural marxism?

I agree that we should all build and strengthen open source community around ethical al.

Interesting article Justin

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

Get the highlights in your inbox every week.