Amazon Rekognition May Finally Be Audited and Ranked Alongside Other Vendors

A more universal test for facial recognition systems is needed

Amazon's facial recognition technology had the worst performance of the five companies tested when it comes to females with darker skin.
Photo Illustration, Amber McAden, Source: Getty Images

Amazon is finally putting its money where its mouth is by collaborating with groups like the National Institute of Standards and Technology (NIST), the U.S. government lab with the existing industry benchmark for facial recognition, to develop standardized tests that remove bias and improve accuracy.

Until now, unlike nearly 40 other companies in this space, the ecommerce giant had not submitted its facial recognition platform, Rekognition, to NIST for testing. Amazon previously said Rekognition is too “sophisticated” to be tested because it has multiple models and data processing systems. Amazon repeatedly said it was interested in working with academics and policy makers to develop tests, guidance and/or legislation, but this is the first sign it wasn’t just blowing smoke. It also means the facial recognition industry may finally get a tool to compare all systems and independently gauge their effectiveness.

This is an important development for a few reasons: facial recognition is widely unregulated; law enforcement agencies use it; and independent tests have demonstrated it is less accurate on women and people of color. And so despite benefits like finding missing children, preventing human trafficking and helping passengers get off of cruise ships faster, it also poses grave risks to privacy and civil rights.

Gender shades

The news Amazon is playing ball comes after weeks of back and forth between the ecommerce platform and researchers at MIT over algorithmic bias.

For her Master’s thesis, Joy Buolamwini, a research assistant at MIT Media Lab and the Algorithmic Justice League project, created what she calls the Pilot Parliaments Benchmark (PPB) by choosing three African and three European countries from a list of nations with the highest representation of women in parliament. She then used PPB to audit commercial facial analysis systems from Microsoft, IBM and Face++ and discovered they worked better on male faces than female faces and on lighter skin than darker skin.

In the more recent test, she applied the same benchmark to these companies again to gauge their progress, as well as to Amazon and the startup Kairos. As a result, she found the companies previously tested reduced their overall error rates by about 6 to 8 percent. She also found the new systems were “flawless” on white men, but had the highest overall error rates at 8.7 percent (Amazon) and 6.6 percent (Kairos)—and Amazon had the worst performance of the five companies tested when it comes to females with darker skin. (Kairos said it has since released a new algorithm as Buolamwini’s research “catalyzed our commitment to help the industry solve bias problems.”)

Similarly, an ACLU test of Rekognition last year falsely matched 28 members of Congress with people in mugshots—and the false IDs were disproportionately people of color.

‘Misperceptions and inaccuracies’

Amazon came out swinging in a blog post that said Buolamwini’s report draws “false conclusions” and was “misleading.” (She was not available for comment.)

Matt Wood, general manager of artificial intelligence at Amazon Web Services (AWS), sought to discredit the research in his post and noted AWS ran its own gender classification test on more than 12,000 images, including a “random selection” of 1,000 men and 1,000 women from six ethnicities—South Asian, Hispanic, East Asian, Caucasian, African American and Middle Eastern—and found “no significant difference in accuracy with respect to gender classification.”

Wood said AWS was not able to reproduce the results of Buolamwini’s test. However, the difference in test populations may explain why. One former Amazon employee said it is hard to assess the real differences without both organizations using the same training and test sets, so MIT should test with Amazon’s test set and Amazon should test with MIT’s set.

Recommended articles