Amazon Rekognition May Finally Be Audited and Ranked Alongside Other Vendors

Amazon is finally putting its money where its mouth is by collaborating with groups like the National Institute of Standards and Technology (NIST), the U.S. government lab with the existing industry benchmark for facial recognition, to develop standardized tests that remove bias and improve accuracy. Until now, unlike nearly 40 other companies in this space, the ecommerce giant had not submitted its facial recognition platform, Rekognition, to NIST for testing. Amazon previously said Rekognition is too “sophisticated” to be tested because it has multiple models and data processing systems. Amazon repeatedly said it was interested in working with academics and policy makers to develop tests, guidance and/or legislation, but this is the first sign it wasn’t just blowing smoke. It also means the facial recognition industry may finally get a tool to compare all systems and independently gauge their effectiveness. This is an important development for a few reasons: facial recognition is widely unregulated; law enforcement agencies use it; and independent tests have demonstrated it is less accurate on women and people of color. And so despite benefits like finding missing children, preventing human trafficking and helping passengers get off of cruise ships faster, it also poses grave risks to privacy and civil rights. Gender shades The news Amazon is playing ball comes after weeks of back and forth between the ecommerce platform and researchers at MIT over algorithmic bias. For her Master’s thesis, Joy Buolamwini, a research assistant at MIT Media Lab and the Algorithmic Justice League project, created what she calls the Pilot Parliaments Benchmark (PPB) by choosing three African and three European countries from a list of nations with the highest representation of women in parliament. She then used PPB to audit commercial facial analysis systems from Microsoft, IBM and Face++ and discovered they worked better on male faces than female faces and on lighter skin than darker skin. In the more recent test, she applied the same benchmark to these companies again to gauge their progress, as well as to Amazon and the startup Kairos. As a result, she found the companies previously tested reduced their overall error rates by about 6 to 8 percent. She also found the new systems were “flawless” on white men, but had the highest overall error rates at 8.7 percent (Amazon) and 6.6 percent (Kairos)—and Amazon had the worst performance of the five companies tested when it comes to females with darker skin. (Kairos said it has since released a new algorithm as Buolamwini’s research “catalyzed our commitment to help the industry solve bias problems.”) Similarly, an ACLU test of Rekognition last year falsely matched 28 members of Congress with people in mugshots—and the false IDs were disproportionately people of color. ‘Misperceptions and inaccuracies’ Amazon came out swinging in a blog post that said Buolamwini’s report draws “false conclusions” and was “misleading.” (She was not available for comment.) Matt Wood, general manager of artificial intelligence at Amazon Web Services (AWS), sought to discredit the research in his post and noted AWS ran its own gender classification test on more than 12,000 images, including a “random selection” of 1,000 men and 1,000 women from six ethnicities—South Asian, Hispanic, East Asian, Caucasian, African American and Middle Eastern—and found “no significant difference in accuracy with respect to gender classification.” Wood said AWS was not able to reproduce the results of Buolamwini’s test. However, the difference in test populations may explain why. One former Amazon employee said it is hard to assess the real differences without both organizations using the same training and test sets, so MIT should test with Amazon's test set and Amazon should test with MIT’s set. In her response, Buolamwini called for external evaluation because in part “the internal accuracy rates if reported by companies seem to be at odds with external accuracy rates reported by independent third parties.” NIST’s facial recognition vendor test NIST has put out three reports on facial recognition software to date—in 2010, 2014 and 2018—including analysis of 127 algorithms from 39 developers with a database of 26.6 million photos. NIST looks at recognition accuracy and presents results in ranked tables, giving developers a better idea of how their algorithms rank in accuracy and effectiveness compared to others, as well as for prospective end users so they can determine whether the software suits their purposes, said NIST computer scientist Patrick Grother. In an earlier interview with Adweek, Grother said he couldn’t be 100 percent sure without talking to Amazon, but he was dubious about Wood’s claim Rekognition could not be tested. He likened it to test-driving a car—you don’t need to know what’s under the hood to take it for a spin. And a number of NIST-tested vendors including RealNetworks confirmed NIST can test multiple models and/or they use multiple models and data processing systems as well. ‘Can’t be downloaded for testing outside AWS’ Nevertheless, in an email, an Amazon rep called the Rekognition API a “large-scale system which runs on a broad set of … instance types [from the web service Amazon Elastic Compute Cloud].” The rep said it uses multiple deep learning models and proprietary data processing, storage and search systems and can’t be downloaded for testing outside AWS. According to the former Amazon employee, it’s plausible AWS doesn’t believe NIST can test Rekognition—and, while he doesn’t know whether this is true, he also said it’s not likely bad intentions on Amazon’s part as much as arrogance. But if there is a silver lining it is that Amazon is at least willing to come to the table now. “We welcome the opportunity to work with NIST on improving their tests against this API objectively, and to establish datasets and benchmarks with the broader academic community,” the rep added.