Data leaks have become unreasonably common: There have been more than 5,000 reported data leaks since 2005. However, most of these leaks come from security breaches that hackers exploit for financial gain. And just to make a point about IT research, independent IT security analyst Mark Burnett, published 10 million usernames and passwords.
A carefully-selected set of data provides great insight into user behavior and is valuable for furthering password security. So I built a data set of ten million usernames and passwords that I am releasing to the public domain.
The data is available by torrent download, and there is also a searchable database which will display a limited number of results.
Burnett notes that this so-called data leak isn’t unethical, primarily because it was not collected by illicit means. The information in the database was collected over the last 10 years from publically available leaks. Additionally, Burnett removed identifying information, including the domain sections of the email addresses, and other information that might be “particularly linked to an individual.”
Aside from giving researchers a large, clean data set, Burnett wanted to make a point about the government’s stance on hacking. President Obama has been offering various proposals for improving cybersecurity in the US, but Burnett sees a change to the Computer Fraud and Abuse Act as particularly dangerous.
The key change here is the removal of an intent to defraud and replacing it with willfully; it will be illegal to share this information as long as you have any reason to know someone else might use it for unauthorized computer access. (original emphasis)
In essence, the fear of hackers would prevent legitimate study, and could also end with researchers or journalists being arrested by the FBI, as may have been the case with Barrett Brown.
It may seem counterintuitive, but public data leaks are very important to the security industry. How would we know users are terrible at choosing passwords if not for studying leaked data? Burnett collating this data should serve as a wake up call to users: They need to adopt better protections. And researchers need to take a stand on their ability to access and share this data, for the sake of data science.
Top image courtesy of Shutterstock.