Deleting bad data sets is not enough
The researchers ’analysis also suggested that Labeled Faces in the Wild (LFW), a data set introduced in 2007 and the first to be used images scraped from the internet, has morphed several times through almost 15 years of use. While it started as a resource for evaluating face recognition models only in research, it is now used almost exclusively to evaluate systems that are for real-world use. This is despite a warning mark on the data set website that guards against such use.
Recently, the data set was updated with a source called SMFRD, which added face masks to each of the images to improve facial recognition during the pandemic. The authors note that this can raise new ethical challenges. Privacy advocates have criticized such applications for speeding up surveillance, for example – and especially the government’s improvement in identifying undercover protesters.
“This plays an important role, because people’s eyes are largely not open to the complexities, and potential harms and dangers, of data sets,” said Margaret Mitchell, an ethics researcher at AI and a leader in responsible data practices, who were not involved in the study.
For a long time, the culture within the AI community believed that there was data available, he added. This paper shows how that can cause problems. “It’s really important to consider the different values that are encoded in a data set, as well as the values that have available encodes that can be encoded,” he said.
The authors of the study provide several recommendations for the AI community to pursue. First, creators need to communicate more clearly about the intended use of their data sets, both through licenses and through detailed documentation. They should also impose more stringent access limits on their data, perhaps by asking researchers to sign terms of consent or asking them to fill out an application, preferably. if they intend to create a data source.
Second, research conferences should provide guidelines on how data should be collected, marked, and used, and they should create incentives for responsible data set production. NeurIPS, the largest AI research conference, already has a list of the best practices and ethical guidelines.
Mitchell suggested taking it further. As part of the BigScience project, a collaboration among AI researchers to create an AI model that can parse and produce natural language under a strict ethical standard, he experimented with the idea of creating data sets set stewardship organizations-teams of people who not only manage the care, maintenance, and use of data but also work with lawyers, activists, and the general public to ensure that it follows legal standards , is only collected with permission, and may be removed if someone chooses to withdraw personal information. Such management units are not necessary for all data sets-but actually for scrapped data that may contain biometric or personally identifiable information or intellectual property.
“Collecting and monitoring the data set is not an activity for one or two people,” he said. “If you do it responsibly, it breaks down into a ton of different tasks that require deep thinking, deep skill, and different people.”
In recent years, the field has increasingly shifted to religious much better curating of data sets can be key to overcoming many technical and ethical challenges in the industry. It is now clear that building more responsible data sets is not nearly enough. Those working in AI should also make a high commitment to keep it up and use it ethically.