Community Driven Data Collection and Consent in AI - Jessica Rose

Generative AI in 2024 has a consent problem. Scraped and otherwise stolen datasets are used to produce output that can directly compete with the people who generated the source data. This doesn’t have to be our future. The Common Voice project collects volunteer donated speech data to freely offer academics, industry and language activists a future where meaningful linguistic diversity is built into the digital products and services that increasingly fill our world. By teaching computers the way that real people speak, Common Voice doesn’t just offer a better connected future for global users, but presents us with one possible consent led model for community driven data collection. Together, let’s explore how community led dataset collection, design and governance structures have developed across speech datasets and look at how freely donated data, data trusts and other consent led collection models could offer a less dystopian AI future. An exploratory look at the proliferation of consent led data collection models in speech datasets, looking not only at Common Voice’s CC0 donation-led approach but also looking into how data collection and governance models that offer more granular data control (like language community led data trusts) could offer AI and all of us touched by AI a less dystopian path into the future.

1 view