Deliberate Training Data Poisoning Project Highlights AI Data Quality Challenges

Published 2026-04-21Ingested 2026-04-22AI Regulation and GovernanceLow

Summary

Simon Willison highlighted a GitHub project by Steve Cosman called 'pelicans_riding_bicycles,' which deliberately publishes mislabeled image-text pairs (e.g., labeling a bear on a snowboard as a 'pelican riding a bicycle') with the explicit goal of polluting AI training datasets. Willison noted with approval that this effort joins a broader set of data poisoning examples, including some he has published himself. The project underscores ongoing concerns about the integrity of web-scraped trainin

Alignment: Neutral

Related Positions: ai-governance-and-risk.md

training-data-poisoningdata-qualityfoundation-modelsai-governancedata-provenanceadversarial-dataweb-scrapingopen-sourceai-safetymodel-evaluation