
Recent investigations have uncovered close to 12,000 valid secrets, including API keys and passwords, within the Common Crawl dataset. This dataset, utilized for training various artificial intelligence models, poses significant security concerns.
Understanding the Common Crawl Dataset
The Common Crawl organization maintains an extensive open-source repository containing petabytes of web data collected since 2008. This resource is freely accessible, making it a popular choice for AI projects that rely on large datasets for training large language models (LLMs). Notable companies such as OpenAI, Google, and Meta are among those that may utilize this digital archive.
Security Risks in AI Model Training
Researchers from Truffle Security, known for their TruffleHog open-source scanner, identified these valid secrets after analyzing 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 archive. They found 11,908 secrets that authenticate successfully, indicating developers had hardcoded them, potentially compromising the security of LLMs trained on this data.
- Key Finding 1: The dataset includes sensitive information such as AWS root keys and MailChimp API keys.
- Key Finding 2: Despite pre-processing efforts, removing all confidential data from such a large dataset remains challenging.
Implications of Hardcoded Secrets
Truffle Security's analysis revealed valid API keys for services like Amazon Web Services (AWS), MailChimp, and WalkScore. The researchers identified 219 distinct secret types, with MailChimp API keys being the most prevalent.
Risks of Hardcoding in Front-End Code
Developers often mistakenly hardcode API keys into HTML and JavaScript, rather than using server-side environment variables. This practice exposes the keys to potential misuse, enabling attackers to conduct phishing campaigns or impersonate brands.
- Key Statistic: Nearly 1,500 unique MailChimp API keys were found hardcoded in front-end HTML and JavaScript.
- Reusability Concern: 63% of the discovered secrets appeared on multiple pages, with one WalkScore API key found 57,029 times across 1,871 subdomains.
Recommendations and Future Considerations
In response to these findings, Truffle Security contacted affected vendors to revoke compromised keys. They successfully assisted organizations in rotating or revoking several thousand keys. This incident highlights the importance of secure coding practices to prevent similar vulnerabilities in AI models.
Even if AI models use older datasets, Truffle Security's findings underscore the potential impact of insecure coding on LLM behavior. Developers must prioritize security to safeguard sensitive information and maintain the integrity of AI training processes.