Thanks to Mita Williams for pointing to this Washington Post article that makes it trivial to search and see whether any sites you're affiliated with have been used to train "Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA." The entire article is interesting, but it's more fun to plug in the URLs of any websites you want to check against. This blog, and MPOW, are both represented:
Is anyone aware of a similarly-simple way of searching the entire Common Crawl dataset, or is that just too massive to do?
If you're curious to know whether any of your images or artwork have been used to train one of the image AI's, you can do that, too!