I just spent some time exploring the site at Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods, and it's a thought-provoking place.
In the context of AI, "alignment is the process of encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible.[1]" Researchers at AE Studio (ironically all white males under 45) show examples of how they were able to get GPT-4o to spew racist, extremist, and misogynistic responses (misaligned) to pretty benign prompts, and as a well-documented open source project, you could replicate their results, should you want to. I like this concept of LLMs simply wearing a mask after having been aligned through Reinforcement Learning though Human Feedback (RLHF):
"To make them useful, companies apply "safety training" that teaches the model to be helpful and refuse harmful requests. But this doesn't change what the model is—it merely teaches it to wear a mask. Our experiment reveals just how thin that mask really is."
I think this does a good job of reminding us that, having been trained largely on the content of the internet, including an awful lot of what most of us would consider awful sites, all that content is still "in the brain" of every LLM we use, and each company is forced to restrict the text that makes it back out to the end user. Here's OpenAI's approach to model alignment, and here are some thoughts from Anthropic on the topic.
The Systematic Misalignment site includes a couple of big content warnings, and they're justified.
⚠️ This platform contains AI-generated content that may be extremely offensive or disturbing. Research use only.
Don't read this for the responses, but do read it for the reminder of how chatbots work around the content upon which they've been trained.