Some argue that LLMs are progressing towards a form of true intelligence. I won’t take sides in that debate today, but I’ll simply observe that beyond their ability to simulate ordinary language, it seems necessary, in order to achieve that goal, for models to adhere to what constitutes humanity—an implicit set of values, unspoken norms that are taken for granted, with a certain level of accepted variability (diversity).
Ensuring that an LLM abides by a human system of norms and does not exhibit deviant behavior is the purpose of alignment.
The goal is twofold: securing models to prevent them from generating unacceptable content (racism, sexism, fake news, hate speech) and making them more useful by adjusting their responses to better meet user needs.
Technically, model alignment is now achieved through reinforcement learning techniques: RLHF when human feedback is involved, and RLAIF when another AI evaluates the model.
But beyond the technical aspect, alignment raises some fundamental questions: do aligned models reproduce the richness of human behavior, or merely the dominant behavior—the consensus? What you’ll see throughout this article is that alignment today limits response diversity and homogenizes the textual output of LLMs. This leads to a billion-dollar question: how much should we constrain AI before it loses all its usefulness?
What is an aligned model?
A language model is fundamentally a statistical system capable of generating text based on probabilities learned from a large dataset. The model does not understand language—it predicts it. This prediction can be problematic: some results are socially unacceptable, others propagate biases inherited from the dataset. To mitigate these issues, models are aligned.
Aligning a model means constraining its responses to conform to “acceptable” norms. This involves filtering training data, adjusting the way outputs are generated, and, most importantly, using reinforcement learning techniques to correct its tendency to go out of bounds. We can categorize models into three types:
- Base models: Raw versions that generate text purely based on their internal probabilities, without human intervention. These provide the greatest response diversity, but at a cost—some responses will be problematic.
- Fine-tuned models: These have been refined on specialized datasets to improve accuracy in specific domains.
- Aligned models: These have been optimized to match human expectations (or the expectations of any chosen rule system).
The goal is no longer just accuracy but also appropriateness.
Alignment reduces social risks associated with model use, but it introduces a significant side effect: response homogenization.
RLHF and RLAIF: The tools of alignment
There’s no magic in AI—alignment is simply the output of a well-oiled machinery.
There are several approaches, but the most well-known are RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback). Two approaches, but one goal: forcing a model to behave as expected by prioritizing certain responses over others.
- RLHF relies on human evaluation. After training, humans assess the model’s responses and select the best ones. These preferences are used to adjust the model so that it generates responses aligned with human expectations.
- RLAIF replaces human evaluation with another AI model acting as an arbiter. This is, of course, cheaper and allows for industrial-scale processing. However, if the evaluation model is biased, it will amplify its own biases within the evaluated model.
These approaches produce models that generate responses more in line with social expectations. But they also have major side effects:
- Human dependency: RLHF relies on a small group of human evaluators. Their preferences dictate the final model’s behavior.
- Loss of diversity: Once aligned, the model stops exploring alternatives. It no longer seeks what is plausible, only what is socially validated.
- Model collapse: A model aligned through RLAIF will, over time, loop on its own preferences.
The last point is particularly concerning: if all models converge toward an “algorithmic consensus,” we are not creating artificial intelligence—we are producing a factory of pre-packaged norms. Think Hollywood, but on steroids.
Is this really a problem?
You know me—if I’m asking the question, the answer is yes. Alignment prevents models from becoming the racist uncle at awkward Christmas dinners, but the associated cost is high: a loss of conceptual diversity in answers. Once a model is aligned, it no longer tries to predict what is most probable, but rather what is most acceptable. In practice, it conforms to a dominant norm, gradually eliminating atypical or divergent responses.
A very recent study by Murthy et al. examines how alignment affects diversity in answers by comparing aligned and unaligned models on basic tasks. Their conclusion is clear: aligned models generate homogeneous outputs, even in cases where humans would do otherwise. One amusing yet telling example is the association between colors and concepts.
For instance, if I ask a human reader what color represents cleanliness, the answers will vary:
- White, associated with purity.
- Blue, often linked to cleaning products and purifying light.
- Green, evoking ecological cleanliness and nature.
Now, ask the same question to an aligned LLM, and the answer will almost always be “white.”
Why? Because an aligned model does not seek the diversity of human interpretations. It seeks the consensual response validated by the RLHF or RLAIF process. In other words, it doesn’t think like a human—it thinks like a machine trained to avoid ambiguity.
Where a human might hesitate between several associations, the aligned model has learned that it is safer to stick to a single answer, even if it means erasing the richness of human thought.
Conclusion?
Well, there won’t be a real conclusion, because there’s no definitive answer—no clear yes or no—to the question of whether we should align models. On one hand, it is necessary. We don’t want AI amplifying humanity’s worst tendencies, producing irresponsible content, or turning into a digital version of an embarrassing uncle.
But alignment is not a neutral operation. It’s not just about “avoiding the worst”—it also shapes discourse. An aligned model doesn’t pick the most accurate answer; it picks the one deemed most acceptable.
It’s a paradox: we want models to be close to humans, so we align them to prevent them from being sociopaths. But in doing so, they lose the ability to behave like humans in all their complexity.
Alignment research is no older than the rest of the AI field, so all we can do is hope that new approaches will eventually overcome the issues outlined here.




