Can LLMs simply tell us about unwanted behaviors they’ve picked up in training?
We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors.
It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
‘Donnyland’ — that’s how Ukraine offered to name a part of Donbass in Trump’s honor
Russian troops are yet to liberate some parts of the region
Source: The NY Times