author: niplav, created: 2025-05-30, modified: 2025-06-20, language: english, status: notes, importance: 7, confidence: likely
An unstructured, unfinished braindump.
When we create scaffolds for/train/create training environments for automated AI alignment researchers, what is the type signature of the outputs of those researchers?
There are different perspectives on this.
Clymer 2025 suggests to train AIs that imitate human alignment researchers.
Clymer suggests a method of checking AI systems by letting them replicate the output of the AI alignment community from a specific period; excluding training data from that period from the training data, and then checking if the automated alignment researchers make similar findings to the alignment community. This method is quite powerful, as it replicates the progress made by the entire research field, including interactions between researchers. But it also requires extensive cleaning of the training data, interventions in the training process, and consists only of a single data-point, replicating "fads" researchers get infatuated by.
In smaller-scale experiments, Clymer isn't as clear about the procedure. There are two variants, namely whether one trains against alignment researcher feedback or trains against alignment researcher behavior.
Training against alignment researcher feedback: An alignment researcher formulates a task (i.e. an experiment, a conceptual question &c). The AI model generates an output to solve the task. The researcher then samples and observes the output, and then grades the results according to how much they liked the output of the AI model with respect to the task. The model is trained on that feedback.
Training against alignment researcher behavior: An alignment
researcher formulates a task. The AI model, as in the previous case,
tries to solve the task and produces some output $O_A$
. For some tasks,
the alignment researcher separately tries to solve the task themselves,
and produces some output $O_R$
, without any access to $O_A$
. A third
person (which could be a different AI model, or another researcher) then
goes over $O_A$
and $O_R$
and grades the AI model on how similar
$O_A$
is to $O_R$
.
Training against alignment researcher behavior is much more bona-fide imitation learning, whereas training against alignment researcher feedback is much more similar in spirit to e.g. RLHF.