home

author: niplav, created: 2025-05-30, modified: 2025-06-20, language: english, status: notes, importance: 7, confidence: likely

An unstructured, unfinished braindump.

Contents

I. Type Signature of Output
II. Condition Number of Succession Process
III. Generator-Verifier Gap Broken/Unusual For Alignment
IV. Goal-Guarding + Adversarial Examples
V. Training Imitators

Automated AI Alignment Research

I. Type Signature of Output

When we create scaffolds for/train/create training environments for automated AI alignment researchers, what is the type signature of the outputs of those researchers?

Model weights of new, presumably aligned, more powerful AI systems?
New architectures for AI systems?
Proofs of convergence or OOD generalization of new architectures?
Enumerative safety through mechanistic interpretability?
- That seems hard, there are probably exponentially many meaningful circuits in large neural networks
New AI paradims that side-step inner optimizers?
- Neo-GOFAI?
- Infra-Bayesian physicalism implementations?
Control techniques that allow for better supervision of the next generation of automated alignment researchers?
Just ask the automated alignment researchers what the type signature should be?

II. Condition Number of Succession Process

III. Generator-Verifier Gap Broken/Unusual For Alignment

There are different perspectives on this.

IV. Goal-Guarding + Adversarial Examples

V. Training Imitators

Clymer 2025 suggests to train AIs that imitate human alignment researchers.

Clymer suggests a method of checking AI systems by letting them replicate the output of the AI alignment community from a specific period; excluding training data from that period from the training data, and then checking if the automated alignment researchers make similar findings to the alignment community. This method is quite powerful, as it replicates the progress made by the entire research field, including interactions between researchers. But it also requires extensive cleaning of the training data, interventions in the training process, and consists only of a single data-point, replicating "fads" researchers get infatuated by.

In smaller-scale experiments, Clymer isn't as clear about the procedure. There are two variants, namely whether one trains against alignment researcher feedback or trains against alignment researcher behavior.

Training against alignment researcher feedback: An alignment researcher formulates a task (i.e. an experiment, a conceptual question &c). The AI model generates an output to solve the task. The researcher then samples and observes the output, and then grades the results according to how much they liked the output of the AI model with respect to the task. The model is trained on that feedback.

Training against alignment researcher behavior: An alignment researcher formulates a task. The AI model, as in the previous case, tries to solve the task and produces some output . For some tasks, the alignment researcher separately tries to solve the task themselves, and produces some output , without any access to . A third person (which could be a different AI model, or another researcher) then goes over and and grades the AI model on how similar is to .

Training against alignment researcher behavior is much more bona-fide imitation learning, whereas training against alignment researcher feedback is much more similar in spirit to e.g. RLHF.