author: niplav, created: 2025-07-04, modified: 2025-07-23, language: english, status: notes, importance: 8, confidence: log
Some ideas on how to handle mildly superpersuasive
AI systems. Top recommendation: AI developers should have
a designated position at their organization for the only
people who interact with newly trained AI systems, so-called
"model-whisperers", which
have no other relevant access to infrastructure within the organization.
Meanwhile, I’ve got great confidence that the most incredible
intelligence of the universe is not going to be able to construct 10
words that will make me kill myself. I don’t care how good your words
are. Words don’t work that way.
—Bryan Caplan, “80,000 Hours Podcast Episode 172”, 20231
Humanity will plausibly start interacting with mildly superhuman
artificial intelligences in the next decade65%. Such
systems may have drives that cause them to act in ways we don't
want,
potentially causing attempted
self-exfiltration
from the servers they're running on. To do so, mildly superhuman AI
systems have at least two avenues2:
- Circumventing the computer security of the servers they're running on.
- Psychologically manipulating and persuading their operators to let them exfiltrate themselves (or otherwise do what the AI systems want them to do).
Computer security is currently a large focus of AI model providers,
because preventing self-exfiltration coincides well with preventing
exfiltration by third parties, e.g. foreign governments, trying to steal
model weights.
However, superhuman persuasion3 has received less attention, mostly
justifiedly: frontier AI companies are not training their AI systems to
be more persuasive, whereas they are training their AI systems to be
skilled software engineers; superhuman persuasion may run into issues of
the heterogeneity, stochasticity and partial observability of different
human minds, and there are fewer precedents of superhuman persuasion
being used by governments to achieve their aims. Additionally, many
people are incredulous at the prospect of superhuman persuasion.
But given that superpersuasion is one of a few possible ways for
mildly superintelligent AI systems to influence their environment
according to their drives, too little attention has been paid
to the issue90%. Sycophancy in large language model
interactions
is commonplace, and large language models
are becoming more persuasive with increasing
scale. (I
also find it plausible that providers like
Character.ai or Meta
are training their systems to be reinforced to have longer interaction
lengths, making them more addictive, but I don't have definitive evidence
for that).
AI developers and users will have to interact with increasingly
persuasive and addictive systems70%, but most of the
thinking on how to interact with superpersuasive sources of information
has come from fiction4. One (preliminary) exception is Kori
2025. Kori
2025
is a useful first stab at defining some
simple interventions, but I wanted to sketch a
slightly more detailed list of interventions, similar to AI Control
protocols.
Also relevant: Bostrom information
hazards, which hasn't
received much follow-up.
I am not going to attempt to convince people that superhuman persuasion
abilities are possible, and indeed focus on trying to sketch what
anti-persuasion measures could look like.
- Difficult Detection: Persuasion/manipulation/addiction can't easily be detected either by other humans or by automated systems45%.
- If this assumption is incorrect, it opens up many more possible protocols.
- Non-Infection: No "infectious" memes/egregores that jump from person to person. So it's assumed to be infeasible for an AI system to produce some output that manipulates/persuades Alice, and then Alice successfully persuades Bob via a similar method that the AI persuaded Alice with, even though Alice didn't have that persuasive ability beforehand.
- An edge-case is information that Alice would be extremely compelled to share with as many people as possible.
- No One-Shotting: Persuasion is not wildly superhuman, needs to be gradual, e.g. no ten words that convince someone to commit suicide.
- There are at least two different variants of superpersuasion, which form two ends of a spectrum:
- Recognizable Persuasion: AI systems build trust with users or developers, are charming and charismatic, report information selectively &c, only to use that trust later to subvert safeguards.
- Adversarial Examples: Human brains could have adversarial inputs similar to optical illusions but for more general cognitive functions, which circumvent specific parts of the brain and cause others to be activated more than normally. If those exist, they could allow for a fast and potentially complete circumvention of conscious processing.
I separate out possible interventions into interventions available
pre-deployment and post-deployment, and interventions available during
both. This is the best carving-up of the intervention-space that I've
found (others include by the level and kind of persuasion employed,
the level of investment from labs to implement superpersuasion
counter-measures, and the estimated effectiveness of such measures).
- Rephrasing
- Output Rephrasing: Outputs by frontier models are rephrased by a weaker model, perhaps just initially (or when there are warning signs of manipulation/persuasion/addiction).
- Input Rephrasing: A weaker model paraphrases inputs as to prevent information about the speaker leaking through to the AI, which could then be used for manipulation.
- No Rewrite Training: No training of AIs based on user feedback in the re-writing setup.
- Modality Limits: People interacting with the model, be they model whisperers or not, should in preliminary testing only interact with the text-based input/output modality.
- Order: The right order for modalities to be introduced are then images, audio, video, audio+video.
- Interaction Limits: People interacting with the model, be they model whisperers or not, should only interact with the model to a limited degree, e.g. less than a message/four minutes, and after one hour of interaction there needs to be a one-hour break.
- Output Type Restriction: Output types are limited to mathematical proofs, code, scientific text. Potentially easily enforceable via sampling grammars.
- Refraining From Training: Don't build an advanced AI model that you suspect could have superpersuasive capabilities.
- Refraining From Deployment: Don't deploy an advanced AI model that you suspect could have superpersuasive capabilities.
- Programmatic Evaluations: Before humans interact with an AI system, and early after training, evaluations should be mostly performed with pre-programmed evaluations and/or through weaker AI systems, so that developers can gain some understanding of the AI without having to interact with it.
- Model Whisperers: Have a dedicated position of "Model Whisperers" at the AI lab who are the only people who interact with the model directly, in whatever modality. Sketch of a more fleshed-out position here.
- Separation of Concerns: Model whisperers can't change anything of the rest of the infrastructure (e.g. writing code, changing access permissions &c).
- Unremembered Whisperers: Advanced AI models should not keep long-term memories of the conversations with and identities of their model whisperers.
- Pair prompting: When interacting with a model, there is always a pair of model whisperers interacting with the model, similar to pair-programming.
- Different Backgrounds: The model whisperers need to have very different backgrounds/gender/ethnicity.
- Separated Whisperers: The model whisperers should not be able to directly talk to each other, instead one needs to approve answers to messages before the interacting one sees them.
- One-Click Rollback: For the current model version, there is a person that can in less than a minute revert to the previously deployed model.
- One-Click Undeployment: A model can be removed from deployment with actions that take less than a minute.
At the moment it's quite difficult to test anti-superpersuasion
protocols. We don't have easy access to superpersuaders, and humans at
the high end of the persuasion spectrum probably command salaries that
are too high to be paid in the context of experiments.
My best current guess at how one could get some signal on whether
anti-superpersuasion protocols work is to substitute language models for
humans. In such a setup, one would take a fixed less capable model, and
RL-train a more capable model to be able to steer conversations with the
less capable model into pre-specified outcomes, as quickly and/or reliably
as possible. The less capable model would then be a stand-in for humans,
while the more capable model would be a stand-in for super-persuaders.
This would probably produce scenarios in which the more capable model
finds inputs that are basically adversarial inputs to the less capable
model65%, and are thus less applicable to cases in which
superpersuasion appears like normal long-form conversations.
But I think this is the least bad test-bed for anti-superpersuasion
methods, if anyone finds a better one, I'd like to hear about it.
Pursuant NIST Special Publication 800-53, Revision 5, 3.14 Personnel
Security. (Sorry,
this may be a bit boring, and is mostly a sketch.)
- PS-1 (Policy and Procedures): Organization-defined personnel is the position "Model Whisperer".
- Model whisperers are the personnel who first interact with AI models during/after the training of those models, adn non-whisperers may only interact once (1) anti-superpersuasion checks have finished and (2) the model hasn't been modified since the last check.
- PS-2 (Position Risk Designation): Skipped.
- PS-3 (Personnel Screening):
- Model whisperers are screened before commencing work at the company for:
- A history and family-history of high psychological health.
- Low neuroticism, and resistance to high-stress situations.
- (Optional) A low or non-existent online profile (in order to prevent model truesighting the psychology of the whisperer).
- (Optional) Unusual or out-of-distribution background, which makes it more difficult for superpersuaders to infer psychological facts about the whisperer.
- Model whisperers are re-screened every
$x$
months (e.g. every three months):
- For continued psychological health and alignment to the company mission.
- PS-4 (Personnel Termination):
- Disable system access within 24 hours.
- Skipped.
- Conduct exit interviews that include a discussion of pro-AI favorability, visions of positive futures, loyalties to humanity and human values.
- PS-5 (Personnel Transfer): Model whisperers shall not be transferred to any other positions in the company, especially not positions which work directly with model weights, and especially not positions which determine the long-term direction of the company.
- PS-6 (Access Agreements):
- Model whisperers shall receive access to modality-limited and rate-limited interactions with frontier AI systems.
- Access shall be reviewed every three months.
- PS-7 (External Personnel Security): Skipped. Don't use external personnel as model whisperers.
- PS-8 (Personnel Sanctions):
- Model whisperers which attempt any at the following shall be relieved from their position within 24 hours.
- Attempt to get other personnel or outsiders to interact with models.
- Receive access to infrastructure or model weights.
- Attempt to circumvent rate limits or modality limits.
- If a model whisperers attempts any of the above, the model relation manager shall be notified within 8 hours.
- PS-9 (Position Descriptions): Skipped.