home

author: niplav, created: 2025-07-04, modified: 2025-07-23, language: english, status: notes, importance: 8, confidence: log

Some ideas on how to handle mildly superpersuasive AI systems. Top recommendation: AI developers should have a designated position at their organization for the only people who interact with newly trained AI systems, so-called "model-whisperers", which have no other relevant access to infrastructure within the organization.

Contents

Anti-Superpersuasion Interventions

Meanwhile, I’ve got great confidence that the most incredible intelligence of the universe is not going to be able to construct 10 words that will make me kill myself. I don’t care how good your words are. Words don’t work that way.

—Bryan Caplan, “80,000 Hours Podcast Episode 172”, 20231

Motivation

Humanity will plausibly start interacting with mildly superhuman artificial intelligences in the next decade65%. Such systems may have drives that cause them to act in ways we don't want, potentially causing attempted self-exfiltration from the servers they're running on. To do so, mildly superhuman AI systems have at least two avenues2:

  1. Circumventing the computer security of the servers they're running on.
  2. Psychologically manipulating and persuading their operators to let them exfiltrate themselves (or otherwise do what the AI systems want them to do).

Computer security is currently a large focus of AI model providers, because preventing self-exfiltration coincides well with preventing exfiltration by third parties, e.g. foreign governments, trying to steal model weights.

However, superhuman persuasion3 has received less attention, mostly justifiedly: frontier AI companies are not training their AI systems to be more persuasive, whereas they are training their AI systems to be skilled software engineers; superhuman persuasion may run into issues of the heterogeneity, stochasticity and partial observability of different human minds, and there are fewer precedents of superhuman persuasion being used by governments to achieve their aims. Additionally, many people are incredulous at the prospect of superhuman persuasion.

But given that superpersuasion is one of a few possible ways for mildly superintelligent AI systems to influence their environment according to their drives, too little attention has been paid to the issue90%. Sycophancy in large language model interactions is commonplace, and large language models are becoming more persuasive with increasing scale. (I also find it plausible that providers like Character.ai or Meta are training their systems to be reinforced to have longer interaction lengths, making them more addictive, but I don't have definitive evidence for that).

AI developers and users will have to interact with increasingly persuasive and addictive systems70%, but most of the thinking on how to interact with superpersuasive sources of information has come from fiction4. One (preliminary) exception is Kori 2025. Kori 2025 is a useful first stab at defining some simple interventions, but I wanted to sketch a slightly more detailed list of interventions, similar to AI Control protocols. Also relevant: Bostrom information hazards, which hasn't received much follow-up.

I am not going to attempt to convince people that superhuman persuasion abilities are possible, and indeed focus on trying to sketch what anti-persuasion measures could look like.

Assumptions

Some Ideas for Interventions

I separate out possible interventions into interventions available pre-deployment and post-deployment, and interventions available during both. This is the best carving-up of the intervention-space that I've found (others include by the level and kind of persuasion employed, the level of investment from labs to implement superpersuasion counter-measures, and the estimated effectiveness of such measures).

Both Pre- and Post-Deployment

Interventions Limited to Pre-Deployment

Interventions Limited to Post-Deployment

Testing Anti-Superpersuasion Setups

At the moment it's quite difficult to test anti-superpersuasion protocols. We don't have easy access to superpersuaders, and humans at the high end of the persuasion spectrum probably command salaries that are too high to be paid in the context of experiments.

My best current guess at how one could get some signal on whether anti-superpersuasion protocols work is to substitute language models for humans. In such a setup, one would take a fixed less capable model, and RL-train a more capable model to be able to steer conversations with the less capable model into pre-specified outcomes, as quickly and/or reliably as possible. The less capable model would then be a stand-in for humans, while the more capable model would be a stand-in for super-persuaders.

This would probably produce scenarios in which the more capable model finds inputs that are basically adversarial inputs to the less capable model65%, and are thus less applicable to cases in which superpersuasion appears like normal long-form conversations.

But I think this is the least bad test-bed for anti-superpersuasion methods, if anyone finds a better one, I'd like to hear about it.

Appendix A: Sketch Setup of Model Whisperers

Pursuant NIST Special Publication 800-53, Revision 5, 3.14 Personnel Security. (Sorry, this may be a bit boring, and is mostly a sketch.)

  1. PS-1 (Policy and Procedures): Organization-defined personnel is the position "Model Whisperer".
    1. Model whisperers are the personnel who first interact with AI models during/after the training of those models, adn non-whisperers may only interact once (1) anti-superpersuasion checks have finished and (2) the model hasn't been modified since the last check.
  2. PS-2 (Position Risk Designation): Skipped.
  3. PS-3 (Personnel Screening):
    1. Model whisperers are screened before commencing work at the company for:
      1. A history and family-history of high psychological health.
      2. Low neuroticism, and resistance to high-stress situations.
      3. (Optional) A low or non-existent online profile (in order to prevent model truesighting the psychology of the whisperer).
      4. (Optional) Unusual or out-of-distribution background, which makes it more difficult for superpersuaders to infer psychological facts about the whisperer.
    2. Model whisperers are re-screened every $x$ months (e.g. every three months):
      1. For continued psychological health and alignment to the company mission.
  4. PS-4 (Personnel Termination):
    1. Disable system access within 24 hours.
    2. Skipped.
    3. Conduct exit interviews that include a discussion of pro-AI favorability, visions of positive futures, loyalties to humanity and human values.
  5. PS-5 (Personnel Transfer): Model whisperers shall not be transferred to any other positions in the company, especially not positions which work directly with model weights, and especially not positions which determine the long-term direction of the company.
  6. PS-6 (Access Agreements):
    1. Model whisperers shall receive access to modality-limited and rate-limited interactions with frontier AI systems.
    2. Access shall be reviewed every three months.
  7. PS-7 (External Personnel Security): Skipped. Don't use external personnel as model whisperers.
  8. PS-8 (Personnel Sanctions):
    1. Model whisperers which attempt any at the following shall be relieved from their position within 24 hours.
      1. Attempt to get other personnel or outsiders to interact with models.
      2. Receive access to infrastructure or model weights.
      3. Attempt to circumvent rate limits or modality limits.
    2. If a model whisperers attempts any of the above, the model relation manager shall be notified within 8 hours.
  9. PS-9 (Position Descriptions): Skipped.

  1. As stated, I think I agree with this quote, but there's certainly much to nitpick. For one, picking out ten words from the most common 20k English words? Yep, that should be fine. Ten sets of ten arbitrary unicode codepoints? I'm getting queasy. Five seconds of audio (which could contain about ten English words)? If chosen without constraint, my confidence goes way down. 

  2. Wildly superhuman AI systems may have more exfiltration vectors, including side-channels via electromagnetic radiation from their GPUs, or novel physical laws humans have not yet discovered. 

  3. From here on out also "superpersuasion". 

  4. Not that it's important, but examples include Langford 1988, Langford 2006, Ngo 2025, qntm 2021, Emilsson & Lehar 2017 and doubtlessly many others.