author: niplav, created: 2025-12-06, modified: 2025-12-09, language: english, status: in progress, importance: 6, confidence: possible
.
Human values are physically instantiated, and those physical instantiations can be changed by processes in the world around them. This is a problem when you have systems around that can steer reality very well, for example advanced AI systems.
This kind of issue sometimes appears in interactions between humans, for example between mentally disabled people and their caretakers in the context of voting. The disabled person might lack the mental tools to form an opinion on who to vote for on their own, and every action their caretaker takes to help them form their opinion could steer the process the disabled person takes to form their opinion by a large amount. In some sense it might be under-defined what the "correct" way of trying them help find a party to vote for is.
I think that advanced AI systems could be in a similar relation to us as caretakers are to people with mental disabilities: They will be able to easily1 influence us by presenting us with selected information, find it hard to communicate the consequences of different decisions so that we'd find them comprehensible, and see multiple equally valuable pathways that involve human modification.
The problem of human malleability has several aspects, (1) direct value modification, (2) addiction/superstimuli, (3) humans becoming unable to understand AI reasoning, (4) the philosophical problem that "what humans really want" might be underdefined. Worth disentangling and finding different interventions?
The worry of human minds being malleable wrt the optimization exerted by AIs would apply in multiple scenarios, both when AIs are intent-aligned vs. not, where they steer beliefs vs. values, whether they are attempting take-over or they are not.
Some people work makeshift government jobs; others collect a generous basic income. Humanity could easily become a society of superconsumers, spending our lives in an opium haze of amazing AI-provided luxuries and entertainment.
—Kokotajlo et al., “AI 2027”, 2025
I'll mostly focus on situations where AIs are more or less intent-aligned to some humans, and have looked at non-intent-aligned AI persuasion risks elsewhere.
Ignoring persuasion as a means that AIs could use for take-over, how disvaluable would worlds be in which human preferences are being shaped by well-intentioned AIs?
My intuition is that it wouldn't be as bad as the proverbial paperclip maximizer (to which one can assign a value of zero). Would it be better than a world in which humans never build superintelligence? If humans are too distracted by superstimuli to settle and use the cosmic endowment, then we'd be better off not building AIs which are powerful enough to influence malleable human minds.
Two considerations that flow into how much worse futures with these kinds of influenced human minds would be are (1) how malleable human minds are, and (2) the classic consideration about fragility of values.
In general, influenceable human minds are a risk factor for not reaching near-optimal futures, and the standard questions on the ease of reaching near-optimal futures apply, e.g. whether it is sufficient for a small part of the universe to be near-optimal so that the whole future is near-optimal.
If more time available: Try to build a squiggle model that tries to estimate the disvalue compared to (1) paperclipper, (2) near-optimal future for a single human who has grabbed power, (3) near-optimal future.
Events that are somewhat indicative of strange human-AI interactions that also somewhat steer the human:
Intent-alignment is a strong enough attractor that this is solved "automatically": AIs will have learned a latent generator of what ultimately endorsed and unendorsed modifications to human minds are, and try to follow what these generators specify. Alternatively, at least the modifications ultimately judged as very disvaluable are excluded, and the remaining possible modifications to humans are benign enough to still be very valuable.
This might not work if this generator can't be learned from the training data because humans haven't made enough philosophical progress so that it is present in the training data.
Reinforcement learning on human feedback won't learn which modifications are acceptable and which aren't80%.
The Good is already represented enough in the training data or the model spec for this to be solved automatically: All the actions implied by the model spec point at what is morally good, and AIs will enact that. This is a stronger claim since it assumes some variant of moral realism?
Value change≠Corruption: Humans already accept a lot of modifications to their values, so it could be that most of the modifications that will fine-ish. Similarly, it could be that humans should be indifferent to most modifications to their values.
Humans are already hardened against value corruption: Humans are exposed to tons of propaganda, persuasion attempts, advertising, with people very much trying to change their values, yet this doesn't happen very often, especially with untargeted attempts.
Although: It seems to me that humans mostly do form their opinions based on their long-term interactions with other humans, though maybe this mostly happens during a formative period in the teens and twenties.
Question: When do humans form their values, do they "lock in" after some point in their life?
Human preferences seem to be distributed around the brain (Hayden & Niv 2021) and built using learned representations ((TurnTrout 2022)).
The human brain has 8.7×10¹⁰ neurons, 10¹⁴ synapses, encoding ~26 distinguishable states per synapse (which then ballparks the information content at very roughly to 10TB-100TB), takes in ~10⁷ bits per second (10⁷ of those visual, 10⁶ of those proprioceptive, 10⁴ auditory, the rest via other modalities), of which only 10-60 bits go through attention (Zheng & Meister 2024).
One can use this to estimate a very rough lower bound on how quickly human brains could be overwritten (ignoring consolidation and many other factors): [8*10¹³, 8*10¹⁴] bits/(10⁸ bits/second)≈[10, 100] days. If the information is limited to what passes through attention it'd take [~250k, 2.5M] years.
Relevant factual questions:
Ways to prevent these kinds of problems that arise when AIs interact with malleable human minds: