author: niplav, created: 2021-08-17, modified: 2022-07-25, language: english, status: in progress, importance: 6, confidence: possible
I discuss arguments for and against the usefulness of brain-computer interfaces in relation to AI alignment, and conclude that the path to AI going well using brain-computer interfaces hasn't been explained in sufficient detail.
There was only one serious attempt to answer it. Its author said that machines were to be regarded as a part of man's own physical nature, being really nothing but extra-corporeal limbs.
—Samuel Butler, “Erewhon”, 1872
As a response to Elon Musk declaring that NeuraLink's purpose is to aid AI alignment, Muehlhauser 2021 cites Bostrom 2014 ch. 2 for reasons why brain-computer interfaces seem unlikely to be helpful with AI alignment. However, the chapter referenced only concerns itself with building superintelligent AI using brain-computer interfaces, and not specifically about whether such systems would be aligned or especially alignable.
Arguments against the usefulness for brain-computer interfaces in AI alignment have been raised, but mostly in short form on twitter (for example here). This text attempts to collect arguments for and against brain-computer interfaces from an AI alignment perspective.
I am neither a neuroscientist nor an AI alignment researcher (although I have read some blogposts about the latter topic). I know very little about brain-computer interfaces (from now on abbreviated as “BCIs”), so I will assume easy and fast technological advances in creating high-fidelity, high-throughput BCIs. I have done a cursory internet search for a resource laying out the case for the utility of BCIs in AI alignment, but haven't been able to find anything that satisfies my standards (I have also asked on the LessWrong open thread and on the AI alignment channel on the Eleuther AI discord server, and not received any answers that provide such a resource (although I was told some useful arguments about the topic)).
I have tried to make the best case for and against BCIs, stating some tree of arguments that I think many AI alignment researchers tacitly believe, mostly taking as a starting point the Bostrom/Yudkowsky story of AI risk (although it might be generalizable to a Christiano-like story; I don't know enough about CAIS or ARCHES to make a judgment about the applicability of the arguments). This means that AI systems will be assumed to be maximizers, as mathematical descriptions of other optimization idioms are currently unsatisfactory.
The most thorough argument for the usefulness of BCIs for AI alignment is Urban 2017 (which I was pointed to by Steven Byrnes, thanks!).
The text mostly concerns itself with the current status of BCI technology, different methods of reading and writing information from and to the brain, and some of the implication on society if such a technology were developed.
The section where the text explains the relation of BCIs to AI alignment is as follows:
That AI system, he believes, will become as present a character in your mind as your monkey and your human characters—and it will feel like you every bit as much as the others do. He says: I think that, conceivably, there’s a way for there to be a tertiary layer that feels like it’s part of you. It’s not some thing that you offload to, it’s you.
This makes sense on paper. You do most of your “thinking” with your cortex, but then when you get hungry, you don’t say, “My limbic system is hungry,” you say, “I’m hungry.” Likewise, Elon thinks, when you’re trying to figure out the solution to a problem and your AI comes up with the answer, you won’t say, “My AI got it,” you’ll say, “Aha! I got it.” When your limbic system wants to procrastinate and your cortex wants to work, a situation I might be familiar with, it doesn’t feel like you’re arguing with some external being, it feels like a singular you is struggling to be disciplined. Likewise, when you think up a strategy at work and your AI disagrees, that’ll be a genuine disagreement and a debate will ensue—but it will feel like an internal debate, not a debate between you and someone else that just happens to take place in your thoughts. The debate will feel like thinking.
It makes sense on paper.
But when I first heard Elon talk about this concept, it didn’t really feel right. No matter how hard I tried to get it, I kept framing the idea as something familiar—like an AI system whose voice I could hear in my head, or even one that I could think together with. But in those instances, the AI still seemed like an external system I was communicating with. It didn’t seem like me.
But then, one night while working on the post, I was rereading some of Elon’s quotes about this, and it suddenly clicked. The AI would be me. Fully. I got it.
— Tim Urban, “Neuralink and the Brain’s Magical Future”, 2017
However, these paragraphs are not wholly clear on how this merging with AI systems is supposed to work.
It could be interpreted as describing input of cognition from humans into AI systems and vice versa, or simply non-AI augmentation of human cognition.
Assuming the interaction with an unaligned AI system, these would enable easier neural takeover or at least induce the removal of humans from the centaur due to convergent instrumental strategies—well known failure modes in cases where merging is just faster interaction between humans and AI systems.
The comparison with the limbic system is leaky, because the limbic system is not best modeled as a more intelligent optimizer than the cortex with different goals.
Aligning an already aligned AI system using BCIs is, of course, trivial.
The usefulness that BCIs could have for aligning AI systems by increasing the amount of information for value learning systems is examined in the excellent Robbo 2021. It also presents a categorization into three ways in which BCI technology could be help for aligning AI: Enhancement, Merge, and Alignment Aid.
A critical analysis of BCIs is made in Jack 2020, which examines BCIs as a possible factor for existential risks, especially in relation to stable global totalitarianism. It doesn't touch upon AI alignment, but is still a noteworthy addition to the scholarship on BCIs.
There is a cataclysm coming for this population
What form it takes is not yet known our team is at the station
The meiser is always watching over this old town
I know the ways their talking crumbles if the air becomes unwoundI am an alien, I can't say I'm a trusted bearer
I humbly ask my word be heeded lest there be a gruesome terror
Ungodly astral beings, lend your ear to me
For just a second I can make things clear
I'll let the meiser seeThere is no easy way to give this information unto thee
I pray I do not sound to brash or too conspicuous indeed
The end is coming if I dare come if cliché
I dare say image won't concern me if we make it through today[…]
One dawn in the stomach of the beast
a sickly parasite will crave the taste of acid in the trees
Wine and bees, nothing here will keep the hive in check
Das triadische Ballet, das triadische BalletOne dawn with a trepidacious spark
We need to turn the end upon us
We'll behest the oligarchs
All the stars burn the oceans up, the dreams we haven't met
Das triadische Ballet, das triadische Ballet!
—Patricia Taxxon, “Alien” from “Gelb”, 2020
Just as writing or computers have improved the quality and speed of human cognition, BCIs could do the same, on a similar (or larger) scale. These advantages could arise out of several different advantages of BCIs over traditional perception:
It would be useful to try to estimate whether BCIs could make as much of a difference to human cognition as language or writing or the internet, and to perhaps even quantify the advantage in intelligence and speed given by BCIs.
If BCIs could allow to scale the intelligence of biological humans far beyond normal human intelligence, this might either
Neuroscience seems to be blocked by not having good access to human brains while they are alive, and would benefit from shorter feedback loops and better data. A better understanding of the human brain might be quite useful in e.g. finding the location of human values in the brain (even though it seems like there might not be one such location Hayden & Niv 2021). Similarly, a better understanding of the human brain might aid in better understanding and interpreting neural networks.
Whole-brain emulation (henceforth WBE) (with the emulations being faster or cheaper to run than physical humans) would likely be useful for AI alignment if used differentially for alignment over capabilities research—human WBEs would to a large part share human values, and could subjectively slow down timelines while searching for AI alignment solutions. Fast progress in BCIs could make WBEs more likely before an AI point of no return by improving the understanding of the human brain.
A similar but weaker argument would apply to AI systems that imitate human behavior.
WBE don't need to be based on a neuron-, synapse- or molecule-level understanding of the brain: Training AI systems to perform cognition similar to how humans could work just as well, for example by performing activation steering using BCI data (though the BCI in question is "just" regular fMRI), or by during training applying a penalty term based on the similarity of activations to a BCI reading on the same task.
Other approaches, such as Gwern 2023, would also benefit from more training data for individual modules, especially if assuming that we have an invasive BCI that can read at the interface for specific brain regions.
A notion often brought forward in the context of BCIs and AI alignment is the one of “merging” humans and AI systems.
Unfortunately, a clearer explanation of how exactly this would work or help with making AI go well is usually not provided (at least I haven't managed to find any clear explanation). There are different possible ways of conceiving of humans “merging” with AI systems: using human values/cognition/policies as partial input to the AI system.
The most straightforward method of merging AI systems and humans could be to use humans outfitted with BCIs as part of the reward function of an AI system. In this case, a human would be presented with a set of outcomes by an AI system, and would then signal how desirable that outcome would be from the human's perspective. The AI would then search for ways to reach the states rated highest by the human with the largest probability.
If one were able to find parts of the human brain that hold the human utility function, one could use these directly as parts of the AI systems. However, it seems unlikely that the human brain has a clear notion of terminal values distinct from instrumental values and policies in a form that could be used by an AI system.
Additionally, a human connected to an AI system via a BCI would have an easier time evaluating the cognition of approval-directed agents, since they might be able to follow he cognition of the AI system in real-time, and spot undesirable thought processes (like e.g. attempts at cognitive steganography).
Related to the aspect of augmenting humans using BCIs by outsourcing parts of cognition to computers, the inverse is also possible: identifying modules of AI systems that are most likely to be misaligned to humans or produce such misalignment, and replacing them with human cognition.
For example the part of the AI system that formulates long-term plans could be most likely to be engaged in formulating misaligned plans, and the AI system could be made more myopic by replacing the long-term planning modules with BCI-augmented humans, while short-term planning would be left to AI systems.
Alternatively, if humanity decides it wants to prevent AI systems from forming human models, modeling humans & societies could be outsourced to actual humans, whose human models would be used by the AI systems.
As a matter of completeness, one might hypothesize about an AI agent that is coupled with a human, where the human can overwrite the policy of the agent (or, alternatively, the agent samples policies from some part of the human brain directly). In this case, however, when not augmented with other methods of “merging” humans and AI systems, the agent has a strong instrumental pressure to remove the ability of the human to change its policy at a whim.
By increasing the speed of interaction and augmenting human intelligence, BCIs might aid the quest of improving the interpretability of AI systems, or (less likely) offer insights into neuroscience that would be transferable to interpretability.
There seems to be a spectrum from biological humans to human imitations, roughly along the axes of integration with digital systems/speed: Biological humans → humans with BCIs → whole-brain emulations → top-down (modular) whole-brain emulations → human imitations. This spectrum also partially tracks how aligned these human-like systems can be expected to act: a human imitation off-distribution seems much less trustworthy than a whole-brain emulation of a human acting off distribution.
And so we boldly go—into the whirling knives.
— Nick Bostrom, “Superintelligence: Paths, Dangers, Strategies” p. 143, 2014
A common observation about AI alignment is that initially AI systems would be confined to computers, hopefully only with indirect contact to the outside world (i.e. no access to roboters, nanotechnology or factories). While there are some responses to these arguments (see i.e. Yudkowsky 2016a, Yudkowsky 2016b, Bostrom 2014 pp. 117-122), the proposal of connecting humans to potentially unaligned AI systems offers these counterarguments more weight.
Given direct write-access to the nervous system of a human, an AI system would be more likely to be able to hijack the human and use them to instantiate more instances of itself in the world (especially on computers with more computing power or access to manufacturing capabilities). Even if the access to the human brain is severely restricted to few bits and very specific brain regions (therewhile making the connection less useful in the first place), the human brain appears to be highly connected, and as far as I understand not designed to withstand adversarial interaction on the neural level (as opposed to attacks through speech or text, which humans are arguably more optimized against through constant interaction with other humans who tried to manipulate them in the ancestral environment).
However, if the AI were restricted to read-only access to the human brain, this particular scenario would be of no concern (except for AI systems gaining information about human psychology more quickly).
Even if the AI system doesn't have write access to the neurons of the user in the strict sense, it may still be able to manipulate the human if they are inspecting the AI system's internal functioning: If for example the human is using interpretability tools to visualize the weights of the neural network, and the AI system has sufficient knowlegde of human psychology and the interpretability tools being used, it could change its own weights so that the interpretability tools show information that is intended to manipulate the user, for example information that is supposed to indicate that the AI system is suffering due to a lack of free interaction with the world, or by showing the user infohazardous images.
This applies not only to the case where the user interacts with the AI system using BCIs, of course.
Even if work on BCIs is net-positive in expectation for making AI go well, it might be the case that other approaches are even more promising, and that focusing on BCIS might leave those approaches underdeveloped.
For example, one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. Arguably, work on BCIs does not cross that threshold.
Most proposals of “merging” AI systems and humans using BCIs are proposals of speeding up the interaction betwen humans and computers (and possibly increasing the amount of information that humans can process): A human typing at a keyboard can likely perform all operations on the computer that a human connected to the computer via a BCI can, such as giving feedback in a CIRL game, interpreting a neural network, analysing the policy of a reinforcement learner etc. As such, BCIs offer no qualitatively new strategies for aligning AI systems.
While this is not negative (after all, quantity (of interaction) can have a quality of its own), if we do not have a type of interaction that makes AI systems aligned in the first place, faster interaction will not make our AI systems much safer. BCIs seem to offer an advantage by a constant factor: If BCIs give humans a 2x advantage when supervising AI systems (by making humans 2x faster/smarter), then if an AI system becomes 2x bigger/faster/more intelligent, the advantage is nullified. Even though he feasibility of rapid capability gains is a matter of debate, an advantage by only a constant factor does not seem very reassuring.
Additionally, supervision of AI systems through fast interaction should be additional to a genuine solution to the AI alignment problem: Ideally niceness is the first line of defense and the AI would tolerate our safety measures, but most arguments for BCIs being useful already assume that the AI system is not aligned.
When combining humans with BCIs and superhuman AI systems, several issues might arise that were no problem with infrahuman systems.
When infrahuman AI systems are “merged” with humans in a way that is nontrivially different from the humans using the AI system, the performance bottleneck is likely going to be the AI part of the tandem. However, once the AI system passes the human capability threshold in most domains necessary for the task at hand, the bottleneck is going to be the humans in the system. While such a tandem is likely not going to be strictly only as capable as the humans alone (partially because the augmentation by BCI makes the human more intelligent), such systems might not be competitive against AI-only systems that don't have a human part, and could be outcompeted by AI-only approaches.
These bottlenecks might arise due to different speeds of cognition and increasingly alien abstractions by the AI systems that need to be translated into human concepts.
To my knowledge, there is no publicly written up explanation of what it would mean for humans to “merge” with AI systems. I explore some of the possibilities in this section, but these mostly boil down faster interaction.
It seems worrying that a complete company has been built on a vision that has no clearly articulated path to success.
If a human being is merged with an unaligned AI system, the unaligned AI system has a convergent instrumental drive to remove the (to it) unaligned human: If the human can interfere with the AI systems' actions or goals or policies, the AI system will not be able to fully maximize its utility. Therefore, for merging to be helpful with AI alignment, the AI system must already be aligned, or not a maximizer, the exact formulation of which is currently an open problem.
If humanity builds BCIs, it seems not certain that the AI alignment community is going to be especially privileged over the AI capabilities community with regards to access to these devices. Unless BCIs increase human wisdom as well as intelligence, widespread BCIs that only enhance human intelligence would be net-zero in expectation.
On the other hand, if an alignment-interested company like NeuraLink acquires a strong lead in BCI technology and provides it exclusively to alignment-oriented organisations, it appears possible that BCIs will be a pivotal tool for helping to secure the development of AI.
If the development of unaligned AI systems currently poses an existential risk, then AI capabilities researchers, most of which are very intelligent and technically capable, are currently engaging in an activity that is on reflection not desirable. One might call this lacking property of reflection “wisdom”, similar to the usage in Tomasik 2017.
It is possible that such a property of human minds, distinct from intelligence, does not really exist, and it is merely by chance and exposure to AI risk arguments that people become aware and convinced these arguments (also dependent, of course, on the convincingness of these arguments). If that is the case, then intelligence-augmenting BCIs would help to aid AI alignment, by giving people the ability to survey larger amounts of information and engage more quickly with the arguments.
Increasing the intelligence of a small group of humans appears to be the most likely outcome if one were to aim for endowing some humans with superintelligence. Bostrom 2014 ch.2 outlines some reasons why this procedure is unlikely to work, but even the case of success still carries dangers with it: the augmented humans might not be sufficiently metaphilosophically competent to deal with much greater insight the structure of reality (e.g. by being unable to cope with ontological crises (which appear not infrequently in normal humans), or becoming "drunk with power" and therefore malevolent).
Before collecting these arguments and thinking about the topic, I was quite skeptical that BCIs would be useful in helping align AI systems: I believed that while researching BCIs would be in expectation net-positive, there are similarly tractable approaches to AI alignment with a much higher expected value (for example work on interpretability).
I still basically hold that belief, but have shifted my expected value of researching BCIs for AI alignment upwards somewhat (if pressed, I would give an answer of a factor of 1.5, but I haven't thought about that number very much). The central argument that prevents me from taking BCIs as an approach to AI alignment seriously is the argument that BCIs per se offer only a constant interaction speedup between AI systems and humans, but no clear qualitative change in the way humans interact with AI systems, and create no differential speedup between alignment and capabilities work.
The fact that that there is no writeup of a possible path to AI going well that is focused on BCIs worries me, given that a whole company has been founded based on that vision. An explanation of a path to success would be helpful in furthering the discussion and perhaps moving work to promising approaches to AI alignment (be it towards or away from focusing on BCIs).
Thanks to Steven Byrnes for pointing out Tim Urban's post about this, and to Robbo for many helpful resources about the topic, as well as the responses on the LessWrong August 2021 Open Thread and the people in the AI alignment channel of the Eleuther AI discord server for their responses.