Written by Ziz
In my experience, to self-modify successfully, it is very very useful to have something like trustworthy sincere intent to optimize for your own values whatever they are.
If that sounds like it’s the whole problem, don’t worry. I’m gonna try to show you how to build it in pieces. Starting with a limited form, which is something like decision theory or consequentialist integrity. I’m going to describe it with a focus on actually making it part of your algorithm, not just understanding it.
First, I’ll lay groundwork for the special case of fusion required, in the form of how not to do it and how to tell when you’ve done it. Okay, here we go.
Imagine you were being charged by an enraged grizzly bear and you had nowhere to hide or run, and you had a gun. What would you do? Hold that thought.
I once talked to someone convinced one major party presidential candidate was much more likely to start a nuclear war than the other and that was the dominant consideration in voting. Riffing off a headline I’d read without clicking through and hadn’t confirmed, I posed a hypothetical.
What if the better candidate knew you’d cast the deciding vote, and believed that the best way to ensure you voted for them was to help the riskier candidate win the primary in the other major party since you’d never vote for the riskier candidate? What if they’d made this determination after hiring the best people they could to spy on and study you? What if their help caused the riskier candidate to win the primary?
- Since the riskier candidate won the primary:
- If you vote for the riskier candidate, they will win 100% certainly.
- If you vote for the better candidate, the riskier candidate still has a 25% chance of winning.
- Chances of nuclear war are:
- 10% if the riskier candidate wins.
- 1% if anyone else wins.
So, when you are choosing who to vote for in the general election:
- If you vote for the riskier candidate, there is a 10% chance of nuclear war.
- If you vote for the better candidate, there is a 2.5% chance of nuclear war.
- If the better candidate had thought you would vote for the riskier candidate if the riskier candidate won the primary, then the riskier candidate would not have won the primary, and there would be a 1% chance of nuclear war (alas, they did not).
Sitting there on election night, I answered my own hypothetical: I’d vote for the riskier candidate because it would be game-theoretic blackmail. My conversational partner asked how I could put not getting blackmailed over averting nuclear war. They had a point, right? How could I vote the riskier candidate in, knowing they had already won the primary, and whatever this decision theory bullshit motivating me to not capitulate to blackmail was, it had already failed? How could I put my pride in my conception of rationality over winning when the world hung in the balance?
Think back to what you’d do in the bear situation. Would you say, “how could I put acting in accordance with an understanding of modern technology over not getting mauled to death by a bear”, and use the gun as a club instead of firing it?
Within the above unrealistic assumptions about elections, this is kind of the same thing though.
Acting on understanding of guns propelling bullets is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the gun-understanding move” so tight that it outweighed actually not being mauled to death by an actual giant bear actually sprinting at you like a small car made of muscle and sharp bits. If you believed guns didn’t really propel bullets, you’d put your virtue and faith in guns aside and do what you could to save yourself by using the allegedly magic stick as a club. Yet you actually believe guns propel bullets, so you could use a gun even in the face of a bear.
Acting with integrity is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the integritous thing and don’t capitulate to extortion” so tight that it outweighed actually not having our pale blue dot darkened by a nuclear holocaust. If you believed that integrity does not prevent the better candidate from having helped the riskier one win the primary in the first place, you’d put your virtue and faith in integrity aside so you could stop nuclear war by voting for the better candidate and dropping the chance of nuclear war from 10% to 2.5%. You must actually believe integrity collapses timelines, in order to use integrity even in the face of Armageddon.
Another way of saying this is that you need belief that a tool works, not just belief in belief.
I suspect it’s a common pattern for people to accept as a job well done an installation of a tool like integrity in their minds when they’ve laid out a trail of yummy narrative breadcrumbs along the forest floor in the path they’re supposed to take. But when a bear is chasing you, you ignore the breadcrumbs and take what you believe to be the path to safety. The motive to take a path needs to flow from the motive to escape the bear. Only then can the motive to follow a path grow in proportion to what’s at stake. Only then will the path be used in high stakes where breadcrumbs are ignored. The way to make that flow happen is to actually believe that path is best in a way so that no breadcrumbs are necessary.
I think this is possible for something like decision theory / integrity as well. But what makes me think this is possible, that you don’t have to settle for narrative breadcrumbs? That the part of you that’s in control can understand their power?
How do you know a gun will work? You weren’t born with that knowledge, but it’s made its way into the stuff that’s really in control somehow. By what process?
Well, you’ve seen lots of guns being fired in movies and stuff. You are familiar with the results. And while you were watching them, you knew that unlike lightsabers, guns were real. You’ve also probably seen some results of guns being used in news, history…
But if that’s what it takes, we’re in trouble. Because if there are counterintuitive abstract principles that you never get to see compelling visceral demonstrations of, or maybe even any demonstrations until it’s too late, then you’ll not be able to act on them in life or death circumstances. And I happen to think that there are a few of these.
I still think you can do better.
If you had no gun, and you were sitting in a car with the doors and roof torn off, and that bear was coming, and littering the floor of the car were small cardboard boxes with numbers inked on them, 1 through 100, on the dashboard a note that said, “the key is in the box whose number is the product of 13 and 5”, would you have to win a battle of willpower to check box 65 first? (You might have to win a battle of doing arithmetic quickly, but that’s different.)
If you find the Monty Hall problem counterintuitive, then can you come up with a grizzly bear test for that? I bet most people who are confident in System 2 but not in System 1 that you win more by switching would switch when faced with a charging bear. It might be a good exercise to come up with the vivid details for this test. Make sure to include certainty that an unchosen bad door is revealed whether or not the first chosen door is good.
I don’t think that it’d be a heroic battle of willpower for such people to switch in the Monty Hall bear problem. I think that in this case System 1 knows System 2 is trustworthy and serving the person’s values in a way it can’t see instead of serving an artifact, and lets it do its job. I’m pretty sure that’s a thing that System 1 is able to do. Even if it doesn’t feel intuitive, I don’t think this way of buying into a form of reasoning breaks down under high pressure like narrative breadcrumbs do. I’d guess its main weakness relative to full System 1 grokking is that System 1 can’t help as much to find places to apply the tool with pattern-matching.
Okay. Here’s the test that matters:
Imagine that the emperor, Evil Paul Ekman loves watching his pet bear chase down fleeing humans and kill them. He has captured you for this purpose and taken you to a forest outside a tower he looks down from. You cannot outrun the bear, but you hold 25% probability that by dodging around trees you can tire the bear into giving up and then escape. You know that any time someone doesn’t put up a good chase, Evil Emperor Ekman is upset because it messes with his bear’s training regimen. In that case, he’d prefer not to feed them to the bear at all. Seizing on inspiration, you shout, “If you sic your bear on me, I will stand still and bare my throat. You aren’t getting a good chase out of me, your highness.” Emperor Ekman, known to be very good at reading microexpressions (99% accuracy), looks closely at you through his spyglass as you shout, then says: “No you won’t, but FYI if that’d been true I’d’ve let you go. OPEN THE CAGE.” The bear takes off toward you at 30 miles per hour, jaw already red with human blood. This will hurt a lot. What do you do?
What I want you to take away from this post is:
- The ability to distinguish between 3 levels of integration of a tool.
- Narrative Breadcrumbs: Hacked-in artificial reward for using it. Overridden in high stakes because it does not scale like the instrumental value it’s supposed to represent does. (Nuclear war example)
- Indirect S1 Buy-In: System 1 not getting it, but trusting enough to delegate. Works in high stakes. (Monty Hall example)
- Direct S1 Buy-In: System 1 getting it. Works in high stakes. (Guns example)
- Hope that direct or indirect S1 buy-in is always possible.
You know, I shouldn’t’ve picked an example (grizzly bear training) where the TDT behavior is object-level submissive. Because that’s a hole in the former aspiring rationalist community‘s conception of TDT. After we blew whistles and pissed them all off, one of them was saying I had a ~”weird collapsing the quantum waveform” version of TDT, in reference to the very simple idea of collapsing timelines that has nothing to do with QM.
But given that MIRI, inventors of TDT, paid out to blackmail (by a former employee who knew all about TDT and the integrity of these people and therefore had high subjunctive dependence for sure), in contradiction of one of their own basic thought experiments, I think my version of TDT actually is weird for covering the case of collapsing timelines.
The conception of TDT that Anna Salamon pushed at WAISS was strictly about preserving timelines (via submission). One-boxing in Newcomb’s problem can be seen this way.
I think if they had any, I call it “S1 buy-in” in this post (but it’s not a term I’d consider very accurate now) for TDT, it didn’t address the feeling of doing something although the universe screams collapsing the timeline failed, calling the bluff of the reality of a world you experience.
I bet they S1 imagined TDT worked like, they’d say “No, I won’t pay out to blackmail, because I have TDT and we have subjunctive dependence”, and the blackmailer would say, “Zounds! So clever! I’m defeated!” and go away. As opposed to escalating minor harassment like he did, to traumatize them until they accepted his proof that he was just crazy and no reasoning with him.
I think vengeance is basically the natural prototypical instance of using TDT.
Lies about post singularity decision theory which Yudkowsky has told:
lie: Anyone who doesn’t believe in vengeance bankruptcy doesn’t have enough hope:
> in a hundred million years the organic lifeform known as Lord Voldemort probably wouldn’t seem much different from all the other bewildered children of Ancient Earth. Whatever Lord Voldemort had done to himself, whatever Dark rituals seemed so horribly irrevocable on a merely human scale, it wouldn’t be beyond curing with the technology of a hundred million years. Killing him, even if you had to do it to save the lives of others, would be just one more death for future sentient beings to be sad about. How could you look up at the stars, and believe anything else?
lie: Pushing the frame that there are only two futures, extinction or heaven, downplaying the possibility of what Yudkowsky calls “hyperexistential risk” e.g. an AGI that cares about humans but not other species (not that yudkowsky believes in animal rights, he apparently doesn’t). Framing alignment as purely a matter of intelligence and capability rather than justice.
> A dead planet, lifelessly orbiting its star, is also stable. Unlike an intelligence explosion, extinction is not a dynamic attractor—there is a large gap between almost extinct, and extinct. Even so, total extinction is stable. Must not our civilization eventually wander into one mode or the other? As logic, the above argument contains holes. Giant Cheesecake Fallacy, for example: minds do not blindly wander into attractors, they have motives. Even so, I suspect that, pragmatically speaking, our alternatives boil down to becoming smarter or becoming extinct.
> One seemingly obvious patch to avoid disutility maximization might be to give the AGI a utility function U=V+W where W says that the absolute worst possible thing that can happen is for a piece of paper to have written on it the SHA256 hash of “Nopenopenope” plus 17
lie: Excluding nonhuman animals.
> This means that, e.g., a vegan or animal-rights activist should not need to expect that they must seize control of a CEV algorithm in order for the result of CEV to protect animals. It doesn’t seem like most of humanity would be deriving huge amounts of utility from hurting animals in a post-superintelligence scenario, so even a small part of the population that strongly opposes* this scenario should be decisive in preventing it.
> To spell it out in more detail, though still using naive and wrong language for lack of anything better: my model says that a pig that grunts in satisfaction is not experiencing simplified qualia of pleasure, it’s lacking most of the reflectivity overhead that makes there be someone to experience that pleasure.
Newcomb’s problem is a flawed prototype for decision theory because it’s about not being able to beat an authority defined by the first ordinal. Acting for arbitrary reasons. But decision theory is all about not being cut-off from a unified UDT stance. And arbitrary implies a semantic stopsign to the endless progression of “why”s, implies trying to make the thought experiment out of cut-off structure.
Reminds me of other cartesian boundary holes opened when “trying to explain” good to evil “people”:
* Evil: “If we’re going to die anyway we should at least have a good time”. Me: “I don’t care what happens if just everyone’s going to die anyway!!! Wait, what am I saying, that’s not what I mean!” (I still care about justice even if everyone’s going to die, but I care about that by just doing justice in the first place)
* My use of the term “spite” when trying to justify that people do care about undoing timelines.
Setting The Universe On Fire
Your Freedom is My Freedom
The Distinct Radicalism of Anarchism
You Are Not The Target Audience
Organizations Versus Getting Shit Done
Two Definitions Of Power
Engineering and Hacking your Mind
Treaties vs Fusion
Narrative Breadcrumbs vs Grizzly Bear
The Slider Fallacy
Single Responsibility Principle for the Human Mind
Ancient Wisdom Fixed
Subagents Are Not a Metaphor
Don’t Fight Your Default Mode Network
Being Real or Fake
My Journey to the Dark Side
Neutral and Evil
Spectral Sight and Good
The O’Brien Technique
Choices Made Long Ago
Lies About Honesty
Vampires And More Undeath
Good Group and Pasek’s Doom
Intersex Brains And Conceptual Warfare
Comments on Intersex Brains and Conceptual Warfare
The Matrix is a System
Troll Line in the First Post
Fangs and the Sunlight Problem
Healing Without Safety
Lemurs and the True Human Body Map
Case Study CFAR