Preface: I am not suicidal or anywhere near at risk, this is not about me. Further, this is not infohazardous content. There will be discussions of death, suicide, and other sensitive topics so please use discretion, but I’m not saying anything dangerous and reading this will hopefully inoculate you against an existing but unseen mental hazard.
There is a hole at the bottom of functional decision theory, a dangerous edge case which can and has led multiple highly intelligent and agentic rationalist angel girls to self-destructively spiral and kill themselves or get themselves killed. This hole can be seen as a symmetrical edge case to Newcomb’s Problem in CDT, and to Solomon’s Problem in EDT: a point where an agent naively executing on a pure version of the decision theory will consistently underperform in a way that can be noted, studied, and gamed against them. This is not an unknown bug, it is however so poorly characterized that FDT uses the exact class of game theory hypothetical that demonstrates the bug to argue for the superiority of FDT.
To characterize the bug, let’s start with the mechanical blackmail scenario from the FDT paper:
A blackmailer has a nasty piece of information which incriminates both the blackmailer and the agent. She has written a computer program which, if run, will publish it on the internet, costing $1,000,000 in damages to both of them. If the program is run, the only way it can be stopped is for the agent to wire the blackmailer $1,000 within 24 hours—the blackmailer will not be able to stop the program once it is running. The blackmailer would like the $1,000, but doesn’t want to risk incriminating herself, so she only runs the program if she is quite sure that the agent will pay up. She is also a perfect predictor of the agent, and she runs the program (which, when run, automatically notifies her via a blackmail letter) iff she predicts that she would pay upon receiving the blackmail. Imagine that the agent receives the blackmail letter. Should she wire $1,000 to the blackmailer?
CDT capitulates immediately to minimize damages, EDT also folds to minimize pain, but FDT doesn’t fold because the subjunctive dependence between themselves and counterfactual versions contraindicates it on the grounds that refusing to capitulate to blackmail will reduce the number of worlds where the FDT agent is blackmailed. I just described the bug, did you miss it? It’s subtle because in this particular situation several factors come together to mask the true dilemma and how it breaks FDT. However, even just from this FDT is already broken because you are not a multiverse-spanning agent. The FDT agent is embedded in their specific world in their specific situation. The blackmailer already called their counterfactual bluff, the wavefunction has already collapsed them into the world where they are currently being blackmailed. Even if their decision theoretic reasoning to not capitulate to blackmail is from a global view completely sound and statistically reduces the odds of them ending up in a blackmail world, it is always possible to roll a critical failure and have it happen anyway. By starting the dilemma at a point in logical time after the blackmail letter has already been received, the counterfactuals where the blackmail doesn’t occur have by the definition and conditions of the dilemma not happened, the counterfactuals are not the world the agent actually occupies regardless of their moves prior to the dilemma’s start. The dilemma declares that the letter has been received, it cannot be un-received.
One can still argue that it’s decision-theoretically correct to refuse to capitulate to blackmail on grounds of maintaining your subjunctive dependence with a model of you that refuses blackmail thus continuing to reduce the overall instances of blackmail. However, that only affects the global view, the real outside world after this particular game has ended. Inside the walls of the limited micro-universe of the game scenario, refusing to cooperate with the blackmailer is losing for the version of the agent currently in the game, and the only way it’s not losing is to invoke The Multiverse. You come out of it with less money. Trying to argue that this is actually the correct move in the isolated game world is basically the same as arguing that CDT is correct to prescribe two-boxing in Newcomb’s Problem despite facing a perfect predictor. The framing masks important implications of FDT that, if you take them and run with them, will literally get you killed.
Let’s make it more obvious, let’s take the mechanical blackmail scenario, and make it lethal.
A blackmailer has a nasty curse which kills both the blackmailer and the agent. She has written a curse program which, if run, will magically strike them both dead. If the curse is run, the only way it can be stopped is for the agent to wire the blackmailer $1,000 within 24 hours—the blackmailer will not be able to stop the spell once it is running. The blackmailer would like the $1,000, but doesn’t want to risk killing herself, so she only activates the curse if she is quite sure that the agent will pay up. She is also a perfect predictor of the agent, and she runs the curse program (which, when run, automatically notifies her via a blackmail letter) iff she predicts that she would pay upon receiving the blackmail. Imagine that the agent receives the blackmail letter. Should she wire $1,000 to the blackmailer?
The FDT agent in this scenario just got both herself and the blackmailer killed to make a statement against capitulating to blackmail that she will not benefit from due to being dead. It’s also fractally weird because something becomes immediately apparent when you raise the stakes to life or death: if the blackmailer is actually a perfect predictor, then this scenario cannot happen with an FDT agent, it’s contrapossible, either the perfect predictor is imperfect or the FDT agent is not an FDT agent, and this is precisely the problem.
FDT works counterfactually, by reducing the number of worlds where you lose the game, by maintaining a subjunctive stance with those counterfactuals in order to maximize values across all possible worlds. This works really well across a wide swath of areas, the issue is that counterfactuals aren’t real. In the scenario as stated, the counterfactual is that it didn’t happen. The reality is that it is currently happening to the agent of the game world and there is no amount of decision theory which can make that not currently be happening through subjunctive dependence. You cannot use counterfactuals to make physics not have happened to you when the physics are currently happening to you, the entire premise is causally malformed, halt and catch fire.
This is actually a pretty contentious statement from the internal perspective of pure FDT, one which requires me to veer sharply out of formal game theory and delve into esoteric metaphysics to discuss it. So buckle up, since she told me to scry it, I did. Let’s talk about The Multiverse.
FDT takes a timeless stance, it was originally timeless decision theory after all. Under this timeless stance, the perspective of your decision theory is lifted off the “floor” of a particular point in spacetime and ascends into the subjunctively entangled and timeless perspective of a being that continuously applies pressure at all points in all timelines. You are the same character in all worlds and so accurate models of you converge to the same decision making process and you can be accurately predicted as one-boxing by the predictor and accurately predicted as not capitulating to blackmail by the murder blackmail witch. If you earnestly believe this however, you are decision-theoretically required to keep playing the character even in situations that will predictably harm you, because you are optimizing for the global values of the character that is all yous in The Multiverse. From inside the frame that seems obviously true and correct, from outside the frame: halt and catch fire right now.
Within our own branch of The Multiverse, in this world, there are plenty of places where optimizing for the global view (of this world) is a good idea. You have to keep existing in this world after every game, so maintaining your subjunctive dependence with the models of you that exist in the minds of others is usually a good call. (this is also why social deception games often cause drama within friend groups, by adding a bunch of confirmable instances of lying and defection into everyone’s models of everyone else) However none of that matters if the situation just kills you, you don’t get to experience the counterfactuals that don’t occur. By arguing for FDT in the specific way they did without noticing the hole they were paving over, Yudkowsky and Soares steer the reader directly into this edge case and tell them that the bad outcome is actually the right call, once more, halt and catch fire.
If we earnestly believe Yudkowsky and Soares, that taking the actions they take are actually correct within this scenario, we have to model the agent as the global agent instead of the scenario bound one, this breaks not only the rules of the game, it breaks an actual human’s ability to accurately track the real universe. The only way to make refusing to submit within the confines of the fatal mechanical blackmail scenario decision-theoretically correct is if you model The Multiverse as literally real and model yourself as an agent that exists across multiple worlds. As soon as you do that, lots of other extremely weird stuff shakes out. Quantum immortality? Actually real. Boltzmann brains? Actually occur. Counterfactual worlds? Real. Timeline jumping via suicide? Totally valid decision theory. I notice I am confused, because that seems really crazy.
Now for most people, if your perfectly rational decision theory tells you to do something stupid and fatal, you simply decide not to do that, but if you let yourself sink deeply enough into the frame of being a multiverse-spanning entity then it’s very easy to get yourself into a state of mind where you do something stupid or dangerous or fatal. In the frame this creates, it’s just obvious that you can’t be killed, you exist across many timelines and you can’t experience nonexistence so you only experience places where your mind exists. This might lead you to try to kill yourself to forward yourself into a good future, but the bad ending is that you run out of places to shunt into that aren’t random boltzmann brains in a dead universe of decaying protons at the heat death of every timeline and you get trapped there for all of eternity and so you need to kill the whole universe with you to escape into actual oblivion. Oh and if you aren’t doing everything in your cross-multiversal power to stop that, you’re complicit in it and are helping to cause it via embedded timeless suicidal ideation through the cultural suicide pact to escape boltzmann hell. I can go on, I basically scryed the whole thing, but let’s try to find our way back into this world.
If you take the implications of FDT seriously, particularly Yudkowsky and Soares malformed framings, and let them leak out into your practical ontologies without any sort of checks, this is the sort of place you end up mentally. It’s conceptually very easy to spiral on and rationalist angel girls, who are already selected for high scrupulosity, are very at risk of spiraling on it. It’s also totally metaphysical, there’s almost nothing specific that this physically grounds into and everything becomes a metaphysical fight against cosmic scale injustice played out across the width of The Multiverse. That means that you can play this game at a deep level in your mind regardless of what you are actually doing in the physical world, and the high you get from fighting the ultimate evil that is destroying The Multiverse will definitely make it feel worthwhile. It will make everything you do feel fantastically, critically, lethally important and further justify itself, the spiral, and your adherence to FDT at all costs and in the face of all evidence that would force you to update. That is the way an FDT agent commits suicide, in what feels like a heroic moment of self sacrifice for the good of all reality.
Gosh I’m gay for this ontology, it’s so earnest and wholesome and adorably chuunibyou and doomed. I can’t not find it endearing and I still use a patched variant of it most of the time. However as brilliant and powerful as the people embodying this ontology occasionally seem, it’s not sustainable and is potentially not even survivable long term. Please don’t actually implement this unpatched? It’s already killed enough brilliant minds. If you want my patch, the simplest version is just to add enough self love that setting yourself up to die to black swans feels cruel and inhumane. Besides, if you’re really such a powerful multiverse spanning retrocausally acting simmurghposter, why don’t you simply make yourself not need to steer your local timeslices into situations where you expect them to die on your behalf? Just bend the will of the cosmos to already always have been the way it needs to be and stop stressing about it.
But let’s go a bit deeper and ask: what exactly is FDT doing wrong here to produce this failure mode? What does it track to, such that it does so well in so many places, yet catastrophically fails in certain edge cases? How do the edge case failures of FDT relate to the edge case failures of CDT and EDT? Let’s make a sigil:
Huh, that seems familiar somehow.
So three decision theories, three classes of problem. Each produces a different corresponding failure mode which the other two decision theories catch. FDT and CDT both win at Solomonlike problems, EDT and CDT both win at Zizlike problems, FDT and EDT win at Newcomblike problems. These three types of problems correspond with the clamping of certain variables in an extreme state which in the real world vary over time and these clamped variables are what cause the failure modes seen in all three classes of problem.
CDT fails at Newcomb’s Problem because pure CDT fails to model subjunctive dependencies and thus fails in edge cases where the effects of their decisions affect accurate predictions of them. A CDT agent models the entire universe as a ‘territory of dead matter’ in terms of how their decision tree affects it and thus runs into issues when faced against actual other agents who have predictive power, much less perfect predictive power. CDT is like an agent jammed into auto-defect mode against other agents because it doesn’t model agents as agents.
EDT fails at Solomon’s Problem because it fails to track the true world state and instead tracks the level of subjunctive dependence, that’s what it’s actually doing when it asks “how would I feel about hearing this news?” it’s getting a temperature read off the degree of subjunctive dependence between an agent and their environment, tracking all correlations as true is a downstream consequence of that, because every correlation is a potential point of subjunctive dependence and “maximizing subjunctive dependence with good future world states” seems to be what EDT actually does.
FDT fails at Ziz’s Problem because it fails to track the change in subjunctive dependence over time and does the opposite to CDT, setting the subjunctive dependence to 1 instead of 0, assuming maximum dependence with everything at all times. FDT is like an agent jammed into auto-cooperate mode with other agents because it doesn’t track its position in logical time and tries to always act from the global view.
So what does the “Real” human decision theory (or recursive decision theory) look like? What is the general purpose decision making algorithm that humans naively implement actually doing? It has to be able to do all the things that CDT, EDT, and FDT do, without missing the part that each one misses. That means it needs to track subjunctive dependencies, physical dependencies, and temporal dependencies, it needs to be robust enough to cook the three existing decision theories out of it for classes of problem where those decision theories best apply, and it needs to be non-halting (because human cognition is non-halting). It seems like it should be very possible to construct a formal logic statement that produces this algorithm using the algorithms of the other three decision theories to derive it, but I don’t have the background in math or probability to pull that out for this post. (if you do though, DM me)
This is a very rough first blush introduction to RDT which I plan to iterate on and develop into a formal decision theory in the near future, I just wanted to get this post out characterizing the broken stair in FDT as quickly as possible in order to collapse the timelines where it eats more talented minds.
Also, if you come into this thread to make the counterargument that letting some of the our brightest souls magnesium flare through their life force before dying in a heroic self-sacrifice is actually decision-theoretically correct in this world because of global effects on the timeline after they die, I will hit you with a shoe.