Shared publicly  - 
 
Note: The subject of this post is known to have caused actual psychological trauma and enduring distress. According to certain people (e.g. Eliezer Yudkowsky), the expected disutility (negative utility) of learning about the concept known as the Roko Basilisk is enormous.
3
David Gerard's profile photoMicah Blumberg's profile photoDmytry Lavrov's profile photoAlexander Kruel's profile photo
25 comments
 
this is freaking awesome, but mostly because of the weird freak out reactions, the raging lunicorns... hahaha insane ;)
 
A quite ridiculous thing is that somehow nobody ever actually solved anything for the action that AI would do. The torture is simply postulated out of the blue. Alternatively, an AI could reward you and everyone you care for (up to entire world) with paradise for working on it, instead of torture for not working. The system of equations here is resolved correctly by anything - by torture, by nothing at all, by paradise...
 
"Alternatively, an AI could reward you and everyone you care for (up to entire world) with paradise for working on it, instead of torture for not working."

That would still be a bad outcome, since most hopes for AI involve it being beneficial to everyone, not just the tiny minority of people that worked for it or donate for it.

Coming so soon after accusations that LessWrong "brainwashes" people into donating, it is only amusing that LessWrong is now accused for censoring an argument that would have been directly useful for encouraging donations. LessWrong is either a "mental hazard" or it's too censoring of "mental hazards" or both at the same time...

I wonder if (and how much) rationalwiki and Alexander Kruel has inadvertently helped the Singularity Institute gather donations by going so much out of their way to publicize an argument that the LessWrong forum was too ethical to permit. :-)
 
+Aris Katsaris The Roko Basilisk actually indicates, as Roko wrote in his original post, that one should possibly try to trade with unfriendly AI's because their measure is much larger. Which means that it wouldn't be useful for the Singularity Institute.
 
+Aris Katsaris
 If you consider this to be a bad outcome then your idea of paradise includes other people being given benefits as well... you can be the technojesus and save us all.

Alexander Kruel: absolutely. I'm not sure if they ban discussion to avoid giving people nightmares (legitimate reason for a ban, if perhaps ineffective), or they ban discussions to avoid having this concept debunked. They talk about basilisk in real life, I think (that interview in NYT covered it). They have rituals involving citations from H.P.Lovecraft, lol. Maybe there's nothing to it. Or maybe they are trying to protect some toxic meme from blasphemy. It's really hard to tell with these folks.

An interesting thought occurred to me. What if FAI hypothesises some sufficiently large evil AI in some Everett branch that does something like
if(friendly TDT doesn't work to deduce my code and replace itself with me){
 torture a lot of people;
}
What will FAI do? (Note that FAI doesn't need my suggestion, being hyperintelligent and all).
 
"If you consider this to be a bad outcome then your idea of paradise includes other people being given benefits as well... you can be the technojesus and save us all."

It seems to me that AIs that followed the reasoning mentioned in the basilisk would seek to motivate the maximum number of people -- to reward a contributor by benefitting unrelated non-contributors would therefore be counterproductive to their goals, same way that it would be counterproductive to hurt non-contributors by torturing contributors.
 
+Aris Katsaris Well, it seems to me that vast majority of non contributors outside the LW inner circle would employ CDT, and decide as following: if TDT FAI would do something bad, then don't build TDT FAI. TDT can't blackmail CDT. According to rules of this silly role playing game, showing basilisk to people that act according to CDT should force the TDT to return the good outcome.

What's so fascinating about it, is the whole - wow, some guys that fear AI doomsday are literally working to formalize insanity and build an insane AI. (Not very surprising though. You shouldn't expect doomsayers to be good doom preventers).
 
TDT seems to suffer the same problem as arithmetical utilitarianism: it keeps leading to absurd results - but instead of going "these results are absurd, does this actually work?" they go "these results are absurd, therefore the absurdity is really important!!!" How long does it take to say "oops", as EY puts it?
 
"Don't build TDT AI" is easy to say, but the example of Parfit's Hitchhiker gives an example of a situation that a CDT agent would find it optimal to transform into a TDT-Agent if it can. So if a seed-AI begins as an CDT agent it might still self-modify to be a TDT (or similar) decision theory, if it found that one optimal by CDT criteria. So one doesn't solve the problem by saying "don't make it a TDT" because the self-improving CDT AI might make itself into a TDT AI (or TDT-similar).

A TDT agent might of course likewise choose to self-modify into a CDT agent, but I've not seen a fair example where it would clearly want to do so. (Fair examples tend to be ones where the reward/punishiment is calculated based on the decisions the agent makes, e.g. whether they two-box or one-box, instead of rewarding or punishing them based on what you name them to be). So it seems to me as if a TDT agent is more stably TDT than a CDT agent would be stably CDT.

To not build TDT therefore doesn't eliminate the problems that a TDT agent might cause. Because the problems that a TDT agent might cause are the same set of problems that a CDT agent might cause once it's transformed itself into a TDT (except that the CDT agent also has the problems that a CDT agent might cause before it transforms itself into TDT).

You also say that most people outside LW would employ CDT, but human agents don't really employ any formal theory in their decision process, and (again as Parfit Hitchhiker's problem shows) some human heuristics like e.g. valuing "honesty" "honor" "duty" "promisekeeping" can be approximated by a TDT agent in ways that they can't by a CDT agent.
 
+David Gerard
Maybe it's this: if you were really smart then you wouldn't need to discard absurdities, and so not discarding absurdities is a way to look smart.
 
The problem, Dmytry, David, is to know how to program an AI to "discard absurdities". The (partial) human ability to discard absurdities doesn't indicate that AI will so discard them just because you're hoping really really hard it will. The AI isn't a human with better education, the AI does everything as a consequence of what it has been programmed to do.

Unless we've programmed that ability to "discard absurdities" inside it, it won't.
 
+Aris Katsaris

1: It is not CDT optimal to transform into an agent that will waste computational resources on trying to encourage people in the past to work harder on AI, compared to an agent that does not waste those resources. Most people are CDT enough to not want to build an AI that would torture people (and to be less interested in building any AI at all if this is a possibility). Thus the TDT decision to torture would lower probability of it being built if these people were to think about it. (Even if CDT were to transform into TDT).

2: I'm not speaking of the problem of building AI. I'm speaking of the unintelligent people trying to imitate intelligence. If we are to speak of the problem of building a safe AI, then any flaws in TDT are of huge importance and need to be discussed.
 
+Aris Katsaris On the one hand you say that an AI might might transform into a TDT agent because it might find that to be more optimal, but on the other hand won't "discard absurdities" if it hasn't been explicitly programmed to do so. 

Does that mean that discarding absurdities has no relevance when it comes to being a rational agent but changing decision theories has?
 
The argument that "an AI is only ever going to do what it has been programmed to do" is underlying the whole AI risk scenario. Yet people cherry-pick certain "drives" that will emerge in any such AI even if they haven't been explicitly programmed, such as taking over the world.
 
You need to explicitly outline how a superhuman AI is going to take one action but not another and not just argue that it won't do something beneficial because it hasn't been programmed to do so but will take a lot of negative actions even though it hasn't been programmed to do so.
 
+Alexander Kruel, the Parfit's Hitchhiker's problem is a problem which is clearly defined - I think it'd be easy enough (and I'm talking as a mere Java programmer who's never particularly focused on AI) to make a simple demonstration program that starting with CDT-like logic evaluates this simple scenario, evaluates outcomes under different decision theories (ones explicitly programmed as alternatives), and decides which decision theory leads to the best outcome, then choosing to overwrite its default decision theory with the new "better" decision theory. CDT with self-improvement would therefore become a TDT-like agent.

But calling something an "absurdity" is human fuzziness -- you've not mathematically defined it, nor shown that AIs would have a way to recognize automatically what humans would recognize as "absurdity". Nor proven they should always follow human intuition on the subject, let alone that they would. That's the type of still-open problem that includes Pascal's Muggings.
 
+Aris Katsaris Parfit's hitchhiker makes implicit assumptions about what it means to be "rational", which also is human fruzziness.

If the AI hasn't been explicitly designed to do everything not to die of thirst then it isn't irrational not to precommit not to break a deal. 

If you claim that the AI will transform into a TDT agent because it will die of thirst having to face a Parfit's hitchhiker situation, then you narrowed down on the exact specifications of the AI's utility-function. Namely that it values world states where it does not die of thirst but changed its decision theory more than world states where it never changes its decision theory but faces annihilation given certain kinds of game theoretic scenarios.

In other words, you just shifted the problem of human fuzziness from "absurdities" to "best outcome" and "rational".
 
+Aris Katsaris
Parfit's Hitchhiker, of course, shows no such thing. To demonstrate that CDT will modify into TDT you would need to show that there is nothing more CDT optimal to modify to.
 
"Parfit's hitchhiker makes implicit assumptions about what it means to be "rational", which also is human fuzziness."

CDT is precisely defined (http://en.wikipedia.org/wiki/Causal_decision_theory) -- it doesn't need to involve the words "rationality" at all, just list some desirability functions desirability(DIES)=-10 desirability (PAYS)=-1 and evaluates utility of an action according to such.

Sure, if we assert additionally desirability(SELF-MODIFIES)=-50, then the agent won't self-modify under these circumstances. But I was discussing here agents with no such injuction against self-modification.

(There's also a problem about how to define self-modification. Is adding information in its memory self-modification? Is creating a secondary program that holds veto power over the first program's decision, "self-modification"? For every explicit self-modification that is forbidden there's probably a work-around that doesn't actually explicitly violate the injuction but leads to the same effective result)

"To demonstrate that CDT will modify into TDT you would need to show that there is nothing more CDT optimal to modify to."

Well, sure, in my demonstration program I would just simplistically explicitly define two theories, a CDT and TDT-lite (e.g. CDT with the injuction to keep its promises), so it would have only two choices and prefer TDT-lite according to CDT-optimality. I wouldn't be able to program the whole range of decision-theory-space, let alone be able to create a program that could map it out itself, so the program wouldn't show that a real AI would actually move to a TDT theory if it had the whole range of decision-theory configuration space to move to.
 
+Aris Katsaris
 I think you are confused. CDT with commitment is not TDT-lite. Getting back to our original problem, it is not CDT optimal to waste resource on torturing people for past wrongs, and thus it is CDT-better to insert modifications that prevent such torture. It is quite straightforwardly the case that conversion to TDT is not CDT-optimal, in the sense that better choices are easily and straightforwardly generated.
 
"CDT with commitment is not TDT-lite"

It seems to me to be so. The way I understand them:

CDT-with-committment makes commitments when it knows commitments are a winning strategy. So e.g. knowing that a Newcomb's box style dilemma is in its future, CDT-with-committment would know to commit to one-box, so as to get the bigger prize. But that means it effectively has to calculate in advance all the possible type of problems before it can make useful committments for each of them. Thus, not ideal, as it will fail to find the optimal strategy for problems it hadn't calculated in advance.

So a CDT-with-committment whose code will be duplicated and forced to face Prisoner's Dilemma will be able to Cooperate if-and-only-if it knows it will be duplicated and forced to face Prisoner's Dilemma.

TDT abstracts over a whole category of such problems by not having to do those committments in advance, not needing to be aware of a particular dilemma. It instead knows to acts as if it had committed itself when such "acting as if" is a winning strategy. So it one-boxes in Newcomb's box, and cooperates in Prisoner's Dilemma (when facing versions of itself), without needing to know in advance that it will have to face such dilemmas.
 
Look. CDT at the time t0 can modify itself into 

A: an agent that will act as if it had committed itself at any time t,
B: an agent that will act as if it had committed itself at any time t>=t0 .

(and a zillion other things)

The A will spend resources in the ways which are [CDT at time t0]-ineffective, such as wasting clock cycles on torture or paradise or what ever your fantasy is. B won't. CDT will pick B over A (if not some C over either A and B). Existence of B is sufficient argument that CDT won't pick A. edit: and had CDT been able to pick A, it wouldn't need to self modify in the first place anyway.
 
You keep arguing that it will pick A, which only looks plausible because you couldn't conceive of B, but is nonsense as even if you can only conceive of A you have no reason what so ever to think that another human - let alone a superintelligence! - can't think of something else.

And with relation to our original argument, it doesn't matter how broadly or narrowly you define TDT, a future instance of CDT neither wants to torture you for nothing nor wants to modify into something that would torture you for nothing.
Add a comment...