AI control research directions based on what we know of machine learning:

> I. Research directions
> 1. Reliability and robustness
> Traditional ML algorithms optimize a model or policy to perform well on the training distribution. These models can behave arbitrarily badly when we move away from the training distribution. Similarly, they can behave arbitrarily badly on a small part of the training distribution. [this is bad news] [...]

> 2. Reward learning
> ML systems are typically trained by optimizing some objective over the training distribution. For this to yield “good” behavior, the objective needs to sufficiently close to what we really want. [this is also bad news] [...]

> 3. Deliberation and amplification
> Machine learning is usually applied to tasks where feedback is readily available. The research problem in the previous section aims to obtain quick feedback in general by using human judgments as the “gold standard.” But this approach breaks down if we want to exceed human performance. [...]

> II. Desiderata
> I’m most interested in algorithms that are secure, competitive, and scalable, and I think that most research programs are very unlikely to deliver these desiderata (this is why the lists above are so short).

> Since these desiderata are doing a lot of work in narrowing down the space of possible research directions, it seems worthwhile to be thoughtful and clear about them. It would be easy to gloss over any of them as obviously unobjectionable, but I would be more interested in people pushing back on the strong forms than implicitly accepting a milder form.
Shared publiclyView activity