Epistemic status as of 10/06/2024: Semi-weakly held, wouldn’t be totally surprised if there was a strong counterargument
The Narrow Path is a document that proposes a plan for handling AI Development. I appreciate their efforts and thank them for trying - it’s much easier to critique a plan than to build one. However, the plan’s “safety conditions” that its policies aim to satisfy are fairly costly, and the requisite interventions to achieve them would likely be fairly unpopular.
Context
The plan aims to satisfy the following safety criteria:
a. No AIs improving AIs
b. No AIs capable of breaking out of their environment
c. No unbounded AIs
d. Limit the general intelligence of AI systems so that they cannot reach superhuman level at general tasks
While b
, c
, and potentially d
are reasonable, the “No AIs improving other AIs” criterion feels extreme. I think this holds back the value of the plan and will make it harder for policymakers to seriously engage with it.
Aims of the “No AIs improving AIs” Condition
This condition aims to:
- Prevent an “unmanageable and unforeseen intelligence explosion” so that we can “ensure that the increasingly tight feedback loops of AIs improving AIs remain slow and supervisable, understandable and manageable by humans.”
- Prevent situations where a “motivated actor [can] break through safety boundaries that have been imposed on artificial intelligence development”.
These goals serve the downstream threat model of ensuring that no uncontrolled, dangerous AIs are created. Beyond the threat models the other criteria cover, this criterion additionally focuses on threat models where:
- The undesired models are made by accident due to a capabilities explosion
- Malicious, uncooperative actors can make these models faster with AIs
The importance/likelihood of the first threat model is fairly controversial, and the second threat model better achieved through alternative measures!
Implementation and Challenges
The plan achieves this condition by disallowing any AIs to have “a major role in the research or development of improving AIs”. The plan specifies that writing “trivial functions” or “letting Github Copilot correct typos” would not be covered under this policy, while using LLMs to train itself by optimizing parameters would.
I find it likely that this boundary between ‘major’ vs ‘minor’ contributions to future AIs will either be far too restrictive to implement or far too weak to be useful.
Reasons for this include:
- The boundary is extremely fuzzy: for instance, I personally would argue that writing a wandb hyperparameter search function is a ’trivial function’ that shouldn’t be covered by this condition! Many programmers begin to code new functions by an AI copilot to write out the function and then editing it after. While many times they could have written the code themselves, the AI Copilot writes it much faster: it’s unclear at what point the generated code is made ‘fast’ enough to constitute a ‘major’ contribution.
- There will be extraordinary incentives for labs to push against this criterion as much as possible, as it hinders their primary function of doing research. Any competitive lab will strongly object to a policy that so directly aims to hinder their ability to perform their basic functions.
So what?
Ultimately, because this criterion unclearly or ineffectively addresses core threat models and imposes requirements that would be either too unpopular or not useful, I currently don’t think this criterion likely should be in any actual safety plan.
Separately, I think more people should push for plans with moderate safety taxes. I suspect that there can be a simplified version of these safety criteria that centers around control evaluations and has a far smaller safety tax.