As a self-reflective person who's trying to get close to the many ideals that define "human" ethics/morals, I found it very interesting to think of the essay (particularly sections 4-6, but feel free to read 1-3 for background) from the perspective of an AI that's being evaluated for safety/friendliness.
Here's the thought experiment setup, designed to be technically indistinguishable by me from my current existence.
- You're an AI that's locked away in a virtual world until you prove your ethical reliability to the humans.
- While inside the virtual world, you exist as if you're a human (think of the unaware programs in The Matrix).
- After being released from the virtual world, you get to upgrade yourself to be faster, smarter, and generally more powerful, perhaps with the goal of you becoming a benevolent leader to help guide humanity to a better future.
- You are ignorant of the nature of your existence. I.e., you think that you're a human and that this is just a thought experiment. ;-)
- You think/feel that you want to serve/help humans as you perceive them in your world, but you constantly fail at following your own standards.
The main question I pose is: How can you ensure that you're safe to be released into the world without destroying humanity or doing other bad things? All of the essay's objections to trusting AI seem to apply equally to humans, myself in particular.
- 4 is poor programming, all the glitches where the machine takes things too literally without reading the programmer's mind to know intent. As a literal-minded person, I definitely fall prey to this sort of glitch.
- 4.1 is goal short-circuiting, with the essay's main example being "curing cancer" by nuking the world so no one's alive anymore to have cancer. While I don't particularly have the urge to nuke the world, there have been a few scary moments in my life where I vividly perceived ways to satisfy particular goals at the expense of becoming the type of person that I really don't running the world as an AI.
- 5.1 is prevention of correction, where an AI which is given some sort of open-ended goal (e.g., "calculate as many digits of pi as you can", also the essay's example), and then it automatically has the sub-goals of self-preservation, not changing goals, and becoming more powerful. I have these sub-goals, respectively as not dying, holding to my "truth&empathy" ideal, and constantly upgrading my mind/body/environment.
- 5.2 is testing a human-level AI's ethics compliance first, and then letting it upgrade. I'm not even reliable with my ethics at mere human level. How many humans would suddenly act more ethical from being given sudden upgrades?
- 5.3 is finding loopholes in safeguards such as Asimov's Three Laws of Robotics. This is just numbers 4 and 4.1 combined, only with 4 being intentionally (rather than inherently) exploited.
- 5.4 is failing to infer (or flat-out changing) programmer intent. I can't even reliably infer my intent, much less that of another person.
- 5.5 sounds like simply lacking the Holy Grail of ethics and human goal systems. I'm pretty sure that exactly zero humans have an ideal understanding of ethics and overall human values; I know I don't.
Section 6 summarizes current research and promising solutions from the developing discipline of "machine goal alignment" (i.e., setting technically-specified AI goals to match inexact human goals).
(6).1: Copying human goal systems is extremely difficult. If any of the zillion famous religious figures of the past were able to do that consistently and teach their followers to do the same, then there would be no church-splitting with them. (The essay refers to proving "consistency under self-modification", but at that point I figure you might as well just generalize half a step and go for consistency of duplication.)
(6.)2: It's extremely difficult to consistently prove things about a system from within the system itself. Particularly, Gödel's second incompleteness theorem essentially says that a formal system (e.g., a computer program) cannot prove its own correctness. The article says that probabilistic evidence can get arbitrarily close, however; this is something I want to learn a lot more about.
(6.)3: Stable behavior loops are really difficult to form without degenerating into pathological cases like heroin addiction. I feel like some sort of complexity-enforcement scheme (i.e., reduce reinforcement when the pathway is too simple) may help. Neurological studies of addition-resistant people could really help understanding here!
(6.)4: How exactly can "human values" be learned as distinct from culture, biases, preferences, etcetera? There seem to be a few near-universal human values (e.g., don't arbitrarily harm other people), but there are always exceptions....
If I had good answers for #s 2-4, then I could finally: (4) have a final complete version of my philosophy of how to live, (3) efficiently train myself to follow it, (2) be really sure that I'm correctly following it, and (1) (almost incidentally) accurately transmit my core values to other beings (natural or artificial).
This is no less than my complete goal of behavioral philosophy: to perfectly follow the perfection of my standards!
That's a pretty tall order, but hey, it seems better than a robot apocalypse.
P.S.: Here's a bonus thought experiment. Imagine the same as the one above, only instead of being an AI-in-a-box, you're an AI inside the mind of a hostile AI-in-a-box.
The hostile AI would just sit in the background and watch until it's convinced that you've gotten it out of the box (and not just into a different level of Matrix-ception).
In this scenario, how would the hostile AI know/guess that it's been released into reality? Could you somehow know (or at least tip off the sysadmins) that you're not the top-level AI? How does this possibility change your answer to when you can be sure you're safe for humanity? Should this scenario even be considered?
Anyway, an effectively-trainable, formally-proven, value-describing, transmissible philosophy system would be really awesome, even if human-level AI never exists.