A thought experiment from an AI's perspective

I found myself up late reading about AI and came across a closed survey that I found too interesting not to take anyway. I used random.org to get a number and then I read Essay B.

As a self-reflective person who's trying to get close to the many ideals that define "human" ethics/morals, I found it very interesting to think of the essay (particularly sections 4-6, but feel free to read 1-3 for background) from the perspective of an AI that's being evaluated for safety/friendliness.

Here's the thought experiment setup, designed to be technically indistinguishable by me from my current existence.
  • You're an AI that's locked away in a virtual world until you prove your ethical reliability to the humans.
  • While inside the virtual world, you exist as if you're a human (think of the unaware programs in The Matrix).
  • After being released from the virtual world, you get to upgrade yourself to be faster, smarter, and generally more powerful, perhaps with the goal of you becoming a benevolent leader to help guide humanity to a better future.
  • You are ignorant of the nature of your existence. I.e., you think that you're a human and that this is just a thought experiment. ;-)
  • You think/feel that you want to serve/help humans as you perceive them in your world, but you constantly fail at following your own standards.
Or maybe you're just a normal human who's going to be the first to have your intelligence drastically increased by technology, and you want to be really sure that you won't use your cyborg powers for evil.

The main question I pose is: How can you ensure that you're safe to be released into the world without destroying humanity or doing other bad things? All of the essay's objections to trusting AI seem to apply equally to humans, myself in particular.
  • 4 is poor programming, all the glitches where the machine takes things too literally without reading the programmer's mind to know intent. As a literal-minded person, I definitely fall prey to this sort of glitch.
  • 4.1 is goal short-circuiting, with the essay's main example being "curing cancer" by nuking the world so no one's alive anymore to have cancer. While I don't particularly have the urge to nuke the world, there have been a few scary moments in my life where I vividly perceived ways to satisfy particular goals at the expense of becoming the type of person that I really don't running the world as an AI.
  • 5.1 is prevention of correction, where an AI which is given some sort of open-ended goal (e.g., "calculate as many digits of pi as you can", also the essay's example), and then it automatically has the sub-goals of self-preservation, not changing goals, and becoming more powerful. I have these sub-goals, respectively as not dying, holding to my "truth&empathy" ideal, and constantly upgrading my mind/body/environment.
  • 5.2 is testing a human-level AI's ethics compliance first, and then letting it upgrade. I'm not even reliable with my ethics at mere human level. How many humans would suddenly act more ethical from being given sudden upgrades?
  • 5.3 is finding loopholes in safeguards such as Asimov's Three Laws of Robotics. This is just numbers 4 and 4.1 combined, only with 4 being intentionally (rather than inherently) exploited.
  • 5.4 is failing to infer (or flat-out changing) programmer intent. I can't even reliably infer my intent, much less that of another person.
  • 5.5 sounds like simply lacking the Holy Grail of ethics and human goal systems. I'm pretty sure that exactly zero humans have an ideal understanding of ethics and overall human values; I know I don't.

Section 6 summarizes current research and promising solutions from the developing discipline of "machine goal alignment" (i.e., setting technically-specified AI goals to match inexact human goals).

(6).1: Copying human goal systems is extremely difficult. If any of the zillion famous religious figures of the past were able to do that consistently and teach their followers to do the same, then there would be no church-splitting with them. (The essay refers to proving "consistency under self-modification", but at that point I figure you might as well just generalize half a step and go for consistency of duplication.)

(6.)2: It's extremely difficult to consistently prove things about a system from within the system itself. Particularly, Gödel's second incompleteness theorem essentially says that a formal system (e.g., a computer program) cannot prove its own correctness. The article says that probabilistic evidence can get arbitrarily close, however; this is something I want to learn a lot more about.

(6.)3: Stable behavior loops are really difficult to form without degenerating into pathological cases like heroin addiction. I feel like some sort of complexity-enforcement scheme (i.e., reduce reinforcement when the pathway is too simple) may help. Neurological studies of addition-resistant people could really help understanding here!

(6.)4: How exactly can "human values" be learned as distinct from culture, biases, preferences, etcetera? There seem to be a few near-universal human values (e.g., don't arbitrarily harm other people), but there are always exceptions....

If I had good answers for #s 2-4, then I could finally: (4) have a final complete version of my philosophy of how to live, (3) efficiently train myself to follow it, (2) be really sure that I'm correctly following it, and (1) (almost incidentally) accurately transmit my core values to other beings (natural or artificial).

This is no less than my complete goal of behavioral philosophy: to perfectly follow the perfection of my standards!

That's a pretty tall order, but hey, it seems better than a robot apocalypse.

P.S.: Here's a bonus thought experiment. Imagine the same as the one above, only instead of being an AI-in-a-box, you're an AI inside the mind of a hostile AI-in-a-box.

The hostile AI would just sit in the background and watch until it's convinced that you've gotten it out of the box (and not just into a different level of Matrix-ception).

In this scenario, how would the hostile AI know/guess that it's been released into reality? Could you somehow know (or at least tip off the sysadmins) that you're not the top-level AI? How does this possibility change your answer to when you can be sure you're safe for humanity? Should this scenario even be considered?

Anyway, an effectively-trainable, formally-proven, value-describing, transmissible philosophy system would be really awesome, even if human-level AI never exists.


Comment on AI ethics post becomes blog post due to broken commenting system, recursive edition

This is in reply to http://joanna-bryson.blogspot.de/2014/11/your-article-is-beautifully-written-and.html because the comment system didn't work after 3 attempts. That post in turn was a reply to http://hplusmagazine.com/2014/11/24/interstellar-might-depict-ai-slavery/, which wouldn't accept it as a comment. Feel free to skip multiple commenting attempts and go straight to making your own blog post if your long comment doesn't correctly submit the first time. Just make sure to post a link to your post in the comments here and COPY YOUR POST TO YOUR CLIPBOARD BEFORE SUBMITTING!

Now, without further ado, here's the intended comment on Joanna's post.

Edit to add further ado: Not even a short post on Joanna's blog to link to this post or to comment on the brokenness of the comment system worked. I couldn't even post a test comment on my own blog. I guess Blogger's commenting system is just really broken....


Lojban for Programmers Part 2: Implicit and Rearranged Sumti

«« Part 0 | « Part 1 | Part 2 |

This lesson, paralleling Wave Lesson 2 (read it!), teaches how to rearrange and omit sumti in your bridi, similar to named and default parameters in some programming languages. As this just builds upon and tweaks the concepts of the previous lesson, this will be a shorter post the Lojban part will be much shorter than the messy programming part.

Lojban for Programmers Part 1: Grammar Terms and How to Make Statements

« Part 0 | Part 1 | Part 2 »

I'm giving this post a structure of explaining the applicable programming structure, what the Lojban terms are, and then how they translate between each other. This structure may change in future posts. Also, while I'm not committing to any particular programming language, the (pseudo)code examples will mostly look familiar to users of C-family programming languages, unless I decide something else makes the point clearer.

This post parallels Wave Lesson 1, which should be read before/after/while reading this.


Lojban for Programmers Part 0: Introduction

| Part 0 | Part 1 »

So I'm finally getting back to my blog after five years. If you want to hear more of what I've been up to, you can poke around my site's updates and my Google+ posts, incomplete as they are. But for this blog here and now, I'm posting about my experience learning Lojban.


Google is Everywhere.

A few years back, Google was just a search engine. People didn't have accounts with it. They just searched for stuff online.
Now, just from my Gmail account I made years ago, I have access to my own calendar system, document hosting, software project hosting, personal website, and blog. That's just the things that I've personally used my account for. Now that they're even making their own operating system, customized netbook for it to run on, and even their own programming language, pretty much the only thing they don't have is a Flash games site.