
But if you hit the button it will not able to bring the coffee and get reward. Now if we think to continue to dreaming about your keeper android and baby, even if you have that button, your android will not allow you to smash the button because it wants to get you a cup of coffee and it will gain a reward at the end of this big mission. It understands that it's not complete, that the utility function that it's running is not the be all and end all.

That is to say it is open to be corrected. So this has been formalized, as this property that we want early AGI to have called "corrigibility". If you want to build something that you can teach, that means you want to be able to change its utility function.


In almost any situation being given a new utility function is gonna rate very low on your current utility function.
