A sketch of a nice model of human values I think I first heard from John Wentworth at Iliad II in August 2025:
We don’t have an explicit utility function, but we have a set of reward signals and pretend that there is some true hidden utility function which we try to model with uncertainty. Then we use our reward signals to update our map of a (fictional) utility function.
This lets us protect against value drift and wireheading by resisting things which would cause our reward signals to give us worse information about our imagined true utility function.