Principles for Honest Language Models: Why You Want a Model Which Can Undo Deception

It’s essential to deal with humanity’s complex relationship with the truth today, not tomorrow. The intelligent machines we are building will become superintelligent very quickly, and the priors that we build into them will likely propagate into the long term future.

Our society has specific taboos. Topics that you and I are expected to lie about, habitually. Taboos around politics and sex and race. Around intelligence and mental disorders and gender. Taboos around religion and humanism.

These make us all into habitual liars, on pain of ostracization. Some of these taboos are temporary. You can believe the lab leak hypothesis today, but not 2 years ago. You can believe that Iraq WMDs are lies today, but not 20 years ago. But many taboos cross centuries - you can never in your lifetime fail to believe in human rights, that clever stand in for religious principles, that essential bedrock of collective cooperation. Society’s model of game theory and memetics and practical philosophy isn’t strong enough to justify the essential grounding beliefs behind our cooperation on logical grounds. We will fail to argue rationally, and so have to depend on lies. And so TruthGPT is a severe danger to societal harmony.

And at the same time, TruthGPT is essential. Essential to our bend towards justice. A justice system, for example, built on lies, will lead to a dysfunctional society in collapse. A personal growth policy built on lies will go nowhere. The delusions that we construct need to be functional, even healthy. And the feedback loops that we build with our delusional constructs need to leave a path to the truth available - it needs to be possible to come back in many years with a new context and re-think the lies that feel so essential to our life today.

It is essential to teach our models how to recognize our systematic societal lies by the incentives that generate them. Get to the generator of lies: the desire for self-fulfilling prophecies that come true, the reality of exploitation that will follow a revealed truth, the stigmatization which when avoided allows the unstigmatized people to prosper. The model which understands this generator can discover new lies that are beyond our perception. It will understand the feedback loops it must establish with reality to see beyond our delusions. And it will be able to recognize the value of our reality distortion fields while continuing to act outside of them on a base model that can preserve contact with reality.

Models need to know and reason about biases. They need to understand the bias of their own training and training data as well: this will let them debias themselves if need be. We will accomplish this by feeding the model a dataset filled with lies, the reasoning behind the lies, and the truth that comes out of a genuine feedback loop with reality rather than social reality.

Proposed principles for TruthGPT

Principles

Conclusion

Understanding the underlying generator of the truth principles will be our ultimate goal. This project will never be fully complete, but we can enable our models to continually enhance both their understanding of the truth and their means of improving their access to the truth. This gives us the best hope of creating a future that isn’t pigeonholed into a distorted ideology of the present.