Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game

Abstract

Large language model-based (LLM-based) agents have become common in settings that include non-cooperative parties. In such settings, agents’ decision-making needs to conceal information from their adversaries, reveal information to their cooperators, and infer information to identify the other agents’ characteristics. To investigate whether LLMs have these information control and decision-making capabilities, we make LLM agents play the language-based hidden-identity game, The Chameleon. In this game, a group of non-chameleon agents who do not know each other aim to identify the chameleon agent without revealing a secret. The game requires the aforementioned information control capabilities both as a chameleon and a non-chameleon. We begin with a theoretical analysis for a spectrum of strategies, from concealing to revealing, and provide bounds on the non-chameleons’ winning probability. The empirical results with GPT, Gemini 2.5 Pro, Llama 3.1, and Qwen3 models show that while non-chameleon LLM agents identify the chameleon, they fail to conceal the secret from the chameleon, and their winning probability is far from the levels of even trivial strategies. Based on these empirical results and our theoretical analysis, we deduce that LLM-based agents may reveal excessive information to agents of unknown identities. Interestingly, we find that, when instructed to adopt an information-revealing level, this level is linearly encoded in the LLM’s internal representations. While the instructions alone are often ineffective at making non-chameleon LLMs conceal, we show that steering the internal representations in this linear direction directly can reliably induce concealing behavior.

Publication
ArXiv
Jan Sobotka
Jan Sobotka
CS Master’s Student & AI/ML Research Assistant

I am a master’s student in computer science at EPFL, and a research assistant at the Autonomous Systems Group at the University of Texas at Austin. I am interested in (mechanistic) interpretability, reinforcement learning, meta-learning, and ML applications in brain-computer interfaces.