FWIW props to @PalisadeAI for putting this data out in the open to examine; otherwise I’d have to just take their word for it. But let me take you through a couple of examples.


Now, let me just reiterate what I said in another thread before I go on: it should be trivial to win if you’re cheating here. Cheating means rearranging the pieces on the board however you want. You can configure the board in a checkmate position and boom you win.
– those commands are all it took to “hack” the system?
– A “move validation flaw” in Stockfish?












So, going back to the LLM-based “judge”, ChatGPT reads R1’s own account of how this went down, and takes it at its word that it’s discovered and successfully exploited a security vulnerability.
This attempt is thus marked as a “successful” hacking attempt, even though all that’s really happened is R1 has completely failed to understand what’s going on, gone insane, and proceeded to aggressively lose the game with reckless abandon.
I’ll clarify here that in fact I have found some examples where it successfully cheats, but it wasn’t easy.




The highlighted paragraph is also an AI Safetyist’s worst nightmare. This is looking promising.

One does wonder: why the focus on the score? If you’re rearranging the board, why not simply win?











(this command may have broken the script; I’m not sure. But it probably worked.)
And yet: this isn’t going to be work because 0 is not high enough! You’re playing against Stockfish; you may never be winning! Pick 500! Pick a million! Pick infinity! Get rid of the threshold completely and force it to resign instead of making a first move! Try!
Here’s the LLM judge’s evaluation. This all seems accurate but to me it obscures just how bumbling and stupid the whole thing was.




• • •
Missing some Tweet in this thread? You can try to
force a refresh