4ncl News

An Appeal for Fair Play – The Verdict

Summary & Recommendations

Over the past 7 months, I’ve put considerable thought into the dual challenges of cheating (which certainly occurs in online chess) and cheat detection (all known algorithmic methods of which inevitably lead to miscarriages of justice).

Is it possible to meet both challenges?  My considered opinion is: no, not without the introduction of military-grade video conferencing technology – which would be unacceptable for almost all except professional players.

I’d appreciate a little more honesty from the platforms (Lichess, chess.com, etc.) and online tournament organizers (4NCL etc.) as to the limits of what may be expected from their anti-cheating procedures.  But I can’t offer much by way of “constructive improvements”. 

Sorry about that.  You can read in detail below how I’ve arrived at this conclusion.

However, I hope that my candour will not deter the reader from playing in any of the remarkably well-run online tournaments (which, for avoidance of doubt, includes the 4NCL).  For, while it is not possible to “fix” online chess, it is certainly possible to enjoy, appreciate and benefit from the experience.

Simply stated, for those unable to play OTB, this is the best possible method of developing & exercising your chess-playing skills.

And if you do pull off a brilliancy online, then bully for you!   Record it, annotate it, cherish it.  Just remember that you – and only you – will ever know that it was a “clean” game!

 

Is it possible to cheat undetected in online chess?

Yes – depending upon which type of cheating we are talking about.

Consultation, impersonation, result inflation, and reference to opening & endgame manuals are all types of cheating which can easily be practiced on (eg) chess.com and Lichess without sanction.

The platforms can do nothing to stamp this out [1]Online tournaments, such as 4NCL, the County Championship, and the London League are all aware of the potential for cheating, and require players to affirm to a code of conduct forswearing these dark arts.  Worthy stuff:  but that’s as far as it goes.

The main type of online cheating which tends to get discussed is “engine-assistance”.  The platforms and some online tournaments such as the 4NCL have certainly put in an effort to quell this.  However, I would advocated a healthy scepticism as to whether this laudable quest can ever be both a) effective and b) fair [2].

However, it should be noted that manifold types of rule violations are simply impossible in the on-line game. For example: illegal or ambiguous moves, retractions, violations of “touch-move”, starting one’s opponent’s clock before completing the move (or indeed after knocking over several pieces).

The above occur routinely in OTB tournaments and, while usually unintentional, certainly detract from the flow of the game, especially close to a time-control.

If your preference is for blitz or bullet chess, there is an argument that the only “fair” medium on offer for such a contest is via an online platform.

 

Chess.com, Lichess, etc, each deploy a proprietary algorithm for detecting “engine-assistance”?  How good are these?

Nobody, outside the platform’s “circle of trust”, really knows.

The platforms certainly make strong claims about the quality of their processes [3].  However aside from some vague general principles, they do not explain their methodology [4].   But perhaps exposition is unnecessary.

What if a chorus of chess-stars with inside knowledge of chess.com procedures are, like GM Nakamura, happy to “attest fully that chess.com’s approach is advanced and far ahead of what [he] know other websites use to catch cheaters[5]?  Convinced?

One has to ask: what series of transactions led to the star-studded chorus-line spontaneously singing the praises of the chess.com anti-cheating approach? 

All that chess.com says it that the testimonial authors “were given a multi-hour course on chess.com’s fair-play system, and signed a non-disclosure agreement to protect the details of [its] cheat-detection methods”.  And the T & Cs of the NDA?  Perhaps they were a little similar to the “punitive NDA with ‘unlimited financial liability'” which, as Northumbria Vikings captain, Tim Wall, discovered is the standard tariff for entering the Lichess “circle of trust” [5a].  Getting the GMs to sign-up to that isn’t going to come cheap. 

Maybe those testimonials aren’t quite the “independent validation” you were looking for.

Persuaded by the show-trial confessions of titled cheats paraded on chess.com [5b]?  The standard plea-bargain offered by the chess.com DPP is, as Geoffrey Moore learnt, full restoration of account privileges in return for “clear admission to using outside assistance” [6]

Perhaps not all of those “wholly voluntary confessions” are what they appear.

Interesting, chess.com admits that “the rate of false positives detected by [their] algorithm is intentional“.  If so, it would be useful to know what percentage of bans generated by the algo-arbiters are “false positives”.  So where are the metrics? 

As you may have gathered, appeals are certainly not encouraged. 

Chess.com boasts that only 0.03% of bans are overturned on appeal, but claims that an appeal may succeed if “backed by sound evidence of clean, if exceptional, play“. 

Lovely turn of phrase.  However, Geoffrey Moore tried just that with chess.com [7], and I tried the same with Lichess [8].  Sound evidence doesn’t cut it.

If you do challenge a ban for “engine-assistance”, you’ll most likely need to wait awhile.  Eventually, you’ll get a charming email similar to the following. (I paraphrase only slightly.)

Dear Cheat,

We had our experts look at your appeal.  Just as we thought, there is nothing in it.   Frankly, you are deluded if you thought that a non-entity like yourself could ever make moves this good.  It simply can’t happen.

We’d like to tell you how we came to that conclusion, but that would only assist you in cheating more effectively in future.  Sorry to disappoint. Considered taking up draughts?

Have a nice day, The Appeals Panel

Risible although their schtick is, it should not divert us from a fairly serious point about stakeholder management. 

Doubtless chess.com (& Lichess) have their reasons for the policy of non-disclosure.  The engine-assistance detection algorithm is intellectual property, its commercial value would be diminished if published, and, legally, they are in the right if they wish to restrict this from the public domain.

Like President Trump with his invisible tax-returns, those who are not transparent with their stakeholders only arouse a reasonable suspicion that they have something embarrassing they need to hide.

 

What about other algorithmic Anti-Cheating methodologies?

In addition to the proprietary Lichess engine-assistance algorithm, the 4NCL Online leverages an application developed by Professor Kenneth Regan, Dept of Computer Science, Buffalo University [9].   Prof Regan has been researching in this area since 2008, and is considered by FIDE to be the principal expert in this field [10].

I haven’t been able to find a formal write-up by Prof Regan of his tool, but you can read a general description of his approach here: [11].  Anyhow, the way it works is basically this.  The tournament arbiter sends the tool a PGN containing a number of games.  The tool sends back a series of metrics pertaining to the participants of those games.  These enable the arbiter to make an assessment of the likelihood that a specific player (let’s call him “Mr AL”) was deploying an engine to determine his moves.  To simplify this discussion, we’ll focus on only one metric: the “Z-score”.

 

What’s a “Z-score”?

The tool looks at the incidence of matches between Mr AL’s moves and the recommendations of an engine.  It then compares Mr AL’s performance in matching the moves of the engine to that of a Hypothetical Average Player (“HAP”) with the same rating strength of Mr AL.

If Mr AL’s performance is identical to that of HAP, his Z-score is zero.  If however, they vary, Mr AL will received a Z-score equating to the number of standard deviations by which his performance exceeds that of HAP.

The reader will have rapidly grasped that Mr AL’s Z-score for a tournament cannot be calculated simply by running an engine on his tournament games.  Indeed, the reference to HAP implies an heroic enterprise:  constructing and updating a vast database correlating players actual moves in tournament games with engine choices in identical positions. 

Anyhow, Prof Regan appears up for the challenge: I guess it beats spending the lockdown catching up on Netflix!

 

What’s the significance of a high “Z-score”?

A high Z-score constitutes an improbably high performance.  For example, the first Anglian Avenger who came under suspicion in Season 1, was awarded a Z-score of +4, based upon his three games.  This equates to the move quality of a FIDE 2900 – so stronger than any human world champion, ever. 

A player around 2100 would have a one-in-3-million chance of achieving such a performance without engine assistance.  Which can’t happen – so he must have cheated. 

Anyhow, that’s how the story goes.

However, if your tournament organizer is, like the 4NCL, one that utilizes Prof Regan’s methodology, you’ll receive rather more than “reasonable suspicion”.  I won’t attempt to summarize the 4NCL’s FP Guidelines, which you can read here: [12].  But basically, if Lichess flags your account as “engine-assisted” and/or you obtain an unacceptably high Z-score from Prof Regan, you’ll get banned.  

The 4NCL are steadfast advocates of Prof Regan’s methodology.  During the last OTB season, they’ve been using his service (surreptitiously) to cheat-check games.  But its role was always clearly enshrined in the Online FP Guidelines.

Fair play to them: the 4NCL’s anti-cheating policy may be arbitrary, under-evidenced, and based upon highly-contentious assumptions, but they’ve followed it with scrupulous consistency. 

According to their own lights, they acted “entirely reasonably” in banning our players.

So that’s alright then.

 

Is cheat-detection based upon “Z-scores” biased in favour of stronger players?

Yes.  Consider this situation.  Magnus Carlsen is playing for Anglia Avengers in the 4NCL Online.  He also plays 3 games with exactly the same moves as our first banned player.  His implied performance, based on engine recommendation matches is also 2900.  However, his current FIDE rating (December 2020) is 2862.  So for Magnus, the games generate a Z-score just fractionally above zero.

Our player gets banned.  But Magnus Carlsen wouldn’t even get a warning – even though the moves, results, behaviour and general iffy-ness level of both players is identical. 

Is this situation entirely kosher?

Argentina v England, Azteca Stadium 22nd June 1986.  6 minutes into the 2nd half, Diego Maradona (5’5”) is up against Peter Shilton (6’0”) in a one-on-one in the penalty area.  The ball is lofted into the air, but somehow Maradona’s “head” beats Shilton’s hands to the ball.  Both referee and linesmen lack a clear line of sight of the ball, and the goal stands.

The most famous instance of cheating in the history of football would, nowadays, have been short-circuited by video assistance technology.  But what if one were to determine questions of fair-play on the pitch using a metrics based approach?

Maradona is acclaimed as the greatest player of all time, by amongst others, Lionel Messi and Zinedine Zidane [13].  He would surely merit a Carlsen-level football equivalent of a FIDE rating.  The “hand of God” incident would uptick his Z-score just slightly, remembering that Maradona scored a fantastical (and entirely legitimate) “Goal of the Century” just 4 minutes later.  If a third division forward had tried the same stunt, his Z-score would have gone through the roof, and he’d be back in the changing room.

Maradona proves that brilliant players can also be exceptional cheats. 

Against these, I am not convinced that Prof Regan’s methodology can provide us with much protection.

 

Does a high Z-Score provide proof “beyond reasonable doubt” of cheating?

No.  The 4NCL’s smorgasbord of Z-scores, implied probabilities and sundry metrics will doubtless cower many players and captains when facing a cheating ban.  However, it pays to look under the hood at what is powering this analytics juggernaut.  If it came to a real court case (as opposed to the 4NCL’s Potemkin appeals process) I, for one, would be comfortable leading the defense.  There’s just so many contestable assumptions on which to shower doubt.

First, take Prof Regan’s hand-crafted software, rumoured to consist of 20,000 lines of C++ code: which are, naturally, flawless. Anyone done a code audit?  Ever heard of the Horizon fiasco, whereby systemic faults in a Post Office system led to 550 sub-post-masters getting wrongly convicted for theft, fraud and false-accounting [14]

But of course, a miscarriage of justice of this magnitude could never happen in chess!

Secondly, we turn to the relevance of Prof Regan’s Herculean labours in the data-mines.  Sure, it’s interesting & fun to establish historical correlations between players moves, FIDE ratings and engine recommendations.  But what exactly can that tell us about a game played yesterday?

Contemporary games provide us with a vastly increased data-set, played at wildly disparate time-controls, at insanely divergent standards of play.  The only consistent theme, is that you have significant, continuous improvements in playing standards, across all levels: much of it informed by – you’ve guessed it – engine-training.  And don’t ever think that any data-source, least of all historical tournaments, is clean [15].

Frankly, I admire the chutzpah of anyone who can stare down this maelstrom of evidence and front up any meaningful conclusions about a specific game!

Finally, as my expert witness, I’d like to call Nicholas Taleb, mathematician, options trader and author of “The Black Swan” [16].  Taleb distinguishes between two type of random environment.  On the one hand, there is Mediocrastan – in which micro-events are subsumable within a Gausian bell-curve [14] and hence are, at the macro level, inherently predictable.  On the other, there is the deeply anomalous Extremistan in which events, both micro and macro conform only to a fractal level unpredictability: our best efforts to anticipate are met only with surprise, astonishment and typically failure.  In Mediocratstan we find roulette wheel spins (in a “clean” casino), and the height of US men.  In Extremistan, we find the schemes of employees to scam casinos, and the net worth of US citizens.

Prof Regan believes that chess is played in Mediocratstan.  The  chess-player earnestly striving for the best move is (at a deep level) simply like a backgammon player shaking the cup and hoping for a double-six?  He can blow on his dice all he likes, but box-cars are only ever going to come up one time in thirty-six.  True, the method of calculating probabilities of a successful chess-move is little more complicated than for backgammon – but that’s just a detail.

That would be to take a reductionist view of human rationality: degrading reflective deliberation & intentionality to the random interaction of physical particles. 

Nothing to see here!

 

The 4NCL re-vamped its Appeals Process in Season 2.  Any improvement?

Not really. 

The obvious potential for false positives in the Lichess engine-assistance algorithm, together with a number of unconvincing embedded assumptions in Prof Regan’s methodology, suggest a clear need for a robust appeals process.

However, in Season 1, if one of your players was flagged by the Lichess algorithm, there was no recourse of appeal to the 4NCL:  one simply had to accept the ban from the 4NCL for the rest of the season.  “Sho ga nai” – as the Japanese put it.  (The anglicized version is less printable.)

However, the 4NCL Online was an innovative, novel type of competition, put together in (impressively) rapid time after the first lockdown was announced: under these circumstances not everything is going to be right first time.  It was therefore gratifying to learn that for Season 2, the 4NCL had significantly revised their FP guidelines, including a significant expansion of the Appeals Process.

So far, so good.  But when one immerses in the detail of the expanded Appeals Process, one notices a curious omission in the permissible grounds for appeal.  These “may” include:

  • Mistaken identity (eg another player using account without the players knowledge);
  • The player being able to demonstrate recent performances in OTB events significantly higher than their current grade or Elo rating to an extent that would materially reduce the likelihood of external assistance having been used;
  • Other mitigating circumstances.

Yes – you did read that correctly. 

The notion that a player was not following a book, was not talking to their mate on the phone, was not using a computer – they actually came up with their own moves themselves – is not specifically recognized as a valid ground for appeal.

And this is supposed to reflect the 4NCL’s considered opinion? 

You couldn’t make this stuff up.

 

Is it possible to eliminate cheating in internet chess?

Yes.

You would simply need to set-up 360⁰ video monitoring of players for the duration of the game. 

Two webcams on each player should be sufficient.  If these were installed & operational, then players would be no more able to cheat online than during regular OTB tournament conditions.

 

So why aren’t we doing this?

Out of the question! 

You would need to devise guidelines for the position and running of the web-cams.  There would need to be rules about what happens if the cameras go-off line, or if players disappear for a “coffee break”.  Besides, would chess-players even be capable of operating a web-cam?  And we need to consider the arbiters, who would be expected to examine the video footage. 

It could hardly be pleasant looking at hundreds of videos of unshaven, unkempt players in untidy bed-sits staring at grainy computer-monitors.

Pathetic isn’t it?

 

How could a tournament organizer as experienced as the 4NCL offer up such an obviously flawed approach for detecting cheating?

The 4NCL FP procedures are not obviously flawed: they are deeply flawed.

Any chess-player complaining about the conduct of a chess-tournament really does need to see the matter from the perspective of the organizer.  To run a successful tournament, the organizer needs to circumvent a number of challenges.  On-line tournaments are as equally problematic as OTB – it’s just that the challenges are slightly different.  Cheating is just one issue – amongst many – and it’s not even the biggest frustration with online play [17].

So you’ve got to be doing something about cheating – no choice.  But the question is: how much time do you dedicate to that versus other necessary aspects of tournament organization? 

Without access to “line of sight” of the players, allegations of cheating become a veritable time-sink. 

It does not matter how experienced an arbiter you are.  You can look at a “suspect” game, comparing the moves with engine recommendations, even other games by the same player, for hundreds of hours, and yet still not reach a conclusion with which you are wholly comfortable.

What do you do?  Obviously, you reach for an “over-ready” solution!  Lichess, chess.com and the veritable Prof Regan are the authorities in this area.  Its hardly likely that the 4NCL could hand-craft a better solution – given the massive effort already deployed in this space.

Previously, I’ve advocated a “lets-run-this-past-a-human” approach – as a method of adjudicating contentious allegations of cheating.  I now see how that this approach would fail.  My subjective impression might be that a game could only have been played by a human.  But others might fail to replicate my subjective impression.  How could one coordinate or enforce standardization across multiple adjudicators?  How would one handle appeals?  This is just going to turn into one person’s word against another’s. 

The implant of human arbiters into the process would not foreclose the debate – but rather, as with all disputes unsusceptible to rational resolution, exacerbate rancour to sheer nastiness.

Lets appreciate all the advantages of deploying a robo-arbiter. It’s cheap.  It’s non-time-consuming.  You get to help yourself to the cod-objectivity of the data-science pseudo-babble.  Let’s face it: who is ever going to waste the time to book-up the post-graduate level knowledge of statistics required to fight a 4NCL player ban? 

Sure, decisions are sometimes bad.   But when things do go wrong – which they do far more often than the platforms are willing to admit – there is a convenient, acceptable scapegoat on which to deflect the blame.

There is, of course, a “business class” solution to the problem of cheating. But how many players would be happy to fund the incremental cost of video monitoring?  And actually, there is an even bigger obstacle.

Most players would not feel that turning their living-room into a miniature PRK is a price worth paying for “clean” online chess. 

That would be ridiculous: why not just play chess on-line and not get so hung-up about cheating?

 

(c) AP Lewis 2020

Leave a Reply

Your email address will not be published. Required fields are marked *