Spacious Mind's Renaissance School

spacious_mind · Post by **spacious_mind** » Tue Apr 23, 2024 1:14 pm

Hi Everyone,

Wow 3.1/2 weeks have passed since I last posted. But I have not been idle, during this time I revamped and resorted the old tests and created 5 new ones and played quite a few test games together with Eric who helped me a lot with his bug testing and game playing. This is going to be a long post so be prepared to be bored to tears. Before sharing these new test games renamed as Renaissance School as that was the period during which these games were played. I want to explain why I am creating these tests and the below will hopefully explain it. A lot of you here in the Forum know I have a pretty sizeable collection of dedicated chess computers and chess programs.

This is a test comprising of 10 games totaling 415 moves with 208 White moves and 207 Black. What makes this test so interesting for me is that Eric and I played 16 chess computer programs at Tournament level through these 10 tests, all playing 415 identical positions. And we also played the same at 30 seconds per move. With Eric currently working on adding another 6 computers.

If you were to do the math and these 16 computers were to play each other 10 times, then you would have to play a total of 2400 games at Tournament level and 2400 at Active 30s/move level compared to just 160 tournament and 160 active test games. And you would still not be able to compare what computer would play which move in any given position without going back and studying the games played and then going back to those computers and setting up your test positions and then seeing who would have played what.

Now just imagine you have for example a collection of 500 chess computers, and you wanted to play them all against each other 10 times to see for yourself how they rank. At 3 minutes per move it would be impossible as this Tournament would be 499 times 500 times 10 equals 2,495,000 games if my math is correct times and estimate 3 hours at the very minimum per game equals at least 7,485,000 hours. There are on average with leap year 8766 hours in a year with no sleep which means you would have to play as 1-person 24/7 for 854 years and never sleep or do anything else.

Take Schacomputer.info as an example, they just updated their Tournament list and added 5220 games since their last update which I don't know when their last update was made. But in total there are now just over 18,000 games collected over an approximate period of 25 years with 189 computers on the list, never mind 500. All that is a massive amount of effort and dedication from the players. I don't know how many people play and send in their games. But it’s a huge amount of work and commitment and kudos to those players.

Since I am a collector of chess computers, you can perhaps now see why I am interested in some good methods of rating every one of my computers and favorite chess programs, be it DOS, PC, Commodore, Tabletop and comparing them in an apple-to-apple exact comparison. I want to be able to see the results in this lifetime.

So anyway, even if I must do this alone, now that I am retired you can do the math, I can easily complete this, health permitting without breaking into a sweat but cheating a little and playing with two or three computers at the same time. Since the games are short in comparison to a full tournament match game, I can casually complete 2-3 computers a week and therefore in possibly less than 4 years I would have the data for 500 computers at both tournament and active level and any shorter level that might be of interest i.e. 10 sec per move. Instead of taking 1,000 years.

However, just imagine how quick all this would be if several people played these tests with their computers and the results were recorded and shared by everyone. Those 4 years might quickly turn into less than a year and everyone is in data heaven for chess computer comparisons.
To summarize, this has been my driver to create some workable accurate test that can compare every chess program so long as they have a move take back option and every human who might be interested in seeing how they rate.

Eric has been of great assistance with his games and spreadsheet testing, finding any human errors I have made on the spreadsheets and his suggestions on improvements such as the “Sneak a Peek” feature I just added. Therefore, a big thank you to Eric.

I won’t bore you more with the many different things you can achieve from the data of these tests, but do I think these tests rate accurately?

Well, the more test games you have as in regular computer games that you play the more accurate the average final ratings become. But within this 10-game universe the ratings are for sure 100% accurate, they cannot be otherwise since everything is rated 100% the same way in 100% the same move by move game positions.

SF16-1 scored 3543 ELO and King Performance 2595 and Chess-Master Diamond 1265. It’s not made up it’s how they played these games.

Now back to these tests.

Above you will see the games played so far by Eric and myself at Tournament level. I have check marked the results that match pretty well to other rating lists and question marked the ones that are a little high. The hard work was trying to figure out the scoring calculations and automate them so that it matches what the human chess player are used to comparing themselves with. The game scoring itself based on the moves played is all over the place but when combined as 415 moves, they do create a pretty good ELO comparison and of course more test games would make this even more accurate. You can see from the individual test games how up and down each and every computer performs as they are all programmed by their creators differently. I mean who would have thought that V11 plays like a beginner in Game 7 and like a Grandmaster in other Games. Some games are just more complex than others for chess computers.

Playing the same test at 30 seconds per move you can see how their skill level declines. Technically it is wrong to think a player is playing at for example 2200 ELO in Blitz when the game quality for all to see is at a level of 1800. These tests will show the difference with few exceptions. Some computers just strike lucky and score better but it's mostly timing. The base line for these tests is Tournament Level. More test games will likely eliminate outlying high or low scores.

This test will very accurately show the improvements/decline of most computers (excluding any lucky ones) in their performance by adding/reducing time.

Lastly you can see on the list that I searched and found a weak chess program listed at CCRL at 1839 ELO. I ran it through these tests, and it scored as expected higher. In the past I don't know how many times I tried to play an engine against a dedicated and the dedicated was embarrassed every time because of the speed of the engine. So, I tried different handicaps like slowing the engine down. Well, doing the test with ECE_03 at full laptop speed 1 core and just reducing the hash to 1 MB (that's maybe too low and experts may suggest a higher hash for testing) I then played it at full speed against TM Lyon and TM Lyon won 2.0 as its higher Test rating result confirmed. This is the first time I have ever at a first attempt without searching and searching and playing and slowdowns have both playing their natural speed and the dedicated computer won. So that opens up a lot of doors for some great future tournaments.

Here are the games:

Game 1: https://lichess.org/yvyS5bo4
Game 2: https://lichess.org/9LWIscv6

I am finally done with this lengthy post and all that is left is to give you this link to download the zip file for the 10 test games that includes the tests played so far.

https://www.spacious-mind.com/forum_rep ... School.zip

Best regards
Nick

spacious_mind · Post by **spacious_mind** » Tue Apr 23, 2024 1:31 pm

By the way I forgot to add that these tests are not meant to replace other lists. They are meant to complement other lists and find a fast way to establish a decent rating for your computer quickly and efficiently.

gordonr · Post by **gordonr** » Tue Apr 23, 2024 3:46 pm

Firstly, I want to thank you for all your great efforts and sharing them. I have referred to your work on countless occasions and always appreciate it.

I think your approach of analysing games and scoring is very interesting indeed. I don't think there is a single right method of testing and it is useful to compare different approaches.

I wanted to add some comments on other test methods.

- "If you were to do the math and these 16 computers were to play each other 10 times, then you would have to play a total of 2400 games". You should divide by 2 since each game is a game for 2 computers - so 1200.

- every computer doesn't have to play every other computer in order to determine a reliable rating. You can perform something similar to the Swiss pairing system where computers play other computers that have been performing likewise.

So if I'm testing a new computer that I have no idea about, I could pick an opponent rated 2200. If it loses, I may jump down to an 1800 opponent. If it then wins, pick a 2000 opponent. It doesn't take many matches to find the ballpark and test it with similarly rated opponents.

For any two computers that are next to one another on my ranking list, I try to ensure that have played a minimum number of games against one another. This means that any computer that was initially "lucky" or "unlucky" in it's initial positioning into the list will over time move up or down to where it more reliably belongs.

I use PGN Stat to calculate ratings but then modify the output to include my own preferred details. Here is the start of my ranking list. I use Cosmos rated at 1920 as the fixed reference point for the other ratings.

Code: Select all

Platform: D - Dedicated, E - Emulated, O - Old PC/DOS, M - Modern PC

Rank  Name                              Elo       +/-      Games   VsBelow    P    Year    Author             

1     DGT Centaur                       2858      339      2       2 0 0      D    2019    Stockfish Team     
---
---
2     Genius 5                          2654      175      8       2 4 0      O    1996    Lang               
---
3     King Performance                  2577      285      2       0 0 0      D    2019    De Koning          
---
4     Atlanta                           2491      144      16      0 2 6      D    1997    Morsch             
5     CheckCheck                        2439      281      2       0 0 2      O    1992    Delmare            
6     Sparc                             2400      225      6       2 0 0                                      
---
7     Genius Pro                        2357      154      14      0 2 4      D    2016    Lang               
---
8     Maestro D++                       2289      281      2       0 0        E    1990    Kaplan             
---
9     Star Diamond                      2134      195      10      6          D    2003    Kittinger          
---
---
10    Cosmos                            1920      199      14

spacious_mind · Post by **spacious_mind** » Tue Apr 23, 2024 5:15 pm

gordonr wrote: ↑Tue Apr 23, 2024 3:46 pm Firstly, I want to thank you for all your great efforts and sharing them. I have referred to your work on countless occasions and always appreciate it.

I think your approach of analysing games and scoring is very interesting indeed. I don't think there is a single right method of testing and it is useful to compare different approaches.

I wanted to add some comments on other test methods.

- "If you were to do the math and these 16 computers were to play each other 10 times, then you would have to play a total of 2400 games". You should divide by 2 since each game is a game for 2 computers - so 1200.

- every computer doesn't have to play every other computer in order to determine a reliable rating. You can perform something similar to the Swiss pairing system where computers play other computers that have been performing likewise.

So if I'm testing a new computer that I have no idea about, I could pick an opponent rated 2200. If it loses, I may jump down to an 1800 opponent. If it then wins, pick a 2000 opponent. It doesn't take many matches to find the ballpark and test it with similarly rated opponents.

For any two computers that are next to one another on my ranking list, I try to ensure that have played a minimum number of games against one another. This means that any computer that was initially "lucky" or "unlucky" in it's initial positioning into the list will over time move up or down to where it more reliably belongs.

Hi Gordon

Thanks for your kind comments. I have also been using elo stat and other rating systems for years and I know they are great at creating ELO lists with the more games played by a computer the better in order to reduce the +/- tolerance. However, try the tests and you will find that they will take you 1/3 of the time that you take playing the games and compare the results with your elo stat results, you might be surprised

Dedicated computers are unequally played and that leaves plenty of room to doubt.

The tests should be done at 3 minutes per move because as I explained for lesser times there should be a measured performance loss which I am interested in tracking with my computers as well.

What ELO stat and other systems cannot do is lay out right next to each other move by move what the computers played, compare and even rate their move weakness move by move. You can see the good moves and blunders forever right next to each other.

As I mentioned none of us have unlimited time in our lives, therefore getting more and better stats and results by thinking and doing things out of the box works for me.

I am sure like all of us your dedicated computers sometime play what 60+ moves each to the finish. That means at tournament level you are worn out after 1 game. My tests average 20.1/2 moves each.

I see you played 16 games with Atlanta or Cosmos or any on your list. Doing all the tests at tournament level or 30 seconds level should be the equivalent of playing just playing 5 games with your Atlanta, and compare the final rating against what you show now and what other places show as the rating for Atlanta.

Best regards
Nick

spacious_mind · Post by **spacious_mind** » Tue Apr 23, 2024 5:47 pm

What I should have mentioned is that my tests compare time based playing strengths in a closed Universe not game performances against other computers as you would see in other rating lists. However, in order to allow chess players to visualize this and compare against what they are used to seeing I created a method that converts the results to an ELO equivalent. Which surprisingly gets very close even with just 10 tests to the game playing ELOs these computers achieve playing each other. Since the original games were played by humans and not computers, these tests also simulate how the computer would move against what you might play including your blunders.

I could have just as easily scored them 1 to 1000 or A to Z or whatever but that would just make this seem like gobblygook for other people and it would not be able to rate humans who might do the test either.

Best regards
Nick

Tibono2 · Post by **Tibono2** » Wed Apr 24, 2024 6:58 am

spacious_mind wrote: ↑Tue Apr 23, 2024 1:14 pm And we also played the same at 30 seconds per move. With Eric currently working on adding another 6 computers.

Hi Nick, hello all,

First, thank you Nick for resuming the nice work you started time ago with a first set of five "past centuries" master games.
Also, thanks for the kind words about my contribution, joint work was a great pleasure.

I attach here the scores and ranking at 30 secs/move, with some weak computers scores added.
Of course some randomness of moves is more present with (very) weak chess computers, therefore the final rating is to be taken with a grain of salt; nevertheless I consider the resulting ranking is very accurate; at least it is what I could have expected. It is well known the Fidelity Chess Challenger 7 and the Novag CC Super System III are very close as far as strength level is concerned; and it is what we get here. We know the CC MK 1 and the Delta 1 are quite weak, and I trust the Delta 1 to be slightly better. We also know Delta 1 is a messed up copy of the Boris program, the ranking order looks perfectly fine. As you can read from the attached scores, results can be negative, of course this can occur thru many enough blunders or other weak moves.

And at the other end of the ranking list, we already knew Nick's approach used for his first set of five "past centuries" master games was impressively accurate; I strongly trust it is further enhanced thanks to these ten games assessment and refined moves scores. Beyond results, I think playing the test games is also much fun thanks to the Elo Sneak capability. I only shared a rough idea of displaying a "work in progress" Elo score, Nick deserves all kudos for the nice implementation.

Warm regards to all,
Eric

dbenchebra · Post by **dbenchebra** » Wed Apr 24, 2024 7:32 am

Hello Nick,

you are doing great work, thank you for all the effort.

If it can be of any help, I would be happy to run some tests with some of my machines, let me know.

gordonr · Post by **gordonr** » Wed Apr 24, 2024 10:54 am

This is indeed great work. I see that some of the tested computers have an emulation within CB Emu Pro. So I was wondering if any of this effort could be automated.

The test games can be converted to a list of EPDs. Then within MessChess/Arena, it's possible to do auto analysis of the list of EPDs. I get output such as:

Code: Select all

1 Nh7;                 
    Searching move: Ng5xh7
    Best move (Novag Star Diamond (v1.04)): Qe1-h4
    Not found in: 16:40
   Best move: Qe1-h4
   24-Apr-24 11:34:30 AM, Time for this analysis: 00:00:55, Rated time: 16:40

 2 Qc3;                 
    Searching move: Qe1-c3
    Best move (Novag Star Diamond (v1.04)): Qe1-d2
    Not found in: 16:40
   Best move: Qe1-d2
   24-Apr-24 11:35:21 AM, Time for this analysis: 00:00:50, Rated time: 33:20

Here I can see what the Star Diamond chose for each position along with the analysis time. It's then possible to write a short piece of code that will parse this output and score the moves, etc along with the other calculations.

I've only just looked at this possibility and there may be some issues to consider. But I'm curious if anyone can think of issues or drawbacks. For example, when stepping through a game, maybe some computers don't clear their hash tables etc and could use them in the next position. I'm sure the above automation will clear hash tables, etc. And remember, as a test we could compare doing some of the same computers both manually and automated to see if there are any differences.

spacious_mind · Post by **spacious_mind** » Wed Apr 24, 2024 11:10 am

dbenchebra wrote: ↑Wed Apr 24, 2024 7:32 am Hello Nick,

you are doing great work, thank you for all the effort.

If it can be of any help, I would be happy to run some tests with some of my machines, let me know.

Yes of course, please test any computer you want to and thank you for your offer. I think to avoid duplicated work at least for now that we are at the beginnings stage in order to grow the list quickly and efficiently, please let us all know which ones you are planning to test so that those can be avoided by others. Also please include the level settings in your report i.e... on a particular dedicated TM level might be A4 and 30 seconds A2, so that everyone can see what settings were used.

If we can then share the results and moves played here in the forum then periodically, we can update a master list for download so that everyone can have access to the same information.

Errors are always possible so if anyone wants to test their own dedicated computer against what they see as the results on the list, that is also OK of course as any move deviances can be found and discussed. But preferably adding new computers to the list should be priority. #1.

BTW. 5 of the tests are still the original tests, therefore any games played originally can be copy and pasted into the new tests and that just means the 5 new ones I created need to be played. By copying I mean you have to copy the moves into the computer test tab so it can recalculate the new scores before saving and recording those games. Also, based on Paul's comments about one of the test games being too long and boring, 4 moves were removed as they were deemed redundant by both Eric and me. We also removed some moves in game 1 and game 5 of the original test for the same redundancy reasons. This allowed for maximum efficiency of your time without compromising the rating system.

Best regards
Nick

spacious_mind · Post by **spacious_mind** » Wed Apr 24, 2024 11:32 am

Tibono2 wrote: ↑Wed Apr 24, 2024 6:58 am Beyond results, I think playing the test games is also much fun thanks to the Elo Sneak capability.

Warm regards to all,
Eric

Hi Eric,
I never mentioned to you but when I created the ELO Sneak, I also used it as a backward way of checking the scoring results of the original calculations to make sure that there was a match with the original scores, and no errors were creeping in. So as a backhanded double check it was a useful exercise as well.

Best regards
Nick

spacious_mind · Post by **spacious_mind** » Wed Apr 24, 2024 12:30 pm

I know it's work but now that we have the 30 seconds games, it would be extremely useful to also have 3-minute games for Boris, CC7, CC10, Delta-1 and Conic. A) Because except for possibly Boris ratings for those computers don't exist today anywhere in the World at TM B) it would be good to see if there is improvement with more time.

Also, something to consider with Chess-Master Diamond results Level 8 is barely slower in time when you play it compared to level 3 (30 seconds) and therefore a big reason it does not conform with more time=improved results might be this. Same applies with Enterprise S, it's level C does not really conform to 3 minutes move and at best averages out at 1-1.1/2 minutes therefore not much difference to 30 seconds. All the others tested at 3 minutes conform to the time per move settings.

Chess-Master Diamond does have a tournament setting but when activated it conforms to international chess rules and will not allow take back of moves, therefore not possible to use for these tests. There might be a work around by deactivating that level after each move, taking back the move and reactivating, but I have not tried it and I don't really have time at the moment to try it out.

Best regards
Nick

gordonr · Post by **gordonr** » Wed Apr 24, 2024 1:21 pm

spacious_mind wrote: ↑Tue Apr 23, 2024 5:15 pm What ELO stat and other systems cannot do is lay out right next to each other move by move what the computers played, compare and even rate their move weakness move by move. You can see the good moves and blunders forever right next to each other.

I downloaded the zip of test results. When I look at e.g. "De Castellvi - Vinoles 1475.xlsx", how do I tell what was a good move and what was a blunder? How do I see the scores for each move that a given computer chose?

For each test game, is there a list of candidate moves along with the scores?

thanks
Gordon

spacious_mind · Post by **spacious_mind** » Wed Apr 24, 2024 1:35 pm

gordonr wrote: ↑Wed Apr 24, 2024 10:54 am This is indeed great work. I see that some of the tested computers have an emulation within CB Emu Pro. So I was wondering if any of this effort could be automated.

The test games can be converted to a list of EPDs. Then within MessChess/Arena, it's possible to do auto analysis of the list of EPDs. I get output such as:
Code: Select all
1 Nh7;                 
    Searching move: Ng5xh7
    Best move (Novag Star Diamond (v1.04)): Qe1-h4
    Not found in: 16:40
   Best move: Qe1-h4
   24-Apr-24 11:34:30 AM, Time for this analysis: 00:00:55, Rated time: 16:40

 2 Qc3;                 
    Searching move: Qe1-c3
    Best move (Novag Star Diamond (v1.04)): Qe1-d2
    Not found in: 16:40
   Best move: Qe1-d2
   24-Apr-24 11:35:21 AM, Time for this analysis: 00:00:50, Rated time: 33:20
Here I can see what the Star Diamond chose for each position along with the analysis time. It's then possible to write a short piece of code that will parse this output and score the moves, etc. along with the other calculations.

I've only just looked at this possibility and there may be some issues to consider. But I'm curious if anyone can think of issues or drawbacks. For example, when stepping through a game, maybe some computers don't clear their hash tables etc and could use them in the next position. I'm sure the above automation will clear hash tables, etc. And remember, as a test we could compare doing some of the same computers both manually and automated to see if there are any differences.

Hi Gordon,

Thanks that's a great question that I knew would be asked eventually.

The only reason I created these tests is to have a common base for dedicated chess computers and old chess programs as existed with Commodore, Dos and early Windows that people loved and still play manually and nostalgically whenever they find one, they can get to work in today's Windows systems. As a result, I created a manual test. There are plenty of chess engine automation tournaments and rating lists for people who enjoy the automation. My tests are not needed for those.

I don't want this to become another one of those as after all true chess is played by humans picking up a piece and putting it down.

This is why I use human games and put emphasize around the human aspect...Ie the players who played them and who they were and also wanting human players to try them out themselves.

btw. Franks great work ensuring the history of these great desktop machines continues, is not something I sneer at, I appreciate them as well, but automating in Arena and Winboard does not work for me, as you cannot guarantee the moves are the same as when you the human picks up and puts down the move on your dedicated computer at home or on a computer screen as with a DOS program and that would cause issues with move deviations that I don't want to contaminate these lists with.

Purist regards.
Nick

spacious_mind · Post by **spacious_mind** » Wed Apr 24, 2024 2:02 pm

gordonr wrote: ↑Wed Apr 24, 2024 1:21 pm
spacious_mind wrote: ↑Tue Apr 23, 2024 5:15 pm What ELO stat and other systems cannot do is lay out right next to each other move by move what the computers played, compare and even rate their move weakness move by move. You can see the good moves and blunders forever right next to each other.
I downloaded the zip of test results. When I look at e.g. "De Castellvi - Vinoles 1475.xlsx", how do I tell what was a good move and what was a blunder? How do I see the scores for each move that a given computer chose?

For each test game, is there a list of candidate moves along with the scores?

thanks
Gordon

Yes of course, go to row 9 and click on 1st Black move and you will see it has a dropdown list. Pick the move you or your computer played. If you want a sneak peek, it will show a score in the sneak peek, drop down after you entered the move. Now go to row D and show the move of what the original human played. Now that you have completed the first move, take back the move your computer played on your computer (not on the list) and enter the move the human played instead. This will force your computer to now calculate the white move and you keep doing this throughout the test. If the computer plays a different move to the human player, you record it using the dropdowns and then just correct the move on your computer and let it calculate the next move and so on. At the end of the test go to the correct time tab i.e. 30 seconds tab and in Column A you will see all the moves recorded and the final rating for the game. Column A and B are locked to avoid corruption so insert a new column anywhere after column C and copy & paste the formatting and value (text) into that column and you are done. You have now completed a game. Whenever you paste, don't paste formulas as that will show as errors or 0's since the formulas are locked.

The sneak peek data if check there will change move by move until you come to the last move, which once it is entered matches what you will see as a final score in column a of the other tabs.

Remember this is a spreadsheet so I have to play by spreadsheet rules when using it.

The human players once you have played the first game, you can keep them there and save them as visible, so you don't have to unhide them again. The only reason I hid them is for human players doing the test, so they don't have a temptation to enter those particular moves if they do the test.

For maintenance when finished (it's the reason why you have two tests showing) I copy and paste (as Value) the move rows from the other tests (which are blank and only showing a - (minus) onto the first test. Which quickly resets the start positions for the next computer test. This is faster than having to go to each dropdown and picking the - every time to restart. If you don't do this, you run the risk through being distracted of leaving a previous move for the new computer in a row that would be different to what it actually played.

I know it sounds complicated but it really is very quick and easy once you have done one.

Best regards
Nick

gordonr · Post by **gordonr** » Wed Apr 24, 2024 2:16 pm

Hi Nick,

Thanks for your replies and I understand where you're coming from. Like yourself and most others on here, computer chess for me is largely about interest and fun. So we often test computers in ways that appeal to us personally, in addition to getting some useful data.

I will manually test Mephisto Atlanta using your test method/games and report the results. I will not try to automate any of your tests. If my curiosity for test automation continues, I will keep it separate and use my own set of test positions, scoring, results, etc.

cheers
Gordon

HIARCS Chess Forums

Spacious Mind's Renaissance School

Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School

Re: Spacious Mind's Renaissance School