Duplicates not found

Get your specific HIARCS/Junior support questions answered here as well as up-to-the-minute news!

Moderators: Watchman, Mark Uniacke, mrudolf

Post Reply
yairjazz
Member
Posts: 4
Joined: Thu Nov 17, 2022 12:00 pm

Duplicates not found

Post by yairjazz »

I created a small database with 2 pairs if identical games (attached the pgn). However when performing duplicate check, HCE Pro says there are no duplicates.
Attachments
duplicates.pgn.zip
(1.93 KiB) Downloaded 22 times
User avatar
SchuBi
Member
Posts: 73
Joined: Wed Aug 01, 2007 3:23 am
Location: Recklinghausen

Re: Duplicates not found

Post by SchuBi »

The same on my Mac. Maybe the problem has to do with the annotations.
When I copy game #4, strip the annotations and insert it in the database then HCE Pro finds a duplicate.
duplicates.jpg
duplicates.jpg (105.3 KiB) Viewed 683 times
Bxh7
Member
Posts: 12
Joined: Tue Aug 30, 2022 11:54 am
Location: Norway

Re: Duplicates not found

Post by Bxh7 »

Although the games are identical, they have different UTCTime and UTCDate tags.
HCE perceiving the games as different would be natural when they seem to have been played on different dates and different times.
Bxh7
Member
Posts: 12
Joined: Tue Aug 30, 2022 11:54 am
Location: Norway

Re: Duplicates not found

Post by Bxh7 »

After some further investigation:

The Games were played on chess.com, then they were uploaded to LiChess for analysis.

And this is the crap about LiChess PGNs, They are not consistent. Some don't use the Date and and Time tags, and even ignores them and only use the UTCDate and UTCTime tags. Other times they use both. Some of them even misses the Site tag, which is part of the 7-tag roster and is mandatory in any PGN.

So what happened is that the LiChess PGN retained the original Date/Time tags from chess.com but also included the UTC equivalents, which were set to the download Date/Time, and both tag pairs eventually appeared in the resulting PGN, with different values.

Handle LiChess PGNs with some care. They also have other issues, like the handling of 960 castling rights. Maybe not important for you, but a huge issue for 960 players.

The HCE programmers could probably make some LiChess-specific workaround, but it should rather be a task for LiChess to clean up their act. A hundred million games played a month, and not being able to deliver a PGN according to standard is just silly.
yairjazz
Member
Posts: 4
Joined: Thu Nov 17, 2022 12:00 pm

Re: Duplicates not found

Post by yairjazz »

@Bxh7
Great detective work ! Thanks for digging dip into this.
Next time I will make sure to take the games directly from chess.com.
User avatar
mrudolf
HCE Developer
Posts: 988
Joined: Thu Dec 17, 2020 4:44 pm

Re: Duplicates not found

Post by mrudolf »

yairjazz wrote: Sun Sep 10, 2023 4:08 pm @Bxh7
Great detective work ! Thanks for digging dip into this.
Next time I will make sure to take the games directly from chess.com.
For annotated games we indeed compare all the comments/tags, as it is impossible to guess, what to ignore.

I don't see a good solution here (except for always downloading your games from one source). The basic idea of duplicate finder was to avoid any data loss, so that users can be sure nothing important is removed. Perhaps we should have another option of fuzzy matching, but I don't think many people would like to carefully review thousands of games.
yairjazz
Member
Posts: 4
Joined: Thu Nov 17, 2022 12:00 pm

Re: Duplicates not found

Post by yairjazz »

In the meantime I've found a temporary workaround - a simple script that removes the UTCTime and UTCDate tags.
That helped HPE find all the duplicates in my current database.
Post Reply