Giter VIP home page Giter VIP logo

Comments (13)

DeFilippis avatar DeFilippis commented on September 22, 2024 2

Would really appreciate it. I've been at this for hours. I have no idea what kind of typographic errors are causing the program to hang.

Would also be nice to have a feature to only read in a few PGNs rather than the entire dataset. Something like read_pgn("millionbase.pgn", ngames = 1000). Even better if you can take a random sample of the games!

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024 1

I heartily disagree! I don't ever have cause to work with these files and would not have known about that million one or the errors on other ones nor the possible desire to read portions or sample from them. And, the testing is amazingly helpful. So, I'd say you're absolutely a major contributor :-)

I implemented a C++-backed (so compilation again…yay) buffered file reader for this and a fast games-per-file count function and a function that can read in file portions or samples from a file w/o reading in the whole file. Plus added in some error checking for malformed game records.

I need to sleep on the changes and review them in the morning as I think the game record validator is over-aggressive and also need to write some unit tests.

I've just pushed them up so if you're game (ugh, pun not intended) test as you have time. I'll jump back on this bright and early.

I'm also thinking of adding in a couple converter functions. I poked around a bit and there are some modern formats including JSON and one based on SQLite3 (which was a great idea by whomever though of that) which might be useful to have as conversion. An out-of-full-load-into-RAM converter to SQLite would also make it possible to use dplyr ops on the resultant file directly (and without requiring reading into RAM).

I now, also kinda want to make some "starting 'n' moves" graph visualizations since I now know there are ginormous data files spreading over years that could make grouped vis kinda cool.

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

Wow. Some of those are really badly formatted internally (likely encoding issues).

Some initial stabs at it are proving to be not as quick as I had hoped but lemme poke at it a bit more later today and I'll report back.

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

I've got something brewing. Just need to poke it a bit more. I'm likely going to dump the C library I was using if this works as I think I figured out a way to do it with just a cpl other R packages.

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

Just pushed up a new version that should work (but I don't work with these files so your aid in testing will be greatly appreciated).

I added in the features you noted and also added a pgn_count() function.

I need to add more error checking / docs / etc but if you could give this a go it'd be most appreciated!

I don't show it reading in the whole thing as I didn't have 8m to spare :-)

library(pigeon)

pgn_fils <- list.files("~/Data/KingBase2017-pgn", "\\.pgn$", full.names=TRUE)

pgn_fils[[1]]
## [1] "/Users/bob/Data/KingBase2017-pgn/KingBase2017-A00-A39.pgn"

pgn_count(pgn_fils[[1]])
## [1] 235772

xdf <- read_pgn(pgn_fils[[1]], 20)

glimpse(xdf)
## Observations: 20
## Variables: 12
## $ Event     <chr> "83. ch-BLR 2017", "83. ch-BLR 2017", "77. ch-ARM HL 2017", "15. Delhi Open...
## $ Site      <chr> "Minsk BLR", "Minsk BLR", "Erevan ARM", "New Delhi IND", "Wijk aan Zee NED"...
## $ Date      <chr> "2017.01.16", "2017.01.16", "2017.01.16", "2017.01.16", "2017.01.16", "2017...
## $ Round     <chr> "7.4", "7.2", "5.4", "10.7", "3.5", "10.27", "10.6", "9.2", "5.3", "5.5", "...
## $ White     <chr> "Aleksandrov, A", "Fedorov, Alex", "Grigoryan, K2", "Deviatkin, A", "Eljano...
## $ Black     <chr> "Azarov, Sergei", "Nikitenko, M", "Pashikian, A", "Murshed, N", "Harikrishn...
## $ Result    <chr> "1/2-1/2", "1-0", "1-0", "1-0", "1/2-1/2", "0-1", "1/2-1/2", "1/2-1/2", "1-...
## $ WhiteElo  <chr> "2565", "2576", "2571", "2499", "2755", "2314", "2507", "2562", "2592", "25...
## $ BlackElo  <chr> "2594", "2395", "2607", "2444", "2766", "2244", "2448", "2432", "2516", "25...
## $ ECO       <chr> "A15", "A26", "A29", "A00", "A34", "A29", "A00", "A12", "A05", "A07", "A21"...
## $ EventDate <chr> "2017.01.10", "2017.01.10", "2017.01.12", "2017.01.09", "2017.01.14", "2017...
## $ Moves     <list> [<"Nf3", "Nf6", "c4", "b6", "b3", "e6", "Bb2", "Bb7", "e3", "d5", "Be2", "...

set.seed(20171102)
xdf <- read_pgn(pgn_fils[[1]], n = 50, sample = TRUE)

glimpse(xdf)
## Observations: 50
## Variables: 12
## $ Event     <chr> "3. V Nabokov Memorial", "TCh-POR 1. Div", "AUT Ch 2015", "Bundesliga 1999-...
## $ Site      <chr> "Kiev UKR", "Torres Vedras POR", "Pinkafeld AUT", "Neukoelln GER", "Gyula o...
## $ Date      <chr> "2005.04.21", "2011.07.30", "2015.07.26", "1999.11.06", "1996.??.??", "1993...
## $ Round     <chr> "3", "8.8", "2.2", "3", "7", "9", "7", "7", "6", "7.15", "4", "5", "5.5", "...
## $ White     <chr> "Stopkin, Vladimir", "Eggert, Alberto", "Menezes, Christoph", "Franke, Joha...
## $ Black     <chr> "Kernazhitsky, Leonid", "Pires, Gustavo", "Kreisl, Robert", "Stolz, Mike", ...
## $ Result    <chr> "1/2-1/2", "0-1", "1/2-1/2", "1/2-1/2", "0-1", "1/2-1/2", "0-1", "1-0", "1/...
## $ WhiteElo  <chr> "2337", "2161", "2299", "2272", "2335", "2625", "2222", "2403", "2383", "21...
## $ BlackElo  <chr> "2345", "2050", "2454", "2308", "2360", "2515", "2376", "2214", "2188", "22...
## $ ECO       <chr> "A04", "A30", "A35", "A08", "A26", "A34", "A09", "A17", "A11", "A05", "A36"...
## $ EventDate <chr> "2005.04.19", "2011.07.23", "2015.07.25", "1999.10.08", NA, NA, "2000.10.21...
## $ Moves     <list> [<"Nf3", "d6", "g3", "e5", "d3", "f5", "Bg2", "Nf6", "c3", "Nc6", "O-O", "...

from pigeon.

DeFilippis avatar DeFilippis commented on September 22, 2024

Fantastic work. I imagine I don't have enough RAM to read in millionbase.pgn. I get the following error:

pgn_count("millionbase.pgn")

rsession(14338,0x7fffc05e43c0) malloc: *** mach_vm_map(size=18446744072537481216) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

With some KingBase files, I get the following error:

king <- read_pgn("KingBase2017-E60-E99.pgn")

Error in meta[, 1] : incorrect number of dimensions

king <- read_pgn("KingBase2017-A00-A39.pgn")

Error in meta[, 1] : incorrect number of dimensions

With others, it works perfectly:

king <- read_pgn("KingBase2017-A40-A79.pgn", 40)

This works fine.

Really impressed with these features, particularly the sample one. Incredible work. I have been wanting a similar "sample" feature implemented with data.table (read in random rows in some underlying csv or a data.file) and it has eluded me for a year.

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

Thx for testing! I'm not keen on the slurp it all up into RAM as well so lemme work on the idea I had to fix that. I was focused more on the self-parsing than anything else this go-round. If you wouldn't mind, could you add yourself to the DESCRIPTION file as a contributor? I can also do that if you provide how you'd like to be cited (i.e. how much name/email/etc you'd like revealed).

from pigeon.

DeFilippis avatar DeFilippis commented on September 22, 2024

I forked the Description file and added a contributor column, although I really didn't do much! Just really glad you've done all this work. Thanks again!

from pigeon.

DeFilippis avatar DeFilippis commented on September 22, 2024

Wow -- can't believe how much you got done. Incredible work. I'm game to bug test, but won't be able to do much with the code because it's now way over my head. Couple of bugs:

Running a sample on millionbase does not eat up all my RAM, but returns an empty dataframe:

millionbase <- read_pgn("millionbase.pgn", 1000, sample = TRUE)
nrow(millionbase)

[1] 0

Pgn_count suffers from a similar problem:

pgn_count("Downloads/millionbase.pgn")

[1] 0

I suspect this may be a RAM issue lurking in the background?

Running pgn_count on all KingBase file throws errors on two files (KingBase2017-A00-A39.pgn and KingBase2017-E60-E99.pgn)

files <- list.files(path="Downloads/KingBase", pattern="*.pgn", full.names=T, recursive=FALSE)
lapply(files, function(x) {
  try(pgn_count(x))
})

[[1]]
[1] "Error in int_pgn_count(path) : \n  Line number 1 in file \"/Users/edefilippis/Downloads/KingBase/KingBase2017-A00-A39.pgn\" exceeds the maximum length of 2^24-1.\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<io::error::line_length_limit_exceeded in int_pgn_count(path): Line number 1 in file "/Users/edefilippis/Downloads/KingBase/KingBase2017-A00-A39.pgn" exceeds the maximum length of 2^24-1.>

[[2]]
[1] 160881

[[3]]
[1] 37666

[[4]]
[1] 191230

[[5]]
[1] 252193

[[6]]
[1] 172712

[[7]]
[1] 121401

[[8]]
[1] 104075

[[9]]
[1] 85610

[[10]]
[1] 144171

[[11]]
[1] 126977

[[12]]
[1] 51910

[[13]]
[1] 92652

[[14]]
[1] 56075

[[15]]
[1] "Error in int_pgn_count(path) : \n  Line number 1 in file \"/Users/edefilippis/Downloads/KingBase/KingBase2017-E60-E99.pgn\" exceeds the maximum length of 2^24-1.\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<io::error::line_length_limit_exceeded in int_pgn_count(path): Line number 1 in file "/Users/edefilippis/Downloads/KingBase/KingBase2017-E60-E99.pgn" exceeds the maximum length of 2^24-1.>

Other than that, everything is running a lot faster. Great work.

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

Hrm. When you get a chance, would you be able to paste the output of devtools::session_info() in a session that you have pigeon loaded in?

It looks like we're both running macOS and this is what I get for a similar pgn_count() test:

library(purrr)
library(stringi)
library(crayon)
library(pigeon)

list.files("~/Data/KingBase2017-pgn", "\\.pgn$", full.names=TRUE) %>%
  walk(~{
    cat(green(basename(.x)))
    cat(green(": "))
    cat(yellow(stri_pad_left(scales::comma(pgn_count(.x)), 8, " ")), "\n")
  })

## KingBase2017-A00-A39.pgn:  235,772 
## KingBase2017-A40-A79.pgn:  160,881 
## KingBase2017-A80-A99.pgn:   37,666 
## KingBase2017-B00-B19.pgn:  191,230 
## KingBase2017-B20-B49.pgn:  252,193 
## KingBase2017-B50-B99.pgn:  172,712 
## KingBase2017-C00-C19.pgn:  121,401 
## KingBase2017-C20-C59.pgn:  104,075 
## KingBase2017-C60-C99.pgn:   85,610 
## KingBase2017-D00-D29.pgn:  144,171 
## KingBase2017-D30-D69.pgn:  126,977 
## KingBase2017-D70-D99.pgn:   51,910 
## KingBase2017-E00-E19.pgn:   92,652 
## KingBase2017-E20-E59.pgn:   56,075 
## KingBase2017-E60-E99.pgn:  121,516 

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

Holy buckets! millionbase is like ~1.5GB! (finally got it extracted from the EXE I found).

system.time(tmp <- pgn_count("~/Data/millionbase-2.22.pgn"))
##    user  system elapsed 
##  26.468   0.355  26.614 
 
tmp
## [1] 2197188

system.time(tmp <- read_pgn("~/Data/millionbase-2.22.pgn", 10))
## Progress bar(s)
##    user  system elapsed 
##  54.356   0.849  52.359 

tmp
## # A tibble: 10 x 11
##              Event            Site       Date Round             White
##              <chr>           <chr>      <chr> <chr>             <chr>
##  1         Spring    Budapest open 1996.??.??     1 Aadrians, M. (wh)
##  2         Masters       Hampstead 1998.03.28     1  Aagaard, J. (wh)
##  3          Arason   Hafnarfjordur 1997.12.14     1  Aagaard, J. (wh)
##  4 Kopenhagen open Kopenhagen open 1993.??.??     1  Aagaard, J. (wh)
##  5 Kopenhagen open Kopenhagen open 1995.06.30     1  Aagaard, J. (wh)
##  6 Kopenhagen open Kopenhagen open 1995.07.02     1  Aagaard, J. (wh)
##  7           Aruna      Kopenhagen 1997.05.02     1  Aagaard, J. (wh)
##  8       FSIM Juni        Budapest 1995.06.??     1  Aagaard, J. (wh)
##  9       FSIM Juni        Budapest 1995.06.??     1  Aagaard, J. (wh)
## 10      Drury Lane          Londen 1997.06.22     1  Aagaard, J. (wh)
## # ... with 6 more variables: Black <chr>, Result <chr>, BlackElo <chr>, ECO <chr>,
## #   Moves <list>, WhiteElo <chr>

^^ is the reason for the req for session info. The new functions only really eat RAM if the entirety of a giant file is being read in, but there def is a difference between our systems which means something in the code is doing something system dependent. I'm going to try this on a linux box and another macOS box in a bit.

from pigeon.

DeFilippis avatar DeFilippis commented on September 22, 2024

Just saw this (sorry swamped with Finals). Will run some tests when I get some downtime!

from pigeon.

hrbrmstr avatar hrbrmstr commented on September 22, 2024

from pigeon.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.