Comments (13)
Would really appreciate it. I've been at this for hours. I have no idea what kind of typographic errors are causing the program to hang.
Would also be nice to have a feature to only read in a few PGNs rather than the entire dataset. Something like read_pgn("millionbase.pgn", ngames = 1000). Even better if you can take a random sample of the games!
from pigeon.
I heartily disagree! I don't ever have cause to work with these files and would not have known about that million one or the errors on other ones nor the possible desire to read portions or sample from them. And, the testing is amazingly helpful. So, I'd say you're absolutely a major contributor :-)
I implemented a C++-backed (so compilation again…yay) buffered file reader for this and a fast games-per-file count function and a function that can read in file portions or samples from a file w/o reading in the whole file. Plus added in some error checking for malformed game records.
I need to sleep on the changes and review them in the morning as I think the game record validator is over-aggressive and also need to write some unit tests.
I've just pushed them up so if you're game (ugh, pun not intended) test as you have time. I'll jump back on this bright and early.
I'm also thinking of adding in a couple converter functions. I poked around a bit and there are some modern formats including JSON and one based on SQLite3 (which was a great idea by whomever though of that) which might be useful to have as conversion. An out-of-full-load-into-RAM converter to SQLite would also make it possible to use dplyr
ops on the resultant file directly (and without requiring reading into RAM).
I now, also kinda want to make some "starting 'n' moves" graph visualizations since I now know there are ginormous data files spreading over years that could make grouped vis kinda cool.
from pigeon.
Wow. Some of those are really badly formatted internally (likely encoding issues).
Some initial stabs at it are proving to be not as quick as I had hoped but lemme poke at it a bit more later today and I'll report back.
from pigeon.
I've got something brewing. Just need to poke it a bit more. I'm likely going to dump the C library I was using if this works as I think I figured out a way to do it with just a cpl other R packages.
from pigeon.
Just pushed up a new version that should work (but I don't work with these files so your aid in testing will be greatly appreciated).
I added in the features you noted and also added a pgn_count()
function.
I need to add more error checking / docs / etc but if you could give this a go it'd be most appreciated!
I don't show it reading in the whole thing as I didn't have 8m to spare :-)
library(pigeon)
pgn_fils <- list.files("~/Data/KingBase2017-pgn", "\\.pgn$", full.names=TRUE)
pgn_fils[[1]]
## [1] "/Users/bob/Data/KingBase2017-pgn/KingBase2017-A00-A39.pgn"
pgn_count(pgn_fils[[1]])
## [1] 235772
xdf <- read_pgn(pgn_fils[[1]], 20)
glimpse(xdf)
## Observations: 20
## Variables: 12
## $ Event <chr> "83. ch-BLR 2017", "83. ch-BLR 2017", "77. ch-ARM HL 2017", "15. Delhi Open...
## $ Site <chr> "Minsk BLR", "Minsk BLR", "Erevan ARM", "New Delhi IND", "Wijk aan Zee NED"...
## $ Date <chr> "2017.01.16", "2017.01.16", "2017.01.16", "2017.01.16", "2017.01.16", "2017...
## $ Round <chr> "7.4", "7.2", "5.4", "10.7", "3.5", "10.27", "10.6", "9.2", "5.3", "5.5", "...
## $ White <chr> "Aleksandrov, A", "Fedorov, Alex", "Grigoryan, K2", "Deviatkin, A", "Eljano...
## $ Black <chr> "Azarov, Sergei", "Nikitenko, M", "Pashikian, A", "Murshed, N", "Harikrishn...
## $ Result <chr> "1/2-1/2", "1-0", "1-0", "1-0", "1/2-1/2", "0-1", "1/2-1/2", "1/2-1/2", "1-...
## $ WhiteElo <chr> "2565", "2576", "2571", "2499", "2755", "2314", "2507", "2562", "2592", "25...
## $ BlackElo <chr> "2594", "2395", "2607", "2444", "2766", "2244", "2448", "2432", "2516", "25...
## $ ECO <chr> "A15", "A26", "A29", "A00", "A34", "A29", "A00", "A12", "A05", "A07", "A21"...
## $ EventDate <chr> "2017.01.10", "2017.01.10", "2017.01.12", "2017.01.09", "2017.01.14", "2017...
## $ Moves <list> [<"Nf3", "Nf6", "c4", "b6", "b3", "e6", "Bb2", "Bb7", "e3", "d5", "Be2", "...
set.seed(20171102)
xdf <- read_pgn(pgn_fils[[1]], n = 50, sample = TRUE)
glimpse(xdf)
## Observations: 50
## Variables: 12
## $ Event <chr> "3. V Nabokov Memorial", "TCh-POR 1. Div", "AUT Ch 2015", "Bundesliga 1999-...
## $ Site <chr> "Kiev UKR", "Torres Vedras POR", "Pinkafeld AUT", "Neukoelln GER", "Gyula o...
## $ Date <chr> "2005.04.21", "2011.07.30", "2015.07.26", "1999.11.06", "1996.??.??", "1993...
## $ Round <chr> "3", "8.8", "2.2", "3", "7", "9", "7", "7", "6", "7.15", "4", "5", "5.5", "...
## $ White <chr> "Stopkin, Vladimir", "Eggert, Alberto", "Menezes, Christoph", "Franke, Joha...
## $ Black <chr> "Kernazhitsky, Leonid", "Pires, Gustavo", "Kreisl, Robert", "Stolz, Mike", ...
## $ Result <chr> "1/2-1/2", "0-1", "1/2-1/2", "1/2-1/2", "0-1", "1/2-1/2", "0-1", "1-0", "1/...
## $ WhiteElo <chr> "2337", "2161", "2299", "2272", "2335", "2625", "2222", "2403", "2383", "21...
## $ BlackElo <chr> "2345", "2050", "2454", "2308", "2360", "2515", "2376", "2214", "2188", "22...
## $ ECO <chr> "A04", "A30", "A35", "A08", "A26", "A34", "A09", "A17", "A11", "A05", "A36"...
## $ EventDate <chr> "2005.04.19", "2011.07.23", "2015.07.25", "1999.10.08", NA, NA, "2000.10.21...
## $ Moves <list> [<"Nf3", "d6", "g3", "e5", "d3", "f5", "Bg2", "Nf6", "c3", "Nc6", "O-O", "...
from pigeon.
Fantastic work. I imagine I don't have enough RAM to read in millionbase.pgn. I get the following error:
pgn_count("millionbase.pgn")
rsession(14338,0x7fffc05e43c0) malloc: *** mach_vm_map(size=18446744072537481216) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
With some KingBase files, I get the following error:
king <- read_pgn("KingBase2017-E60-E99.pgn")
Error in meta[, 1] : incorrect number of dimensions
king <- read_pgn("KingBase2017-A00-A39.pgn")
Error in meta[, 1] : incorrect number of dimensions
With others, it works perfectly:
king <- read_pgn("KingBase2017-A40-A79.pgn", 40)
This works fine.
Really impressed with these features, particularly the sample one. Incredible work. I have been wanting a similar "sample" feature implemented with data.table (read in random rows in some underlying csv or a data.file) and it has eluded me for a year.
from pigeon.
Thx for testing! I'm not keen on the slurp it all up into RAM as well so lemme work on the idea I had to fix that. I was focused more on the self-parsing than anything else this go-round. If you wouldn't mind, could you add yourself to the DESCRIPTION
file as a contributor? I can also do that if you provide how you'd like to be cited (i.e. how much name/email/etc you'd like revealed).
from pigeon.
I forked the Description file and added a contributor column, although I really didn't do much! Just really glad you've done all this work. Thanks again!
from pigeon.
Wow -- can't believe how much you got done. Incredible work. I'm game to bug test, but won't be able to do much with the code because it's now way over my head. Couple of bugs:
Running a sample on millionbase does not eat up all my RAM, but returns an empty dataframe:
millionbase <- read_pgn("millionbase.pgn", 1000, sample = TRUE)
nrow(millionbase)
[1] 0
Pgn_count suffers from a similar problem:
pgn_count("Downloads/millionbase.pgn")
[1] 0
I suspect this may be a RAM issue lurking in the background?
Running pgn_count on all KingBase file throws errors on two files (KingBase2017-A00-A39.pgn and KingBase2017-E60-E99.pgn)
files <- list.files(path="Downloads/KingBase", pattern="*.pgn", full.names=T, recursive=FALSE)
lapply(files, function(x) {
try(pgn_count(x))
})
[[1]]
[1] "Error in int_pgn_count(path) : \n Line number 1 in file \"/Users/edefilippis/Downloads/KingBase/KingBase2017-A00-A39.pgn\" exceeds the maximum length of 2^24-1.\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<io::error::line_length_limit_exceeded in int_pgn_count(path): Line number 1 in file "/Users/edefilippis/Downloads/KingBase/KingBase2017-A00-A39.pgn" exceeds the maximum length of 2^24-1.>
[[2]]
[1] 160881
[[3]]
[1] 37666
[[4]]
[1] 191230
[[5]]
[1] 252193
[[6]]
[1] 172712
[[7]]
[1] 121401
[[8]]
[1] 104075
[[9]]
[1] 85610
[[10]]
[1] 144171
[[11]]
[1] 126977
[[12]]
[1] 51910
[[13]]
[1] 92652
[[14]]
[1] 56075
[[15]]
[1] "Error in int_pgn_count(path) : \n Line number 1 in file \"/Users/edefilippis/Downloads/KingBase/KingBase2017-E60-E99.pgn\" exceeds the maximum length of 2^24-1.\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<io::error::line_length_limit_exceeded in int_pgn_count(path): Line number 1 in file "/Users/edefilippis/Downloads/KingBase/KingBase2017-E60-E99.pgn" exceeds the maximum length of 2^24-1.>
Other than that, everything is running a lot faster. Great work.
from pigeon.
Hrm. When you get a chance, would you be able to paste the output of devtools::session_info()
in a session that you have pigeon
loaded in?
It looks like we're both running macOS and this is what I get for a similar pgn_count()
test:
library(purrr)
library(stringi)
library(crayon)
library(pigeon)
list.files("~/Data/KingBase2017-pgn", "\\.pgn$", full.names=TRUE) %>%
walk(~{
cat(green(basename(.x)))
cat(green(": "))
cat(yellow(stri_pad_left(scales::comma(pgn_count(.x)), 8, " ")), "\n")
})
## KingBase2017-A00-A39.pgn: 235,772
## KingBase2017-A40-A79.pgn: 160,881
## KingBase2017-A80-A99.pgn: 37,666
## KingBase2017-B00-B19.pgn: 191,230
## KingBase2017-B20-B49.pgn: 252,193
## KingBase2017-B50-B99.pgn: 172,712
## KingBase2017-C00-C19.pgn: 121,401
## KingBase2017-C20-C59.pgn: 104,075
## KingBase2017-C60-C99.pgn: 85,610
## KingBase2017-D00-D29.pgn: 144,171
## KingBase2017-D30-D69.pgn: 126,977
## KingBase2017-D70-D99.pgn: 51,910
## KingBase2017-E00-E19.pgn: 92,652
## KingBase2017-E20-E59.pgn: 56,075
## KingBase2017-E60-E99.pgn: 121,516
from pigeon.
Holy buckets! millionbase is like ~1.5GB! (finally got it extracted from the EXE I found).
system.time(tmp <- pgn_count("~/Data/millionbase-2.22.pgn"))
## user system elapsed
## 26.468 0.355 26.614
tmp
## [1] 2197188
system.time(tmp <- read_pgn("~/Data/millionbase-2.22.pgn", 10))
## Progress bar(s)
## user system elapsed
## 54.356 0.849 52.359
tmp
## # A tibble: 10 x 11
## Event Site Date Round White
## <chr> <chr> <chr> <chr> <chr>
## 1 Spring Budapest open 1996.??.?? 1 Aadrians, M. (wh)
## 2 Masters Hampstead 1998.03.28 1 Aagaard, J. (wh)
## 3 Arason Hafnarfjordur 1997.12.14 1 Aagaard, J. (wh)
## 4 Kopenhagen open Kopenhagen open 1993.??.?? 1 Aagaard, J. (wh)
## 5 Kopenhagen open Kopenhagen open 1995.06.30 1 Aagaard, J. (wh)
## 6 Kopenhagen open Kopenhagen open 1995.07.02 1 Aagaard, J. (wh)
## 7 Aruna Kopenhagen 1997.05.02 1 Aagaard, J. (wh)
## 8 FSIM Juni Budapest 1995.06.?? 1 Aagaard, J. (wh)
## 9 FSIM Juni Budapest 1995.06.?? 1 Aagaard, J. (wh)
## 10 Drury Lane Londen 1997.06.22 1 Aagaard, J. (wh)
## # ... with 6 more variables: Black <chr>, Result <chr>, BlackElo <chr>, ECO <chr>,
## # Moves <list>, WhiteElo <chr>
^^ is the reason for the req for session info. The new functions only really eat RAM if the entirety of a giant file is being read in, but there def is a difference between our systems which means something in the code is doing something system dependent. I'm going to try this on a linux box and another macOS box in a bit.
from pigeon.
Just saw this (sorry swamped with Finals). Will run some tests when I get some downtime!
from pigeon.
from pigeon.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pigeon.