mlr-org / farff Goto Github PK

View Code? Open in Web Editor NEW

11.0 18.0 6.0 16.93 MB

a faster arff parser

License: Other

R 73.49% C 26.51%

arff rweka

farff's Introduction

farff: A faster ARFF parser.

This is a subproject for better file handling with mlr and OpenML.

Installation instructions

Please install the proper CRAN releases in the usual way. If you absolutely have to install from here (you should not):

devtools::install_github("mlr-org/farff")

What is ARFF

ARFF files are like CSV files, with a little bit of added meta information in a header and standardized NA values. They are quite often used for machine learning data sets and were introduced for the WEKA machine learning java toolbox.

RWeka's read.arff and write.arff already exist?

Several reasons motivated the development of farff:

The java dependency of RWeka is annoying.
The I/O code in RWeka is pretty slow, at least the reading of files in farff is much faster.

How does it work?

library(farff)
# import arff format file
d = readARFF("iris.arff")
# export arff format file
writeARFF(iris, path = "iris.arff")

How does it work under the hood?

We read the ARFF header with pure R code.
We preprocess the data section a bit with custom C code and write the result into a temporary file TEMP.
The TEMP file, i.e., the data section, is parsed with readr::read_delim. Support for data.table::fread is planned for future releases.

farff's People

Contributors

Stargazers

Watchers

Forkers

pfistfl giuseppec jakobbossek afcarl mboecker damirpolat

farff's Issues

add jenkins test to download runs

sometimes NA-rows are introduced

While fixing my bugs, I found another bug here

setOMLConfig(arff.reader = "RWeka")
dRWeka = getOMLDataSet(1418)
setOMLConfig(arff.reader = "farff")
dFarff = getOMLDataSet(1418) # after each row another NA-rows is introduced

str(dRWeka$data)
#'data.frame':  139 obs. of  9 variables:
# $ A1   : num  0.898 0.866 0.927 0.927 0.917 ...
# $ A2   : num  151 156 149 148 150 ...
# $ A3   : num  37 41 34 27 21 40 17 37 12 39 ...
# $ A4   : num  0.833 0.946 0.952 0.752 0.887 ...
# $ A5   : num  151.3 156.2 114.5 84.2 138.2 ...
# $ A6   : num  34 41 5 3 17 22 14 3 4 10 ...
# $ A7   : num  151 156 145 147 150 ...
# $ A8   : num  36 40 32 26 20 39 16 36 14 38 ...
# $ class: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

str(dFarff$data)
#'data.frame':  278 obs. of  9 variables:
# $ A1   : num  0.898 NA 0.866 NA 0.927 ...
# $ A2   : num  151 NA 156 NA 149 ...
# $ A3   : num  37 NA 41 NA 34 NA 27 NA 21 NA ...
# $ A4   : num  0.833 NA 0.946 NA 0.952 ...
# $ A5   : num  151 NA 156 NA 115 ...
# $ A6   : num  34 NA 41 NA 5 NA 3 NA 17 NA ...
# $ A7   : num  151 NA 156 NA 145 ...
# $ A8   : num  36 NA 40 NA 32 NA 26 NA 20 NA ...
# $ class: Factor w/ 2 levels "0","1": 2 NA 2 NA 2 NA 2 NA 2 NA ...

Support parsing strings

Is there a reason why a file needs to be given?

compiling warnings - segfault when using farff

I am not a 100% sure that this is a problem with farff or something with my local installation, but when I install farff from github I get a bunch of warnings:

> devtools::install_github("mlr-org/farff")
Downloading GitHub repo mlr-org/farff@master
Installing farff
'/usr/local/Cellar/r/3.2.3/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore CMD INSTALL  \
  '/private/var/folders/ll/m090_qx11tz6t5bd7mjpx3sr0000gp/T/RtmpJjKamu/devtools15a7f717b5875/mlr-org-farff-d8333d1'  \
  --library='/Users/joa/Library/R/3.2/library' --install-tests

* installing *source* package ‘farff’ ...
** libs
clang -I/usr/local/Cellar/r/3.2.3/R.framework/Resources/include -DNDEBUG -I/usr/local/include  -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/opt/openssl/include -I/usr/local/include  -I/usr/local/include   -fPIC  -g -O2  -c preproc_datatable.c -o preproc_datatable.o
preproc_datatable.c:58:13: warning: expression result unused [-Wunused-value]
           i+1;
           ~^~
1 warning generated.
clang -I/usr/local/Cellar/r/3.2.3/R.framework/Resources/include -DNDEBUG -I/usr/local/include  -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/opt/openssl/include -I/usr/local/include  -I/usr/local/include   -fPIC  -g -O2  -c tools.c -o tools.o
tools.c:3:10: warning: implicit declaration of function 'isspace' is invalid in C99 [-Wimplicit-function-declaration]
    if (!isspace(*s))
         ^
1 warning generated.
installing to /Users/joa/Library/R/3.2/library/farff/libs
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
No man pages found in package  ‘farff’
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (farff)

Since these are just warnings perhaps I can ignore them, but then R crashes when I try to download an OpenML dataset:

> b = getOMLDataSet(61)
Data '61' file 'description.xml' found in cache.
Data '61' file 'dataset.arff' found in cache.
[1] "{Iris-setosa,Iris-versicolor,Iris-virginica}"
[1] "Iris-setosa,Iris-versicolor,Iris-virginica}"

[1] "Iris-versicolor,Iris-virginica}"

[1] "Iris-virginica}"

Loading required package: readr

 *** caught segfault ***
address 0x68, cause 'memory not mapped'

Traceback:
 1: g(.Call(c_rd_preproc, path, tmp.file, as.integer(header$line.counter)))
 2: farff::readARFF(file, show.info = FALSE)
 3: arff.reader(f$dataset.arff$path)
 4: getOMLDataSet(61)

Can you see if you can reproduce this? I'm using R 3.2.3, have updated all mlr-related packages to the latest versions, but I keep getting this error.

factors with levels T,F are converted to logical (like RWeka)

We maybe should add an option to deactivate this converting to logical
(I am referring to 8e4ed73)

Some OpenML datasets can't be parsed

I get

Rscript -e 'library(farff); readARFF("1112.arff")'
Parse with reader=readr : 1112.arff
Loading required package: readr
Warning: 50001 parsing failures.
row col expected actual
1 X1 a double @DaTa
2 -- 1 columns 231 columns
3 -- 1 columns 231 columns
4 -- 1 columns 231 columns
5 -- 1 columns 231 columns
... ... ......... ...........
.See problems(...) for more details.
Error in colnames<-(*tmp*, value = header$col.names) :
'names' attribute [231] must be the same length as the vector [1]
Calls: readARFF -> colnames<-
In addition: Warning message:
Unnamed col_types should have the same length as col_names. Using smaller of the two.
Execution halted

Joaquin says:

Alright, after some experiments I found that the problem goes away if I
remove the features with have more than 15000 (nominal) values.

Maybe farff raises an internal error when it encounters such cases and
skips them, and hence the feature count won't match, which would explain
the error we see.

It happens for 1111,1112 and 1114.

Quoted strings in nominal values

ds = getOMLDataSet(71) # no problems with RWeka
setOMLConfig(arff.reader = "farff")
ds = getOMLDataSet(71)

Data '71' file 'description.xml' found in cache.
Data '71' file 'dataset.arff' found in cache.
Fehler in consume(x, "^\s_}\s_", no.match.error = TRUE) :
Error while parsing factor levels in line:
@Attribute carbon {''B1of3'',''B2of3'',''B3of3''}

readARFF with a high cardinality factor does not work if the labels are too long

library(farff)
library(stringi)

set.seed(1)

n = 2000000
n_levels = 25000
label_length = 30

fac_levels = stri_rand_strings(n = n_levels, length = label_length)

# a high cardinality factor with "long" labels
dat1 = data.frame(huge_factor = factor(sample(fac_levels, size = n, replace = TRUE)))

# make the labels as short as possible
dat2 = dat1
levels(dat2$huge_factor) = abbreviate(fac_levels, minlength = 1)

# write arff files (successful for both!)
writeARFF(dat1, path = "datafile1.arff")
writeARFF(dat2, path = "datafile2.arff")

# reading the long label version takes a very long time and breaks in a strange way
dat3 = readARFF("datafile1.arff")
all.equal(dat1, dat3)

# the short label version works fine
dat4 = readARFF("datafile2.arff")
all.equal(dat2, dat4)

Sidenote:
This leads to errors when working with OpenML which are hard to debug, as the dataset can be uploaded with the R-Interface without error but then the download fails (or in one case I had, seems to be caught in an infinite loop).

> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  de_DE.UTF-8                 
 tz       Europe/Berlin               
 date     2018-11-08                  

Packages --------------------------------------------------------------------------------------------------------------------------------------------
 package    * version date       source                             
 assertthat   0.2.0   2017-04-11 CRAN (R 3.5.0)                     
 backports    1.1.2   2017-12-13 CRAN (R 3.5.0)                     
 base       * 3.5.1   2018-07-05 local                              
 BBmisc       1.11    2018-11-07 Github (berndbischl/BBmisc@a5a4e45)
 checkmate    1.8.5   2017-10-24 CRAN (R 3.5.0)                     
 cli          1.0.1   2018-09-25 CRAN (R 3.5.0)                     
 compiler     3.5.1   2018-07-05 local                              
 crayon       1.3.4   2017-09-16 CRAN (R 3.5.0)                     
 data.table   1.11.8  2018-09-30 CRAN (R 3.5.0)                     
 datasets   * 3.5.1   2018-07-05 local                              
 devtools     1.13.6  2018-06-27 CRAN (R 3.5.0)                     
 digest       0.6.18  2018-10-10 CRAN (R 3.5.0)                     
 fansi        0.4.0   2018-10-05 CRAN (R 3.5.0)                     
 farff      * 1.0     2018-10-30 Github (mlr-org/farff@2e911b7)     
 graphics   * 3.5.1   2018-07-05 local                              
 grDevices  * 3.5.1   2018-07-05 local                              
 hms          0.4.2   2018-03-10 CRAN (R 3.5.0)                     
 memoise      1.1.0   2017-04-21 CRAN (R 3.5.0)                     
 methods    * 3.5.1   2018-07-05 local                              
 pillar       1.3.0   2018-07-14 CRAN (R 3.5.0)                     
 pkgconfig    2.0.2   2018-08-16 CRAN (R 3.5.0)                     
 R6           2.3.0   2018-10-04 CRAN (R 3.5.0)                     
 Rcpp         0.12.19 2018-10-01 CRAN (R 3.5.0)                     
 readr      * 1.1.1   2017-05-16 CRAN (R 3.5.0)                     
 rlang        0.3.0.1 2018-10-25 cran (@0.3.0.1)                    
 rstudioapi   0.8     2018-10-02 CRAN (R 3.5.0)                     
 stats      * 3.5.1   2018-07-05 local                              
 stringi    * 1.2.4   2018-07-20 CRAN (R 3.5.0)                     
 tibble       1.4.2   2018-01-22 CRAN (R 3.5.0)                     
 tools        3.5.1   2018-07-05 local                              
 utf8         1.1.4   2018-05-24 CRAN (R 3.5.0)                     
 utils      * 3.5.1   2018-07-05 local                              
 withr        2.1.2   2018-03-15 CRAN (R 3.5.0)                     
 yaml         2.2.0   2018-07-25 CRAN (R 3.5.0)

parseFactorLevels is slow

this consumes regexp by regexp in R.
this is extremely slow, but only for header parsing.
best to do this in C.

preproc data section into mem instead of into tmp file

at least data.table can read from a string. this should be much faster.

whether this is done or not (mem or temp.file) should be an option.
maybe readARFF(tmp.file = NULL) should mean: preproc into mem

write small benchmark study

compare timings for RWeka, foreign and farff and sizes of different files
show that foreign produces different output than Rweka on some files

writeARFF is slow

is there a faster writer for table data that we can use?
fwrite does not exist, and will not be written.
maybe in readr?

tests that fail with farff but work with RWeka

In #11, I added a test with a few ARFF files from OpenML that can be read by RWeka but not with farff. The test is skipped on travis, but should run when you use devtools::check() on your local machine.

currently we cannot parse multi-instance data

OpenML dataset with ID 1438 has multi-instance observation. In arff files this is denoted by a 'relational' variable-type which is currently not supported by any arff reader in R.
The header looks like:

 @attribute bag relational 
    @attribute y numeric 
    @attribute x numeric 
    @attribute z numeric 
@end bag

Path expansion with ~ generates segfault

at least on Ubuntu 16.04

e.g.

readARFF("~/data.arff")

generates a segfault

while

readARFF("/home/janek/data.arff")

works

linebreaks of different type should be converted in C

datasets 1028, 1030 are problematic, especially for data.table

parseHeader is very slow and should be rewritten in C++

parsing this in R really makes no sense, and using regexps
we should probably switch to Rcpp and use stringstreams to parse this much more efficiently

issue #37 contains a testable example where stuff becomes too slow

Encoding bug?

There seem to be an encoding issue at least for windows (not sure if this is because of windows java or windows or farff):

    oml.conf = getOMLConfig()
    cachedir = oml.conf$cachedir
    data.id = 376
    data.reader = "readr"
    getOMLDataSet(data.id)
    path = file.path(cachedir, "datasets", data.id, "dataset.arff")
    d1 = readARFF(path, data.reader = data.reader)
    d2 = RWeka::read.arff(path)
    for(i in 1:nrow(d1)){ 
      cat(i, fill = TRUE)
      expect_equal(d1$text[i], d2$text[i])
    }
    expect_equal(d1$text[7], d2$text[7])
    d1$text[7]
    d2$text[7]

the first string mismatch happens in row 7 of this data set and refers to the string ¤, which in RWeka is represented as Ã‚Â¤. I have experimented with the iconv function to convert the character into UTF-8 but it did not work. Does this work for other operating systems?

> d1$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" ¤ [4]Discography ¤ [5]Innocence Mission ¤ [6]Tony ¤ [7]NIWEB ¤ ¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

> d2$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" Ã‚Â¤ [4]Discography Ã‚Â¤ [5]Innocence Mission Ã‚Â¤ [6]Tony Ã‚Â¤ [7]NIWEB Ã‚Â¤ Ã‚Â¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

does not work

data.table does not handle quotes in strings correctly

issue reported here
Rdatatable/data.table#1299

ISO_8601_to_POSIX_datetime_format in parseHeader

I am not sure what this is used for anymore.
code needs to be checked.
possibly from foreign, it is an internal function there.

maybe import / copy it or remove the code

RWeka != farff when a feature contains only FALSE

Suppose this is my data:

@relation R_data_frame

@attribute V1 {FALSE}
@attribute V2 {FALSE,TRUE}

@data
FALSE,TRUE
FALSE,FALSE

Reading this data with RWeka gives me:

'data.frame':	2 obs. of  2 variables:
 $ V1: logi  FALSE FALSE
 $ V2: logi  TRUE FALSE

while farff does not recognize the first feature as logical but as a Factor:

'data.frame':	2 obs. of  2 variables:
 $ V1: Factor w/ 1 level "FALSE": 1 1
 $ V2: logi  TRUE FALSE

writeARFF should have overwrite = TRUE

RWeka has overwrite = TRUE by default

out = tempfile()
writeARFF(iris, out)
# doing it again yields error
writeARFF(iris, out)
# Error in writeARFF(iris, out) : 
#  Assertion on 'path' failed: File at path already exists: 

# with RWeka this is still possible and overwrites the file
RWeka::write.arff(iris, out)

write little blog post on mlr blog on farrf with mini speed test

List of OML files that dont work


  # here a list of other dids that do not work (some of them even don't work for RWeka)
  bad = c(70,71,73,74,75,76,78,115,116,118,119,121,122,123,124,125,126,127,128,129,130,
          131,132,133,135,136,138,140,141,142,144,146,147,148,273,292,293,350,358,383,
          384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,572)

  # some of the files are also "big" and take a long time
  size.bad = vapply(bad, function(X) {
    path = OpenML:::downloadOMLObject(X, object = "data")$files$dataset.arff$path
    file.size(path)
  }, numeric(1))

We need to check those, reduce the list. Maybe convert some of them into now issues

the preprocessing buffer code is VERY bad

these lines

SEXP c_rd_preproc(SEXP s_path_in, SEXP s_path_out, SEXP s_data_sect_index) {

FILE* handle_in;
FILE* handle_out;
const char* path_in = CHAR(asChar(s_path_in));
const char* path_out = CHAR(asChar(s_path_out));
int data_sect_index = asInteger(s_data_sect_index);
char line_buf_1[400000];
char line_buf_2[400000];

can obviously lead to buffer overflows for very long lines. that is VERY BAD code.
at least we should do a check, but we should likely alloc mem here dynamically.

see issue #37 for where this problem already occured

columns containing question marks

There seems to be a problem when a column contains question marks (maybe this issue occurs also with other special characters). In openml missing values in an arff file are labeled as a question mark, see for example the prediction object in http://www.openml.org/r/506373, which looks like:

@relation 'run$predictions'
@attribute 'repeat' numeric
@attribute 'fold' numeric
@attribute 'row_id' numeric
@attribute 'prediction' {'good','bad'}
@attribute 'truth' {'good','bad'}
@attribute 'confidence.good' {FALSE, TRUE}
@attribute 'confidence.bad' {FALSE, TRUE}
@data
0 0 490 ? "good" ? ?
0 0 406 ? "good" ? ?
0 0 139 ? "good" ? ?
0 0 482 ? "good" ? ?

While RWeka is able to read this, farff is failing.

library(OpenML)

# fails
setOMLConfig(arff.reader = "farff")
d = getOMLRun(506373)

# works
setOMLConfig(arff.reader = "RWeka")
d = getOMLRun(506373)

It seems that the error happens in c_rd_preproc, it produces for the first four lines:

00490NA"good"NANA
00406NA"good"NANA
00139NA"good"NANA
00482NA"good"NANA

currently we cannot parse sparse files.

we need to

a) throw an error now, if a sparse file is detected

b) figure out which sparse matrix reader on cran exists that parses something close to sparse ARFF

Here is a list of sparse file DIDs: 350, 386, 391, 397, 401

unit test with date columns

make sure reading and writing works on date cols

too long lines in preproc_readr c code

currently we have some bad magic number that restricts arff line lengths

thats really bad code, i have a possibly better version in "attic"

but this is not really finished and tested. need to look at it

#FIXME Benni und Flo

See #2 for updated Version

# FIXME: dat sets which text features and special chars, they are not stored as UTF8 on OML
dids = setdiff(dids, c(374, 376,  379,  380))

#187: OD280%2FOD315_of_diluted_wines should evaluate to OD280/2FOD315 says description

# FIXME: strings are broken at "," so "[1,2]" becomes "'[1" and "2]'"
dids = setdiff(dids, c(1047, 1057))

# FIXME: foreign can not read dat set linebreaks are \r\r\n instead of \r\n 
# Might be due to conversion using R download.file()?
dids = setdiff(dids, c(579,585, 581))

# FIXME: dat sets with space in column names
dids = setdiff(dids, c(1058))

# FIXME: Error in data, numeric data sometimes quoted e.g. '1047' instead of 1047
# Weka simply removes quotes
dids = setdiff(dids, c(1092, 1095))

# FIXME: dat set where @Data lines sometimes begin with ",". 
# farff reads NA for first and drops last entry in the line
# rWeka removes ","
dids = setdiff(dids, c(676))

# FIXME: dat set of form {0 entry1, 1 entry2, 2 entry3, 4 entry5}
# Where 0,1,2,4... is the column number.
# dat set with according column number in front of entry,
# if colnumber not  in '{}' tags 
# then fill with 0 (that is what RWeka does)
dids = setdiff(dids, c(292))
´´´

Error parsing file

getOMLRun(536902)

fails with the following error:

Warnung: 49 parsing failures.
row col               expected                                    actual
  1  X1 no trailing characters NA43"Iris-setosa""Iris-setosa"100
  2  X1 no trailing characters NA76"Iris-versicolor""Iris-versicolor"010
  3  X1 no trailing characters NA49"Iris-setosa""Iris-setosa"100
  4  X1 no trailing characters NA85"Iris-versicolor""Iris-versicolor"010
  5  X1 no trailing characters NA134"Iris-virginica""Iris-virginica"001
... ... ...................... .........................................
.See problems(...) for more details.
Fehler in `colnames<-`(`*tmp*`, value = header$col.names) :
  Attribut 'names' [8] muss dieselbe Länge haben wie der Vektor [1]
Zusätzlich: Warnmeldung:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.

This is the content of the arff file:
@relation 'run$predictions'
@Attribute 'repeat' numeric
@Attribute 'fold' numeric
@Attribute 'row_id' numeric
@Attribute 'prediction' {'Iris-setosa','Iris-versicolor','Iris-virginica'}
@Attribute 'truth' {'Iris-setosa','Iris-versicolor','Iris-virginica'}
@Attribute 'confidence.Iris-setosa' numeric
@Attribute 'confidence.Iris-versicolor' numeric
@Attribute 'confidence.Iris-virginica' numeric
@DaTa
0 ? 43 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 76 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 49 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 85 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 134 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 122 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 110 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 130 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 82 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 78 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 132 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 91 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 70 "Iris-virginica" "Iris-versicolor" 0 0 1
0 ? 129 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 72 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 99 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 42 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 96 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 83 "Iris-virginica" "Iris-versicolor" 0 0 1
0 ? 138 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 115 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 67 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 34 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 135 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 4 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 46 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 123 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 26 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 29 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 88 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 25 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 5 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 102 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 14 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 146 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 61 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 68 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 140 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 84 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 124 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 15 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 74 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 41 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 71 "Iris-versicolor" "Iris-versicolor" 0 1 0
0 ? 22 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 27 "Iris-setosa" "Iris-setosa" 1 0 0
0 ? 128 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 113 "Iris-virginica" "Iris-virginica" 0 0 1
0 ? 73 "Iris-versicolor" "Iris-versicolor" 0 1 0

#FIXME Benni und Flo

# FIXME: dat sets which text features and special chars, they are not stored as UTF8 on OML
dids = setdiff(dids, c(374, 376,  379,  380))

# FIXME: strings are broken at "," so "[1,2]" becomes "'[1" and "2]'"
dids = setdiff(dids, c(1047, 1057))

# FIXME: foreign can not read dat set linebreaks are \r\r\n instead of \r\n 
# Might be due to conversion using R download.file()?
dids = setdiff(dids, c(579,585, 581))

# FIXME: dat sets with space in column names
dids = setdiff(dids, c(1058))

# FIXME: Error in data, numeric data sometimes quoted e.g. '1047' instead of 1047
# Weka simply removes quotes
dids = setdiff(dids, c(1092, 1095))

# FIXME: dat set where @Data lines sometimes begin with ",". 
# farff reads NA for first and drops last entry in the line
# rWeka removes ","
dids = setdiff(dids, c(676))

# FIXME: dat set of form {0 entry1, 1 entry2, 2 entry3, 4 entry5}
# Where 0,1,2,4... is the column number.
# dat set with according column number in front of entry,
# if colnumber not  in '{}' tags 
# then fill with 0 (that is what RWeka does)
dids = setdiff(dids, c(292))
´´´

Instance weighting not supported

Weka (>=v 3.5.8) supports having instance weighting directly encoded in the arff file.
See: https://weka.wikispaces.com/ARFF+(stable+version)#Instance%20weights%20in%20ARFF%20files
This does not appear to be supported in your library. It'd be pretty nice if this were added!
-August

leading tilde in path -> segfault

We get a segfault if the path to the arff file contains a leading tilde:

farff::readARFF("~/Desktop/iris.arff")

A simple call to path.extend should fix this.

Parse with reader=readr : ~/Desktop/iris.arff
Lade nötiges Paket: readr

 *** caught segfault ***
address 0x68, cause 'memory not mapped'

Traceback:
 1: system.time(expr)
 2: g(.Call(preproc.fun, path, tmp.file, as.integer(header$line.counter)))
 3: farff::readARFF("~/Desktop/arff/iris.arff")
 4: eval(ei, envir)
 5: eval(ei, envir)
 6: withVisible(eval(ei, envir))
 7: source("run.R")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace