tixxit / delimited Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 8.0 624 KB

Scala library for working with {tab,colon,semicolon,comma,pipe,etc}-separated files.

License: Other

Scala 100.00%

delimited's People

Contributors

Stargazers

Watchers

Forkers

danking benhutchison non travisbrown ianoc-stripe johnynek scala-steward

delimited's Issues

Look for BOM at the beginning of files

Excel produces UTF-8 files with BOMs in them, so we should probably just cover the gamut and always look for BOMs when reading files, unless explicitly given a character set.

This isn't currently an easy option because Scala.js (as of 0.6.16) doesn't implement java.io.PushbackReader, but in the future that might change, or someone might decide to provide an alternative Scala.js-specific implementation of GuessDelimitedFormat that doesn't use PushbackReader.

version for 2.13?

any time to publish for 2.13 sir?

Robust validation data set for format inference

We should have a decent set of delimited documents sourced from the internet that have a variety of styles. This includes files produced by some "standard libs" in various programming languages, Excel, etc.

Support for escaping the quote escape character

Is it possible to add support for escaping the quote escape character. I have some mysql dumps that use " as a quote character and \ as an escape character but there are some values in the form of like "value\" so the last quote is wrongly interpreted as an escaped quote. If it was possible to escape the escape character \ could be treated as an escaped \ so the value would be valid.

Thanks

Binary compatiblity problem in scala 2.12.3

When using delimited 0.9.0 with scala 2.12.3, Im seeing:

@ val parser: DelimitedParser = DelimitedParser(DelimitedFormat.CSV)
java.lang.BootstrapMethodError: java.lang.NoSuchMethodError: scala.math.Ordering$.$anonfun$by$1$adapted(Lscala/Function1;Lscala/math/Ordering;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
  net.tixxit.delimited.RowDelim$.<init>(RowDelim.scala:28)
...

root cause suspected to be scala/bug#10489

How to work with Rows

Hi, I was trying out the library and I'm a bit confused by the way Row type behaves. I see that it inherits from IndexedSeq, but it seems that any inherited method returns IndexedSeq instead of a Row:

@ import $ivy.`net.tixxit::delimited-core:0.9.0`, net.tixxit.delimited._

@ Row("a", "b", "c") ++ Row("d", "e", "f")
res1: IndexedSeq[String] = Vector("a", "b", "c", "d", "e", "f")

@ Row("a", "b", "c") :+ "d"
res2: IndexedSeq[String] = Vector("a", "b", "c", "d")

So, how one is supposed to work with Rows?

Support chunked input when guessing

Rather than requiring a String, we could actually take chunked input, such as what's accumulated in inferDelimitedFormat in the iteratee module. This would avoid a possibly large string concatenation. It should really make things slower or be particularly difficult to do.

Support hard limits on row sizes

If you allowRowDelimInQuotes is true, a malformed CSV can potentially OOM the JVM. That is pretty bad. We should allow people to provide a maximum row size that is acceptable. This can be used to just be a safety mechanism. In the case of inference, the buffer size used is sort of an implicit hard limit on row sizes, since if rows can be larger than the buffer, than we have no hope of inferring the row delimiter (at least).

Ideally this would just be part of the initial parser creation:

DelimitedParser(format, maxRowSize = 64 * 1024)

java.lang.StringIndexOutOfBoundsException in 0.6.0 but not it 0.5.5

Here's the file (TSV)
sample.txt.zip

Code producing the exception:

import java.io._

import net.tixxit.delimited._

object Main extends App{

  val fileName = "/Users/dw/Desktop/sample.txt"
  val bufferedReader = new BufferedReader(
    new FileReader(fileName)
  )
  try{

    val iterator = DelimitedParser(DelimitedFormat.TSV).parseReader(bufferedReader)
    for{
      parsed <- iterator
    } {
      println(parsed)
    }


  } catch {
    case e: Throwable =>
      e.printStackTrace
      throw e
  } finally{
    bufferedReader.close
  }


}

stacktrace:

 java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error]     at java.lang.String.charAt(String.java:646)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error]     at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error]     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error]     at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error]     at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error]     at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error]     at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error]     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.collection.immutable.List.foreach(List.scala:381)
[error]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error]     at scala.App$class.main(App.scala:76)
[error]     at september.cj.Main$.main(Main.scala:8)
[error]     at september.cj.Main.main(Main.scala)
[error] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error]     at java.lang.String.charAt(String.java:646)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error]     at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error]     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error]     at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error]     at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error]     at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error]     at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error]     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.collection.immutable.List.foreach(List.scala:381)
[error]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error]     at scala.App$class.main(App.scala:76)
[error]     at september.cj.Main$.main(Main.scala:8)
[error]     at september.cj.Main.main(Main.scala)
java.lang.RuntimeException: Nonzero exit code returned from runner: 1

Add Iteratee support (Cats)

Add a delimited-iteratee package that uses Travis Brown's iteratee library.

Ideally we would have an Enumeratee that can map chunks of character data to rows. I'm not super sure this is possible... but it would be nice to avoid a proliferation of Enumerators for different ways of getting rows out of sources of chunks of character data.

Iteratee support should include dumping CSVs to files, possibly with a BOM marker.

Add column-level schema inference

It would be nice to include some support for inferring column-level, detailed schemas for parsed CSVs too. This comes up often and usually gets hacked around. The goal would be to provide a fairly robust way of gathering statistics per column, including:

inferred type: text, category, ordinal, integer, continuous, boolean, etc
required vs optional
how missing values are marked (empty, NULL, N/A, NaN, etc)
ratio of unique values / rows

Ideal proof-of-concept would include a tool that given a CSV will produce a SQL file that includes the CREATE TABLE statement along with an INSERT statement for the values - at least supporting Postgres or maybe allowing pluggable backends if that is feasible (without rewriting 80% of the tool).

Support user-supplied buffer sizes for inference

We currently use a constant. However, users may want to use a larger (or smaller) buffer and we should support that.

Inferred DelimitedFormat verification step

Right now we make a good guess at the format based on frequency counts. However, we don't do any validation that our choice was correct. Mainly, we can do 2 types of verification right away:

Check that there are no errors when parsing the sample - or very few, relative to other candidates
Check that the rows all have ~the same number of cells in them

The idea would be that rather than inferring 1 candidate DelimitedFormat, we would return a Stream[DelimitedFormat] of potential formats, ranked based on some sort of score. We would then have a validation step that would sample them until it found a successful one.

Partially parse delimited files to LazyRow

The idea here is that instead of having the parser produce fully parsed rows, we may be able parse things a bit quicker by simple parsing the entire row, without breaking it down to cells, then using a new LazyRow type to defer actually parsing the row fully. The idea here is that we may be able to speed up a single threaded reader by only partially parsing the rows, then allowing a chunk of the work required to further parse the rows to be done concurrently.

The hope here is that we can keep the reader IO bound, if at all possible.