tixxit / delimited Goto Github PK
View Code? Open in Web Editor NEWScala library for working with {tab,colon,semicolon,comma,pipe,etc}-separated files.
License: Other
Scala library for working with {tab,colon,semicolon,comma,pipe,etc}-separated files.
License: Other
Please publish for Scala 2.12
Excel produces UTF-8 files with BOMs in them, so we should probably just cover the gamut and always look for BOMs when reading files, unless explicitly given a character set.
This isn't currently an easy option because Scala.js (as of 0.6.16) doesn't implement java.io.PushbackReader
, but in the future that might change, or someone might decide to provide an alternative Scala.js-specific implementation of GuessDelimitedFormat
that doesn't use PushbackReader
.
any time to publish for 2.13 sir?
We should have a decent set of delimited documents sourced from the internet that have a variety of styles. This includes files produced by some "standard libs" in various programming languages, Excel, etc.
Is it possible to add support for escaping the quote escape character. I have some mysql dumps that use " as a quote character and \ as an escape character but there are some values in the form of like "value\" so the last quote is wrongly interpreted as an escaped quote. If it was possible to escape the escape character \ could be treated as an escaped \ so the value would be valid.
Thanks
When using delimited 0.9.0
with scala 2.12.3
, Im seeing:
@ val parser: DelimitedParser = DelimitedParser(DelimitedFormat.CSV)
java.lang.BootstrapMethodError: java.lang.NoSuchMethodError: scala.math.Ordering$.$anonfun$by$1$adapted(Lscala/Function1;Lscala/math/Ordering;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
net.tixxit.delimited.RowDelim$.<init>(RowDelim.scala:28)
...
root cause suspected to be scala/bug#10489
Hi, I was trying out the library and I'm a bit confused by the way Row
type behaves. I see that it inherits from IndexedSeq
, but it seems that any inherited method returns IndexedSeq
instead of a Row
:
@ import $ivy.`net.tixxit::delimited-core:0.9.0`, net.tixxit.delimited._
@ Row("a", "b", "c") ++ Row("d", "e", "f")
res1: IndexedSeq[String] = Vector("a", "b", "c", "d", "e", "f")
@ Row("a", "b", "c") :+ "d"
res2: IndexedSeq[String] = Vector("a", "b", "c", "d")
So, how one is supposed to work with Row
s?
Rather than requiring a String
, we could actually take chunked input, such as what's accumulated in inferDelimitedFormat
in the iteratee module. This would avoid a possibly large string concatenation. It should really make things slower or be particularly difficult to do.
If you allowRowDelimInQuotes
is true
, a malformed CSV can potentially OOM the JVM. That is pretty bad. We should allow people to provide a maximum row size that is acceptable. This can be used to just be a safety mechanism. In the case of inference, the buffer size used is sort of an implicit hard limit on row sizes, since if rows can be larger than the buffer, than we have no hope of inferring the row delimiter (at least).
Ideally this would just be part of the initial parser creation:
DelimitedParser(format, maxRowSize = 64 * 1024)
Here's the file (TSV)
sample.txt.zip
Code producing the exception:
import java.io._
import net.tixxit.delimited._
object Main extends App{
val fileName = "/Users/dw/Desktop/sample.txt"
val bufferedReader = new BufferedReader(
new FileReader(fileName)
)
try{
val iterator = DelimitedParser(DelimitedFormat.TSV).parseReader(bufferedReader)
for{
parsed <- iterator
} {
println(parsed)
}
} catch {
case e: Throwable =>
e.printStackTrace
throw e
} finally{
bufferedReader.close
}
}
stacktrace:
java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error] at java.lang.String.charAt(String.java:646)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error] at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error] at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error] at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error] at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error] at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error] at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error] at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error] at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error] at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error] at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] at scala.collection.immutable.List.foreach(List.scala:381)
[error] at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error] at scala.App$class.main(App.scala:76)
[error] at september.cj.Main$.main(Main.scala:8)
[error] at september.cj.Main.main(Main.scala)
[error] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error] at java.lang.String.charAt(String.java:646)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error] at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error] at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error] at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error] at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error] at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error] at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error] at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error] at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error] at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error] at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error] at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] at scala.collection.immutable.List.foreach(List.scala:381)
[error] at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error] at scala.App$class.main(App.scala:76)
[error] at september.cj.Main$.main(Main.scala:8)
[error] at september.cj.Main.main(Main.scala)
java.lang.RuntimeException: Nonzero exit code returned from runner: 1
Add a delimited-iteratee
package that uses Travis Brown's iteratee library.
Ideally we would have an Enumeratee
that can map chunks of character data to rows. I'm not super sure this is possible... but it would be nice to avoid a proliferation of Enumerator
s for different ways of getting rows out of sources of chunks of character data.
Iteratee support should include dumping CSVs to files, possibly with a BOM marker.
It would be nice to include some support for inferring column-level, detailed schemas for parsed CSVs too. This comes up often and usually gets hacked around. The goal would be to provide a fairly robust way of gathering statistics per column, including:
Ideal proof-of-concept would include a tool that given a CSV will produce a SQL file that includes the CREATE TABLE statement along with an INSERT statement for the values - at least supporting Postgres or maybe allowing pluggable backends if that is feasible (without rewriting 80% of the tool).
We currently use a constant. However, users may want to use a larger (or smaller) buffer and we should support that.
Right now we make a good guess at the format based on frequency counts. However, we don't do any validation that our choice was correct. Mainly, we can do 2 types of verification right away:
The idea would be that rather than inferring 1 candidate DelimitedFormat
, we would return a Stream[DelimitedFormat]
of potential formats, ranked based on some sort of score. We would then have a validation step that would sample them until it found a successful one.
The idea here is that instead of having the parser produce fully parsed rows, we may be able parse things a bit quicker by simple parsing the entire row, without breaking it down to cells, then using a new LazyRow
type to defer actually parsing the row fully. The idea here is that we may be able to speed up a single threaded reader by only partially parsing the rows, then allowing a chunk of the work required to further parse the rows to be done concurrently.
The hope here is that we can keep the reader IO bound, if at all possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.