michel-kraemer / actson Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 9.0 406 KB

A reactive (or non-blocking, or asynchronous) JSON parser

License: MIT License

Java 100.00%

actson's People

Contributors

Stargazers

Watchers

Forkers

halofour petrglad bsubhashni vadi haiderny flyaruu cbuschka dario-boberek meimingle

actson's Issues

String value parsing should be memory-bounded or multi-event

I have a use case parsing a very big JSON document that's structured like

{
  "fieldName":"35cdf2614146t14314[...]" /* many megabytes of base-64 data */
}

It seems that currently the StringBuilder currentValue will just keep growing unbounded as a string value rolls in.

In order to be bounded, there could be a maximumStringValueLength setting that puts an upper limit on the memory usage. But even better would be to allow (possibly by another flag) for the string value to be separated into multiple events, so it can be handled in a streaming way.

It'd need a new VALUE_PARTIAL_STRING event though, or sth like that.

Test Actson against JSONTestSuite

See test suite here:
https://github.com/nst/JSONTestSuite

And the article Parsing JSON is a Minefield by Nicolas Seriot (@nst):
http://seriot.ch/parsing_json.html

Since Actson is based on JSON Checker it will probably suffer from the issues described in section 4.8.

Support for unicode escapes

Unicode symbols represented as UTF-8 in strings are parsed as expected. I also see that parser has state ES (escape). But escapes of Unicode code-points are not converted to respective code points.
For example source text "\"\\u00e9\"".getBytes() is parsed as single token "\\u00e9" instead of "é".

Here's the test ba58753

Streamline API, remove feeder.

I am thinking about getting rid of other validating legacy of thins code. Procedures that use feeder look cumbersome (see these loops in d.u.a.JsonParserTest#parseFail, for example). So removing feeder from parser altogether should simplify code and it's usage. Looking at d.u.a.JsonParser#nextEvent I now think that whole point of having feeder is the ability to detect premature end of input and return EOF or ERROR events appropriately. Application that is only interested in parsing still can trivially infer these conditions anyway. Here's changes that I propose.
Change API of parser to

public class JsonParser {
  public JsonParser(int depth) {...} // Initial state, only constructor
  public int getMaxDepth() {...} // Keep this
  public int nextEvent(char nextChar) {...} // Change state with next input text's character, return event. 
  // All else would be non public or removed
}

This effectively reduces parser to be state machine only. This keeps input buffering out of parser so clients are free to use whatever they want without implementing interface.

Replace event EOF with, say, END which is emitted when current JSON structure is completely parsed. This happens when we have already parsed a scalar value or end of nested structure is popped and stack is empty.
Here are 2 options: after issuing END all subsequent calls would produce ERROR which is similar to current behavior, or keep parsing as we're at the beginning. In case when sequence of multiple JSON documents is expected in input first option would require to re-create parser to parse new piece, later case allow to reuse it and just interpret END event if necessary. Since I am interested in this use-case I'd prefer to keep parsing after END.

setDepth setter should be removed since it is not clear what to do if we are in the middle of parsing, say, at depth 100 and this setting is changed to 50.

The parser usage example would be like:

final JsonParser jsonParser = new JsonParser(100);
final CharSequence jsonSource = "{\"abc\":123}";
int i = 0;
int event;
while (i < jsonSource.length()) {
    event = jsonParser.nextEvent(jsonSource.charAt(i++));
    if (event != NEED_MORE_INPUT) {
      // One can handle END event explicitly here, e.g. break loop 
      processEvent(event);
    }
}
if (event == NEED_MORE_INPUT) {
   throw new RuntimeError("Incomplete JSON input.");
}

Not possible to add content from a netty ByteBuf to a feeder without additional copy.

This is a feature request rather than a defect.

I'm using actson in a netty 4.1.17 environment to stream json content that is being received from a chunked http transfer.

Using the JsonFeeder interface it is not possible to feed data from a netty ByteBuf without copying it to a byte array first. It is, however, possible to wrap the ByteBuf as a standard java.nio ByteBuffer. Would it be reasonable to add a feed method that takes the ByteBuffer?

I've experimented with implementing my own feeder but I would like to avoid duplicating DefaultJsonFeeder's charset decoding. An alternative approach would be to factor the charset decoding into a separate class.

As as side note, the feeder interface could reasonably be split into two interfaces (one that the parsers uses to consume input and one that application code uses to provide input). I suspect this change would be too disruptive to existing users.

If any of these ideas sound reasonable I am happy to help make a pull request.

Bug: Unicode escapes seem to break parsing down the line

I've encountered a weird issue. In the example I've attached in line 18 is a unicode escape. This seems to parse fine. However, the parser now breaks in line 49. If I remove line 18, line 49 (then 48) is properly read. Also, if I remove the unicode escape from that line it all works out fine.

parsing-example.txt

To clarify: I'm using the JsonParser stream parser with default encoding. The file is also UTF8 encoded but still has these escapes.

Large string values

I am building an application that consumes large files of structured data and does some processing based on the content.
I'm using the Aalto streaming parser for XML files and this looks like the most similar option for JSON.

One of the requirements for my project is that it must be able to handle large string values in the JSON, which may be too large to hold in memory. I'm wondering if there is any way I can use this parser in its current form to handle this case since as far as I can tell it only emits a VALUE_STRING JsonEvent once the parser has seen the closing ".

JsonFeeder needs to copy bytes, even if source already is ByteBuffer

Currently JsonFeeder lacks the ability to directly feed off read-only ByteBuffer objects. This has the disadvantage of needing another copy operation for frameworks that offer input bytes as ByteBuffer (e.g. akka's ByteString and google protobuf's ByteString do this).

JsonFeeder should be extended to read from byte buffers as well. On the implementation side, I see two possible implementations:

Hold on to the byte buffers if they're read-only, and process them once fillBuffer is called (no copying)
Immediately read them into characters when feeding, holding on to only a few final bytes if we happen to land in the middle of a multi-byte sequence.

Actually, glancing at the code, is (2) even handled currently?

I'd be happy to do a PR.

performance review?

How does the performance compare with existing libraries such as Jackson? I consider using it for network application in which JSON messages arrive in chunks.

Proposal: Add reasons for ERROR event

It would be useful if when receiving the JsonEvent.ERROR event that there was a property set that could explain the reason why the parse failed. Looking through the JsonParser source this seems like it would be simple enough to do since most error conditions are the result of an unexpected token.

I'd be happy to tackle this myself.

michel-kraemer / actson Goto Github PK

actson's People

Contributors

Stargazers

Watchers

Forkers

actson's Issues

String value parsing should be memory-bounded or multi-event

Test Actson against JSONTestSuite

Support for unicode escapes

Streamline API, remove feeder.

Not possible to add content from a netty ByteBuf to a feeder without additional copy.

Bug: Unicode escapes seem to break parsing down the line

Large string values

JsonFeeder needs to copy bytes, even if source already is ByteBuffer

performance review?

Proposal: Add reasons for ERROR event

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent