masala / masala-parser Goto Github PK

View Code? Open in Web Editor NEW

139.0 139.0 11.0 2.15 MB

Javascript Generalized Parser Combinators

License: GNU Lesser General Public License v2.1

JavaScript 83.54% TypeScript 16.19% HTML 0.26%

functional-programming generalization monad parsec parser-combinator

masala-parser's People

Contributors

Stargazers

Watchers

Forkers

ltearno xgrommx-forks aychtang sup3rc4l1fr4g1l1571c3xp14l1d0c10u5 elias551 thecrypticace leonardofreitas-ciandt scamden domdomegg kreijstal-contributions

masala-parser's Issues

Simple markdown parser implementation

Create a simple markdown parser

Token Bundle: accept any date

It could be nice to accept date('MM-D-YYYY, h:mm:ss a'). It should not be that complicated for dates that does not require languages.

MMMM is difficult and moment.js uses some localized files, wich is out of masala's scope. https://github.com/moment/moment/tree/develop/locale

Parser Extension: p.letters('hello')=>'hello'

Letters is a succession of letters

Dealing with internationalization

The Parser must deal with internationalization. There may need to configure the Parser with a builder function.

I started writing my own parser combinators library until I came across this project and Bennu. Bennu hasn't been worked on in a while, but it seems quite mature. Its really lightweight and has all the bells and whistles I could imagine.

Any reason you are deciding to build a library of your own rather than using Bennu? Just curious, what are you guys using this library for?

Markdown : title are not always set correctly

In the file samples/trivial-test.md, it works if I put:

title1
====

But fails if i put

# title1

Parser Extension: p.sequence(x1,x2,x3)=>[X1,X2,X3]

The goal is to write easily these often seen structures :

2+2

P.sequence ( this.number(), '+', this.number() )

will result in an array : [2, '+', 2]

So we can write :

P.sequence ( this.number(), '+', this.number() ).map( values => values[0] + values[2] );

Brainfuck parser

Make the parse of a Kiss compiler. The context created must reference variables and scopes. If there is none in this stupid language, use an ultra basic Javascript.

Overloading

We often use combinator.parse(Stream.of(document) ). We could add a function combinator.parseString(document), or modify the parse(doc | stream) function to test if argument is a String or a Stream.

Question is the same with the extractor. Should we create textUntil( string | combinator ), or create textUntil(combinator) and textUntilString(string) ?

Build automatically only master on dev

It's mainly to avoid emails when we push onb WIP branches. I don't know what is the best practice. I'm pretty sure we don't need emails at midnight to say the build is broken.

Parser Extension : thenSpread

The objective is to quickly mix a with b multiple elements.

In Parser class :

thenSpread(p) {
        return this.flatmap((a) => p.map((b) => [a, ...b]));
    }

Real exemple : We have a paragraph then a succession of others

function paragraphs() {
    return P.try(paragraph().thenSpread(followingParagraph().rep()));
}

It will return :

parsing Accept {
  offset: 90,
  consumed: true,
  value: 
   [ { paragraph: [Object] },
     { paragraph: [Object] },
     { paragraph: [Object] } ],
  input: 
   StringStream {
     source: 'Lorem ipsum is a *first* paragraph\nSecond line\n\nThe second paragraph\n\nThe Third paragraph\n' } }

The counterpart return this.flatmap((a) => p.map((b) => [...a,b])); might also be interesting in some cases.

Writing extensive documentation for beginners

Write a documentation of most Parser functions with exemple and links to unit tests.

Extractor features

I will open a branch for a text Extractor in standard/ directory. It comes after my "real world mission". Main features will be:

stringIn(['John', 'Jack']): like charIn, but with an array of strings
charNumber: search for numbers, but returns a string and keep 0 ; useful for dates or phone number ( 0502)
simpleWords(separators): grab words define as lettersand separated with separators. Separators will default to [' ', '\n'];
textUntil(stop, including:bool = false): eats text until stop. Stop can be a combinator, or a text. Set including flag to true to also eat the stop
looseDate() : a very loose search for date. Real date with iso formatting would be a full time project. Giving a simple reusable exemple is a better idea.

"We are the 23-09-2013 at noon." : textUntil(looseDate()) will return "We are the " and offset is put at the beginning of the date. textUntil(looseDate(), true) would return "We are the 23-09-2013"

Setting up build for using ES6

Hello @d-plaindoux
I have set ES6 with grunt-babel and jshint does correctly his job. Grunt needs to be at 0.4 old version because of grunt-coverage dependency.

At the end babel produces a dist/app.js ES5 file with sourcemap. Options are defined in the Gruntfile.

When running grunt --gruntfile Gruntfile_Coverage.js,I have no error but I have seen no coverage information, and no lib-cov/report directory. What is it supposed to do ?

If it's ok, I will ES6 everything as soon as possible.

How to select up to a given sequence?

What is the proper way to select everything up to a given sequence?

Let's say that I have this source : "wordSTOPwordagainSTOP"
And I want my value to be : ["word","wordagain"]

I have wrote :

function stop(){
    return P.string('STOP').rep();
}
function detectStop(){
    return P.try(stop().or(P.letters())).rep();
}
function parseText( line, offset=0){
    return detectStop().parse(stream.ofString(line), offset)
}

So once the parser enters into letters(), it does not come back to see if there is a STOP, which has however a better priority. How would I do that ?

====

But suppose that the separator is now -stop- with not only letters

function stop(){
return P.string('-stop-').rep();
}

And we test : test-stop-test-stop-, then my parser works

It is easy, because the parse will detect a special character '-'. But we don't always want to
introduce a special character.

====

Add a debug function for Parser so that we understand what's going on

I propose a debug function used like this :

function paragraph() {
    // if we have found a line, then we will enter in debug
    return line().debug('found a line').thenLeft(eol.opt()).map(paragraphText);
}

Let's have a line as Markdown input. The console output is :

[debug] : found a line [ { text: 'Lorem ' }, { bold: { text: 'ipsum' } }...]

Because debug() is put after line(), the parser will enter in debug. If there was no line, we would not enter.
I send a PR with an exemple of code. If you have more ideas, they are welcome.

letter is only for US ascii letters

The P.letter parser will not accept accents likeé or any foreign UTF-8 characters. There is a sufficient for the moment trick:

var firstLetter = name.charAt(0).toUpperCase();
if( firstLetter.toLowerCase() != firstLetter) {
    // it's a letter
}
else {
    // it's a symbol
}

First solution is to rename it to P.asciiLetter. Second solution is to use the trick, but it will make it quite slower. Third solution is to redefine the method and use a flag as P.letter(onlyAscii=true)

Test `optrep` is not accorded to its text

The question is : When you use optrep(), is it ok to have zero elements ? I think so according to the passing tests.

'expect (optrep) to accepted': function(test) {

    test.deepEqual(parser.char("a").optrep().parse(stream.ofString("a"),0).isAccepted(),
                   true,
                   'should be accepted.');

  },

  'expect (optrep) none to accepted': function(test) {

    test.deepEqual(parser.char("a").optrep().parse(stream.ofString("b"),0).isAccepted(),
                   true,
                   'should be rejected.');  // <===== HERE : did you meant 'Sould be accepted' ?

  },

I'm not fond of this semantic : in real language, you repeat when you have at least two elements. optrep() should be ok with one element or more, but not OK with zero elements.

So we need to build something else when you need to test a real repetition (at least two elements). Something like :

    P.try(anyTitle()).blankLine().blankLine().optrep().paragraph()  // at least one blankline

Remove all debugs in tests (and lib)

There is a debug in markdown lib, and still a lot in unit-tests.

Optimise Buffered stream

A stream can be buffered. For this purpose an entropic cache is build. When a parser is accepted a cut mechanism can be applied in order to flush cache and optimise memory usage.

Operation Parser

Creating a simple operation parser mainly for educational purposes

Release 0.2 and integration

Make sur one can use parsec minified
Make sure one can access to standard elements
make code source visible, mainly for examples

Creating an extensible LineParser class

I want to create a class LineParser :

export default class LineParser {  // ideally it could extend directly Parser or ParserHelper

   textValue(chars) {
        return { text: chars.join('').trim() };
    }
    text(separator) {
        if (separator) {
                                                                /// vvvv this is null
            return P.not(eol.or(P.string(separator))).optrep().map(this.textValue);
        } else {....}
    }

The problem is that this fonction is extracted from the class with this code :

// (('b -> Parser 'a 'c) * 'b)-> Parser 'a 'c
function lazy(p, parameters) {
                                             //  vvvv 'this' will always be null :(
    return new Parser((input, index=0) => p.apply(null, parameters).parse(input, index));
}

Do you have any clue to avoid this p.apply(null) ?

FlattenDeep

This method is unused. I think this stuff is useless since map and flatmap provide the best approach for data transformation.

Code simplification

In the markdown token.js the following function:

function fourSpacesBlock() {
    return P.char('\t').or(P.try(P.charIn(' \u00A0').then(P.charIn(' \u00A0'))
        .then(P.charIn(' \u00A0')).then(P.charIn(' \u00A0'))));
}

can be simplified and replaced by:

function fourSpacesBlock() {
    return P.char('\t').or(P.charIn(' \u00A0').occurrence(4));
}

Infix operators using Sweet

Expressiveness can be increased using infix operators.

p1 <*> p2    // == p1.then(p2)
p1 <|> p2    // == p1.or(p2)
p1 >>= p2    // == p1.flatmap(p2)
p1 || p2     // == p1.chain(p2)

This can be achieved using Sweet.JS meta language.

operator <*> left 1 = (left, right) => #`${left}.then(${right})`;
operator <|> left 1 = (left, right) => #`${left}.or(${right})`;
operator >>= left 1 = (left, right) => #`${left}.flatmap(${right})`;
operator || left 1  = (left, right) => #`${left}.chain(${right})`;

Performance issue

Since v 0.3, hotelhub automated tests are 10x slower, but the main difference is that their custom code has been wrapped into the Extractor bundle, and that we have split code on bundles.

Accept compareTo function in stream.substreamAt

Here is the current implementation of subStream:

// Stream 'a => [Comparable 'a] -> number -> boolean
    subStreamAt(s, index){
        for (var i = 0; i < s.length; i++) {
            var value = this.get(i + index);
            if (!value.isSuccess() || value.success() !== s[i]) { // <=== compareTo
                return false;
            }
        }

        return true;
    }

Suppose we want to create P.stringIgnoreCase("john doe"). We could send a compareTo function to subStream(s, index, [compareTo]).

if (!value.isSuccess() || !compareTo(value.success(), s[i]) ) {
                return false;
            }

Is it a good idea ?

F.first, F.last

It's a mapping function to pick first and last element of an array:

var helloParser = C.string("Hello")
                    .then(C.char(' ').rep())
                    .then(C.char("'"))
                    .thenRight(C.letter.rep()) // keeping repeated ascii letters
                    .thenLeft(C.char("'"));    // keeping previous letters
                    .map (F.last) // keep last letter

Use kebab-case for file names

underscores or upperCase in files name looks weird in Javascript. We are used to kebab-case such as
line-parser.js moslty because of Windows/Linux file compatibility.

Improving build system

Allow use of rm -rf or mkdir tasks with Windows, using rim-raf
copy samples files easily

Standard reoganisation

The standard directory contains a json parser and a naive markdown parser (obsolete) and a markdown parser. This should be reorganise with sub directories i.e. one per parser.

Simple Operation parser

Be able to parse (2+2)*3 and find 12.

Make it in a feature/operation branch.

Markdown parser is using `fs` which is not applicable in browser

Standard libs should not refer to fs. Only unit tests.

Mix of tabs and spaces for bullets

This test is passing. It should not, I think.

 'test bullet niveau 2': function (test) {
        const line = "\t  * This is another lvl2 bullet \n  ";
        testLine(line);
        test.deepEqual({bullet:{level:2, content:[{text:'This is another lvl2 bullet '}]}}, value,'probleme test:test bullet Lvl2');
        test.done();
    },

RxJS and similies stream: `parser.ofRx()`

It should be quite easy to create a stream form RxJS, and as it is a de facto standard, it should be well accepted.
A bit more complex is to allow send Response inside another Rx stream.

Export Tokens Bundle

Token bundle is not exported and pretty empty.

1.0 Roadmap discussion

Suppose there is no bug. What feature do we need for 1.0 ?

Kiss compiler
Clear separation of concerns:
- Use of Bundles #52
- isolated libs in standard
- examples in examples
Good naming
Dealing with internationalization : #51
Compatible with Fantasy Land library for monadic support
Build a pattern matching library on top of parsec
Binary stream decoder (Scala Scodec)
Data Marshaller based on binary stream decoder

Any ideas ?

Add an easy-to-start parseString function

Instead of

var parsec = require('parser-combinator');
var S = require('parser-combinator').stream;
const document = "Hello World in 2017";
const stream = S.ofString(document);
var P = parsec.parser;

We could just write parser.parseString(document); or something like that.
To be defined ...

Parser Extension : flattenDeep

The goal is to parse easily ;

!image: duck.png

Using something like :

P.char('!').then(text()).then(P.char(':')).thenLeft(spaces()).then(text())

We now have a strange soup of arrays. Using :

P.char('!').then(text()).then(P.char(':')).thenLeft(spaces()).then(text()).flattenDeep()

We now have a flat array : ['!','image', ':', 'duck']

Travis is still using Grunt

It should use npm run xyz

Windows Support for eol

Windows uses a \r\n for line feed. There is a lot of \n tests, especially in Markdown, so we may check if \r\n is also working.

Exemple: markdown/bullet-parser.js

function bulletLv1(){
    // TODO: check if T.eol is better on windows
    return C.char('\n').optrep()

prepublishing: npm run prepublish will package parser-combinator.js to /pre-integration and verify that it can be used
pospublishing: npm run integration will download from npm and check that it's ok to be used

Separate Parser functions in clear Bundles

There must be at least:

FlowBundle
CharsBundle
NumberBundle

Then standard/ will use a JsonBundle and MarkdownBundle

Rename thenLeft and thenRight

I think it's not easy to understand for beginners.

We could change x.thenLeft(y) by x.thenSkip(y) and x.thenRight(y) by x.thenKeep(y);

What do you think ?