Giter VIP home page Giter VIP logo

token.js's Introduction

token.js

Token.js is a simple lexer written in JavaScript. You specify rules with regular expressions and then a token to return or, optionally, a function to execute.

Example usage:

var numIDs = 0;
var lexer = new TokenJS.Lexer('num := 3 + 4 #add the numbers', {
  root: [
    [/#.*/, TokenJS.Ignore],  //ignore comments
    [/\s+/, TokenJS.Ignore],  //ignore whitespace
    [/[a-zA-Z]+/, function() {
      numIDs++;     //perform a side-effect
      return 'ID';
    }],
    [/[0-9]+/, 'NUMBER'],
    [/:=/, 'ASSIGN'],
    [/\+/, 'PLUS']
  ]
});

console.log(lexer.tokenize());
console.log("Identifiers: " + numIDs);

This outputs the following:

[{text: "num", token: "VAR", index: 0}, {text: ":=", token: "ASSIGN", index: 4},
 {text: "3", token: "NUMBER", index: 7}, {text: "+", token: "PLUS", index: 9},
 {text: "4", token: "NUMBER", index: 11}]
Identifiers: 1  

constructor

TokenJS.Lexer(input, rules, [ignoreUnrecognized])

The first argument to the constructor is the input text. The second is an object consisting of arrays of rules. Each rule is a tuple consisting of a regular expression followed by either a token, a function to execute, or TokenJS.Ignore.

root is a required rule set and indicates the default state of the lexer. Additional rules may be added as needed. See state below for instructions on switching states.

As the example above illustrates, use TokenJS.Ignore to indicate that characters should be consumed without producing a token. This is useful for discarding whitespace and comments. To ignore all unrecognized characters, pass in true as the third parameter.

A custom function takes the matched text as a parameter. The return value of the function will be the token. If a function does not return a value, as in the example above, then the lexer will consider the rest of the rules for the current state until it either 1) finds a value and returns that as a token, 2) TokenJS.Ignore is encountered, or 3) the rules are exhausted, in which case a syntax error will be thrown.

In the case of multiple matches, the rule producing the longest match will be taken. If there is a tie for longest match, the first rule will be used.

tokenize

tokenize returns an array of all resulting tokens. Any side effects resulting from functions will also be executed.

getNextToken

To get tokens one at a time, call getNextToken. When all input is exhausted, getNextToken will return TokenJS.EndOfStream.

var token = lexer.getNextToken();
while (token !== TokenJS.EndOfStream) {
  console.log(token);
  token = lexer.getNextToken();
}

state

Sometimes it is helpful to change states within your lexer to match on different rules.

To change the state of the lexer, call this.state in your callback function, passing the new state as a string. Note that after changing state within a function, you must return either a token or TokenJS.Ignore. Otherwise a syntax error will be raised.

Here is an example of using states to handle HTML comments:

var lexer = new TokenJS.Lexer(
  'before<!-- consumed-text with before and after and -- dashes -->after', {
  root: [
    [/before/, 'BEFORE'],
    [/after/, 'AFTER'],
    [/<!--/, function() {
      this.state('comment');
      return TokenJS.Ignore;
    }]
  ],
  comment: [
    [/-->/, function() {
      this.state('root');
      return TokenJS.Ignore;  //omitting this will cause a syntax error
    }],
    [/./, TokenJS.Ignore] //consume anything that is not an end-of-comment
  ]
});

Output:

[{text: 'before', token: 'BEFORE', index: 0}, {text: 'after', token: 'AFTER', index: 64}]

reset

Use the reset method to reset the lexer to the root state and start scanning from the beginning. This is called implicitly whenever you call tokenize.

token.js's People

Contributors

jhewlett avatar

Stargazers

Basilin Joe avatar MO avatar Guy Bedford avatar  avatar

Watchers

James Cloos avatar  avatar

Forkers

shashwatsh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.