Giter VIP home page Giter VIP logo

regex-lexing-in-scala's Introduction

Regex-Lexing-in-Scala

An implementation of a Regex Engine and Lexer done in Scala as part of a Formal Languages and Automata course from late 2022.

The point of the assignment was to build an understanding about finite state machines and formal language theory.

Regex

We start from the regex itself in string form, the syntax is similar to usual regex implementations found in standard libraries, but less syntactic "sugars" :

  • character literal
  • [character]"|"[character] => UNION of two characters, can be interpreted as "OR"
  • [character][character] => CONCATENATION of two character, can be interpreted as "AND"
  • [character]* => KLEENE STAR, can be interpreted as "0 or more times"
  • [character]? => MAYBE, can be interpreted as "0 or exactly once"
  • [character]+ => PLUS, can be interpreted as "once or more times"
  • [A-Z], [a-z], [0-9] => Syntactic sugar for all characters from capital A to Z, lowercase a to z and digits 0 through 9.
  • Escaping is done by placing the character in '' 's

We transform this Regex using Djikstra's Shunting Yard algorithm into a Prefix notation we call "Prenex", at this step we also handle the syntactic sugars [A-Z] and the others by transforming them into a series of literals unified with each other.

We then build an Abstract Syntactic Tree using the Prenex notation in order to ease the construction of an NFA in the following step.

We use Thompson's Construction to construct an NFA (Nondeterministic Finite Automaton) from the Prenex we generated at the last step, followed by Subset Construction to convert the NFA to a DFA (Deterministic Finite Automaton).

The Lexer itself takes in a configuration file with the specification listed as follows:

  • #TOKEN_NAME: [REGEX];

The implementation creates a DFA that concatenates all the token regexes and remembers their final states, it then runs the automaton on the input and returns the longest matching token it finds, or it specifies at which line and what character a potential error has occured if it ever reaches its sink state.

Improvements

This was my first project done in Scala so I missed out on a lot of its features early in development, a few things are done in somewhat hacky and ugly ways. I would like to return to this project and make the code cleaner in the future.

regex-lexing-in-scala's People

Contributors

sohanian-daniel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.