acm-uoa-student-chapter / dil Goto Github PK

View Code? Open in Web Editor NEW

22.0 6.0 6.0 44 KB

The Department of Informatics Programming Language.

C++ 98.82% C 0.78% Shell 0.40%

programming-language compiler educational first-timers-friendly

dil's People

Contributors

Stargazers

Watchers

Forkers

billsioros spy-t wilhelmstein theodorealenas georgels vaggelisspi

dil's Issues

Pick a license

This is a discussion thread to choose a software license together. It's an important omission on my part so better add one soon.

For anyone not knowing what a software license is, they can read this.
In a nutshell: A bunch of rules stating how (and if) one can use or redistribute the software.

I like choosealicense.com if you want to take a look on the differences between the most popular licenses. I'm thinking about MIT which is a very permissive license.

Create a character map that will help us check if a character belongs to a certain category.
For example, it will help us check if a character is an octal digit. Or if a character can appear in the middle of an identifier.

More specifically, we need to be able to check if the character:

is blank (space, tab etc.)
can appear in the middle of an identifier.
is an octal digit.
is a hexadecimal digit.
can appear at all in the a DIL program.

That last one may be unfamiliar to most. Try to include the character @ in any place in a
C program and compile. The character @, maybe interestingly, can't appear in a C program (that
doesn't mean it can't appear in a DIL program).

This issue deliberately has a vague description. I think it's interesting to discuss what the best
way to achieve the goal is.

Provide a function that returns a text version of a token

Interface: const char *token_name(TOK).

Template:

const char *token_name(TOK kind) {
    switch(kind) {
    case TOK::EOI:    return "end of input";
    case TOK::INTLIT: return "integer literal";
    ...
    default: assert(0);
    }
    assert(0);
}

Document potential research sources

As stated in the README this is an educational project and although it is not intended to replace the relevant course or textbook I think it would be wise documenting potential sources for anyone interested in learning about compilers in general and specifically about helping with the project.

DIL Grammar

Grammar construction is one thing that I assume most people are not experienced with making.
However, I think it's worth it for you to try it. You can check the samples to try to understand the grammar, then try to construct the EBNF.

To separate the effort, here's a list that I can hope I can make more detailed:

Expressions
Statements
Functions
Modules

DIL will follow an LLVM-like structure (as many other languages). That is, the top-level entity
is a module, which you can think as a file. A module has 0 or more functions.
A function has 0 or more statements. And a statement can use / contain expressions.

Note: Αn important difference with e.g. C / C++ is that expressions are not intermingled with
statements and vice versa. In C, the assignment is both a statement and an expression, e.g.
you can do this: a = b = c; where b = c acts both as a statement and an expression.

If you're interested in this issue but you have no experience in formal grammars, we can incrementally build the grammar on Slack, starting from very easy grammars.

Function that scans an integer

Write a function that reads integer literals. It should handle base 10, base 8 (octal) and base 16
(hexadecimal) numbers. It should provide warnings handle erroneous input gracefully.
That includes: out of base digits, overflow.

Error handling

We want something like atoi() but atoi() has a bad interface because the function has no way to clearly notify the caller about failure.
A usual idiom in C/C++ is if we have a function that returns a value of a specific type, to pick an invalid value (of this type) that we return to signify that there was an error.
For example say we have a function that is supposed to find a number in an array and return the index of this number. Let's say it returns an int. Note that int can represent negative values but those are invalid values for an index (array indexes are >= 0).
So, we could pick a specific value out of those invalid values (e.g. -1) which we will return to notify the caller that we could not find the number. This is a contract between the caller and callee
and the caller should know about it.

In the same way, functions that are supposed to return a pointer, can use NULL to signify failure.

This method of error handling, while good and widely used,
fails when there is no invalid value to pick. In our example, we call a function that reads
an integer value and returns. Well... all possible integer values are valid. What this means is
we have to return 2 values back. One that is the integer value just read and one that holds
if there was an error or not.

Some programming languages provide multiple return value functionality but C / C++ not.
In C / C++ the usual ways to implement this is:
a) By creating a struct that holds those values, for example:

typedef read_int_ret {
    int val;
    int error;
} read_int_ret_t;

Unfortunately makes the code unnecessarily convoluted by introducing naming all over the place.
So, another way is:

int read_int(const char *input, int *val);

So, we pass a pointer in which we save the value while what we return signifies the error.

Colored error messages

Write a variadic template function that displays syntax error messages. You might not know what template or variadic means but trust me, if you do want to learn, this is a very easy issue.
We basically want a function that we can e.g. write: syntax_error("Invalid character literal");
And get Syntax Error: Invalid character literal. But, we want the Syntax Error: part to
be colored. Preferably red. :)

Still though I haven't explained why do we need the function to be a variadic template function.
Well, we need to do things like this:

const char *id = ...;
...
syntax_error("Type with id: ", id, " is invalid.");

So, note that we pass multiple typed parameters in the function. And we want to pass
as many as we want. Those 2 together most of the time mean variadic templates.
The function should be able to recognize that id is of type const char * and print it
accordingly.

But that is maybe not the easiest thing in the world. You may want firstly to start with a function
that just prints one string. Then, multiple strings (with variadic templates), then try to recognize
different types etc.

Provide the list of tokens as an enum

Provide an enum class that will help us identify tokens.

Tokens are the smallest piece of information that the compiler understands.
For example, the characters in in C mean nothing. But the characters int
mean something - they're one token signifying the keyword int.
In the same way a number e.g. 102 is an integer literal token.
We want the list of all possible tokens of DIL.
Important note: We want all the possible categories of tokens. That is,
102 and 103 are different tokens but they belong to the same category: integer literals.

You can start off like this:

enum class TOK {
    UNDEFINED,  // token of undefined kind (i.e. error)
    EOI,                 // end of input
    INTLIT,             // integer literal
    ...
};

The structure we'll use to save tokens we'll have a field of type TOK named e.g. kind.
Then we can distinguish tokens as:

if (kind == TOK::INT_LIT) {
    // Do something specific for integer literals
}

Read a whole file into memory

For the compiler, we'll pick a simple and effective scheme to read the input. That is, we will read each file as a whole into memory as a NUL-terminated, contiguous array of characters.

Initially we will have only one input file and we won't support Unicode.

To start off, we can do our job with just a const char *. Eventually we might need to make our own string class but certainly, it's better to not use std::string.

So, we want a function const char *read_file_into_memory(const char *filepath).

Avoid working on same things

How exactly can we avoid working on the same issues since we cannot assign ourselves to issues? Shall we leave a comment or something?

Trivial test suite

It's pretty obvious that we need a test suite. I was thinking a very trivial test suite, where basically we just have a /tests folder with .cpp files. Every file in this folder should have a first line denoting how it should be compiled. Then, we run a script that for every file, reads its first line, compiles it and runs it. If we have an assert, then the test failed.

Generic hash table for data identified by a unique pointer

Write a generic linear probing hash table for data that are unique to a string identifier.

Why linear probing?
Linear probing is fast! Period. It is optimized for cache locality as the data are contiguous.
What is more, the operations that are hazardous for a LP HT won't be needed for a compiler.
Those are deletion and non-constant length data. With the last one I mean that when
the hash table is create, we will know exactly how many data it will save.

Structure
The data that we will save will be searched by an unique pointer (specifically a const char *).
This is a topic of another issue but essentially all the identifiers will be string interned. That is,
equal strings point to the same memory. For example, let's say we save "foo" as a const char * in one place and in another. These 2 const char * pointers will be equal. What this means
is that essentially the hash table will treat those strings as integers (i.e. the values of pointers).

We will keep 2 parallel arrays. One that will save the actual data and another that saves the
pointers. In that way, we can do the searching in the pointers (very fast) and incur the penalty
only when we find the correct id. At this point we will return the parallel array pointer.

For this hash table we should constrain the data to only be pointers. That's important to save
memory in unused places in the data array because of linear probing (i.e. the hash table won't be full of course).

Integer scanning in the lexer

We now have a lexer started in lex.cpp and also an integer scanning utility in int_scan.cpp. We want to integrate the latter into the lexer, as that was the purpose of its creation anyway.

String interning

Interning strings means that we keep only one copy in memory of each unique string.
For example, let's say we want to save the string "foo". Then, we allocate some memory to save its chars. If we then want again to save "foo", instead of allocating new memory,
we point again to the same previous memory.
That means a couple of things:

We don't waste memory.
Remember that a string is essentially a pointer to char. If we keep only one copy around, then
pointers of the same string point to same memory. That means that a string is now identified
by an integer value (the value of the pointer). This makes the rest of the compiler very fast
as it works with integers instead of whole strings.

For now it's better to start with the easiest way that you can think of implementing this.
Then, we'll improve it.

Starting the lexer - recognize simple keywords

Note: I won't try to replace the theoretical background that compiler textbooks and courses
provide. Certainly github is not the place for that. But I think it's important to give some
description about what we try to do. Feel free to skip it if you already know the basics.

What is a lexical analyzer (lexer) ?
One very important thing that a compiler tries to do is understand the input (so that it can
then tell the computer - in assembly language - what to do based on what the input says).

Now, the input, as it goes into the compiler, is essentially a bunch of characters that (initially)
mean nothing. There has to exist some way to separate / group those characters so that
they have some meaning.

Compare that with natural languages: To understand English text, you first have to separate
the input (i.e. a bunch of characters) into words. Because words are the smallest
piece in the English language that mean something. Then of course, we use combine
those words in different ways to create sentences, paragraphs etc. which help us convey information (i.e. meaning, i.e. help the one reading the text understand what we think).

But it all starts with words, the smallest piece of meaning. Kind of like atoms in the physical world.

So, the same way that a natural language has words, a programming language has tokens.
A token is anything that can mean something on its own. The character + is a token on its
own, because it means something on its own. The keyword int is also a token.

We want to start off with a lexical analyzer that can recognize simple tokens.
Things like +, int, return, [ etc.

How do we implement them?

There are basically 2 ways to implement lexical analyzers:

Automatically generated given a lexical specification.
Hand-written lexical analyzers.

In 1), there is a program that takes a lexical specification and constructs a program
that acts as a lexer. This may sounded complicated so feel free to ignore it for now.
We won't use that.

For most programming languages, it's easy enough to make the lexical analyzer by hand.
And the usual way is to make an automata-like program. If you don't know what an automata
means, then again don't worry. Basically, we try to decide based on the characters
that we come across, in what token we might be.
That is, if I see i, then the token I'm trying to recognize is definitely not e.g. return.
It may be if or it may be int. So, I try to read a second character that will help me decide etc.

In the general case all those are more difficult because there are user-defined tokens, aka
identifiers.
When you write: int foo; you have introduced a user-defined identifier - foo.
In the same way, int i; introduces the identifier i, so when you see i, it might be if, it might
be int but it might be i.

Feel free to forget those complicated situations for now. Try to focus only on simple
keywords.
There will be a big switch that based on the first character that we see, we will try to understand
in what token we're in. Then, we'll test all the rest of the characters for all tokens that start
with the same character. For example:

const char *input = ...;
switch (input[0]) {
case 'i':
{
    if(input[1] == 'f') {
        // We saw `if`
        return;
    }
    if (input[1] == 'n' &&
        input[2] == 't')
    {
        // We saw `int`
        return;
    }
} break;
case 'r':
{
    ...
} break;
}