M-Forth
Abstract
M-Forth is an implementation of FORTH for the M1 processor on Mac OS X.
Revision history
2021-12-06: M-Forth is now capable of running all the code in stdlib.f. Not all primitives are ready though, and documentation is still incomplete. The primitive "." prints in hex until I've written the proper definition in stdlib.f.
2021-12-15: M-Forth can now print numbers with . which respects the number base stored in the BASE variable. The words DECIMAL and HEX in stdlib.f changes BASE to 10 or 16, respectively.
2021-12-20: Fixed some serious bugs with LITSTRING and >CFA so now you can print strings with ." Hello world".
2021-12-28: Very stable now. Almost all of the library code in stdlib.f is ready. (Expect to be finished in a few days.)
2021-12-31: It's New Year's Eve and I managed to get M-FORTH up and running. There's an annoying bug (?) with FILE-OPEN which returns the file descriptor 2 rather than a negative value. Also, it would be nice if DOES> could be implemented. But we have to save something for 2022!
Introduction and background
This document will not explain FORTH in detail — there are plenty of resources for that. Rather, we will just cover the salient features of the language, in order to understand the implementation.
Forth words are akin to subroutines in other programming languages. Words can be either primitive, i.e., written in assembly language, or defined in terms of other words. The syntax for defining a new word is
: word word1 word2 … wordn ;
which is equivalent to
procedureword() { call word1(); call word2(); … call wordn(); }
in a traditional programming language. Notice that in Forth, arguments to words are not named. Instead, they are assumed to appear on a stack. As an example, let's look at the following definitions:: DOUBLE DUP PLUS ; : QUADRUPLE DOUBLE DOUBLE ;Here, we define two new words — DOUBLE and QUADRUPLE — using the primitives DUP and PLUS. (The primitive DUP duplicates the top element of the stack.) In a working FORTH system, we could then continue and enter
42 QUADRUPLE .and the system would reply 168. (If you find this confusing, I urge you read a tutorial on FORTH first.)
FORTH programmers can enjoy identifiers with pretty much any character (but not space), so . is another primitive which removes and prints out the top stack element.
The inner interpreter
As we've seen, FORTH words can be either primitives — consisting entirely of assembly language snippets — or user-defined, referencing a list of previously defined words (which, in themselves, reference other words, etc, until they ultimately reference a primitive).
We will define the inner interpreter — the core of the FORTH system. To execute a primitive word like DUP or PLUS is obvious: simply call the assembly code. But in order to execute a user-defined word, we must potentially invoke the interpreter recursively. For instance, when executing QUADRUPLE we must (twice) execute the word DOUBLE which, in itself, must execute DUP and PLUS.
In a higher-level language, the interpreter could be expressed recursively but here we must create a tight and fast interpreter in assembly language. It is therefore necessary for the interpreter to maintain another stack to remember the location before going off to execute the list of words in the referenced word. More specifically, when we execute the body of QUADRUPLE we first encounter DOUBLE so before calling the interpreter again with DOUBLE we save the location of where we must continue when we are done (which, incidentally, is another call to DOUBLE).
M-Forth uses "Indirect Threading". In the first stage we write the so called "inner interpreter" which can execute words but does not handle user input, i.e., all words (primitive or not) must be defined in the assembler file.
First we need to look at the data structure for word definitions. Let's look at the previous example:
: DOUBLE DUP PLUS ; : QUADRUPLE DOUBLE DOUBLE ;
In this figure, gray cells denote the so called "codewords" which contains a pointer to a piece of code that handles the word. What is meant by "handling" the word depends on the word being a primitive (written in assembler) or not (consisting of links to other words).
M-FORTH language specifics
Control structures
IF
true-part THEN
condition Algol equivalent: IF
condition THEN
true-part
( Calculate the absolute value, e.g., -3 ABS -- 3 )
: ABS ( a -- |a| )
DUP ( a a )
0< IF ( a )
NEGATE ( -a )
THEN
;
IF
true-part ELSE
false-part THEN
condition Algol equivalent: IF
condition THEN
true-part ELSE
else-part
( Find maximum value, e.g., 5 9 MAX -- 9 )
: MAX ( a b -- max(a,b) )
2DUP ( a b a b )
> IF ( a b )
DROP ( a )
ELSE
SWAP DROP ( b )
THEN
;
BEGIN
loop-part condition UNTIL
Algol: DO
loop-part WHILE
condition
( Write n spaces, e.g., 10 SPACES writes 10 " " )
: SPACES ( n -- )
BEGIN
SPACE
-1 + ( n-1 -- )
DUP 0 =
UNTIL ( until n=0 )
;
This is compiled to
,-----------------------------.
V \
: SPACES SPACE -1 + DUP 0 = 0BRANCH ( -72 ) :
BEGIN
loop-part AGAIN
Algol: WHILE
true DO
loop-part
( Print random numbers 0..9 until 0 occurs )
: RANDOM-NUMBERS ( -- )
RANDOMIZE
BEGIN
RND 10 MOD
DUP . CR ( print random number 0..9 )
0= IF
EXIT ( use EXIT to leave infinite loop )
THEN
AGAIN
;
Note that since BEGIN
... AGAIN
forms an infinite loop, you must exit with EXIT
, which returns from the word.
BEGIN
condition WHILE
loop-part REPEAT
Algol: WHILE
condition DO
loop-part
( Calculate sum of a + ... + b, e.g., 1 10 SUM -- 55 )
: SUM ( a b -- sum ) ( if a < b, sum is 0 )
0 ( a b 0 )
-ROT ( 0 a b )
BEGIN
2DUP <= ( while a <= b ... )
WHILE ( sum a b )
-ROT ( b sum a )
DUP ( b sum a a )
-ROT ( b a sum a )
+ ( b a sum+a )
-ROT ( sum+a b a )
1+ ( sum+a b a+1 )
SWAP ( sum+a a+1 b )
REPEAT
DROP DROP ( sum )
;
UNLESS
true-part THEN
condition Algol: IF
not condition THEN
true-part
( Return TRUE if negative, otherwise FALSE, e.g., -3 NEGATIVE -- TRUE )
: NEGATIVE ( a -- TRUE | FALSE )
TRUE ( a TRUE )
SWAP ( TRUE a )
0< UNLESS
1+ ( TRUE -> FALSE )
THEN
;
Deviations from Jones Forth
-
Jones relies heavily on macros defcode, defvar and defconst for defining primitives, variables and constants. The assembler I use does not let me re-define the variable link which is used to chain together all the words. Instead I wrote a small Python script words.py which generates the necessary assembler definitions for all the primitives, variables and constants. The Makefile generates the file primitives.s which is then included in the main file M.s.
-
Rather than putting magic numbers like 4 and 8 in the code, I've decided to add two constants __WORDSIZE and __STACKITEMSIZE. These are (as the names imply) the number of bytes in a word (8) and the size of a stack element (16). These are restrictions enforced by the ARM architecture, so leave them alone unless you are porting this to another platform.
-
As a result of the above, I have not implemented primitives 4+ and 4-. You should use __WORDSIZE + if you are messing around with the internal data structures.
-
M-FORTH checks for stack underflow on the data stack, and stack overflow on the return stack. You can disable this with the STACKWARNING option in the Makefile but in my experience, checking for stack underflow/overflow doesn't add much to overall execution time. While it would be nice to check for data stack overflow and return stack underflow, this is unfortunately not implemented (yet).
-
I've decided to use -1 for true and 0 for false, unlike Jones Forth but in sync with ANS FORTH. I think the reason for sticking to the ANS FORTH convention is that you can use AND, OR, etc both as boolean and bitwise logical operations.
-
Jones Forth relies on the brk(2) system call to figure out where the data segment starts and to request some initial space. In recent Mac OS X, brk has been discontinued so I allocate space for the data segment with a constant INITIAL_DATA_SEGMENT_SIZE in the BSS segment. This does not seem to affect the size of the binary for M-FORTH.
-
The user-defined word .S prints the contents of the stack (non-destructively) but in my implementation I've decided to (1) print the content with . rather than U. so that negative values don't appear unsigned; (2) the stack is printed in the order so that the top element is shown to the right.
-
Normally, M-FORTH is started with
cat stdlib.f - | ./M
but if M-FORTH is started without loading stdlib.f
, there is still
a primitive form of .
which prints the top-of-stack as a 64-bit
hexadecimal number. This is a useful debugging tool when working
with the core interpreter.
Things to do (unsorted)
DO
...LOOP
withLEAVE
,CONTINUE
andI
inside the body- A variable, like
BASE
for case sensitivity, so we can writedup
as well asDUP
. - Documentation
>DFA
usesINCR8
which should be called something else.