Present the architecture of a modern RISC digital computer circa 1996, its relationship to the Unix operating system and the C programming language. Understand the representation of high level languages in a form executable by such a computer and the underlying machine programming language and structure. The project will consider the C programming language, SPARC architecture, boolean logic, number systems, and computer arithmetic; macro assembly language programming and subroutine linkages; the operating system interface and input/output; understanding the output of the C compiler; the use of the C programming language to generate specific assembly language instructions.
Understand the architecture of a RISC machine, specifically a Sun SPARCstation 5 workstation; Understand the microSPARC-II (code-named Swift) microprocessor implementing the SPARC V8 instruction set architecture (ISA) developed by Sun Microsystems; Learn the C language grammar and assembly language programming; the form of assembly language generated by a compiler; the interface to an operating system; and design a complete SPARC C compiler using flex and bison.
- Richard Paul, “SPARC Architecture, Assembly Language Programming, and C.” Prentice Hall.
- Brian Kernighan and Dennis Ritchie, “The C Programming Language.” Second Edition, Prentice Hall.
- Samuel P. Harbison and Guy L. Steel Jr., “C: A Reference Manual” Third Edition, Prentice Hall.
- Alfred V. Aho et. al., “Compilers: Principles, Techniques, and Tools” Addison-Wesley.
- John R. Levine et. al., “lex & yacc” Second Edition, O'Reilly & Associates.
- Richard M. Stallman, “GNU Emacs Manual.” Free Software Foundation.
To compile the SPARC assembly compiler code only requires a clang environment, but to assemble the resulting assembly code generated by the compiler would require a cross-platform environment or a real/virtual SPARCstation 5/10/20 machine. You can run on any Linux machine and as root execute the following:
apt install git make clang vim bison flex
git clone https://github.com/ekbann/sparc-compiler
cd sparc-compiler
make
./CC < tests/test1.c
I also successfully compiled on a macOS machine using the Xcode environment running from a Terminal. An alternative is to use QEMU and Buildroot to make tiny virtual machines. This tutorial describes how to compile or assemble simple user-level programs for a Sparc V8 target and step through their execution using Qemu and gdb. This tutorial assumes you're using Linux.
A cross compiler is needed when the machine on which the compiler is running (called the host) is of a different architecture (say x86) than the machine for which the executable is to be produced (called target, which is Sparc V8 in our case). The simplest way of obtaining a working cross compiler is to use Buildroot.
Download the latest buildroot tarball and untar it or simply clone from GitHub:
$ sudo apt install rsync
$ git clone https://github.com/buildroot/buildroot.git buildroot
Navigate to the untared buildroot directory and run the following commands:
$ cd buildroot
$ make qemu_sparc_ss10_defconfig
$ make menuconfig
This will open a graphical interface. Go to Toolchain–>
, scroll down and select the option Build cross gdb for the host
. We have selected this option because the default buildroot configuration for sparc V8 (called qemu_sparc_ss10_defconfig) does not include cross-gdb by default. Save and exit the graphical interface. Now run make:
$ make
This will download and build required packages and can take a while. At the end of make, we get a working cross compiler toolchain. The binaries (sparc-linux-gcc, sparc-linux-as, sparc-linux-gdb etc) are present in folder: <path-to-buildroot>/output/host/usr/bin
. Add this location to your system's PATH variable to use the cross compiler binaries outside buildroot.
Consider a simple assembly program Foo.s:
Foo.s
.global _start
_start:
! comments start with '!'
mov 2, %g1 !
mov 3, %g2 !
add %g1, %g2, %g3 ! g3 should now contain 5
nop
nop
nop
Assemble and link it to get an executable Foo. (The -g
option is to include debugging symbols in the generated executable).
$ sparc-linux-as -g -o Foo.o Foo.s
$ sparc-linux-ld -g -o Foo Foo.o
Instead of assembly, you can start with a simple C program Bar.c:
Bar.c
int a,b,c=0;
int main()
{
a=2;
b=3;
c=a+b;
return 0;
};
Compile, assemble and link it as follows.
$ sparc-linux-gcc -g -S -o Bar.s Bar.c
$ sparc-linux-as -g -o Bar.o Bar.s
$ sparc-linux-ld -g -e main -o Bar Bar.o
The -e
option points out location of the first executable instruction (the entry point) to the linker. We set the entry point to the function main() in our case. The disassembled instructions in Bar can be viewed using objdump utility
$ sparc-linux-objdump -d -S Bar
Install package qemu-user. This installs binaries for several targets, example qemu-alpha, qemu-mips, qemu-sparc.
$ sudo apt-get install qemu-user
In a terminal start qemu-sparc
and set it up for remote debugging with gdb.
$ qemu-sparc -g 1234 Foo
In another terminal, open gdb.
$ sparc-linux-gdb Foo
Inside gdb, attach to qemu :
(gdb) target remote :1234
In gdb, press s to step through assembly instructuctions. Use command `info reg <reg-name>` to examine register contents.
(gdb) target remote :1234
Remote debugging using :1234
_start () at Foo.s:4
4 mov 2, %g1 !
(gdb) s
5 mov 3, %g2 !
(gdb) s
6 add %g1, %g2, %g3 ! g3 should now contain 5
(gdb) info reg g3
g3 0x0 0
(gdb) s
7 nop
(gdb) info reg g3
g3 0x5 5
(gdb)
Here are some useful tutorials on gdb:
Summary of the important aspects of my compiler:
- The code was originally written in 1996 on a Sun SPARCstation 20 (32-bit RISC architecture) running Solaris 2.6 using gcc 2.7.x which was not ANSI C compliant. Some parts of the code had to be re-written or fixed to eliminate compiler warnings and errors, but the use of pointers in the old days was somewhat hazardous and sometimes relied on Undefined Behavior to make the code work. Using a modern gcc 10.2.x broke the code causing occasional segmentation fault (Try compiling with gcc and running
./CC < tests/gcc-segfault.c
). Luckily, using clang instead mantained those UB and the compiler ran smoothly. Perhaps one day I'll re-write the compiler with proper pointer usage. - All
external_decls
are assignedmodifier
typeEXTERN
unless specifically defined in the source code. - I added a debug directive,
debug(node_dump_on)
and its counterpartdebug(node_dump_off)
, to keep track of the creation of syntax tree nodes. The output of a few sample nodes is:
node_type: STATEMENT [0x600000f7c240]
left: LEAF [0x600000f7c1e0]
right: NODE [0x0]
node_type: STATEMENT [0x600000f7c2a0]
left: LEAF [0x600000f7c180]
right: NODE [0x600000f7c240]
The number in square brackets is a pointer to that specific node. At the end of parsing the source code, this directive outputs the pointer to the ROOT
of the program syntax tree, e.g.
syntax tree root = [0x600000f7c2a0]
This allows the user to manually reconstruct the syntax tree to verify if the syntax tree was constructed properly. Another way is to use the directive debug(statement_dump)
to get a verbose view of the syntax tree:
*** STATEMENT DUMP
=, e_var, t_int, c_scalar
x, e_var, t_int, c_scalar
3, e_const, t_int, c_scalar
=, e_var, t_int, c_scalar
y, e_var, t_int, c_scalar
10, e_const, t_int, c_scalar
=, e_var, t_int, c_scalar
z, e_var, t_int, c_scalar
+, e_var, t_int, c_scalar
x, e_var, t_int, c_scalar
y, e_var, t_int, c_scalar
- The debug directive
debug(symtab_dump)
dumps the symbol table at the current context level. After the closing brace of a compound statement (see CC.y statement ) the compiler will delete the closing context level because those symbols are not required anymore.
*** SYMBOL TABLE DUMP, e_<entry type>, t_<variable type>, c_<constructor type>
bucket 24
<"main" scope 0, e_fn, t_void, c_scalar, references: 1>
bucket 4
<""hello"" scope 0, e_const, t_char, c_array, references: 1>
bucket 2
<"'c'" scope 0, e_const, t_char, c_scalar, references: 1>
- The debug directive
debug(comment_on)
and its counterpartdebug(comment_off)
toggles the output of comments in the source code. Very useful if one wants to analyze a specific segment of code. Only variable identification and some trivial operations such as++
,--
, and register flushing are implemented. Future version of my compiler will have detailed comments in the output code. - A global structure pointer named
fn_p
is used to store the main function entry in the symbol table so that type checking can be performed onRETURN
nodes. - Multi-source code is implemented allowing one to compile multiple sources into objects and then link them together, e.g. compiling main(), init(), sort(), and dump_array() from the tests/sort directory and linking them to an executable.
- I added a node type
ARRAY
to represent anID
and an index. - Test code snippets can be found in the tests directory along with the multi-source programs in tests/sort and tests/euclid.
- There is a basic design flaw in the insertion of constants in the symbol table. By utilizing one table entry per INTEGER CONSTANT, there are conflicts in
p->where
if the integer constant is used both in theLVAL
andRVAL
expressions, e.g.array[1] = 1;
. Future version of my compiler should treat each constant as a unique entry. This bug should also affectFCON
,CCON
, andSCON
.
The following features has not yet been implemented in my compiler:
FLOATS
, and related functionsITOF
andFTOI
;- Special chars
\ooo
and\xhh
, i.e. octal and hex ASCII code; - Passing
ARRAY
pointers to external functions. (This has been implemented but does not yet work.)
A good portion of my code was written ad hoc with little structuring relying heavily on my intuition. Proper planning was done on the design of the node structure and syntax tree generation using a hash table. Debugging consisted of using many strategically located printf's with minimal use of the debugger gdb for code tracing. Writing the compiler this way has shown me many different ways where things can go wrong, especially with the improper use of pointers from pre-ANSI C era that didn't follow the modern ANSI C17 standards. Rewriting my compiler entirely from scratch with my knowledge gained here would probably produce a very efficient and clean code without redundancies.