A rudimentary scanner built in C with Flex to parse tokens from Standard Generalized Markup Language documents. The scanner keeps track of "relevant" tags consisting of "TEXT", "DATE", "DOC", "DOCNO", "HEADLINE", "LENGTH", and "P", and only prints out tokens found between these relevant tags. If a tag is incorrectly closed, an error will be displayed. After the scanner is complete, a list of unclosed tags will also be listed.
While developing the scanner, I overlooked tracking tags that are considered "relevant" and "irrelevant", as the first "working" version of the scanner printed out all proper tags. This oversight resulted in using an int variable to track when an open irrelevant tag was used, and increment accordingly as tags were nested deeper inside the irrelevant tag. If I were to redo this assignment, I would refactor this entire portion and instead modify my stack strucures to hold information about which tags were considered irrelevant (or considered relevant but inside a set of irrelvant tags).
To build and compile the scanner, run:
make
Test files were developed over the course of the development process. Three files were leftover that aimed to cover ideal and error cases. To execute a series of all three tests, run:
make test
To test the provided test file, separate commands were included. To test newsdata.txt and print out, run:
make run
To test newsdata.txt and print out to sample.out, run:
make run_out
To run the scanner on your own files, send a document with SGML formatting to the stdin on the program with:
./scanner < your_file.txt
To simply cleanup the compiled files and scanned output, run:
make clean
To save time developing my own stack data structure, code was used from: http://stackoverflow.com/questions/1919975/creating-a-stack-of-strings-in-c