tshikaboom / libone Goto Github PK

A small stab in trying to parse OneNote files

License: Mozilla Public License 2.0

Makefile 2.58% Shell 3.83% M4 4.59% C++ 88.67% Meson 0.34%

libone's Introduction

libone

This is a library try to parse OneNote files. It currently does not do much, besides parsing structures and verifying they are well present in the file (mostly, they are, because we use OneNote-generated files to test this stuff, as that is all we have). The library is based around librevenge, as the goal would be to have something imported to LibreOffice Draw & co.

Inspiration

The file format is contained in two references, [MS-ONE] and [MS-ONESTORE]. The latter is currently being worked on, as the storage part is still to be finished. The former would be more interesting, as it contains the meat of what the user does with the document.

The file format itself is structured in transactions and revisions (think of git) which themselves contain the user data. We currently parse the header, the root file node list and its file nodes, and would try to parse object spaces after that.

These documents depend on other reference documents provided by Microsoft (for CRC checking algorithms, primitive type sizes and whatnot).

Contributing

Please do! Code clean-ups would be much appreciated, as C++ is not the thing I'm good at. Otherwise, reading the refereneces and implementing them is the way to go, so this stuff would be straightforward, if not time-consuming.

Building and testing

This is an meson-based project. As such, invoking

meson build
ninja -C build

would suffice to get this build. After that, doing ./src/conv/raw/one2raw <file.one> would get you some output.

To run the tests, make sure that your build directory has been configured with tests to be build (currently by default on) Then, change into your build directory and run:

cd build
meson test all_tests

Licence

The library is available under MPL 2.0+, as other librevenge-based projects are.

libone's People

Contributors

Stargazers

Watchers

Forkers

blu-base

libone's Issues

FileNodeType implmentation

Hi Oskar,

the next milestone is likely to implement the logic to deal with different file node types. In #11, you have touch this edge, i guess.

In FileNode.cpp, parsing of the FileNodeTypes is currently empty.

To address this task, i would like to discuss the possible solution approaches. Currently, i see two different approaches.

Type classes approach

The first approach is to deal with the FileNodeTypes by implementing them as individual classes. This has been done in a test suite developed by Microsoft themselves, See FileNode.cs for reference.

This approach has the advantage of having manageable chunks of code for which it also will be easier to write test.
But it will require lot's of implementation code of a basetype class.

No type classes approach

The second approach is to deal with all the types 'inline'. The parser by dropbox went with this approach.
Only the parse will need a logic to deal with the FileNodeType's content.

Such style could limit the amount of code needed, since not all features are relevant to convert a .one file to an open format.
However, i have the impression, such an approach is too monolithic and likely error-prone. Writing tests seems also to be difficult.

A mixed implementation approach

A mix of the above approaches was done by Apache (mika), They have a FileNode Class which has member variable types for the most occurring types. Only a few, more complex FileNodes are implemented in separate classes.

This approach seems to be the best mix of amount of code and class abstraction required. They only parse what is necessary to extract meta information from the file.
However, it is required to actually know which pieces of information are relevant. And again, tests will need to be more complex.

Although, i have read the specifications a couple of times, i don't have a comprehensive overview on the flow of information while parsing. That's why I am not very comfortable with the "no type classes" or the mixed approach.
At the cost of more code and files, I went with the first approach with my library attempt, respectively FileNode.ccp for reference.
However, i think it has the highest flexibility since I could concentrate on the local implementations of the higher level features.

How should FileNodeTypes be implemented in `libone`

So, with this issue, i would like to open a discussion what would likely be the best route for this library.

This library inherits the design targets of librevenge---it is only designed to parse files and not write proprietary formats.
That's why we wouldn't need all functionality of MS-ONESTORE.

What do you think?

Use the Meson build system

This would get us rid of a lot of autotools-based stuff, and would (I guess) simplify the project.

Thoughts on parsing stages RevisionStoreFile and OneDocument

While studying the OneNote file spec further, i have the impression, that it might be a good idea to split the file entities described in MS-ONESTORE from the actual content described by MS-ONE.

This would approximately mean, that the lib would first parse the RevisionStoreFile, with all the Revisions and FileNodes, etc...

When all Chunks are declared, the lib would then instantiate a higher-level Document class which represents the MS-ONE spec, such as Sections, Pages, Textboxes, etc. And would also extract binary data, such as images, embedded files and so on.

This new document class could be further processed by the librevenge converters since they don't need to care anymore how chunks were originally stored.

Why would this be a good idea?... this will result in more boiler plates and additional moving/copying of data in memory...
However, the lib could compartmentalize the RevisionStoreFile into a specific stream. This stream would inherit a more general libone stream which is used as general input stream for that Document class, masking the revision store file structure.
On the long run, this would also mean, we could write other input streams which call data from other sources, such as onedrive, without the need to touch the revision store file parser again.

Though, i have not really an idea whether this is rally necessary, since there is a REST API for the notebook in the ms cloud. The other protocol used by sharepoints is likely out of scope for libone. So this means splitting up the different parsing stages might be unnecessary if no other adapter is ever needed.

Do some kind of proper logging

Currently we use #defined insructions to log everything. It is going to get messy as we go on, maybe try to use some structured logging controlled by an environment variable.

I would imagine using an environment variable to control the verbosity, such as

LIBONE_LOG=NONE
LIBONE_LOG=DEBUG
LIBONE_LOG=WARNING

etc.

Idea for Ambitious Goal

I was reading some articles when i came across the LibreOffice conference with also hosts presentations on the LIberation Project were librevenge seems to be originating.
Libone can be included into that project at some point.

This year, the conference is held in October, and would still accept abstracts till the 4th of August.
I think this would hardly be enough time to have something to show yet...

But I guess this could be a nice goal to present this library in one, maybe two years at that conference.

[Question] Understanding FileNode

I am trying to do something similar en Python (I don't understand C/C++).

I am trying to read the following file: Problemas.zip, but I think that I am not parsing well the header of the FileNode: FileNodeID, Size, A, B, C, D.

For the attached file I am getting:

λ python ex01.py                                                            
FileNodeListFragment                                                        
  FileNodeListHeader: pos=1024                                              
    uintMagic:    hex: c4f4f7f5b17a56a4                                     
    FileNodeListID:  hex:10000000  bin:00010000000000000000000000000000     
    nFragmentSequence:  hex:00000000  bin:00000000000000000000000000000000  
  FileNode fnd[0]                                                           
    FileNodeHeader                                                          
    (4bytes: bin:00001000011011000000000010010101                           
    FileNodeID: hex:0x021 bin:0b0000100001                                  
    Size: dec:5632 bin:0b1011000000000                                      
    StpFormat: dec:1 bin:0b01                                               
    CbFormat: dec:0 bin:0b00                                                
    BaseType: dec:10 bin:1010                                               
    Reserved: bin:True

It looks to me like the size is weird, because 0x021 seems to be linked to GlobalIdTableStartFNDX. Besides BaseType=10, when only 0, 1 or 2 are allowed.

What I am doing is:

I parse the FileNodeListHeader at position 1024 (it takes 16bytes).
I take the next 4 bytes and try to read the 10bits, 13 bits, 2bits, 2 bits, 4 bits, 1bit.

Do you think I am not doing it right or do you get the same results via C++?

Python code (in case you want to take a look)

The python code that I am using:

from bitstring import BitArray
import struct

# https://github.com/tshikaboom/libone/blob/master/src/lib/FileNode.cpp

def show(val, spaces = 0):
    spc = " " * spaces
    d = BitArray(val)
    #print(dir(d))
    return f"{spc}hex:{val.hex()}  bin:{d.bin}"



def fileNodeFragment(data, n):
    header1 = data[n:n+16]  # Header
    print("FileNodeListFragment")
    print("  FileNodeListHeader: pos=1024 ")
    print("    uintMagic: ", "  hex:",header1[0:8].hex()) #  show(header1[0:8],4))#
    print("    FileNodeListID: ", show(header1[8:12])) #" hex:", header1[8:12].hex(), "bin:",header1[8:12])
    print("    nFragmentSequence: ", show(header1[12:]))


def fileNodeHeader(data, n ):
    header2 = data[n:n+4]
    d = BitArray(header2)
    print("  FileNode fnd[0]")
    print("    FileNodeHeader")
    print(f"    (4bytes: bin:{d.bin}")
    tmp = d[0:10] #.reverse()
    print(f"    FileNodeID: hex:{tmp.uint:#05x} bin:{tmp}")
    tmp = d[10:23]
    print(f"    Size: dec:{tmp.uint} bin:{tmp}")

    tmp = d[23:25]  #[::-1]
    print(f"    StpFormat: dec:{tmp.uint} bin:{tmp}")
    tmp = d[25:27]  #[::-1]
    print(f"    CbFormat: dec:{tmp.uint} bin:{tmp}")
    tmp = d[27:31]#[::-1]
    print(f"    BaseType: dec:{tmp.uint} bin:{tmp.bin}")
    tmp = d[31]
    print(f"    Reserved: bin:{tmp}")    

data = open("Abrir bloc de notas.onetoc2", "rb").read()
data = open("Problemas.one", "rb").read()

pos = 1024
fileNodeFragment(data, pos)


#--- First File Node (13bytes) header:4, 1
ini = 1024 + 16
fileNodeHeader(data, ini)

Simplify FileNode

This is more or less me writing down what #19 has made me remark about the state of class FileNode.

Simplify the class definition and its parsing (the header itself is one uint32_t only, use that to infer the FileNode's properties afterwards, and not explicitly store them as separate members
Move skip_node() to a place better suited for what this function does, like FileNodeListFragment

Test name: GUID to_string test
equality assertion failed
- Expected: {00000000-0000-0000-0000-000000000000}
- Actual  : {0-0-0-0-000}

The std::hex method does not seem to respect leading zeros.

I'll pull the tests with bug fix, when we are clear with the meson build system.

Use FileChunkReferences for seeking to a structure's location

Currently we manually seek to the location a structure is located in the file. It would be nice to use the fcr to do it, as the structure itself is referenced by such a struct.