malwarefrank / dnfile Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 14.0 355 KB

Parse .NET executable files.

License: MIT License

Python 100.00%

dnfile's People

Contributors

Stargazers

Watchers

Forkers

mike-hunhoff williballenthin tarterp andrey-shakhzadyan doomedraven skynode binref tccontre coloursofnoise genihoust elsisoft vandir c3rb3ru5d3d53c

dnfile's Issues

research: are stream names guaranteed to be ASCII?

As noted in #17 (comment)
are stream names guaranteed to be ASCII? What does the spec say? What do implementations do?

push v0.11.2 to PyPI

looks like you tagged v0.11.2 but PyPI still hosts v0.11.0. Could you please push v0.11.2 to PyPI? we would love to pull in your recent changes, thank you!

consistent attribute access

Objects' attributes throughout the project should be accessed consistently to reduce cognitive load on developers and library users.

Right now some objects have attributes that are set to None when there is a parse error, while others are not set at all.

Request to add additional resource type support

Requesting that you add the ability to parse BMP images stored as entries within the .NET resources.

Sample:
https://www.virustotal.com/gui/file/0a5dc3b6669cf31e8536c59fe1315918eb4ecfd87998445e2eeb8fed64bd2f2c

dnfile properly identified the resource names and types but the data property is NoneType. Attached is the output from the following code:

pe = dnfile.dnPE(filepath)
for r in pe.net.resources:
    if r.name == “20a87df82283.Resources.resources”:
        for entry in r.data.entries:
            print(f”{r.name}: {entry.name} - {type(entry.data)}“)
            print(entry.__dict__)
            print(entry.struct.__dict__)

I know that the open-source project dnSpy does an excellent job of parsing this resource type from .NET executables so maybe some of that logic can be ported into this project.

https://github.com/dnSpyEx/dnSpy

https://github.com/dnSpyEx/dnSpy/blob/master/Extensions/dnSpy.BamlDecompiler/Baml/KnownTypes.cs

https://github.com/dnSpy/dnSpy/blob/2b6dcfaf602fb8ca6462b8b6237fdfc0c74ad994/dnSpy/dnSpy.Contracts.DnSpy/Documents/TreeView/Resources/SerializedImageListStreamerUtilities.cs#L45-L63

https://github.com/dnSpy/dnSpy/blob/2b6dcfaf602fb8ca6462b8b6237fdfc0c74ad994/dnSpy/dnSpy.Contracts.DnSpy/Documents/TreeView/Resources/SerializedImageListStreamerUtilities.cs#L73-L98

Could possibly use this code to dramatically increate support for other types at the same time.

Use trusted publisher in workflow

GitHub action recommends using a Trusted Publisher instead of API tokens in workflows to push to pypi. And the pypi documentation strongly recommends using a GitHub environment.

https://docs.pypi.org/trusted-publishers/using-a-publisher/

Process strings, user_strings, GUIDs, etc. at time of load

Submitting a request to have things like strings, user_strings, and GUIDs processed when dnfile first loads an executable. Basically implementing the code provided in the following example into dnfile:

https://github.com/malwarefrank/dnfile/blob/b2a24c5eb46995a739c7bb5f626d6f4052ccb753/examples/dnstrings.py

It would be great if the extracted strings could then be simply referenced by the user via a property like dnfile.net.user_strings, which would return a set of extracted user strings.

IndexError in read_compressed_int (utils.py)

I was wondering if you'd be interested by this error, caused by this file.
I found it using CAPA, with dnfile 0.14.1, but it also triggers on 0.15.0.

>>> import dnfile
>>> pe = dnfile.dnPE("e94f7c475e7db0691a2698b5dd349c2b412ffddafa7a3ff85785cbd5ac144fcb")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../dnfile/__init__.py", line 64, in __init__
    super().__init__(name, data, fast_load)
  File ".../pefile.py", line 2895, in __init__
    self.__parse__(name, data, fast_load)
  File ".../dnfile/__init__.py", line 132, in __parse__
    super().__parse__(fname, data, fast_load)
  File ".../pefile.py", line 3328, in __parse__
    self.full_load()
  File ".../pefile.py", line 3439, in full_load
    self.parse_data_directories()
  File ".../dnfile/__init__.py", line 178, in parse_data_directories
    value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dnfile/__init__.py", line 221, in parse_clr_structure
    return ClrData(self, rva, size, self.clr_lazy_load)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dnfile/__init__.py", line 526, in __init__
    self._init_resources(pe)
  File ".../dnfile/__init__.py", line 574, in _init_resources
    rsrc.parse()
  File ".../dnfile/resource.py", line 289, in parse
    rs.parse()
  File ".../dnfile/resource.py", line 433, in parse
    rsrc_factory.read_rsrc_data_v1(self._data, e_data_offset, self.resource_types, e)
  File ".../dnfile/resource.py", line 113, in read_rsrc_data_v1
    d, v = self.type_str_to_type(entry.type_name, data, offset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dnfile/resource.py", line 166, in type_str_to_type
    final_bytes, n = self.read_serialized_data(data, offset)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dnfile/resource.py", line 72, in read_serialized_data
    x = utils.read_compressed_int(data[offset:offset + 4])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../dnfile/utils.py", line 46, in read_compressed_int
    value |= data[1]
             ~~~~^^^
IndexError: index out of range

The file doesn't look to be too badly corrupted, but I may be wrong. 🙂

MethodList list is empty if type has exactly one method

We found an issue where, if a type has exactly one method (e.g., a cctor), the logic that fills the list exits prematurely due to a bug.

This line is responsible:

dnfile/src/dnfile/base.py

Line 463 in 366aa88

if (run_start_index != run_end_index) or (run_end_index == max_row):

For example, if the MethodList index is 122 and the next type's MethodList index is 123, the logic in the lines above the quoted line computes the end index as 122 because it subtracts 1.

Problems with strings extractions

Hello, I just got this bug when i was trying to list the strings

pip3 install git+https://github.com/malwarefrank/dnfile -U
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Collecting git+https://github.com/malwarefrank/dnfile
  Cloning https://github.com/malwarefrank/dnfile to /private/var/folders/yt/mbh1wxlj6fq0qqxbjnjfqr200000gn/T/pip-req-build-efg5499r
  Running command git clone --filter=blob:none --quiet https://github.com/malwarefrank/dnfile /private/var/folders/yt/mbh1wxlj6fq0qqxbjnjfqr200000gn/T/pip-req-build-efg5499r
  Resolved https://github.com/malwarefrank/dnfile to commit 92847841e6496453598947a74eb78fa7299ad579
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
Requirement already satisfied: pefile>=2019.4.18 in /usr/local/lib/python3.9/site-packages (from dnfile==0.10.0) (2021.9.3)
Requirement already satisfied: future in /usr/local/lib/python3.9/site-packages (from pefile>=2019.4.18->dnfile==0.10.0) (0.18.2)
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621

dnstring.py is https://github.com/malwarefrank/dnfile/blob/master/examples/dnstrings.py

➜ python3 dnstring.py b9efc289ffd8951a65f66ddf2649c1959ad1b94f1177002b20f05e8ae86853ae
reference to missing table: File
reference to missing table: File
Traceback (most recent call last):
  File "<censured>/dnstring.py", line 42, in <module>
    show_strings(fname)
  File "<censured>/dnstring.py", line 33, in show_strings
    s = dnfile.stream.UserString(buf)
  File "/usr/local/lib/python3.9/site-packages/dnfile/stream.py", line 116, in __init__
    self.value: str = data.decode(encoding)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 28: truncated data

im not sure how string can be 29bytes in wide mode

'utf-16-le' codec can't decode byte 0x00 in position 28: truncated data b'S\x00t\x00u\x00b\x00.\x00R\x00e\x00s\x00o\x00u\x00r\x00c\x00e\x00s\x00\x00'

any ideaa?

thank you

question: tests and test files?

I'm interested in adding tests (using pytest, unless you have other preferences) that demonstrate functionality of dnfile.

is pytest ok with you? And ok if I place tests under tests/test_*.py?
where would you like test data, like .NET modules used in the tests?

For reference, in capa, we have a separate repository, capa-testfiles, that we use to hold all the files used during testing, which we reference as a submodule under tests/data/. This makes it possible to checkout in CI via --recurse-submodules but also easy to checkout the source code without pulling down MBs of test data. Of course, this introduces a bit more configuration and maintenance of two repos vs. one.

What would you like to do for dnfile?

AssertionError: assert hasattr(self, "_row_class") (EncLog)

encountered an unexpected exception when parsing the file 0033ca037e0496c5c33e3dc19714fb3e:

❯ python foo.py tests/data/0033ca037e0496c5c33e3dc19714fb3e
Traceback (most recent call last):
...
  File "scripts/print_cil.py", line 29, in main
    pe = dnfile.dnPE(args.path)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 61, in __init__
    super().__init__(name, data, fast_load)
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 2743, in __init__
    self.__parse__(name, data, fast_load)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 129, in __parse__
    super().__parse__(fname, data, fast_load)
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3148, in __parse__
    self.full_load()
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3259, in full_load
    self.parse_data_directories()
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 174, in parse_data_directories
    value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 219, in parse_clr_structure
    return ClrData(self, rva, size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 481, in __init__
    self.metadata = ClrMetaData(pe, metadata_rva, metadata_size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 346, in __init__
    s.parse(self.streams_list)
  File "/home/user/code/dnfile/src/dnfile/stream.py", line 375, in parse
    table = mdtable.ClrMetaDataTableFactory.createTable(
  File "/home/user/code/dnfile/src/dnfile/mdtable.py", line 2031, in createTable
    table = cls._table_number_map[number](
  File "/home/user/code/dnfile/src/dnfile/base.py", line 542, in __init__
    assert hasattr(self, "_row_class")
AssertionError

Only load requested metadata tables

I only need to load the AssemblyRef mdtable, but currently there is no way to restrict the tables that get loaded.

Restricting dnfile to only loading the AssemblyRef and tables ManifestResource (required because it's used to parse the resources) results in a considerable speedup:

# before
debug: Parsed data directories in 7.221096945999989 seconds
mons show main --debug  7.39s user 0.27s system 97% cpu 7.891 total
# after
debug: Parsed data directories in 0.010396081999942908 seconds
mons show main --debug  0.29s user 0.03s system 99% cpu 0.326 total

My thoughts on how this could be implemented are either:

Allow the user to provide a list of mdtables that should be loaded to dnPE.__init__
Implement lazy-loading (as much as is possible) for mdtables

I've already set up the former to test this, but I will look into lazy-loading before opening a PR.

more easily identify the location of heap stream items in the underlying PE

It can be useful to identify the exact location in the underlying PE, whether by rva or offset, of items in streams, such as Strings, UserStrings, and GUIDs. Maybe if each item had an rva or file_offset attached to it.

Credit to @c3rb3ru5d3d53c for suggesting something like this in #82

Move to pyproject.toml

Move to pyproject.toml for build process. setup.py is deprecated.

https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/

AssertionError: assert hasattr(self, "_row_class") (FieldPtr)

when parsing a private sample, we encounter an exception like:

❯ python -m pdb -- ./examples/dndump.py <redacted>
Traceback (most recent call last):
  File "/usr/lib/python3.8/pdb.py", line 1705, in main
    pdb._runscript(mainpyfile)
  File "/usr/lib/python3.8/pdb.py", line 1573, in _runscript
    self.run(statement)
  File "/usr/lib/python3.8/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/user/code/dnfile/examples/dndump.py", line 2, in <module>
    '''
  File "/home/user/code/dnfile/examples/dndump.py", line 320, in main
    dn = dnfile.dnPE(args.input)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 61, in __init__
    super().__init__(name, data, fast_load)
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 2743, in __init__
    self.__parse__(name, data, fast_load)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 129, in __parse__
    super().__parse__(fname, data, fast_load)
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3148, in __parse__
    self.full_load()
  File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3259, in full_load
    self.parse_data_directories()
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 174, in parse_data_directories
    value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 219, in parse_clr_structure
    return ClrData(self, rva, size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 481, in __init__
    self.metadata = ClrMetaData(pe, metadata_rva, metadata_size)
  File "/home/user/code/dnfile/src/dnfile/__init__.py", line 346, in __init__
    s.parse(self.streams_list)
  File "/home/user/code/dnfile/src/dnfile/stream.py", line 336, in parse
    table = mdtable.ClrMetaDataTableFactory.createTable(
  File "/home/user/code/dnfile/src/dnfile/mdtable.py", line 2091, in createTable
    table = cls._table_number_map[number](
  File "/home/user/code/dnfile/src/dnfile/base.py", line 542, in __init__
    assert hasattr(self, "_row_class")
AssertionError
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/user/code/dnfile/src/dnfile/base.py(542)__init__()
-> assert hasattr(self, "_row_class")
(Pdb) self
<dnfile.mdtable.FieldPtr object at 0x7fba4852a460>
(Pdb) self.name
'FieldPtr'
(Pdb) self.number
3
(Pdb)

guids object iterable, indexable

The GUIDS stream is easily iterable since all items must be exact 32 bytes long and can only be referenced in that fashion. So it should be easy to make the .net.guids stream object iterable, indexable, and len-able.

dndump: hexdump emits non-ASCII characters

need to emit until 0x7E not 0x7F:

use ECMA standard names for MethodSpec (0x2B) meta table and columns

From ECMA 335 I.22.29 MethodSpec: 0x2B:

The MethodSpec table has the following columns:

Method (an index into the MethodDef or MemberRef table, specifying to which
generic method this row refers; that is, which generic method this row is an
instantiation of; more precisely, a MethodDefOrRef (§II.24.2.6) coded index)

Instantiation (an index into the Blob heap (§II.23.2.15), holding the signature of
this instantiation)

The MethodSpec table records the signature of an instantiated generic method.
Each unique instantiation of a generic method (i.e., a combination of Method and Instantiation) shall be
represented by a single row in the table

Using ECMA's standard naming would help make it easier to read code that leverages dnfile to parse MethodSpec:

dnfile/src/dnfile/mdtable.py

Lines 1988 to 2025 in 498f6c6

 class GenericMethodRowStruct(RowStruct): 

 Unknown1_CodedIndex: int 

 Unknown2_BlobIndex: int 

 class GenericMethodRow(MDTableRow): 

 Unknown1: codedindex.MethodDefOrRef 

 Unknown2: bytes 

 _struct_class = GenericMethodRowStruct 

 _struct_codedindexes = { 

 "Unknown1_CodedIndex": ("Unknown1", codedindex.MethodDefOrRef), 

 } 

 _struct_blobs = { 

 "Unknown2_BlobIndex": "Unknown2", 

 } 

 def _compute_format(self): 

 unknown1_size = self._clr_coded_index_struct_size( 

 codedindex.MethodDefOrRef.tag_bits, 

 codedindex.MethodDefOrRef.table_names, 

 ) 

 blob_ind_size = checked_offset_format(self._blob_offsz) 

 return ( 

 "CLR_METADATA_TABLE_GENERICMETHOD", 

 ( 

 unknown1_size + ",Unknown1_CodedIndex", 

 blob_ind_size + ",Unknown2_BlobIndex", 

 ), 

 ) 

 class GenericMethod(ClrMetaDataTable[GenericMethodRow]): 

 name = "GenericMethod" 

 number = 43 

 _row_class = GenericMethodRow

File Offset missmatch (get_file_offset())

Issue: The offset returned by get_file_offset() is wrong by 0x1E00.

Details:
I try to get the file offset for all the structs printed by dndump.py with struct.get_file_offset():

At:
dndump.py#L191C23-L192C1

Add:

                    ostream.writeln("[%d]:" % (i + 1))
                    ostream.writeln("File offset: " + str(row.struct.get_file_offset()))

Which gives with dotnet-test.dll for example:

 MethodDef:
    [1]:
    File offset: 8748
      Rva:        0x2048
      Name:       .ctor
      Signature:  200001
      ParamList: (empty)
      ImplFlags:
        miIL
        miManaged
      Flags:
        mdHideBySig
        mdPublic
        mdRTSpecialName
        mdReuseSlot
        mdSpecialName

The file offset is 8748, but the file is only 2023 bytes big.

The effective offset is 8748 - 7680:

$ hexdump -vC -s $((8748 - 7680)) -n 16  dotnet-test.dll
0000042c  48 20 00 00 00 00 86 18  20 02 06 00 01 00 50 20  |H ...... .....P |

The RVA=0x2048 is at the beginning of the MethodDef, as little endian: 48 20.

Note that 7680 is 0x1E00.

Any idea where this 0x1E00 offset is coming from? Is it stable and i can just subtract this offset? Is this a bug?
Probably the RVA to Offset calculation is done not completely correctly. There also seems to be no header or section at the offset 0x1E00 in DotNet PE files.

parse method (and field, and ...) signatures

Method (and field, and ...) signatures are represented by data in a custom binary format that is stored in the #Blob stream. The best references I've found for parsing this data are:

ECMA-335 6th Edition, II.23.1 and II.23.2, "Blobs and signatures"
dnlib SignatureReader.cs

implement MDTableRow.parse for lists

dnfile/src/dnfile/base.py

Lines 250 to 252 in cc97eca

 if self._struct_lists: 

 # TODO 

 pass

This is used by at least:
- TypeDef.FieldList_Index
- TypeDef.MethodList_Index
- MethodDef.ParamList_Index

Which are fairly interesting structures (at least to me).

	class GenericMethodRowStruct(RowStruct):
	Unknown1_CodedIndex: int
	Unknown2_BlobIndex: int


	class GenericMethodRow(MDTableRow):
	Unknown1: codedindex.MethodDefOrRef
	Unknown2: bytes

	_struct_class = GenericMethodRowStruct

	_struct_codedindexes = {
	"Unknown1_CodedIndex": ("Unknown1", codedindex.MethodDefOrRef),
	}
	_struct_blobs = {
	"Unknown2_BlobIndex": "Unknown2",
	}

	def _compute_format(self):
	unknown1_size = self._clr_coded_index_struct_size(
	codedindex.MethodDefOrRef.tag_bits,
	codedindex.MethodDefOrRef.table_names,
	)
	blob_ind_size = checked_offset_format(self._blob_offsz)
	return (
	"CLR_METADATA_TABLE_GENERICMETHOD",
	(
	unknown1_size + ",Unknown1_CodedIndex",
	blob_ind_size + ",Unknown2_BlobIndex",
	),
	)


	class GenericMethod(ClrMetaDataTable[GenericMethodRow]):
	name = "GenericMethod"
	number = 43

	_row_class = GenericMethodRow