malwarefrank / dnfile Goto Github PK
View Code? Open in Web Editor NEWParse .NET executable files.
License: MIT License
Parse .NET executable files.
License: MIT License
As noted in #17 (comment)
are stream names guaranteed to be ASCII? What does the spec say? What do implementations do?
looks like you tagged v0.11.2 but PyPI still hosts v0.11.0. Could you please push v0.11.2 to PyPI? we would love to pull in your recent changes, thank you!
Objects' attributes throughout the project should be accessed consistently to reduce cognitive load on developers and library users.
Right now some objects have attributes that are set to None when there is a parse error, while others are not set at all.
Requesting that you add the ability to parse BMP images stored as entries within the .NET resources.
Sample:
https://www.virustotal.com/gui/file/0a5dc3b6669cf31e8536c59fe1315918eb4ecfd87998445e2eeb8fed64bd2f2c
dnfile properly identified the resource names and types but the data property is NoneType. Attached is the output from the following code:
pe = dnfile.dnPE(filepath)
for r in pe.net.resources:
if r.name == “20a87df82283.Resources.resources”:
for entry in r.data.entries:
print(f”{r.name}: {entry.name} - {type(entry.data)}“)
print(entry.__dict__)
print(entry.struct.__dict__)
I know that the open-source project dnSpy does an excellent job of parsing this resource type from .NET executables so maybe some of that logic can be ported into this project.
https://github.com/dnSpyEx/dnSpy
https://github.com/dnSpyEx/dnSpy/blob/master/Extensions/dnSpy.BamlDecompiler/Baml/KnownTypes.cs
Could possibly use this code to dramatically increate support for other types at the same time.
GitHub action recommends using a Trusted Publisher instead of API tokens in workflows to push to pypi. And the pypi documentation strongly recommends using a GitHub environment.
Submitting a request to have things like strings, user_strings, and GUIDs processed when dnfile first loads an executable. Basically implementing the code provided in the following example into dnfile:
It would be great if the extracted strings could then be simply referenced by the user via a property like dnfile.net.user_strings, which would return a set of extracted user strings.
I was wondering if you'd be interested by this error, caused by this file.
I found it using CAPA, with dnfile 0.14.1, but it also triggers on 0.15.0.
>>> import dnfile
>>> pe = dnfile.dnPE("e94f7c475e7db0691a2698b5dd349c2b412ffddafa7a3ff85785cbd5ac144fcb")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../dnfile/__init__.py", line 64, in __init__
super().__init__(name, data, fast_load)
File ".../pefile.py", line 2895, in __init__
self.__parse__(name, data, fast_load)
File ".../dnfile/__init__.py", line 132, in __parse__
super().__parse__(fname, data, fast_load)
File ".../pefile.py", line 3328, in __parse__
self.full_load()
File ".../pefile.py", line 3439, in full_load
self.parse_data_directories()
File ".../dnfile/__init__.py", line 178, in parse_data_directories
value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../dnfile/__init__.py", line 221, in parse_clr_structure
return ClrData(self, rva, size, self.clr_lazy_load)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../dnfile/__init__.py", line 526, in __init__
self._init_resources(pe)
File ".../dnfile/__init__.py", line 574, in _init_resources
rsrc.parse()
File ".../dnfile/resource.py", line 289, in parse
rs.parse()
File ".../dnfile/resource.py", line 433, in parse
rsrc_factory.read_rsrc_data_v1(self._data, e_data_offset, self.resource_types, e)
File ".../dnfile/resource.py", line 113, in read_rsrc_data_v1
d, v = self.type_str_to_type(entry.type_name, data, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../dnfile/resource.py", line 166, in type_str_to_type
final_bytes, n = self.read_serialized_data(data, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../dnfile/resource.py", line 72, in read_serialized_data
x = utils.read_compressed_int(data[offset:offset + 4])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../dnfile/utils.py", line 46, in read_compressed_int
value |= data[1]
~~~~^^^
IndexError: index out of range
The file doesn't look to be too badly corrupted, but I may be wrong. 🙂
We found an issue where, if a type has exactly one method (e.g., a cctor
), the logic that fills the list exits prematurely due to a bug.
This line is responsible:
Line 463 in 366aa88
For example, if the MethodList
index is 122 and the next type's MethodList
index is 123, the logic in the lines above the quoted line computes the end index as 122 because it subtracts 1.
Hello, I just got this bug when i was trying to list the strings
pip3 install git+https://github.com/malwarefrank/dnfile -U
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Collecting git+https://github.com/malwarefrank/dnfile
Cloning https://github.com/malwarefrank/dnfile to /private/var/folders/yt/mbh1wxlj6fq0qqxbjnjfqr200000gn/T/pip-req-build-efg5499r
Running command git clone --filter=blob:none --quiet https://github.com/malwarefrank/dnfile /private/var/folders/yt/mbh1wxlj6fq0qqxbjnjfqr200000gn/T/pip-req-build-efg5499r
Resolved https://github.com/malwarefrank/dnfile to commit 92847841e6496453598947a74eb78fa7299ad579
Running command git submodule update --init --recursive -q
Preparing metadata (setup.py) ... done
Requirement already satisfied: pefile>=2019.4.18 in /usr/local/lib/python3.9/site-packages (from dnfile==0.10.0) (2021.9.3)
Requirement already satisfied: future in /usr/local/lib/python3.9/site-packages (from pefile>=2019.4.18->dnfile==0.10.0) (0.18.2)
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
➜ python3 dnstring.py b9efc289ffd8951a65f66ddf2649c1959ad1b94f1177002b20f05e8ae86853ae
reference to missing table: File
reference to missing table: File
Traceback (most recent call last):
File "<censured>/dnstring.py", line 42, in <module>
show_strings(fname)
File "<censured>/dnstring.py", line 33, in show_strings
s = dnfile.stream.UserString(buf)
File "/usr/local/lib/python3.9/site-packages/dnfile/stream.py", line 116, in __init__
self.value: str = data.decode(encoding)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 28: truncated data
im not sure how string can be 29bytes in wide mode
'utf-16-le' codec can't decode byte 0x00 in position 28: truncated data b'S\x00t\x00u\x00b\x00.\x00R\x00e\x00s\x00o\x00u\x00r\x00c\x00e\x00s\x00\x00'
any ideaa?
thank you
I'm interested in adding tests (using pytest, unless you have other preferences) that demonstrate functionality of dnfile.
tests/test_*.py
?For reference, in capa, we have a separate repository, capa-testfiles, that we use to hold all the files used during testing, which we reference as a submodule under tests/data/
. This makes it possible to checkout in CI via --recurse-submodules
but also easy to checkout the source code without pulling down MBs of test data. Of course, this introduces a bit more configuration and maintenance of two repos vs. one.
What would you like to do for dnfile?
encountered an unexpected exception when parsing the file 0033ca037e0496c5c33e3dc19714fb3e:
❯ python foo.py tests/data/0033ca037e0496c5c33e3dc19714fb3e
Traceback (most recent call last):
...
File "scripts/print_cil.py", line 29, in main
pe = dnfile.dnPE(args.path)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 61, in __init__
super().__init__(name, data, fast_load)
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 2743, in __init__
self.__parse__(name, data, fast_load)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 129, in __parse__
super().__parse__(fname, data, fast_load)
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3148, in __parse__
self.full_load()
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3259, in full_load
self.parse_data_directories()
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 174, in parse_data_directories
value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 219, in parse_clr_structure
return ClrData(self, rva, size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 481, in __init__
self.metadata = ClrMetaData(pe, metadata_rva, metadata_size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 346, in __init__
s.parse(self.streams_list)
File "/home/user/code/dnfile/src/dnfile/stream.py", line 375, in parse
table = mdtable.ClrMetaDataTableFactory.createTable(
File "/home/user/code/dnfile/src/dnfile/mdtable.py", line 2031, in createTable
table = cls._table_number_map[number](
File "/home/user/code/dnfile/src/dnfile/base.py", line 542, in __init__
assert hasattr(self, "_row_class")
AssertionError
I only need to load the AssemblyRef
mdtable, but currently there is no way to restrict the tables that get loaded.
Restricting dnfile
to only loading the AssemblyRef
and tables ManifestResource
(required because it's used to parse the resources) results in a considerable speedup:
# before
debug: Parsed data directories in 7.221096945999989 seconds
mons show main --debug 7.39s user 0.27s system 97% cpu 7.891 total
# after
debug: Parsed data directories in 0.010396081999942908 seconds
mons show main --debug 0.29s user 0.03s system 99% cpu 0.326 total
My thoughts on how this could be implemented are either:
dnPE.__init__
I've already set up the former to test this, but I will look into lazy-loading before opening a PR.
It can be useful to identify the exact location in the underlying PE, whether by rva or offset, of items in streams, such as Strings, UserStrings, and GUIDs. Maybe if each item had an rva or file_offset attached to it.
Credit to @c3rb3ru5d3d53c for suggesting something like this in #82
Move to pyproject.toml
for build process. setup.py
is deprecated.
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/
when parsing a private sample, we encounter an exception like:
❯ python -m pdb -- ./examples/dndump.py <redacted>
Traceback (most recent call last):
File "/usr/lib/python3.8/pdb.py", line 1705, in main
pdb._runscript(mainpyfile)
File "/usr/lib/python3.8/pdb.py", line 1573, in _runscript
self.run(statement)
File "/usr/lib/python3.8/bdb.py", line 580, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/home/user/code/dnfile/examples/dndump.py", line 2, in <module>
'''
File "/home/user/code/dnfile/examples/dndump.py", line 320, in main
dn = dnfile.dnPE(args.input)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 61, in __init__
super().__init__(name, data, fast_load)
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 2743, in __init__
self.__parse__(name, data, fast_load)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 129, in __parse__
super().__parse__(fname, data, fast_load)
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3148, in __parse__
self.full_load()
File "/home/user/env/lib/python3.8/site-packages/pefile.py", line 3259, in full_load
self.parse_data_directories()
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 174, in parse_data_directories
value = entry[1](dir_entry.VirtualAddress, dir_entry.Size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 219, in parse_clr_structure
return ClrData(self, rva, size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 481, in __init__
self.metadata = ClrMetaData(pe, metadata_rva, metadata_size)
File "/home/user/code/dnfile/src/dnfile/__init__.py", line 346, in __init__
s.parse(self.streams_list)
File "/home/user/code/dnfile/src/dnfile/stream.py", line 336, in parse
table = mdtable.ClrMetaDataTableFactory.createTable(
File "/home/user/code/dnfile/src/dnfile/mdtable.py", line 2091, in createTable
table = cls._table_number_map[number](
File "/home/user/code/dnfile/src/dnfile/base.py", line 542, in __init__
assert hasattr(self, "_row_class")
AssertionError
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/user/code/dnfile/src/dnfile/base.py(542)__init__()
-> assert hasattr(self, "_row_class")
(Pdb) self
<dnfile.mdtable.FieldPtr object at 0x7fba4852a460>
(Pdb) self.name
'FieldPtr'
(Pdb) self.number
3
(Pdb)
The GUIDS stream is easily iterable since all items must be exact 32 bytes long and can only be referenced in that fashion. So it should be easy to make the .net.guids stream object iterable, indexable, and len-able.
From ECMA 335 I.22.29 MethodSpec: 0x2B
:
The MethodSpec table has the following columns:
- Method (an index into the MethodDef or MemberRef table, specifying to which
generic method this row refers; that is, which generic method this row is an
instantiation of; more precisely, a MethodDefOrRef (§II.24.2.6) coded index)- Instantiation (an index into the Blob heap (§II.23.2.15), holding the signature of
this instantiation)The MethodSpec table records the signature of an instantiated generic method.
Each unique instantiation of a generic method (i.e., a combination of Method and Instantiation) shall be
represented by a single row in the table
Using ECMA's standard naming would help make it easier to read code that leverages dnfile to parse MethodSpec
:
Lines 1988 to 2025 in 498f6c6
Issue: The offset returned by get_file_offset()
is wrong by 0x1E00.
Details:
I try to get the file offset for all the structs printed by dndump.py
with struct.get_file_offset()
:
Add:
ostream.writeln("[%d]:" % (i + 1))
ostream.writeln("File offset: " + str(row.struct.get_file_offset()))
Which gives with dotnet-test.dll for example:
MethodDef:
[1]:
File offset: 8748
Rva: 0x2048
Name: .ctor
Signature: 200001
ParamList: (empty)
ImplFlags:
miIL
miManaged
Flags:
mdHideBySig
mdPublic
mdRTSpecialName
mdReuseSlot
mdSpecialName
The file offset is 8748
, but the file is only 2023 bytes big.
The effective offset is 8748 - 7680:
$ hexdump -vC -s $((8748 - 7680)) -n 16 dotnet-test.dll
0000042c 48 20 00 00 00 00 86 18 20 02 06 00 01 00 50 20 |H ...... .....P |
The RVA=0x2048 is at the beginning of the MethodDef, as little endian: 48 20
.
Note that 7680 is 0x1E00.
Any idea where this 0x1E00 offset is coming from? Is it stable and i can just subtract this offset? Is this a bug?
Probably the RVA to Offset calculation is done not completely correctly. There also seems to be no header or section at the offset 0x1E00 in DotNet PE files.
Method (and field, and ...) signatures are represented by data in a custom binary format that is stored in the #Blob
stream. The best references I've found for parsing this data are:
Lines 250 to 252 in cc97eca
This is used by at least:
- TypeDef.FieldList_Index
- TypeDef.MethodList_Index
- MethodDef.ParamList_Index
Which are fairly interesting structures (at least to me).
The ManifestResource metadata table may contain rows for .NET resources, external and internal. These are different from PE resources and have their own format as far as I can tell.
There may be useful parsing information in the dotnet (.NET) runtime vs ECMA-335 specification documentation:
https://github.com/dotnet/runtime/tree/main/docs/design/specs
dump_info() is causing an exception. The ClrMetaDataTable class has lost its rva member, but that member is being referenced in dump_info()
Parse the Method data (pointed to by RVA, see mdtable.MethodDefRow), as much as is needed to perform data-agnostic computation over the bytecode (cryptographic and fuzzy hashes, entropy, value distributions, etc).
See ECMA-335 6th Edition, Section II.25.4 Common Intermediate Language physical layout
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.