mandiant / capa Goto Github PK

View Code? Open in Web Editor NEW

3.9K 3.9K 491.0 29.42 MB

The FLARE team's open-source tool to identify capabilities in executable files.

License: Apache License 2.0

Python 99.96% Dockerfile 0.04%

gsoc-2024 malware-analysis reverse-engineering

capa's People

Contributors

Stargazers

Watchers

Forkers

crim3hound cephurs alchemycyberblaze cybertoxin meshaeltech blue-infosec m00zh33 pynpy evilares xu7103224 arturoruz kernal-gh threathive tccontre slowmistio jeffli678 ceyhuncamli aqqdgyz kernweak m4rm0k d1pakda5 malwar3hunt3r tempbottle killbugs analyticsearch jermainlaforce wisdark runonceex olivierh59500 jack51706 fredyfx psifertex tzf-omkey sunware-shellcoder kelvinguo1988 grandgarcon litchi125 bharadwaj1997 dump-guy vishal9066 krzemienski h4sh5 fxcebx hatchetxuexi ja1e0 zldww2011 youngjun-chang usama7628674 binihao5bei seth1002 yurenhan asdlei99 fzxcp3 4n6strider reposities meishao bambooqj walt1998 msmmer angrykobe bruce2014 ashishvishwkarma skirankumar chubbymaggie janette88 limkokholefork gleeda porlockzzz dzbeck crackercat h1d3r guardianrg notepaddotexe uqcybersquad pp00001 keyman9848 zbl116 mbhatt1 davidliu88 ivankabestwill ana06 5l1v3r1 recvfrom simplesoftmx hercul3s winniepe gitter-badger aryanguenthner freemanzyq threatpage hariram32 y0d4a nutmag timb-machine-mirrors harry1080 kernel1337 cclauss sukelluskello zhangzongchen zeta1999

capa's Issues

capa explorer fails on Python 2, IDA 7.5

IDAPython: Error while calling Python callback <OnCreate>:
Traceback (most recent call last):
  File "ida_capa_explorer.py", line 99, in OnCreate
    self.load_capa_results()
  File "capa/capa/ida/ida_capa_explorer.py", line 342, in load_capa_results
    capabilities = capa.main.find_capabilities(rules, capa.features.extractors.ida.IdaFeatureExtractor(), True)
  File "capa\capa\main.py", line 99, in find_capabilities
    for f in tqdm.tqdm(extractor.get_functions(), disable=disable_progress, unit=" functions"):
  File "C:\Python27\lib\site-packages\tqdm\_tqdm.py", line 997, in __iter__
    for obj in iterable:
  File "capa\capa\features\extractors\ida\__init__.py", line 54, in get_functions
    from capa.features.extractors.ida import helpers
ImportError: cannot import name helpers
INFO:capa:form closed.
Python>sys.version
'2.7.15 (v2.7.15:ca079a3ea3, Apr 30 2018, 16:30:26) [MSC v.1500 64 bit (AMD64)]'

linter: warn on non-standard meta fields

and maybe suggest "reference" -> "references"

add JSON-formatted output mode

use this JSON as the source data for all formatters. this will ensure it has all data necessary to render complete details of capa matches.

the JSON document will be the primary method of integration for external tools and scripts, rather than supporting a multitude of integrations.

Get rid of the Element class

The Element class is just used for testing. By using Element we are not testing the actual code. Also, every time we implement a new feature for the Feature class, we need to implement it for Element as well. I think it would be a better idea to use real classes for testing and get rid of Element. We could start by substituing it by number, which should be straighforward. Although I think it could be a good idea to add some more tests for the different Feature classes.

Related to #5, as it symplifies the implementation.

@mr-tz @williballenthin what do you think?

style-checker hook fails

add capafmt utility for consistent formatting of rules

it would be nice to format rules with a consistent style.

this includes:

whitespacing, especially with lists
order meta before features

by default, python yaml emits keys alphabetically. as an example:

rule:
  meta:
    att&ck:
    - Defense Evasion::Obfuscated Files or Information T1027.002
    author: [email protected]
    examples:
    - CD2CBA9E6313E8DF2C1273593E649682
    - Practical Malware Analysis Lab 01-02.exe_:0x0401000
    mbc:
    - Anti-Static Analysis::Software Packing
    name: packed with UPX
    namespace: anti-analysis/packer/upx
    scope: file
  features:
  - or:
    - section: UPX0
    - section: UPX1

this wold look nicer:

rule:
  meta:
    name: packed with UPX
    namespace: anti-analysis/packer/upx
    author: [email protected]
    att&ck:
    - Defense Evasion::Obfuscated Files or Information T1027.002
    mbc:
    - Anti-Static Analysis::Software Packing
    examples:
    - CD2CBA9E6313E8DF2C1273593E649682
    - Practical Malware Analysis Lab 01-02.exe_:0x0401000
    scope: file
  features:
  - or:
    - section: UPX0
    - section: UPX1

simplify metadata rendering

I propose the following formats to reduce duplicate information (MD5) and display the most important information first.

capa report could be included as a header/heading as well

default before

+------------------------+--------------------------------------------------------------+
| capa report for        | 34404a3fb9804977c6ab86cb991fb130                             |
| timestamp              | 2020-07-03T12:41:55.267000                                   |
| version                | 0.0.0                                                        |
| path                   | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_             |
| md5                    | 34404a3fb9804977c6ab86cb991fb130                             |
+------------------------+--------------------------------------------------------------+

>>>>>
after

+------------------------+--------------------------------------------------------------+
| md5                    | 34404a3fb9804977c6ab86cb991fb130                             |
| path                   | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_             |
| timestamp              | 2020-07-03T12:41:55.267000                                   |
| capa version           | 0.0.0                                                        |
+------------------------+--------------------------------------------------------------+



verbose, vverbose (should use same function) before

capa report for  34404a3fb9804977c6ab86cb991fb130
timestamp        2020-07-03T12:42:07.813000
version          0.0.0
path             tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
md5              34404a3fb9804977c6ab86cb991fb130
sha1             b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256           c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
format           auto
extractor        VivisectFeatureExtractor
base address     0x400000

>>>>>
after

md5              34404a3fb9804977c6ab86cb991fb130
sha1             b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256           c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
path             tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
timestamp        2020-07-03T12:42:07.813000
capa version     0.0.0
format           auto
extractor        VivisectFeatureExtractor
base address     0x400000

TypeError: Can't instantiate abstract class NullFeatureExtractor

Tests are failing in master:

――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ERROR collecting tests/test_freeze.py ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
tests/test_freeze.py:27: in <module>
   0x401002: {"features": [(0x401002, capa.features.insn.Mnemonic("mov")),],},
E   TypeError: Can't instantiate abstract class NullFeatureExtractor with abstract methods get_base_address

==================================================================================================== warnings summary ====================================================================================================
/usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14
 /usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14: DeprecationWarning: the md5 module is deprecated; use hashlib instead
   import md5

-- Docs: https://docs.pytest.org/en/latest/warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduced in ff44801

remove args from Features

After #39 it is really obvious that args and value are a duplication for most of the features. In most cases args = [value]. In few features value has a different name, but I think I makes sense to rename this attribute. We could think about it as the value in the yaml file. So, I propose to get rid of args and introduce value for Feature (the main class instead of the subclasses). Removing duplication would simplify the code.

@mr-tz @williballenthin what do you think?

output feature count

capa shows the file feature count

INFO:capa:analyzed file and extracted 21 file features

to avoid confusion, this should be removed or extended to also show function features

vivisect workspace creation

@mr-tz

vivisect and/or viv_util updates may result in modified workspaces. By default getWorkspace loads existing .viv files if they exist. This can lead to confusion, misleading analysis and errors.

@williballenthin

we should probably report this upstream.

@williballenthin

in the meantime, maybe we can stuff the viv version in a meta field and do the check ourselves.

linter: namespace should match directory structure

...unless in nursery

capa refuses to output JSON if a file limitation is detected e.g AutoIt

discussion: capa doesn't handle sandbox or API traces

capa relies on analysis of code structures to identify patterns. this is similar to matching sequences of API calls or other events in a sandbox, but not exactly. right now, capa rules don't directly translate to identifying behaviors from sandbox or debugging output, but it seems like there's a lot of overlap. maybe we can find a way to re-use a lot of work we've done for the static analysis rules.

capa can't be used as a library on py3

capa relies on vivisect for its standalone code analysis (when run within IDA, it uses IDA's analysis). since vivisect is py2-only, this means capa is py2-only, when used standalone or as a library. we should provide an analysis backend that can be used on py3, as this is the future.

we're aware that everyone (actually, including ourselves) has already moved on to py3. you should be aware that using vivisect was the path of least resistance to developing capa. now that we've proved that capa works and is useful, its finally appropriate to dedicate substantial time towards the upgrade.

note, the capa code base is already py3 compatible. this is strictly a limitation of the backend that we ship by default.

pull function scope features into file scope

@mr-tz

add another scope program to encompass file and function (and lower) scopes

@mr-tz

Should we prioritize this feature? We have various instances from Ana's work where this would be helpful. According to @mwilliams31 schannel is also likely implemented across multiple functions.

@williballenthin

works for me. shall we have @Ana06 tackle it? will require getting familiar with the matching logic, which is a good lesson (and maybe torture???).

@mr-tz

sounds good 😄 if it becomes too much torture, let us know, @Ana06

discussion: capa JSON format

I have the following questions/comments after changing the IDA plugin to use the new JSON format:

Does it make sense to define (if not done already) a JSON schema for the new format?
- Pros: Schema would allow for easy validation of the format and serve as documentation for developers wanting to ingest the data into their systems
- Cons: Time and effort
Does it make sense to include the original rule content for match? This data can be found in the source field of the parent match but finding the original source this way isn't as convenient
- Pros: Convenience when parsing/displaying rule data for match
- Cons: Duplicate data in output
Does it make sense to include the locations for range? There locations, and corresponding context e.g. the instruction at a location, used to be displayed in the IDA plugin.
- Pros: Locations can be rendered providing additional context
- Cons: More data in output
Does it make sense to include additional meta data e.g. hash value, entry point, etc. specific to the binary file from which the output was produced?
- Pros: Systems looking to ingest the data could render the additional context - meta data could be used to map output back to original binary
- Cons: More data in output and more work on extractor end to get the meta data
Does it make sense to include feature comments e.g. PAGE_EXECUTE_READWRITE from number: 0x40 = PAGE_EXECUTE_READWRITE
- Pros: Additional context/comments can be rendered
- Cons: More data in output

doc missing locations for "calls from" chacateristic

The doc format does not include locations for calls from characteristic. From my understanding these locations are recorded and should be included?

{'children': [],
 'locations': (),
 'node': {'statement': {'child': {'characteristic': 'calls from',
                 'type': 'characteristic'},
                  'max': 4,
                  'min': 0,
                   'type': 'range'},
                  'type': 'statement'},
'success': True},

Introduce variable DESCRIPTION_SEPARATOR

rather than using an inline string ' = ' that is prone to typo and cannot be used with "find references", use a constant like DESCRIPTION_SEPARATOR = "' = "' and use this throughout the code. (from @williballenthin's comment in #39)

vivisect/viv-utils - Exception: Invalid File: shellcode

$ capa -f sc32 tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe  ./path/to/rules/  /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa:     https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 277 rules
INFO:capa:generating vivisect workspace for: tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\vivisect\impemu\monitor.py", line 147, in prehook
    cb(self, emu, op, starteip)
  File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 19, in analyzeJmp
    ctx = getSwitchBase(vw, op, starteip, emu)
  File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 69, in getSwitchBase
    imgbase = vw.getFileMeta(filename, 'imagebase')
  File "c:\python27\lib\site-packages\vivisect\__init__.py", line 2484, in getFileMeta
    raise Exception("Invalid File: %s" % filename)
Exception: Invalid File: shellcode
[...]
INFO:capa:format: blob, platform: windows, architecture: i386, number of functions: 42
INFO:capa:analyzed file and extracted 112 features
+------------------------+----------------------------------------------------------------+
| ATT&CK Tactic          | ATT&CK Technique                                               |
|------------------------+----------------------------------------------------------------|
| DEFENSE EVASION        | Obfuscated Files or Information [T1027]                        |
| EXECUTION              | Shared Modules [T1129]                                         |
+------------------------+----------------------------------------------------------------+

+---------------------------------------------+----------------------------------------------+
| CAPABILITY                                  | NAMESPACE                                    |
|---------------------------------------------+----------------------------------------------|
| contain obfuscated stackstrings (2 matches) | anti-analysis/obfuscation/string/stackstring |
| encode data using XOR                       | data-manipulation/encoding/xor               |
| parse PE header                             | load-code/pe                                 |
+---------------------------------------------+----------------------------------------------+

INFO:capa:done.

consider using `black` for formatting

https://github.com/psf/black

notably, this is found under the Python Software Foundation (PSF) organization. seems to lend some weight. also, tons of stars and engagements.

engine: support matching on rule namespace prefixes

right now we support matching on other rule names, like match: encrypt data with RC4 KSA

we should support matching on namespaces, as well, like match: data-manipulation/encryption

this would mean that rule authors don't have to know about all the possible techniques to do a thing (like encryption).

assume characteristics always encode the existance

after months of use, it seems that characteristic features are only used like characteristic(nzxor): True. that is, the value is always True. we can simplify and make the rule syntax more consistent by changing the format to look like characteristic: nzxor and count(characteristic(nzxor)).

to match the non-existence of this feature, use not: characteristic: ... or count(characteristic(...)): 0.

add feature & function count to report metadata and render

from #91

also, this:

INFO:capa:format: blob, platform: windows, architecture: i386, number of functions: 42
INFO:capa:analyzed file and extracted 112 features

configure gh actions to update version

blocked on gh actions be available, though.

linter: lib rules should not have a namespace

plan: rule reorganization

move pycodestyle and other dev dependencies into setup.py

rather than putting the python installation into the setup-hooks.py script

use an extras_require for [dev] maybe?

Support descriptions for regular expressions

This was not implemented in #39, as at RegExp are not a Feature. It is needed to either make RegExp a feature or implement this for RegExp as well. It should works in the same as for strings.

Just tracking it here, so that we don't forget about it. 😉

linter: filename should match rule name

modulo some stripping of special characters

discussion: capa doesn't do anything for non-Windows or non-PE files

capa contains a small amount of code and a large amount of default rules that assume the input file is a Windows PE file. this is because the original authors primarily analyze Windows malware. there is nothing stopping analysis of Linux ELF or MacOS Mach-O binaries; however, we haven't yet had the experience, sample binaries, nor time to make this happen.

support for additional platforms may be added in the future, especially with (1) contributions from experts in those fields, and (2) sufficient sample binaries to demonstrate capa works as expected. if you're interested in helping out in these areas, please get in touch!

make sure pyinstaller still works

vivisect extractor: bytes features for immediate operands

currently this gets bytes features for many invalid immediate operators

        if isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
            v = oper.getOperValue(oper)

for example add ebp, 0Bh etc.

this case should be fine-tuned or removed?

Add a CONTRIBUTING file

I think we should add a CONTRIBUTING file to collect some important information we now have in other documents. I information is usually in the CONTRIBUTING file in other project and it is where people expect it to be. In addition, it is used by GitHub to help guiding new contributors. For example, when someone opens a pull request or creates an issue, they will see a link to that file:

Reference: https://help.github.com/en/github/building-a-strong-community/setting-guidelines-for-repository-contributors

I think this document should include the following information:

How to contribute with issues, including a reference to the capa-rules repository and which issues belongs to every repo. This should also be linked from the issues template.
How to write rules, linking current documentation and explaining the linter
How to contribute with code, including how to set the project up (currently in different documents) and how to run the tests.

Something else?

discussion: false positives in vcrt functions

there are a number of interesting rules, like manual PEB parsing, that fire on standard routines inserted by the MSVC compiler. typically, we'd want to include these in the output, except that some of these normal runtime functions aren't doing anything nefarious (as the rule might suggest, like anti-vm).

this leads to the desire that we'd want to filter out some known functions from matching.

there are at least two obvious approaches:

using existing capa logic/rules to match known functions (like count of bb, count and/or distribution of mnemonics, etc) and then not the matches
rely on the analysis backend to provide metadata about functions, such as auto-detected function name, and let rules match against this

both of these have tradeoffs, and its not clear what we should do.

if we use capa infrastructure to match functions,

pro:

need no new features or syntax, can do it today
works across all analysis backends
easy to inspect

con:

we have to maintain function signatures (not our goal here)
our signatures may not be as good as purpose built tech, like FLIRT of Ghidra's database
matching N signatures against M functions may introduce performance issues (maybe, this is a guess)

if we rely on backend analysis backends to match functions,

pro:

rely on backend expertise to do function id very well
less maintainence

con:

need new syntax, maybe like function/name: __init_iob
different analysis backends have different quality, i.e. IDA is very good, and vivisect has minimal coverage
different analysis backends may use different names/formats for function names that we have to normalize

add documentation for IDA plugin

at least include a screenshot in the main readme so people can get a sense for what it does.

ci: configure isort for code formatting

$ isort --length-sort --line-width 120 --thirdparty idc --thirdparty idaapi --thirdparty idautils --thirdparty ida_gdl --thirdparty PyQt5 --thirdparty argparse --builtin posixpath --thirdparty tabulate --thirdparty viv_utils --recursive .

integrate capa with ghidra

lots of people use ghidra, which is free and open source. we should recommend a way of integrating capa results into ghidra.

linter: lib rules should be found in lib directory

Automate submodule sync (rules)

via GitHub Actions

Currently has to be done manually, see https://stackoverflow.com/questions/5828324/update-git-submodule-to-latest-commit-on-origin

<in capa base dir>
cd rules/
git checkout master
git pull origin master
cd ..
git add rules/
git commit
git push origin master

count: 0 - range fails if no feature extracted

test case: rule and output (it should match on functions with no calls)

rule:
  meta:
    name: calls from
    namespace: test
    author: [email protected]
    scope: function
  features:
    - or:
      - count(mnemonic(call)): 0
      - count(characteristic(calls from)): 0

capa tests/data/34404a3fb9804977c6ab86cb991fb130.exe_ -t test -vv
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe  ./path/to/rules/  /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa:     https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 278 rules
INFO:capa:selected 1 rules
INFO:capa:generating vivisect workspace for: tests/data/34404a3fb9804977c6ab86cb991fb130.exe_
INFO:capa:format: pe, platform: windows, architecture: i386, number of functions: 853
INFO:capa:analyzed file and extracted 1549 features

INFO:capa:done.

capa/engine.py:156

    def evaluate(self, ctx):
        if self.child not in ctx:
            return Result(False, self, [])

count basic block

This rule is not working as I expect, I get no results. Am I using this wrong?

rule:
  meta:
    name: count bb
    namespace: test
    scope: function
  features:
    - and:
      - count(basic blocks): 1 or more

separate logging level from output mode

add flag --debug to enable DEBUG level logging. this is independent of --verbose mode that affects result output.

post-commit git hook incorrectly stashes unstaged changes

when i stage and commit only some of the pending changes, the post-commit git hook places unstaged changes into the git stash stack. i have to manually pop them with git stash pop stash@{0}. i would rather have these unstaged changes untouched by the git hook (at least, they should be there when the hook completes its job).

last night i was really scared that i had lost hours of work until i noticed the changes were hidden in the stack.

include additional metadata in json and verbose output modes

this could include:

md5, sha1, sha256 hashes
file format
path of input file
extractor used
arguments specified
base address of module

json: include locations for range nodes

ci: configure black for code formatting

discussion: capa doesn't extract features from packed files

capa relies on disassembly and code analysis that can easily be defeated by packing. right now, capa doesn't attempt to do any auto-unpacking, so even trivially packed samples can bypass capa. fortunately, capa can often recognize when packing is in use (if you notice a bypass, submit a rule!), and will emit a warning about this.

doing auto-unpacking is a non-trivial job, and not really in scope for the what capa does. however, if there are easy ways to make this work, we can revisit the idea.

Associate context with a string

@Ana06

In some cases it could be useful to associate context with a string as it can be done with numbers. For example:
- string: "{3E5FC7F9-9A51-4367-9063-A120244FBEC7}" = CLSID_CMSTPLUA

@mr-tz

hm, good point! maybe it makes sense to make the extra context available to all features.

@williballenthin

+1

update serialization of characteristic feature

remove special handling of characteristic feature when serializing and refreeze testbed files.

it currently maintains backwards compatibility with an old format, by using a list of two elements.

mandiant / capa Goto Github PK

capa's People

Contributors

Stargazers

Watchers

Forkers

capa's Issues

Recommend Projects

Recommend Topics

Recommend Org