Giter VIP home page Giter VIP logo

capa's People

Contributors

aaronatp avatar aayush-goel-04 avatar ana06 avatar anushkavirgaonkar avatar atlas-64 avatar capa-bot avatar captaingeech42 avatar cclauss avatar colton-gabertan avatar dependabot[bot] avatar doomedraven avatar ggold7046 avatar jcrussell avatar jsoref avatar kn0wl3dge avatar manasghandat avatar mike-hunhoff avatar mr-tz avatar psifertex avatar rainrat avatar re-fox avatar recvfrom avatar ronniesalomonsen avatar ruppde avatar stevemk14ebr avatar threathive avatar uckelman-sf avatar williballenthin avatar xusheng6 avatar yelhamer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

capa's Issues

capa explorer fails on Python 2, IDA 7.5

IDAPython: Error while calling Python callback <OnCreate>:
Traceback (most recent call last):
  File "ida_capa_explorer.py", line 99, in OnCreate
    self.load_capa_results()
  File "capa/capa/ida/ida_capa_explorer.py", line 342, in load_capa_results
    capabilities = capa.main.find_capabilities(rules, capa.features.extractors.ida.IdaFeatureExtractor(), True)
  File "capa\capa\main.py", line 99, in find_capabilities
    for f in tqdm.tqdm(extractor.get_functions(), disable=disable_progress, unit=" functions"):
  File "C:\Python27\lib\site-packages\tqdm\_tqdm.py", line 997, in __iter__
    for obj in iterable:
  File "capa\capa\features\extractors\ida\__init__.py", line 54, in get_functions
    from capa.features.extractors.ida import helpers
ImportError: cannot import name helpers
INFO:capa:form closed.
Python>sys.version
'2.7.15 (v2.7.15:ca079a3ea3, Apr 30 2018, 16:30:26) [MSC v.1500 64 bit (AMD64)]'

add JSON-formatted output mode

use this JSON as the source data for all formatters. this will ensure it has all data necessary to render complete details of capa matches.

the JSON document will be the primary method of integration for external tools and scripts, rather than supporting a multitude of integrations.

Get rid of the Element class

The Element class is just used for testing. By using Element we are not testing the actual code. Also, every time we implement a new feature for the Feature class, we need to implement it for Element as well. I think it would be a better idea to use real classes for testing and get rid of Element. We could start by substituing it by number, which should be straighforward. Although I think it could be a good idea to add some more tests for the different Feature classes.

Related to #5, as it symplifies the implementation.

@mr-tz @williballenthin what do you think?

add capafmt utility for consistent formatting of rules

it would be nice to format rules with a consistent style.

this includes:

  • whitespacing, especially with lists
  • order meta before features

by default, python yaml emits keys alphabetically. as an example:

rule:
  meta:
    att&ck:
    - Defense Evasion::Obfuscated Files or Information T1027.002
    author: [email protected]
    examples:
    - CD2CBA9E6313E8DF2C1273593E649682
    - Practical Malware Analysis Lab 01-02.exe_:0x0401000
    mbc:
    - Anti-Static Analysis::Software Packing
    name: packed with UPX
    namespace: anti-analysis/packer/upx
    scope: file
  features:
  - or:
    - section: UPX0
    - section: UPX1

this wold look nicer:

rule:
  meta:
    name: packed with UPX
    namespace: anti-analysis/packer/upx
    author: [email protected]
    att&ck:
    - Defense Evasion::Obfuscated Files or Information T1027.002
    mbc:
    - Anti-Static Analysis::Software Packing
    examples:
    - CD2CBA9E6313E8DF2C1273593E649682
    - Practical Malware Analysis Lab 01-02.exe_:0x0401000
    scope: file
  features:
  - or:
    - section: UPX0
    - section: UPX1

simplify metadata rendering

I propose the following formats to reduce duplicate information (MD5) and display the most important information first.

capa report could be included as a header/heading as well

default before

+------------------------+--------------------------------------------------------------+
| capa report for        | 34404a3fb9804977c6ab86cb991fb130                             |
| timestamp              | 2020-07-03T12:41:55.267000                                   |
| version                | 0.0.0                                                        |
| path                   | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_             |
| md5                    | 34404a3fb9804977c6ab86cb991fb130                             |
+------------------------+--------------------------------------------------------------+

>>>>>
after

+------------------------+--------------------------------------------------------------+
| md5                    | 34404a3fb9804977c6ab86cb991fb130                             |
| path                   | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_             |
| timestamp              | 2020-07-03T12:41:55.267000                                   |
| capa version           | 0.0.0                                                        |
+------------------------+--------------------------------------------------------------+



verbose, vverbose (should use same function) before

capa report for  34404a3fb9804977c6ab86cb991fb130
timestamp        2020-07-03T12:42:07.813000
version          0.0.0
path             tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
md5              34404a3fb9804977c6ab86cb991fb130
sha1             b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256           c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
format           auto
extractor        VivisectFeatureExtractor
base address     0x400000

>>>>>
after

md5              34404a3fb9804977c6ab86cb991fb130
sha1             b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256           c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
path             tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
timestamp        2020-07-03T12:42:07.813000
capa version     0.0.0
format           auto
extractor        VivisectFeatureExtractor
base address     0x400000

TypeError: Can't instantiate abstract class NullFeatureExtractor

Tests are failing in master:

――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ERROR collecting tests/test_freeze.py ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
tests/test_freeze.py:27: in <module>
   0x401002: {"features": [(0x401002, capa.features.insn.Mnemonic("mov")),],},
E   TypeError: Can't instantiate abstract class NullFeatureExtractor with abstract methods get_base_address

==================================================================================================== warnings summary ====================================================================================================
/usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14
 /usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14: DeprecationWarning: the md5 module is deprecated; use hashlib instead
   import md5

-- Docs: https://docs.pytest.org/en/latest/warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduced in ff44801

remove args from Features

After #39 it is really obvious that args and value are a duplication for most of the features. In most cases args = [value]. In few features value has a different name, but I think I makes sense to rename this attribute. We could think about it as the value in the yaml file. So, I propose to get rid of args and introduce value for Feature (the main class instead of the subclasses). Removing duplication would simplify the code.

@mr-tz @williballenthin what do you think?

output feature count

capa shows the file feature count

INFO:capa:analyzed file and extracted 21 file features

to avoid confusion, this should be removed or extended to also show function features

vivisect workspace creation

@mr-tz

vivisect and/or viv_util updates may result in modified workspaces. By default getWorkspace loads existing .viv files if they exist. This can lead to confusion, misleading analysis and errors.

@williballenthin

we should probably report this upstream.

@williballenthin

in the meantime, maybe we can stuff the viv version in a meta field and do the check ourselves.

discussion: capa doesn't handle sandbox or API traces

capa relies on analysis of code structures to identify patterns. this is similar to matching sequences of API calls or other events in a sandbox, but not exactly. right now, capa rules don't directly translate to identifying behaviors from sandbox or debugging output, but it seems like there's a lot of overlap. maybe we can find a way to re-use a lot of work we've done for the static analysis rules.

capa can't be used as a library on py3

capa relies on vivisect for its standalone code analysis (when run within IDA, it uses IDA's analysis). since vivisect is py2-only, this means capa is py2-only, when used standalone or as a library. we should provide an analysis backend that can be used on py3, as this is the future.

we're aware that everyone (actually, including ourselves) has already moved on to py3. you should be aware that using vivisect was the path of least resistance to developing capa. now that we've proved that capa works and is useful, its finally appropriate to dedicate substantial time towards the upgrade.

note, the capa code base is already py3 compatible. this is strictly a limitation of the backend that we ship by default.

pull function scope features into file scope

@mr-tz

add another scope program to encompass file and function (and lower) scopes

@mr-tz

Should we prioritize this feature? We have various instances from Ana's work where this would be helpful. According to @mwilliams31 schannel is also likely implemented across multiple functions.

@williballenthin

works for me. shall we have @Ana06 tackle it? will require getting familiar with the matching logic, which is a good lesson (and maybe torture???).

@mr-tz

sounds good 😄 if it becomes too much torture, let us know, @Ana06

discussion: capa JSON format

I have the following questions/comments after changing the IDA plugin to use the new JSON format:

  • Does it make sense to define (if not done already) a JSON schema for the new format?

    • Pros: Schema would allow for easy validation of the format and serve as documentation for developers wanting to ingest the data into their systems
    • Cons: Time and effort
  • Does it make sense to include the original rule content for match? This data can be found in the source field of the parent match but finding the original source this way isn't as convenient

    • Pros: Convenience when parsing/displaying rule data for match
    • Cons: Duplicate data in output
  • Does it make sense to include the locations for range? There locations, and corresponding context e.g. the instruction at a location, used to be displayed in the IDA plugin.

    • Pros: Locations can be rendered providing additional context
    • Cons: More data in output
  • Does it make sense to include additional meta data e.g. hash value, entry point, etc. specific to the binary file from which the output was produced?

    • Pros: Systems looking to ingest the data could render the additional context - meta data could be used to map output back to original binary
    • Cons: More data in output and more work on extractor end to get the meta data
  • Does it make sense to include feature comments e.g. PAGE_EXECUTE_READWRITE from number: 0x40 = PAGE_EXECUTE_READWRITE

    • Pros: Additional context/comments can be rendered
    • Cons: More data in output

doc missing locations for "calls from" chacateristic

The doc format does not include locations for calls from characteristic. From my understanding these locations are recorded and should be included?

{'children': [],
 'locations': (),
 'node': {'statement': {'child': {'characteristic': 'calls from',
                 'type': 'characteristic'},
                  'max': 4,
                  'min': 0,
                   'type': 'range'},
                  'type': 'statement'},
'success': True},

vivisect/viv-utils - Exception: Invalid File: shellcode

$ capa -f sc32 tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe  ./path/to/rules/  /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa:     https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 277 rules
INFO:capa:generating vivisect workspace for: tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\vivisect\impemu\monitor.py", line 147, in prehook
    cb(self, emu, op, starteip)
  File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 19, in analyzeJmp
    ctx = getSwitchBase(vw, op, starteip, emu)
  File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 69, in getSwitchBase
    imgbase = vw.getFileMeta(filename, 'imagebase')
  File "c:\python27\lib\site-packages\vivisect\__init__.py", line 2484, in getFileMeta
    raise Exception("Invalid File: %s" % filename)
Exception: Invalid File: shellcode
[...]
INFO:capa:format: blob, platform: windows, architecture: i386, number of functions: 42
INFO:capa:analyzed file and extracted 112 features
+------------------------+----------------------------------------------------------------+
| ATT&CK Tactic          | ATT&CK Technique                                               |
|------------------------+----------------------------------------------------------------|
| DEFENSE EVASION        | Obfuscated Files or Information [T1027]                        |
| EXECUTION              | Shared Modules [T1129]                                         |
+------------------------+----------------------------------------------------------------+

+---------------------------------------------+----------------------------------------------+
| CAPABILITY                                  | NAMESPACE                                    |
|---------------------------------------------+----------------------------------------------|
| contain obfuscated stackstrings (2 matches) | anti-analysis/obfuscation/string/stackstring |
| encode data using XOR                       | data-manipulation/encoding/xor               |
| parse PE header                             | load-code/pe                                 |
+---------------------------------------------+----------------------------------------------+

INFO:capa:done.

engine: support matching on rule namespace prefixes

right now we support matching on other rule names, like match: encrypt data with RC4 KSA

we should support matching on namespaces, as well, like match: data-manipulation/encryption

this would mean that rule authors don't have to know about all the possible techniques to do a thing (like encryption).

assume characteristics always encode the existance

after months of use, it seems that characteristic features are only used like characteristic(nzxor): True. that is, the value is always True. we can simplify and make the rule syntax more consistent by changing the format to look like characteristic: nzxor and count(characteristic(nzxor)).

to match the non-existence of this feature, use not: characteristic: ... or count(characteristic(...)): 0.

plan: rule reorganization

  • agree on proposed rule names and namespaces. see shared excel spreadsheet.
  • develop script to do migration #25
  • develop formatter to ensure consistency of formatted rules #8
  • run formatter on all rules (and confirm results) mandiant/capa-rules#12
  • execute migration mandiant/capa-rules#14
  • post snapshot of excel spreadsheet mandiant/capa-rules#14
  • run formatter on all rules
  • update linter to support namespaces
  • document rule naming and namespacing conventions
  • update outputter to support namespaces #34
  • update readme with new rules and output examples
  • update ida plugin to support namespaces @mike-hunhoff #58
  • update readme with screenshots of IDA plugin @mike-hunhoff #66
  • update FC service to support namespaces @MalwareMechanic

Support descriptions for regular expressions

This was not implemented in #39, as at RegExp are not a Feature. It is needed to either make RegExp a feature or implement this for RegExp as well. It should works in the same as for strings.

Just tracking it here, so that we don't forget about it. 😉

discussion: capa doesn't do anything for non-Windows or non-PE files

capa contains a small amount of code and a large amount of default rules that assume the input file is a Windows PE file. this is because the original authors primarily analyze Windows malware. there is nothing stopping analysis of Linux ELF or MacOS Mach-O binaries; however, we haven't yet had the experience, sample binaries, nor time to make this happen.

support for additional platforms may be added in the future, especially with (1) contributions from experts in those fields, and (2) sufficient sample binaries to demonstrate capa works as expected. if you're interested in helping out in these areas, please get in touch!

vivisect extractor: bytes features for immediate operands

currently this gets bytes features for many invalid immediate operators

        if isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
            v = oper.getOperValue(oper)

for example add ebp, 0Bh etc.

this case should be fine-tuned or removed?

Add a CONTRIBUTING file

I think we should add a CONTRIBUTING file to collect some important information we now have in other documents. I information is usually in the CONTRIBUTING file in other project and it is where people expect it to be. In addition, it is used by GitHub to help guiding new contributors. For example, when someone opens a pull request or creates an issue, they will see a link to that file:

image

Reference: https://help.github.com/en/github/building-a-strong-community/setting-guidelines-for-repository-contributors

I think this document should include the following information:

  • How to contribute with issues, including a reference to the capa-rules repository and which issues belongs to every repo. This should also be linked from the issues template.
  • How to write rules, linking current documentation and explaining the linter
  • How to contribute with code, including how to set the project up (currently in different documents) and how to run the tests.

Something else?

discussion: false positives in vcrt functions

there are a number of interesting rules, like manual PEB parsing, that fire on standard routines inserted by the MSVC compiler. typically, we'd want to include these in the output, except that some of these normal runtime functions aren't doing anything nefarious (as the rule might suggest, like anti-vm).

this leads to the desire that we'd want to filter out some known functions from matching.

there are at least two obvious approaches:

  1. using existing capa logic/rules to match known functions (like count of bb, count and/or distribution of mnemonics, etc) and then not the matches
  2. rely on the analysis backend to provide metadata about functions, such as auto-detected function name, and let rules match against this

both of these have tradeoffs, and its not clear what we should do.

if we use capa infrastructure to match functions,

pro:

  • need no new features or syntax, can do it today
  • works across all analysis backends
  • easy to inspect

con:

  • we have to maintain function signatures (not our goal here)
  • our signatures may not be as good as purpose built tech, like FLIRT of Ghidra's database
  • matching N signatures against M functions may introduce performance issues (maybe, this is a guess)

if we rely on backend analysis backends to match functions,

pro:

  • rely on backend expertise to do function id very well
  • less maintainence

con:

  • need new syntax, maybe like function/name: __init_iob
  • different analysis backends have different quality, i.e. IDA is very good, and vivisect has minimal coverage
  • different analysis backends may use different names/formats for function names that we have to normalize

ci: configure isort for code formatting

$ isort --length-sort --line-width 120 --thirdparty idc --thirdparty idaapi --thirdparty idautils --thirdparty ida_gdl --thirdparty PyQt5 --thirdparty argparse --builtin posixpath --thirdparty tabulate --thirdparty viv_utils --recursive .

integrate capa with ghidra

lots of people use ghidra, which is free and open source. we should recommend a way of integrating capa results into ghidra.

count: 0 - range fails if no feature extracted

test case: rule and output (it should match on functions with no calls)

rule:
  meta:
    name: calls from
    namespace: test
    author: [email protected]
    scope: function
  features:
    - or:
      - count(mnemonic(call)): 0
      - count(characteristic(calls from)): 0

capa tests/data/34404a3fb9804977c6ab86cb991fb130.exe_ -t test -vv
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe  ./path/to/rules/  /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa:     https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 278 rules
INFO:capa:selected 1 rules
INFO:capa:generating vivisect workspace for: tests/data/34404a3fb9804977c6ab86cb991fb130.exe_
INFO:capa:format: pe, platform: windows, architecture: i386, number of functions: 853
INFO:capa:analyzed file and extracted 1549 features

INFO:capa:done.

capa/engine.py:156

    def evaluate(self, ctx):
        if self.child not in ctx:
            return Result(False, self, [])

count basic block

This rule is not working as I expect, I get no results. Am I using this wrong?

rule:
  meta:
    name: count bb
    namespace: test
    scope: function
  features:
    - and:
      - count(basic blocks): 1 or more

post-commit git hook incorrectly stashes unstaged changes

when i stage and commit only some of the pending changes, the post-commit git hook places unstaged changes into the git stash stack. i have to manually pop them with git stash pop stash@{0}. i would rather have these unstaged changes untouched by the git hook (at least, they should be there when the hook completes its job).

last night i was really scared that i had lost hours of work until i noticed the changes were hidden in the stack.

discussion: capa doesn't extract features from packed files

capa relies on disassembly and code analysis that can easily be defeated by packing. right now, capa doesn't attempt to do any auto-unpacking, so even trivially packed samples can bypass capa. fortunately, capa can often recognize when packing is in use (if you notice a bypass, submit a rule!), and will emit a warning about this.

doing auto-unpacking is a non-trivial job, and not really in scope for the what capa does. however, if there are easy ways to make this work, we can revisit the idea.

Associate context with a string

@Ana06

In some cases it could be useful to associate context with a string as it can be done with numbers. For example:

- string: "{3E5FC7F9-9A51-4367-9063-A120244FBEC7}" = CLSID_CMSTPLUA

@mr-tz

hm, good point! maybe it makes sense to make the extra context available to all features.

@williballenthin

+1

update serialization of characteristic feature

remove special handling of characteristic feature when serializing and refreeze testbed files.

it currently maintains backwards compatibility with an old format, by using a list of two elements.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.