mandiant / capa Goto Github PK
View Code? Open in Web Editor NEWThe FLARE team's open-source tool to identify capabilities in executable files.
License: Apache License 2.0
The FLARE team's open-source tool to identify capabilities in executable files.
License: Apache License 2.0
IDAPython: Error while calling Python callback <OnCreate>:
Traceback (most recent call last):
File "ida_capa_explorer.py", line 99, in OnCreate
self.load_capa_results()
File "capa/capa/ida/ida_capa_explorer.py", line 342, in load_capa_results
capabilities = capa.main.find_capabilities(rules, capa.features.extractors.ida.IdaFeatureExtractor(), True)
File "capa\capa\main.py", line 99, in find_capabilities
for f in tqdm.tqdm(extractor.get_functions(), disable=disable_progress, unit=" functions"):
File "C:\Python27\lib\site-packages\tqdm\_tqdm.py", line 997, in __iter__
for obj in iterable:
File "capa\capa\features\extractors\ida\__init__.py", line 54, in get_functions
from capa.features.extractors.ida import helpers
ImportError: cannot import name helpers
INFO:capa:form closed.
Python>sys.version
'2.7.15 (v2.7.15:ca079a3ea3, Apr 30 2018, 16:30:26) [MSC v.1500 64 bit (AMD64)]'
and maybe suggest "reference" -> "references"
use this JSON as the source data for all formatters. this will ensure it has all data necessary to render complete details of capa matches.
the JSON document will be the primary method of integration for external tools and scripts, rather than supporting a multitude of integrations.
The Element
class is just used for testing. By using Element
we are not testing the actual code. Also, every time we implement a new feature for the Feature
class, we need to implement it for Element
as well. I think it would be a better idea to use real classes for testing and get rid of Element
. We could start by substituing it by number, which should be straighforward. Although I think it could be a good idea to add some more tests for the different Feature
classes.
Related to #5, as it symplifies the implementation.
@mr-tz @williballenthin what do you think?
it would be nice to format rules with a consistent style.
this includes:
meta
before features
by default, python yaml emits keys alphabetically. as an example:
rule:
meta:
att&ck:
- Defense Evasion::Obfuscated Files or Information T1027.002
author: [email protected]
examples:
- CD2CBA9E6313E8DF2C1273593E649682
- Practical Malware Analysis Lab 01-02.exe_:0x0401000
mbc:
- Anti-Static Analysis::Software Packing
name: packed with UPX
namespace: anti-analysis/packer/upx
scope: file
features:
- or:
- section: UPX0
- section: UPX1
this wold look nicer:
rule:
meta:
name: packed with UPX
namespace: anti-analysis/packer/upx
author: [email protected]
att&ck:
- Defense Evasion::Obfuscated Files or Information T1027.002
mbc:
- Anti-Static Analysis::Software Packing
examples:
- CD2CBA9E6313E8DF2C1273593E649682
- Practical Malware Analysis Lab 01-02.exe_:0x0401000
scope: file
features:
- or:
- section: UPX0
- section: UPX1
I propose the following formats to reduce duplicate information (MD5) and display the most important information first.
capa report
could be included as a header/heading as well
default before
+------------------------+--------------------------------------------------------------+
| capa report for | 34404a3fb9804977c6ab86cb991fb130 |
| timestamp | 2020-07-03T12:41:55.267000 |
| version | 0.0.0 |
| path | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_ |
| md5 | 34404a3fb9804977c6ab86cb991fb130 |
+------------------------+--------------------------------------------------------------+
>>>>>
after
+------------------------+--------------------------------------------------------------+
| md5 | 34404a3fb9804977c6ab86cb991fb130 |
| path | tests\data\34404a3fb9804977c6ab86cb991fb130.exe_ |
| timestamp | 2020-07-03T12:41:55.267000 |
| capa version | 0.0.0 |
+------------------------+--------------------------------------------------------------+
verbose, vverbose (should use same function) before
capa report for 34404a3fb9804977c6ab86cb991fb130
timestamp 2020-07-03T12:42:07.813000
version 0.0.0
path tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
md5 34404a3fb9804977c6ab86cb991fb130
sha1 b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256 c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
format auto
extractor VivisectFeatureExtractor
base address 0x400000
>>>>>
after
md5 34404a3fb9804977c6ab86cb991fb130
sha1 b345e6fae155bfaf79c67b38cf488bb17d5be56d
sha256 c6930e298bba86c01d0fe2c8262c46b4fce97c6c5037a193904cfc634246fbec
path tests\data\34404a3fb9804977c6ab86cb991fb130.exe_
timestamp 2020-07-03T12:42:07.813000
capa version 0.0.0
format auto
extractor VivisectFeatureExtractor
base address 0x400000
Tests are failing in master:
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ERROR collecting tests/test_freeze.py ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
tests/test_freeze.py:27: in <module>
0x401002: {"features": [(0x401002, capa.features.insn.Mnemonic("mov")),],},
E TypeError: Can't instantiate abstract class NullFeatureExtractor with abstract methods get_base_address
==================================================================================================== warnings summary ====================================================================================================
/usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14
/usr/local/lib/python2.7/site-packages/vivisect/parsers/__init__.py:14: DeprecationWarning: the md5 module is deprecated; use hashlib instead
import md5
-- Docs: https://docs.pytest.org/en/latest/warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Introduced in ff44801
After #39 it is really obvious that args
and value
are a duplication for most of the features. In most cases args = [value]
. In few features value
has a different name, but I think I makes sense to rename this attribute. We could think about it as the value
in the yaml file. So, I propose to get rid of args
and introduce value
for Feature
(the main class instead of the subclasses). Removing duplication would simplify the code.
@mr-tz @williballenthin what do you think?
capa shows the file feature count
INFO:capa:analyzed file and extracted 21 file features
to avoid confusion, this should be removed or extended to also show function features
vivisect
and/orviv_util
updates may result in modified workspaces. By defaultgetWorkspace
loads existing.viv
files if they exist. This can lead to confusion, misleading analysis and errors.
we should probably report this upstream.
in the meantime, maybe we can stuff the viv version in a meta field and do the check ourselves.
...unless in nursery
capa relies on analysis of code structures to identify patterns. this is similar to matching sequences of API calls or other events in a sandbox, but not exactly. right now, capa rules don't directly translate to identifying behaviors from sandbox or debugging output, but it seems like there's a lot of overlap. maybe we can find a way to re-use a lot of work we've done for the static analysis rules.
capa relies on vivisect for its standalone code analysis (when run within IDA, it uses IDA's analysis). since vivisect is py2-only, this means capa is py2-only, when used standalone or as a library. we should provide an analysis backend that can be used on py3, as this is the future.
we're aware that everyone (actually, including ourselves) has already moved on to py3. you should be aware that using vivisect was the path of least resistance to developing capa. now that we've proved that capa works and is useful, its finally appropriate to dedicate substantial time towards the upgrade.
note, the capa code base is already py3 compatible. this is strictly a limitation of the backend that we ship by default.
add another scope
program
to encompass file and function (and lower) scopes
Should we prioritize this feature? We have various instances from Ana's work where this would be helpful. According to @mwilliams31
schannel
is also likely implemented across multiple functions.
works for me. shall we have @Ana06 tackle it? will require getting familiar with the matching logic, which is a good lesson (and maybe torture???).
sounds good 😄 if it becomes too much torture, let us know, @Ana06
I have the following questions/comments after changing the IDA plugin to use the new JSON format:
Does it make sense to define (if not done already) a JSON schema for the new format?
Does it make sense to include the original rule content for match
? This data can be found in the source
field of the parent match
but finding the original source this way isn't as convenient
match
Does it make sense to include the locations for range
? There locations, and corresponding context e.g. the instruction at a location, used to be displayed in the IDA plugin.
Does it make sense to include additional meta data e.g. hash value, entry point, etc. specific to the binary file from which the output was produced?
Does it make sense to include feature comments e.g. PAGE_EXECUTE_READWRITE
from number: 0x40 = PAGE_EXECUTE_READWRITE
The doc format does not include locations for calls from
characteristic. From my understanding these locations are recorded and should be included?
{'children': [],
'locations': (),
'node': {'statement': {'child': {'characteristic': 'calls from',
'type': 'characteristic'},
'max': 4,
'min': 0,
'type': 'range'},
'type': 'statement'},
'success': True},
rather than using an inline string ' = '
that is prone to typo and cannot be used with "find references", use a constant like DESCRIPTION_SEPARATOR = "' = "'
and use this throughout the code. (from @williballenthin's comment in #39)
$ capa -f sc32 tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe ./path/to/rules/ /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa: https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 277 rules
INFO:capa:generating vivisect workspace for: tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32
Traceback (most recent call last):
File "c:\python27\lib\site-packages\vivisect\impemu\monitor.py", line 147, in prehook
cb(self, emu, op, starteip)
File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 19, in analyzeJmp
ctx = getSwitchBase(vw, op, starteip, emu)
File "c:\python27\lib\site-packages\vivisect\analysis\generic\switchcase.py", line 69, in getSwitchBase
imgbase = vw.getFileMeta(filename, 'imagebase')
File "c:\python27\lib\site-packages\vivisect\__init__.py", line 2484, in getFileMeta
raise Exception("Invalid File: %s" % filename)
Exception: Invalid File: shellcode
[...]
INFO:capa:format: blob, platform: windows, architecture: i386, number of functions: 42
INFO:capa:analyzed file and extracted 112 features
+------------------------+----------------------------------------------------------------+
| ATT&CK Tactic | ATT&CK Technique |
|------------------------+----------------------------------------------------------------|
| DEFENSE EVASION | Obfuscated Files or Information [T1027] |
| EXECUTION | Shared Modules [T1129] |
+------------------------+----------------------------------------------------------------+
+---------------------------------------------+----------------------------------------------+
| CAPABILITY | NAMESPACE |
|---------------------------------------------+----------------------------------------------|
| contain obfuscated stackstrings (2 matches) | anti-analysis/obfuscation/string/stackstring |
| encode data using XOR | data-manipulation/encoding/xor |
| parse PE header | load-code/pe |
+---------------------------------------------+----------------------------------------------+
INFO:capa:done.
notably, this is found under the Python Software Foundation (PSF) organization. seems to lend some weight. also, tons of stars and engagements.
right now we support matching on other rule names, like match: encrypt data with RC4 KSA
we should support matching on namespaces, as well, like match: data-manipulation/encryption
this would mean that rule authors don't have to know about all the possible techniques to do a thing (like encryption).
after months of use, it seems that characteristic features are only used like characteristic(nzxor): True
. that is, the value is always True
. we can simplify and make the rule syntax more consistent by changing the format to look like characteristic: nzxor
and count(characteristic(nzxor))
.
to match the non-existence of this feature, use not: characteristic: ...
or count(characteristic(...)): 0
.
from #91
also, this:
INFO:capa:format: blob, platform: windows, architecture: i386, number of functions: 42
INFO:capa:analyzed file and extracted 112 features
blocked on gh actions be available, though.
rather than putting the python installation into the setup-hooks.py script
use an extras_require
for [dev]
maybe?
This was not implemented in #39, as at RegExp are not a Feature. It is needed to either make RegExp a feature or implement this for RegExp as well. It should works in the same as for strings.
Just tracking it here, so that we don't forget about it. 😉
modulo some stripping of special characters
capa contains a small amount of code and a large amount of default rules that assume the input file is a Windows PE file. this is because the original authors primarily analyze Windows malware. there is nothing stopping analysis of Linux ELF or MacOS Mach-O binaries; however, we haven't yet had the experience, sample binaries, nor time to make this happen.
support for additional platforms may be added in the future, especially with (1) contributions from experts in those fields, and (2) sufficient sample binaries to demonstrate capa works as expected. if you're interested in helping out in these areas, please get in touch!
currently this gets bytes features for many invalid immediate operators
if isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
v = oper.getOperValue(oper)
for example add ebp, 0Bh
etc.
this case should be fine-tuned or removed?
I think we should add a CONTRIBUTING file to collect some important information we now have in other documents. I information is usually in the CONTRIBUTING file in other project and it is where people expect it to be. In addition, it is used by GitHub to help guiding new contributors. For example, when someone opens a pull request or creates an issue, they will see a link to that file:
I think this document should include the following information:
capa-rules
repository and which issues belongs to every repo. This should also be linked from the issues template.Something else?
there are a number of interesting rules, like manual PEB parsing, that fire on standard routines inserted by the MSVC compiler. typically, we'd want to include these in the output, except that some of these normal runtime functions aren't doing anything nefarious (as the rule might suggest, like anti-vm).
this leads to the desire that we'd want to filter out some known functions from matching.
there are at least two obvious approaches:
not
the matchesboth of these have tradeoffs, and its not clear what we should do.
if we use capa infrastructure to match functions,
pro:
con:
if we rely on backend analysis backends to match functions,
pro:
con:
function/name: __init_iob
at least include a screenshot in the main readme so people can get a sense for what it does.
$ isort --length-sort --line-width 120 --thirdparty idc --thirdparty idaapi --thirdparty idautils --thirdparty ida_gdl --thirdparty PyQt5 --thirdparty argparse --builtin posixpath --thirdparty tabulate --thirdparty viv_utils --recursive .
lots of people use ghidra, which is free and open source. we should recommend a way of integrating capa results into ghidra.
via GitHub Actions
Currently has to be done manually, see https://stackoverflow.com/questions/5828324/update-git-submodule-to-latest-commit-on-origin
<in capa base dir>
cd rules/
git checkout master
git pull origin master
cd ..
git add rules/
git commit
git push origin master
test case: rule and output (it should match on functions with no calls)
rule:
meta:
name: calls from
namespace: test
author: [email protected]
scope: function
features:
- or:
- count(mnemonic(call)): 0
- count(characteristic(calls from)): 0
capa tests/data/34404a3fb9804977c6ab86cb991fb130.exe_ -t test -vv
INFO:capa:--------------------------------------------------------------------------------
INFO:capa: Using default embedded rules.
INFO:capa: To provide your own rules, use the form `capa.exe ./path/to/rules/ /path/to/mal.exe`.
INFO:capa: You can see the current default rule set here:
INFO:capa: https://github.com/fireeye/capa-rules
INFO:capa:--------------------------------------------------------------------------------
WARNING:capa:skipping non-.yml file: .git
WARNING:capa:skipping non-.yml file: README.md
INFO:capa:successfully loaded 278 rules
INFO:capa:selected 1 rules
INFO:capa:generating vivisect workspace for: tests/data/34404a3fb9804977c6ab86cb991fb130.exe_
INFO:capa:format: pe, platform: windows, architecture: i386, number of functions: 853
INFO:capa:analyzed file and extracted 1549 features
INFO:capa:done.
capa/engine.py:156
def evaluate(self, ctx):
if self.child not in ctx:
return Result(False, self, [])
This rule is not working as I expect, I get no results. Am I using this wrong?
rule:
meta:
name: count bb
namespace: test
scope: function
features:
- and:
- count(basic blocks): 1 or more
add flag --debug
to enable DEBUG level logging. this is independent of --verbose
mode that affects result output.
when i stage and commit only some of the pending changes, the post-commit git hook places unstaged changes into the git stash stack. i have to manually pop them with git stash pop stash@{0}
. i would rather have these unstaged changes untouched by the git hook (at least, they should be there when the hook completes its job).
last night i was really scared that i had lost hours of work until i noticed the changes were hidden in the stack.
this could include:
capa relies on disassembly and code analysis that can easily be defeated by packing. right now, capa doesn't attempt to do any auto-unpacking, so even trivially packed samples can bypass capa. fortunately, capa can often recognize when packing is in use (if you notice a bypass, submit a rule!), and will emit a warning about this.
doing auto-unpacking is a non-trivial job, and not really in scope for the what capa does. however, if there are easy ways to make this work, we can revisit the idea.
In some cases it could be useful to associate context with a string as it can be done with numbers. For example:
- string: "{3E5FC7F9-9A51-4367-9063-A120244FBEC7}" = CLSID_CMSTPLUA
hm, good point! maybe it makes sense to make the extra context available to all features.
+1
remove special handling of characteristic feature when serializing and refreeze testbed files.
it currently maintains backwards compatibility with an old format, by using a list of two elements.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.