ehrenb / machina Goto Github PK
View Code? Open in Web Editor NEWA scalable and recursive binary analysis pipeline
Home Page: https://machina.behren.me
A scalable and recursive binary analysis pipeline
Home Page: https://machina.behren.me
Currently, modules trigger on incoming data types. Modules like Similarity might be best triggered at a frequency, rather than per input.
Create a new base image to enable some Ghidra analyses.
Currently, the docker build process performs a 'git clone' to rebuild images with updated code. This prevents any of the useful docker layering from happening (since Docker can't detect a change).
Experimenting with moving from OrientDB -> Neo4J free. https://github.com/ehrenb/machina/tree/neo4j
Can't do multi-stage builds in the docs Dockerfile, because there is a circular dependency problem.
...
# multi-stage build to copy in worker source modules
# for autodoc'ing their source and schemas
# TODO: resolve how to mock imports for each, as
# we dont want to have to install all 3rd party deps
# for all workers. see https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#confval-autodoc_mock_imports
# FROM behren/machina-androdguard:latest as androguard
# FROM behren/machina-binwalk:latest as binwalk_img
# RUN mkdir /machina/binwalk && touch /machina/binwalk/__init__.py
# COPY --from=binwalk_img /machina/src /machina/binwalk
# FROM behren/machina-bz2:latest as bz2
# FROM behren/machina-exif:latest as exif
# FROM behren/machina-findurls:latest as findurls
# FROM behren/machina-gzip:latest as gzip
# FROM behren/machina-identifier:latest as identifier
# RUN mkdir /machina/identifier && touch /machina/identifier/__init__.py
# COPY --from=identifier /machina/src /machina/identifier
# FROM behren/machina-jar:latest as jar
# FROM behren/machina-similarity:latest as similarity
# FROM behren/machina-ssdeep:latest as ssdeep
# FROM behren/machina-tar:latest as tar
# FROM behren/machina-zip:latest as zip
# FROM behren/machina-ghidra-project-creator:latest as ghidra-project-creator
...
Also, to import these for autodc, we can use mock-import to suppress the import warnings instead of bloating the image with all dependencies:
conf.py
autodoc_mock_imports = [
'python-magic'
]
For now, just keeping referential documentation in workers.csv in the docs repo.
Add an ELK stack to monitor all container logs in the namespace.
The below occurs upon startup of the system. This is an issue with pydantic2 and rocketry, see Miksus/rocketry#225 and Miksus/rocketry#210 . For now, I manually downgraded pydantic to 1.10.13.
machina-similarityanalysis-1 | Traceback (most recent call last):
machina-similarityanalysis-1 | File "/machina/src/run.py", line 3, in <module>
machina-similarityanalysis-1 | from similarityanalysis import SimilarityAnalysis
machina-similarityanalysis-1 | File "/machina/src/similarityanalysis.py", line 6, in <module>
machina-similarityanalysis-1 | from machina.core.periodic_worker import PeriodicWorker
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/machina-0.1-py3.10.egg/machina/core/periodic_worker.py", line 8, in <module>
machina-similarityanalysis-1 | from rocketry import Rocketry
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/rocketry/__init__.py", line 1, in <module>
machina-similarityanalysis-1 | from .session import Session
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/rocketry/session.py", line 18, in <module>
machina-similarityanalysis-1 exited with code 1
machina-similarityanalysis-1 | from rocketry.log.defaults import create_default_handler
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/rocketry/log/defaults.py", line 1, in <module>
machina-similarityanalysis-1 | from redbird.logging import RepoHandler
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/redbird/__init__.py", line 2, in <module>
machina-similarityanalysis-1 | from .base import BaseRepo, BaseResult
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/redbird/base.py", line 116, in <module>
machina-similarityanalysis-1 | class BaseRepo(ABC, BaseModel):
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/redbird/base.py", line 153, in BaseRepo
machina-similarityanalysis-1 | ordered: bool = Field(default=False, const=True)
machina-similarityanalysis-1 | File "/usr/local/lib/python3.10/dist-packages/pydantic/fields.py", line 764, in Field
machina-similarityanalysis-1 | raise PydanticUserError('`const` is removed, use `Literal` instead', code='removed-kwargs')
machina-similarityanalysis-1 | pydantic.errors.PydanticUserError: `const` is removed, use `Literal` instead
Support for whitelist/blacklist options
OrientDB's Python client has been consistently broken for the past couple years. A newer fork (orientechnologies/pyorient#42, https://github.com/brucetony/pyorient) claims to support 3.1.x, but using the 3.1.12 Docker image results in the below error:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/machina/images/identifier/src/identifier.py", line 121, in callback
type=resolved_type)
File "/src/pyorient/pyorient/ogm/broker.py", line 56, in create
return self.g.create_vertex(self.element_cls, **kwargs)
File "/src/pyorient/pyorient/ogm/graph.py", line 532, in create_vertex
result = self.client.command(self.create_vertex_command(vertex_cls, **kwargs))[0]
File "/src/pyorient/pyorient/orient.py", line 481, in command
return self.get_message("CommandMessage").prepare((QUERY_CMD,) + args).send().fetch_response()
File "/src/pyorient/pyorient/utils.py", line 48, in wrap_function
return wrap(*args, **kwargs)
File "/src/pyorient/pyorient/utils.py", line 61, in wrap_function
return wrap(*args, **kwargs)
File "/src/pyorient/pyorient/messages/commands.py", line 128, in prepare
self._encode_field(x) for x in _payload_definition
File "/src/pyorient/pyorient/messages/commands.py", line 128, in <genexpr>
self._encode_field(x) for x in _payload_definition
File "/src/pyorient/pyorient/messages/database.py", line 379, in _encode_field
_content = struct.pack("!i", len(v)) + v
TypeError: object of type 'VertexCommand' has no len()
Until official (or stable) maintenance of the pyorient project happens, Machina will have to continue to depend on an older version of OrientDB and another pyorient fork.
Working client: https://github.com/alanmeeson/pyorient.git@0317a87369675df9b33fd38af451099c3c011d40#egg=pyorient
Working server: 2.2
Currently, each worker attempts to initialize the OGM. While no duplicate OGMs will be created, it does cause significant delay in start time. There should be one dedicated service to init the OGMs, or a check to see if the OGM exists before init.
OrientDB shows the following error:
2022-12-25 16:30:39:479 SEVER Exception `5EA22C9A` in storage `plocal:/orientdb/databases/machina`: 3.2.13 (build 1b0940491143c734d9f7338b321c2cde319a79ef, branch UNKNOWN) [OLocalPaginatedStorage]
com.orientechnologies.orient.core.exception.OCommandExecutionException: Property 'apk.md5' already exists. Remove it before to retry.
DB name="machina"
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLCreateProperty.execute(OCommandExecutorSQLCreateProperty.java:298)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:74)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:4205)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:4171)
at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:63)
at com.orientechnologies.orient.server.OConnectionBinaryExecutor.executeCommand(OConnectionBinaryExecutor.java:618)
In workers, use https://github.com/mogui/pyorient/blob/fb74c5da75c14b568c79949b219b98549d1c732a/pyorient/ogm/graph.py#L101 to bind the class OGM
Create an Initializer that performs the create_all https://github.com/mogui/pyorient/blob/fb74c5da75c14b568c79949b219b98549d1c732a/pyorient/ogm/graph.py#L527
The below is unnecessary, just use Pika to make the CLI more lightweight..
machinacli.py
from machina.core.api import BaseAPI
When machina-base is built, also trigger subsequent builds of downstream images.
"gh release create" creates tags automatically. do away with the git tag + manual release script, replace with the "gh release" command.
May need to patch MAXMEM in analyzeHeadless to attempt to read from environment first
Create a ClamAV worker module that fires on (all?) types:
In Dockerfile:
At runtime
run 'clamdand
freshclam -d` to background clamd and db updates
communicate with clamd via Python over the socket interface described here: https://manpages.debian.org/unstable/clamav-daemon/clamd.8.en.html
add to csv in docs
After some time (20 minutes) of analyzing a jffs2, workers start throwing the following error:
Exception in thread Thread-3820 (callback):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/machina/src/identifier.py", line 137, in callback
origin_node = self.graph.get_vertex(data['origin']['id'])
File "/usr/lib/python3.10/site-packages/pyorient/ogm/graph.py", line 612, in get_vertex
record = self.client.command('SELECT FROM {}'.format(vertex_id))
File "/usr/lib/python3.10/site-packages/pyorient/orient.py", line 481, in command
return self.get_message("CommandMessage").prepare((QUERY_CMD,) + args).send().fetch_response()
File "/usr/lib/python3.10/site-packages/pyorient/messages/commands.py", line 143, in fetch_response
super(CommandMessage, self).fetch_response()
File "/usr/lib/python3.10/site-packages/pyorient/messages/database.py", line 300, in fetch_response
self._decode_all()
File "/usr/lib/python3.10/site-packages/pyorient/messages/database.py", line 283, in _decode_all
self._decode_header()
File "/usr/lib/python3.10/site-packages/pyorient/messages/database.py", line 229, in _decode_header
raise PyOrientCommandException(
pyorient.exceptions.PyOrientSecurityAccessException: com.orientechnologies.orient.core.exception.OSecurityAccessException - Invalid authentication info for access to the database com.orientechnologies.orient.core.metadata.security.auth.OTokenAuthInfo@18287ceb
DB name="machina"
Binwalk module failing to build due to deps.sh failing to reach an apt link.
Unrelated, but may want to update custom deps.sh due to: devttys0/sasquatch#48 (comment)
bzip2 decompression
Base config should be INFO. Override individual modules in their respective config file.
cpio unarchiving
Recent release of neomodel (https://github.com/neo4j-contrib/neomodel/releases/tag/4.0.9) supports newer versions of Neo4J. Move both Neo4J and neomodel to latest.
For findurls, there is no mimetype or detailed data associated with a URL, so the worker has to type it as 'url' manually when resubmitting. However, it should support blind resubmission (for CLI submission), but there is no mime data associated with a URL. To support this, there should be a new resolution method not based on mime/detailed type, but regex or pattern for data.
lzma decompression
The build actions are giving the following warning about an upcoming deprecation:
Warning: The `save-state` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
the build-push-action version should be bumped:
Now that URL objects are created by the Identifier (instead of findurls), there needs to be another work that analyzes URLS. This worker should:
Binwalk analysis module using the API
Test using binwalk test input data: https://github.com/ReFirmLabs/binwalk/tree/master/testing/tests/input-vectors
'depends_on' does not guarantee build order in compose anymore. according to docker/compose#6332 (comment) , when parallel builds were introduced it stopped being dependable. also see docker/compose#8538
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.