Comments (9)
"""remove the "-1" labeled rows"""
Check out the train_ember function for how I filter out the -1 labeled rows from the ember benchmark model training: https://github.com/endgameinc/ember/blob/master/ember/__init__.py#L146-L160
"""select certain features"""
If you want to select only certain columns from X_train or X_test, you can use numpy indexing to achieve this:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
https://stackoverflow.com/questions/8386675/extracting-specific-columns-in-numpy-array
If you want to select rows from the feature matrix based on the metadata dataframe, I would suggest doing something like:
selected_rows = metadata["appeared"] < "2017-08"
X_train_filtered = X_train[selected_rows]
Good luck!
from ember.
As I look through the code with fresh eyes I think I can work it out. The thing that was throwing me yesterday was the feature hashing/importing steps.
from ember.
You're probably on the right track, but I'll just stress that you can use this code to vectorize the features only. At that point, you'll have a large feature space that you can read in and hand to whatever model you're interested in, including all those in scikit-learn.
from ember.
To confirm, if i just want to vectorize and go from there, I should follow the "Import usage" steps up to:
metadata_dataframe = ember.read_metadata("/data/ember/")?
I might even want to take over earlier, can't wait to dig in!
from ember.
That's right. If you complete those steps, then the X_train, y_train, X_test, y_test
variables in your environment will be immediately ready to be handed to scikit-learn models.
from ember.
(Temp re-open) I imagine there is some data pre-processing required to select only certain features or to remove the "-1" labeled rows from the datasets for a purely supervised approach. I can remove them from the dataframe step but the X_train, y_train, X_test, y_test = ember.read_vectorized_features(data_dir)
operation goes straight to the already written vectorized data files, instead of the metadata dataframe. The only way I can see making this adjustment would be to strip them from the JSONL files prior to vectorization. Does that seem accurate?
from ember.
What is not clear is the mapping from the header columns to the numpy array needed to slice a particular feature. For instance, what if I only want to look at Imports info, how can I determine which array indices that corresponds to in X_train? After all, there are 2,351 feature vectors to choose from in X_train/X_test. The FeatureHasher also makes it very difficult to characterize feature importance.
from ember.
Since the hashing trick is used to convert, e.g., a ragged count of imports to a fixed-length vector, you'd only be able to back out "these columns are imports", but have a many-to-one problem of many imports mapping to any one column. (One column corresponds to many imported names.)
Hashing trick: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html
For any one section, you can figure out where the feature type begins by noting the order of features:
https://github.com/endgameinc/ember/blob/master/ember/features.py#L441-L443
features = [
ByteHistogram(), ByteEntropyHistogram(), StringExtractor(), GeneralFileInfo(),
HeaderFileInfo(), SectionInfo(), ImportsInfo(), ExportsInfo()
]
and noting that every FeatureType has a dim
that specified the number of columns spanned in the feature matrix. So, ImportsInfo features begins at
imports_offset = sum([fe.dim for fe in features[:6])
and is features[6].dim
columns long.
from ember.
I have it working now. Dug into the code and figured out where each of those fixed length feature sizes are so I can compare the various categories against each-other. Its working well!
from ember.
Related Issues (20)
- AttributeError: module 'ember' has no attribute 'create_vectorized_features' HOT 10
- error: subprocess-exited-with-error (lief) HOT 2
- create_vectorized_features error HOT 7
- module 'lief' has no attribute 'bad_format' HOT 2
- Problem with run classify_binaries.py URGENT
- How to extract raw feature from a set of PE binaries? HOT 2
- Extract Raw Features for Own Dataset HOT 3
- How can I use .features script to extract features from a malware sample I already have using the same way ember does?
- Extracted raw features does match dataset features
- module 'ember' has no attribute 'predict_sample'
- How can I use my own binary Raw sample to apply on this model?
- Can't import the library on Google Colab
- How to get the original bytes of the PE file. I want to covert a file to a gray image. HOT 5
- How to map samples in data set to their SHA256 hash?
- Define the information represented on the malware vector? HOT 3
- What is the license of the ember/features.py file?
- The train_ember.py file is not installed HOT 1
- Dependencies no longer declared in setuptools
- NumPy 1.24 compatibility HOT 2
- Sharing datasets over BitTorrent
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ember.