Giter VIP home page Giter VIP logo

microsoft_malware_prediction_kaggle_2nd's Introduction

Microsoft Maleware Prediction on Kaggle

This repository contains the winning solution (2nd place) of the Macrosoft Maleware Prediction Challenge on Kaggle. For details on our approach, see the overview of our soultion.

Link to the competition homepage: https://www.kaggle.com/c/microsoft-malware-prediction

Team:

  • Stephan Michaels
  • Florian Imorde

Competition Description:

The competition "Microsoft Maleware Prediction" was based on the questions whether or not a computer is infected by maleware. Based on different properties and features povided by Microsofts Windwos Defender, an algorithm had to be created, which predicts the probability of such an infection.

Soution Description:

For solving the maleware prediction problem, two models were trained:

  • Model 1 (later called M1): The data set was cleaned and string values encoded. Afterwards a LightGBM was trained.
  • Model 2 (later called M2): The preprocessed data from model 1 was extended with new features. Next, important features were selected and a LightGBM was trained.

Finally an average of the predictions of both models was calculated.

Archive Contents

  • code
    • 1_Data_Cleaning_train_set.ipynb
      Cleaning the train set.
    • 2_Data_Cleaning_test_set.ipynb
      Cleaning the test set.
    • 3_Data_Encoding_M1.ipynb
      Encode the train - and test data for model 1.
    • 4_Submission_M1.ipynb
      Building model 1.
    • 5_Feature_Engineering_M2.ipynb
      Creating new features for model 2.
    • 6_Data_Encoding_M2.ipynb
      Encode the train - and test data for model 2.
    • 7_Submission_M2.ipynb
      Building model 2.
    • 8_Submission_Solution.ipynb
      Building the final solution by averaging the solutions from model 1 and model 2.
    • Optional_Feature_Selection_M2.ipynb
      Selection the most importend features for building up model 2.
    • Optional_Submission_Simple_Model
      Building a simplified model.
  • data
    • Data_Description.xlsx
      Feature informations: Relevant for the future, type, description.
    • encoding_dictionary.p
      Dictionary, which contains the encoder for relabeling values of features.
  • feature_importance
    • Featureimportance_M1.csv
      List of all features used in model 1 with corresponding importance.
    • Featureimportance_Feature_Selection_M2.csv
      List of all features in model 2 after Feature Engineering with corresponding importance.
    • Featureimportance_M2.csv
      List of all features used in model 2 with corresponding importance.
    • FeatureImportance_Simple.csv
      List of all features used in the simplified model.
  • models
    • Placeholder
      Models are saved here.
  • submissions
    • Placeholder
      Final submissions are stored here.

Data

The original train- and testdata can be downloaded form the competition homepage.

Link to the data:https://www.kaggle.com/c/microsoft-malware-prediction/data

The datasets have to be stored in the data folder.

Hardware:

Notebook with:

  • Intel(R) Core(TM) i7-8850H
  • 16GB RAM

Software

  • Windows 10 Pro, 64 Bit (Version: 1809)
  • Anaconda 1.9.6
  • Python 3.7.1
  • Jupyter Notebook 5.7.4

Libraries

The following libraries are required:

  • numpy (Version 1.15.4)
  • pandas (Version 0.23.4)
  • dask (Version 1.0.0)
  • scikit-learn (Version 0.20.1 )
  • tqdm (Version 4.28.1)
  • lightgbm (Version 2.2.1)
  • pickle (Version 4.0)

Licence

Out code is submitted under MIT license.

microsoft_malware_prediction_kaggle_2nd's People

Contributors

imor-de avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.