Giter VIP home page Giter VIP logo

alf's Introduction

ALF - Active Learning Framework

Outline

About project

Recent network traffic classification methods benefit from machine learning (ML) technology. However, there are many challenges due to use of ML, such as: lack of high-quality annotated datasets, data-drifts and other effects causing aging of datasets and ML models, high volumes of network traffic etc. We presents a novel Active Learning Framework (ALF) to address this topic. ALF provides prepared software components that can be used to deploy an active learning loop and maintain an ALF instance that continuously evolves a dataset and ML model automatically. The resulting solution is deployable for IP flow-based analysis of high-speed (100,Gb/s) networks, and also supports research experiments on different strategies and methods for annotation, evaluation, dataset optimization, etc.


Architecture

ALF implements Active Learning Loop. Using activity diagram we visualize a design of ALF. It basically implements Active Learning loop so we can define ALF as AL core + input interface + preprocessing and postprocessing steps + evaluation.

Bellow we can see how ALF is implemented using class diagram. Note that we used simplified class diagram to simplify the implementation by ommiting inheritance.

Use

Install all dependencies:

make init

There are 4 main dependencies:

  • Python 3.10
  • essential: requirements.txt
  • developers: requirements-dev.txt
  • NEMEA: requirements-nemea.txt

NEMEA dependencies are necessary for ALF to cooperate with NEMEA framework. For now, we assume using NEMEA in tests and quick start. In the future we will remove this dependency.

Quick start

  • Tests, linting, documentation:
make test # unit testy
make lint # linter
firefox docs/_build/html/index.html # documentation
  • Online stream demo:

Terminal 1:

mkdir workdir

python nemea_module_doh.py --i u:alf_socket --id test_random --workdir ./workdir --model single --query_strategy random --blacklist conf/blacklist.txt --query_nmax 1 --max_db_size 10000 --dpath conf/doh_D0.csv

Terminal 2:

/usr/bin/nemea/traffic_repeater -i "f:example.trapcap,u:alf_socket"

Parameter i defines NEMEA inferface. See here for more.

Note: When running nemea_module_doh.py, it is waiting for data to arrive on the socket and the program does not respond to the standard SIGINT (CTRL-C). You need to either kill the process (SIGKILL, kill -9 $PID) or send SIGINT, then send another stream (like the example) and the first thing it does after the loop continues is terminate (in Python KeyboardInterrupt). This is a feature of Python and its infinite waiting loop in the generator. We are aware of a solution, but since this property does no harm we decided not to address it for now.

How to create your own application

For simplicity we do not use parameters and all constants are hardcoded.

# logging
import logging
import sys

# use Random Forrest as classifier
from sklearn.ensemble import RandomForestClassifier

# import parts of ALF
import alf.anotator
import alf.context_manager
import alf.d_manager
import alf.engine
import alf.evaluator
import alf.input_manager
import alf.ml_model
import alf.postprocess
import alf.preprocess
import alf.query_strategy

Frameworks heavy uses logging module to log messages. Configure it:

logging.basicConfig(
    stream=sys.stdout,
    format='[%(asctime)s]: %(message)s',
    level=logging.DEBUG
)

Now let us to setting up contants and parameters. Usually this is set up by user or by configuration file etc:

# list of features from flows, type: list[str]
DATASET_COLUMNS = ["f1", "f2", ..]
# interface IFC_SPEC defined by NEMEA
IFC = "u:alf_socket"
# id, workdir; id should be unique
EXP_ID = "showcase"
WORKDIR = "/tmp/alf"
# annotator specific:
BLACKLIST = "conf/blacklist.txt"
# D0 is init train dataset
D0 = "conf/doh_train_db_small.csv"
# maximum size of the D_i database
MAX_SIZE = 5000
# query strategy specific:
N = 10
THRESHOLD = 0.1

Now we create contexts:

ContextProvider.create_context("file")
ContextProvider.get_context().set_features(DATASET_COLUMNS)
ContextProvider.get_context().set_experiment_id(EXP_ID)
ContextProvider.get_context().set_working_dir(WORKDIR)
DbProvider.create_context(context_type="file", d_0_path=D0)

Finally, now define ALF parts:

anotator = alf.anotator.AnotatorDoH(blacklist_path=BLACKLIST)
model = alf.ml_model.SupervisedMLModel(RandomForestClassifier())
query_strategy = alf.query_strategy.UncertanityUnrankedBatch(
    anotator_obj=anotator, max_samples=N,
    score_threshold=THRESHOLD, dry_run=True)
input_manager = alf.input_manager.TrapcapSocketInputManager(
    definition=IFC)
postprocessor = alf.postprocess.PostprocessorUndersample(MAX_SIZE)

We have to add parts to Engine:

engine = alf.engine.Engine(
    preprocessor=alf.preprocess.PreprocessorDoH(),
    postprocessor=postprocessor,
    ml_model_obj=model,
    query_strategy_obj=query_strategy,
    evaluator_obj=alf.evaluator.EvaluatorTestAnotatedAndAllPredicted(),
    input_manager_obj=input_manager
)

Last part - run the machine:

engine.run()

GUI

ALF comes with an easy GUI demo built with streamlit.

Run with streamlit run alf_gui.py

Further Information

  • @jaroslavpesek here on Github
  • pesek (at) cesnet.cz or pesekja8 (at) fit.cvut.cz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.