Giter VIP home page Giter VIP logo

iroko's Introduction

Iroko: The Data Center RL Gym

DISCLAIMER: This project is still very early stage research. It is not stable, well tested, and changes quickly. If you want to use this project, be warned.

Iroko is an open source project that is focused on providing openAI compliant gyms. The aim is to develop machine learning algorithms that address data center problems and to fairly evaluate solutions again traditional techniques.

A concrete description is available in our Arxiv paper. A more elaborate version is presented in this master's thesis. There is also a published workshop paper on the topic:

Iroko: A Framework to Prototype Reinforcement Learning for Data Center Traffic Control. Fabian Ruffy, Michael Przystupa, Ivan Beschastnikh. Workshop on ML for Systems at NIPS 2018.

Requirements

The data center emulator makes heavy uses of Linux tooling and its networking features. It operates most reliably on a recent Linux kernel (4.15+) and is written in Python 3.6+. The supported platform is Ubuntu (at least 16.04 is required). Using the emulator requires full sudo access.

Package Dependencies

  • GCC or Clang and the build-essentials are required.
  • git for version control
  • libnl-route-3-dev to compile the traffic managers
  • ifstat and tcpdump to monitor traffic
  • python3 and python3-setuptools to build Python packages and run the emulator

Python Dependencies

The generator supports only Python3. pip3 can be used to install the packages.

  • numpy for matrix operations
  • gym to install openAI gym
  • seaborn, pandas and matplotlib to generate plots
  • gevent for lightweight threading

Mininet Dependencies

The datacenter networks are emulated using Mininet. At minimum Mininet requires the installation of

  • openvswitch-switch, cgroup-bin, help2man

Ray Dependencies

The emulator uses Ray to implement and evaluate reinforcement learning algorithms. Ray's dependencies include:

  • Pip: tensorflow, setproctitle, psutil, opencv-python, lz4
  • Apt: libsm6, libxext6, libxrender-dev

Goben Dependencies

The emulator generates and measures traffic using Goben. While an amd64 binary is already provided in the repository, the generator submodule can also be compiled using Go 1.11. The contrib/ folder contains a script to install Goben locally.

Installation

A convenient, self-contained way to install the emulator is to run the ./install.sh. It will install most dependencies locally via Poetry.

Run

To test the emulator you can run sudo -E python3 run_basic.py. This is the most basic usage example of the Iroko environment.

run_ray.py contains examples on how to use Ray with this project.

benchmark.py is a test suite, which runs multiple tests in sequence and produces a comparison plot at the end of all runs.

iroko's People

Contributors

bestchai avatar fruffy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iroko's Issues

Switch from OVS-based Mininet to VPP

OVS is slow, cumbersome, and overengineered for our purposes. A nice alternative would be a VPP or XDP based switching framework which is much more lightweight and flexible. The hope is that this will result in saved CPU cycles and more accurate results.

Issues with 4.19 kernel

Some work has been done in the 4.19 kernel that influences our bufferbloat emulation. We need to investigate why we cannot reproduce our experiments in this environment any more.

Did the tcp work in the background?

Hi, I'm currently studying your code and I found your control pipeline working as the flow chart below. The node controller will start during the initalization of the environment and it will keep working as the pattern shown in the figure.
image
Please correct me if I'm wrong.

However, I'm just wondering while deploying rl algorithm like appo, ddpg, will TCP congestion control still work in the background? If so, will there be a confliction between function rtnl_qdisc_tbf_set_rate(ctrl_handle, (uint32_t) tx_rate, burst , 0); and TCP congestion control? Because according to my understanding, the essence of change the sending rate under tcp is changing the size of CWND. If both algorithms are doing the same thing, there seems to be a data race.

Thanks for your time for reading my issue. I'd appreciate it if you could give me some help.

Some basic documentation

The framework looks very interesting and can probably be very helpful. However, I am a little confused in terms of how to start with this project. Can you give some basic documentation about how to define topology (or use topology), how to implement RL algs on top of this framework, and how to collect results? I think it can be great if there can be a minimal example or such. Thank you!

Make the rate adjustment more reliable.

Unfortunately, using the Linux tbf qdisc does not always work as expected. Depending on the rate this pushed in, and the size of the buckets, rate limiting outcomes may vary substantially. It would be nice to try out a more reliable and accurate rate limiting option for interfaces. Something like hbm for example.

iroko_env problem

Hi, iroko team:
Thanks for your IROKO, it is powerful and easy to use. But when I try to run run_basic.py, if I choose a large number of timesteps, I encounter the following error. When I train my RL algorithm, I also encounter this problem if the number of episodes is too large. Do you know how to solve it?
Best wishes!

Generate traffic on a distribution.

Investigate generating traffic by some sort of distribution (e.g. send a packet every x ~ N(5, 3) seconds. Where x is a sample from the normal distribution. Since getting actual traffic data can be challenging, previous research supposedly make assumptions of what the distribution looks like for sending patterns. Being able to define such distributions & patterns in Iroko could be quite useful for the ML part of Iroko

Trouble getting run_ray.py to work

Hey, nice work!

I'm trying to play with this and reproduce results from https://arxiv.org/abs/1812.09975. I followed the installation instructions and got run_basic.py working, but sudo python run_ray.py --tune fails with the following exception:

Error creating interface pair (s1-eth2,h1-eth0): RTNETLINK answers: File exists

This happens without --tune as well. I've tried doing sudo mn -c, but it doesn't help. Here's the entire log file: https://pastebin.com/3gvZwNTT.

Any help appreciated. Thank you!

Implement more baselines

Add a baseline NUM solver and a random agent to compare their performance against the trained models.

Reduce size of Mininet install

The Mininet install is horrendously bloated and causes all sorts of dependency issues. We only need minimal features. There must be a better way to install the Python code.

Sunset Python2

Stop support for Python2. There is hardly any reason why Python 2 is still required.

Add option to cycle through traffic for an unlimited time.

Currently, the Goben can only generate traffics for a certain period of time, which could be configured by the totalDuration flag. It's probably a better idea to add an option to be able to generate the traffic for an unlimited amount of time

Topologies

  • Dumbbell
  • Nonblocking
  • Fattree
  • Parking Lot
  • Jellyfish
  • DCell
  • BCube

Implement scalability test

It is important to measure the scalability limitations of the emulator. We need to test the scalability properties on a 8-core and 60-core machine by iteratively scaling up the number of hosts until throughput decline is measured.

tcp trace not found

Hi:
Iroko works well but when I try to plot RTT by

If transport == "tcp":
analyze_pcap(rl_algos, tcp_algos, plt_name, runs, data_dir)
plot_barchart(algos, plt_stats, plt_name)

in plot.py.
But it seems no tcp trace and shows 
result_folder /home/kong/iroko/results/kong-VirtualBox_4/tcp_run0/ppo
sh: 1: tcptrace: not found.
How can I solve this problem?
Best wishes.
Wu

Is run_openai_baselines.py be able to run?

Hi:
Thanks for your excellent work. I try to run run_openai_baselines.py in baselines branch. But I met some errors. I want to know if run_openai_baselines has completed and works correctly, or is it a problem in my own operation.
Best wishes!
Wu

New/custom RL algorithm

How do I add a custom RL algorithm? is there a file I need to modify ? a class that I need to implement? Any documentation will be great! I am new to gym and ray/rllib.

Add a listener to the traffic generator that allows the remote adjustment of traffic sending rate.

The goal here is to let the traffic manager (controller) be able to adjust the traffic (bandwidth) limits based on the action (a vector with each entry as the limit to each host).

One way to do this is that: Write an independent "traffic mediator" to listen to "action packets" from traffic manager, and send a packet containing the specific traffic limit to each host (goben interface). And this is in real-time. It can be done by using netcat

  • Maybe there are better designs?

Use Arrow to share data between data collector and agent

Currently, the entire agent is prototyped in Python, this includes the network and data collection operations. They should be written in a more efficient language to reduce operational latency and resource usage. A nice way to compose the framework is to use Arrow to glue all modules together. Data can be collected in C and shared with the Python agent using the Arrow interface.

Add asynchronous checkpointing of collected statistics

Checkpointing the collected bandwidth, queue, and reward stats every 1000 iterations causes a significant slowdown and affects the behavior of the agent. Checkpointing should be asynchronous from the agent's actions

Questions regarding bandwidth control

Hey Fabian, I have a few questions about the internals of the framework that I'm having trouble wrapping my head around. Hoping to get some clarify on the following:

  1. I'm trying to understand the interaction between the tx_rate being enforced based on the policy by BandwidthController and how cwnd gets changed by the native congestion control algorithm in the TCP setting. I see that the way we enforce the action taken by the policy is by capping the bandwidth allocated to the interface (https://github.com/dcgym/iroko/blob/master/dc_gym/control/bw_control.c#L55). This makes sense for UDP, but in the TCP scenario, isn't the underlying congestion control algorithm also doing its own thing and potentially interfering with the bandwidth control?
  2. Have you been able to, or is there a way to simulate bursts of traffic to simulate a more realistic setting? Based on what I understand from iroko_traffic.py (https://github.com/dcgym/iroko/blob/master/dc_gym/iroko_traffic.py#L174), we generate a constant stream of traffic from src to dst right?

Thanks!

Move flow measurements to end hosts.

Instead of sampling hosts from centralized switches, hosts should notify the arbiter of intention to send a flow to a location. Path can be inferred by the arbiter. This is much more lightweight.
In addition, a queue sampling approach such as SIMON (NSDI'19) can be used.

myproject.toml

there appears to be an extra * in line 38
matplotlib = "3.3.**"

Use netlink API instead of TC

Calling into Linux tc is a clunky method to collect data from interfaces and regulate sending rate. A better way is to directly write a C program that uses netlink.

The Output files are empty?

Hi,

Your works are pretty clever! However, I met some problems when trying to observe the result folder. The host server.out&err /client.out&err /ctrl.out are all empty. And for host ctrl.err it says "no such file or directory" anyway for PPO/PG/DDPG/DCTCP agent(those agents are used so far for my current work.) The program seems to be running smoothly.
image

Any ideas to solve this? I tried to even commit _start_controller func but still nothing to show in those files but the .csv files have plenty of data in it, which I guess generated from goben?

Unify logging

Mininet, Ray, and the Iroko platform log and print to stdout completely independently, which causes a big output mess. Debugging is also complicated. It would be nice to unify all logging output under a single framework.

DCTCP is behaving strangely

Trying to use DCTCP in the current emulator results in strange congestion behavior. It seems that DCTCP bypasses the qdisc which rate limits bandwidth and is pushing beyond the maximum possible amount. This could be caused by Mininet or the Linux kernel. Unclear so far.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.