godka / pensieve-ppo Goto Github PK

The simplest implementation of Pensieve (SIGCOMM' 17) via state-of-the-art RL algorithms, including PPO, DQN, SAC, and support for both TensorFlow and PyTorch.

Home Page: https://godka.github.io/Pensieve-PPO/

License: BSD 2-Clause "Simplified" License

Python 16.20% DIGITAL Command Language 83.80%

pensieve reinforcement-learning a2c ppo dqn tensorflow deep-learning pytorch

pensieve-ppo's Introduction

Pensieve PPO

Updates

May. 4, 2024: We removed the Elastic, revised BOLA, and add new baseline Comyco [3] and Genet [2].

Jan. 26, 2024: We are excited to announce significant updates to Pensieve-PPO! We have replaced TensorFlow with PyTorch, and we have achieved a similar training speed while training models that rival in performance.

For the TensorFlow version, please check Pensieve-PPO TF Branch.

Dec. 28, 2021: In a previous update, we enhanced Pensieve-PPO with several state-of-the-art technologies, including Dual-Clip PPO and adaptive entropy decay.

About Pensieve-PPO

Pensieve-PPO is a user-friendly PyTorch implementation of Pensieve [1], a neural adaptive video streaming system. Unlike A3C, we utilize the Proximal Policy Optimization (PPO) algorithm for training.

This stable version of Pensieve-PPO includes both the training and test datasets.

You can run the repository by executing the following command:

python train.py

The results will be evaluated on the test set (from HSDPA) every 300 epochs.

Tensorboard Integration

To monitor the training process in real time, you can leverage Tensorboard. Simply run the following command:

tensorboard --logdir=./

Pretrained Model

We have also added a pretrained model, which can be found at this link. This model demonstrates a substantial improvement of 7.03% (from 0.924 to 0.989) in average Quality of Experience (QoE) compared to the original Pensieve model [1]. For a more detailed performance analysis, refer to the figures below:

If you have any questions or require further assistance, please don't hesitate to reach out.

Additional Reinforcement Learning Algorithms

For more implementations of reinforcement learning algorithms, please visit the following branches:

DQN: Pensieve-PPO DQN Branch
SAC: Pensieve-PPO SAC Branch or Pensieve-SAC Repository

[1] Mao H, Netravali R, Alizadeh M. Neural adaptive video streaming with Pensieve[C]//Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017: 197-210.

[2] Xia, Zhengxu, et al. "Genet: automatic curriculum generation for learning adaptation in networking." Proceedings of the ACM SIGCOMM 2022 Conference. 2022.

[3] Huang, Tianchi, et al. "Comyco: Quality-aware adaptive video streaming via imitation learning." Proceedings of the 27th ACM international conference on multimedia. 2019.

pensieve-ppo's People

Contributors

Stargazers

Watchers

pensieve-ppo's Issues

Is this project based on python3？and which version of tf2.x?

Does it run on Windows?

What do the dimensions of the state returned by the environment mean? What are the corresponding parameters in the paper?

比如说state[0, -1]是last quality，对应论文中的哪个参数？是xt还是什么？

The time to train the model

Could you give me some hint on how long it takes to train the model with 1000000 epoch? Several days or weeks?With much thanks.

if I want to use VMAF as a QoE evaluation metric, how should I add it in?

I noticed that Comyco's QoE model includes VMAF. How can I modify the code to achieve the same QoE model as Comyco? Could you provide me with some ideas? Thank you very much!

TQL

tqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltqltql

a question about compute_v

Hi,gaoka
thanks for your work.
but i'm confused by your calculation of Adv

Pensieve-PPO/src/ppo2.py

Lines 150 to 151 in f310bf1

 for t in reversed(range(ba_size - 1)): 

 R_batch[t, 0] = r_batch[t] + GAMMA * R_batch[t + 1, 0]

you let R_t=r_t+\gamma * R_{t+1}.
but in this loop,you use reversed.does that mean R_t=r_t+\gamma * R_{t-1}
is this a typo?
wait for your reply,thanks.

setting entropy to TD_loss summary vars

In train.py, why did you set TD_loss variable to entropy in summary_vars[0]?
You already computed avg_entropy in summary_vars[2]!

 summary_str = sess.run(summary_ops, feed_dict={
                    summary_vars[0]: actor._entropy,
                    summary_vars[1]: avg_reward,
                    summary_vars[2]: avg_entropy
                })
                writer.add_summary(summary_str, epoch)
                writer.flush()

Could you please clarify?

Monitor cross-validation curve

I was trying to monitor the TD_loss and rewards using tensorboard by launching it using this command:
sim> tensorboard logdir=./results
However, I couldn't see the validation curve of the rl-test.py
I can only see the TD_loss, entropy and rewards of train.py.
How can I enable and show the cross-validation curve to monitor the model convergence?

有最新关于这方面的进展的Paper吗？我看Pensieve论文里说，没有多少改进空间了。

我是做因果强化学习的，对于这方面不太了解，所以想问问大佬有没有相关文章推荐一下。

SAC import error

In train_sac.py there is an import from sac import Network which gives an error!

How did you install soft actor-critic (sac)?
Default installation using pip install sac leads to Network not found error!

How is your baseline method implemented?

I can see from your experimental results comparison chart that there are many algorithms, including BOLA, HYB, RB, etc. Can the implementation details of these algorithms be made public?

Way to replicate baseline results for different data sets

Hey, I am exploring github for solutions similar/based on Pensieve by Hongzimao. I am looking for a way to create benchmark platform for Pensieve vs non rl ABR implementations popular in research like BB, RB, BOLA, MPC and all other that's been included here in baseline directory. I have already created tools for generating different data sets basing on https://dash.itec.aau.at/dash-dataset/. It is part of my masters degree work, so it would be very helpful to receive some hints.

A question about the comparing function "r" between new and old policies

I see the implementation of "r" function in your code

Pensieve-PPO/src/ppo2.py

Line 62 in a029189

def r(self, pi_new, pi_old, acts):

, and I realized that it is different from others which are shown as follows:

The main difference is whether using the "exp", does it matter? Looking forward to your reply~

Training with Multiple videos with random number of bitrates masked.

We are trying to train this model for multiple video using PPO with different masks(variable number of bit rates masked).
For example, there are currently maximum 12 bitrates and some of them are randomly masked in different videos. Some times there are only 9 of them available and some time only 6 and so on.

So far we are trying to use the same dataset as used for A3C. We have modified abr.py and abrenv.py accordingly. But in our approach there seems to be issues in experience batch creation and its further processing. Can you share a reference how to create a video data set for PPO with different masks.

a2c vs ppo NN architecture

According to the code, PPO uses the same network architecture (actor-critic), state-space, and reward space.
The difference between A2C and PPO resides in clipping the policy, right?

 # adaptive entropy weight
 # https://arxiv.org/abs/2003.13590
 p_batch = np.clip(p_batch, ACTION_EPS, 1. - ACTION_EPS)
 _H = np.mean(np.sum(-np.log(p_batch) * p_batch, axis=1))
 _g = _H - self.H_target
 self._entropy_weight -= self.lr_rate * _g * 0.1

Therefore, the design of NN architecture of Pensieve-PPO looks like Pensieve's.

Is it so, or does the architecture have another design?
Could you please share the design and the paper?

Thanks

How to improve exploration?

What is the parameter to enable better exploration in PPO?
Is it EPS ?
How to control exploration and exploitation in PPO?
Do you recommend a good approach for this purpose?

	for t in reversed(range(ba_size - 1)):
	R_batch[t, 0] = r_batch[t] + GAMMA * R_batch[t + 1, 0]