Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning

License: MIT License

Python 100.00%

5g deep-reinforcement-learning reconfigurable-intelligent-surfaces

ris-miso-deep-reinforcement-learning's Introduction

Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning

PyTorch implementation of the paper Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning. The paper solves a Reconfigurable Intelligent Surface (RIS) Assisted Multiuser Multi-Input Single-Output (MISO) System problem with the deep reinforcement learning algorithm of DDPG for sixth generation (6G) applications.

The algorithm is tested, and the results are reproduced on a custom RIS assisted Multiuser MISO environment.

I've updated the repository after 10 months. So, what's new?

Minor mistakes (didn't have any effect on the results), such as the computation of the channel matrices and responses that previously increased the computational complexity, have been solved.
Channel noise is now added to realize a noisy channel estimate for realistic implementation. Channel noise can be added by changing the argument channel_est_error to True (default is False).
Now, results are saved as a list with the shape (# of episodes, # of time steps per episode). You can visualize the results for a specific episode by selecting result[desired episode num.], where the result is the imported .npy file from the custom results directory.
The way that the paper addresses the transmission and received powers is false. A power entity cannot be complex, and it is a scalar reel value. This has also been solved. Naturally, the number of elements added by each power entity is now the number of users. The performance increased in terms of stability since the computational complexity is now reduced.
Due to the reduced computational complexity, please decrease the number of time steps per episode to approximately 10,000. DRL agents can suddenly diverge, also known as the deadly triad, when they utilize off-policy learning, deep function approximation, and bootstrapping. These three entities are combined in the DDPG algorithm. Therefore, as a reinforcement learning researcher, I suggest you not increase the training duration significantly; otherwise, you may observe sudden and infeasible divergence in the learning.
Also check out our recent work (paper, repo) on the same system, that is, DRL-based RIS MU-MISO, but now with the phase-dependent amplitude reflection model (PDA). The PDA model is mandatory as most RIS papers assume ideal reflections at the RIS. However, in reality, these reflections are scaled by a factor between 0 and 1, depending on the chosen phase angles. We solved this by introducing a novel DRL algorithm.
IMPORTANT: I receive too many mails about the repository. Please open an issue so that everyone can follow the possible problems with the code.

Results

Reproduced figures are found under ./Learning Figures respective to the figure number in the paper. Reproduced learning and evaluation curves are found under ./Learning Curves. The hyper-parameter setting follows the one presented in the paper except for the variance of AWGN, scale of the Rayleigh distribution and number of hidden units in the networks. These values are tuned to match the original results.

Run

0. Requirements

matplotlib==3.3.4
numpy==1.21.4
scipy==1.5.4
torch==1.10.0

1. Installing

Clone this repo:

git clone https://github.com/baturaysaglam/RIS-MISO-Deep-Reinforcement-Learning
cd RIS-MISO-Deep-Reinforcement-Learning

Install Python requirements:
```
pip install -r requirements.txt
```

2. Reproduce the results provided in the paper

Usage:

 usage: reproduce.py [-h] [--figure_num {4,5,6,7,8,9,10,11,12}]

Optional Arguments:

  optional arguments:
  -h, --help            show this help message and exit
  --figure_num {4,5,6,7,8,9,10,11,12} Choose one of figures from the paper to reproduce

3. Train the model from scratch

Usage:

usage: main.py [-h]
            [--experiment_type {custom,power,rsi_elements,learning_rate,decay}]
            [--policy POLICY] [--env ENV] [--seed SEED] [--gpu GPU]
            [--start_time_steps N] [--buffer_size BUFFER_SIZE]
            [--batch_size N] [--save_model] [--load_model LOAD_MODEL]
            [--num_antennas N] [--num_RIS_elements N] [--num_users N]
            [--power_t N] [--num_time_steps_per_eps N] [--num_eps N]
            [--awgn_var G] [--exploration_noise G] [--discount G] [--tau G]
            [--lr G] [--decay G]

Optional arguments:

optional arguments:
  -h, --help            show this help message and exit
  --experiment_type {custom,power,rsi_elements,learning_rate,decay}
                        Choose one of the experiment types to reproduce the
                        learning curves given in the paper
  --policy POLICY       Algorithm (default: DDPG)
  --env ENV             OpenAI Gym environment name
  --seed SEED           Seed number for PyTorch and NumPy (default: 0)
  --gpu GPU             GPU ordinal for multi-GPU computers (default: 0)
  --start_time_steps N  Number of exploration time steps sampling random
                        actions (default: 0)
  --buffer_size BUFFER_SIZE
                        Size of the experience replay buffer (default: 100000)
  --batch_size N        Batch size (default: 16)
  --save_model          Save model and optimizer parameters
  --load_model LOAD_MODEL
                        Model load file name; if empty, does not load
  --num_antennas N      Number of antennas in the BS
  --num_RIS_elements N  Number of RIS elements
  --num_users N         Number of users
  --power_t N           Transmission power for the constrained optimization in
                        dB
  --num_time_steps_per_eps N
                        Maximum number of steps per episode (default: 20000)
  --num_eps N           Maximum number of episodes (default: 5000)
  --awgn_var G          Variance of the additive white Gaussian noise
                        (default: 0.01)
  --exploration_noise G
                        Std of Gaussian exploration noise
  --discount G          Discount factor for reward (default: 0.99)
  --tau G               Learning rate in soft/hard updates of the target
                        networks (default: 0.001)
  --lr G                Learning rate for the networks (default: 0.001)
  --decay G             Decay rate for the networks (default: 0.00001)

Using the Code

If you use our code, please cite this repository:

@misc{saglam2021,
  author = {Saglam, Baturay},
  title = {RIS MISO Deep Reinforcement Learning},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/baturaysaglam/RIS-MISO-Deep-Reinforcement-Learning}},
  commit = {8c15c4658051cc2dc18a81591126a3686923d4c2}
}

ris-miso-deep-reinforcement-learning's People

Contributors

Stargazers

Watchers

ris-miso-deep-reinforcement-learning's Issues

Generating Fig. 4 and 5

I am sorry to bother you. This is my first time working on an RL project. I want to know what procedure to follow in generating figures 4 and 5, using your code and training from scratch.

Rewards decrease in late training

How to solve the problem of reward decrease in reinforcement learning DDPG algorithm in the later stage of training

.npy file for reproduction for figure no 4?

Dear @baturaysaglam and @Amjad-iqbal3,
I have been working on this code for a while and i am also stuck at same issue (as described in issue no 14 by @Amjad-iqbal) for figure no 4. How did you generate data (result.npy file) to produce this figure. Because in code there is no option to generate this data. Did you do it manualy to match the result in paper?
Please let me know if anyone has an idea.

Some doubts about program run time?

How long does it take to run this program with the default parameters?
--num_time_steps_per_eps = 20000
--num_eps default=5000

I run this main.py on a rtx3090 GPU, one eps needs 4min. Is this allright?

Unit modulus constraint of RIS

hi, i debugg the code and have a problem. i observed variable values in the environment.py, and found that diagonal elements of Phi ("Phi" denotes RIS matrix) dissatisfy with unit modulus constraint, i.e., |Phi(n,n)|^2 is not equal to 1.
This seems to mean that the "Normalize the phase matrix" part in DDPG.py doesn't guarantee that |Phi(n,n)|^2 = 1.
If you have any suggestions, please do not hesitate to advise, thank you!

transmit power handing

hi i am also confuse of the power， as you can see ， the power is equal with Trace｛G.G^h｝, but when you compute the power in step part of environment , you use "np.real(np.diag(self.G.conjugate().T @ self.G)).reshape(1, -1) ** 2",
its mean P ** 2,
so i think the transmit power will be a wrong value,
could you tell me why you use the equation, maybe some knowledge i havent learn
thank you

further more, when you normalize the phase , why use sqrt(2) here,torch.sum(torch.abs(Phi_real), dim=1).reshape(-1, 1) * np.sqrt(2)
honstly, i dont knowledge this whole sentence. why normalize the phase need use sum of Phi_real, as i know ,the Phi is computed by the angles and the range is complex area which cant be sure

On the problem of drawing reward to steps diagram

I'm having some problems when I'm running a simulation with your code.
For example, I want to reproduce Figure 7 in the paper
The size of the trained instant rewards is (num episode, num steps per eps)
I know that each row should represent the reward for that episode.

But I loaded your file on github and found that the size is (1, num steps per eps)
I would like to ask if you took the first episode for drawing?
Otherwise, I found that the reward hardly increased until the last episode.

Sorry to bother you, I'd be very grateful if you could help me out!

Help need for Figure 4 and data you generated

I have been diligently working on this code, and currently, I find myself at an impasse regarding the generation of Figure 4 data. Could you kindly shed some light on the process you employed to generate the data for the "Sum_rate_power" folder? If it's not too much to ask, I would greatly appreciate it if you could pinpoint the relevant sections within the code.

Additionally, I am curious about the contents of the "result.npy" file present in the "sum_rate_ris" folder. I'm wondering if these values correspond to the optimal reward (opt_reward) or if they were derived from an alternative source. I would be immensely grateful if you could furnish me with the comprehensive code that facilitated the creation of these two folders. A step-by-step breakdown of the process would be invaluable in aiding my understanding.

Thank you immensely for your assistance.

Comparison algorithm problem

Hello, I have encountered some difficulties in the comparison algorithm. Can you provide the code of the ZF and WMMSE comparison algorithm provided in the picture? Thank you very much.

I have the doubt that i can't re current the results similar to the papers.

I have downloaded this projects and run it. I found that in the first eps , the results can reach 9 or higher. But in the later eps, the result was stuck in about 1 to 4, this question puzzled me for a long time.

Some questions about rewards in the training process

Greetings.Recently,I run a simulation with your code.I used the default values for all the parameters in the simulation without changing anything.I found that,at the end of the first episode,the reward can reach 11 or higher.

But,at the end of the second episode,the reward was stuck in 2 or less

What's more,the performance of subsequent episodes is much lower than that of the first episode.I don't know why it is like this. Shouldn't the performance of each episode increase as the number of training episode increases?Why is the performance best in the first episode？

Action dimension problem

I have modified your code in order to implement a method that associates users with the available RISs, allocates the transmission power of multiple users and configures the phase shifts of multiple RISs. The state now is defined as the achievable SNR per user, however the action space becomes very large when the number N of RIS elements (per RIS) is very high (e.g., 64 or 128) or more users and RISs are added. The complexity is very high, any hints on how this could be improved algorithm-wise?

Questions in 'DDPG.py' and 'environment.py'

Dear Mr. Baturay Saglam,

I have some questions regarding to 'DDPG.py' and 'environment.py' .
In 'DDPG.py', I think the codes of lines 47-52 and 65-67 aiming to normalize the phase shift matrix are wrong. Because the norm of the phase after normalization is not equal to 1 ( Actually much less than 1).

In 'environment.py', in line 114 , is the power '**2' necessary? Because 'self.G.conjugate().T @ self.G' is already the power. Meanwhile, in line 90, why do you multiply the noise power 'self.awgn_var' with '(self.K - 1)'?

Could you please double-check these three places?

Issue in regenerating figure 11

Hi everyone,
I'm trying to regenerate figure 11 for the learning rates.
Can I ask for how many episodes you run the code? I tried for 300 episodes and 10000 steps, and it still does not converge!
Any advice?
I also see that the one the author saved at Learning_rates are vectors of the length 10000 (steps). Shouldn't it be of the shape (number of episodes x number of steps)?
Thanks!

How do I generate comparison data with a trained model?

Appreciate your contribution. I am a beginner in the DRL field, and I noticed that the project only provides training code, so how do I use the trained model? For example, how can I use the code of this project to run diagrams such as 4_reproduced.jpg, 5_reproduced.jpg, and 9_reproduced.jpg? More specifically, how can I get the data that leads to these graphs?

What's the between sum_rate and reward?

I find the reward is defined as sum rate capacity, but I am confused why the sum rate doesn't equal reward in figures.

Transmitted power handling

Hi
I was reading your repository again because of your new update but I couldn't understand whether you handle something about power transmitted from BS or not.
do you check that your power could be the maximum power and don't force it to be exactly the maximum power (as I know sending maximum power is not always the best choice because of interference it can make and current users' CSI.)? I think you normalize your precoding matrix and multiple sqrt(power) to it. didn't you?

Variation in channel at each time step.

We are resetting the channel (H_1, H_2) in reset() part of the environment. resetting is done only at start of each episode. it means channel will change once in an episode and it will stay constant during all steps in an episode, as there is no command to change channel during each time step. As per my knowledge channel should change at each time step.
Kindly let me know if i am getting it wrong somewhere.

Optimization constraints

Why the normalization of power can satisfy the constraints of the optimization equation? If there are other constraints, what should we do？

Calculation of Unit Modulus constraint.

Hope you are doing fine,
I have gone through all the issues related to Calculation of Unit Modulus constraint and your response. In my opinion you have calculated unit modulus constraint wrongly. you have added real/imag values of all RIS elements and then multiplied by sqrt(2). as per constraint, magnitude of each RIS element should be one (not sum of all elements). If i am getting it wrong somewhere please let me know.

Power conditions used to create figures 8, 11 and 12

Hi, I was trying to reproduce the results in the paper but I am unable to get the same results. Would it be possible for your to share the power that was used to create the plots for figures 8, 11 and 12 as the final reward in my case is not matching with the figures. Thank you

baturaysaglam / ris-miso-deep-reinforcement-learning Goto Github PK

ris-miso-deep-reinforcement-learning's Introduction

Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning

I've updated the repository after 10 months. So, what's new?

Results

Run

Using the Code

ris-miso-deep-reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

ris-miso-deep-reinforcement-learning's Issues

Recommend Projects

Recommend Topics

Recommend Org