reinforcement-learning-kr / pg_travel Goto Github PK

View Code? Open in Web Editor NEW

366.0 366.0 76.0 275.32 MB

Policy Gradient algorithms (REINFORCE, NPG, TRPO, PPO)

License: MIT License

Python 100.00%

pg_travel's People

Contributors

Stargazers

Watchers

pg_travel's Issues

학습 속도와 성능 개선을 위해 A2C 스타일의 PPO 에이전트 만들기

1 개의 액터러너를 가지고 샘플을 모아서 학습시키는 것은 학습 속도가 느린 것 같습니다. 또한 여러개의 액터러너로 학습시킨 에이전트보다 policy의 quality가 상당히 낮기 때문에 여러 개의 액터러너를 가지고 학습해야할 것 같습니다. 다음과 같은 순서로 진행하면 될 것 같습니다.

여러개의 액터러너가 있는 환경 만들기
각 액터러너로 각각의 메모리에 샘플 저장하기
각 메모리를 통해 GAE를 따로 따로 구하기
각 메모리를 통해 gradient를 구한 다음에 평균을 취해서 actor와 critic을 업데이트

일단 이게 되어야 뒤의 다른 작업들을 진행할 수 있기 때문에 가능한 한 빠르게 구성해주시면 좋을 것 같습니다.

코드를 서버에서 돌리기 위해 여러가지 설정 추가

현재 unity ppo 코드는 로컬 랩탑(cpu only)에서 돌리는데 mujoco와 달리 state와 action space가 커서 gpu가 있는 서버에서 돌려야합니다. 게다가 ppo는 gpu를 trpo보다 잘 활용할 수 있는 알고리즘입니다.

따라서 다음을 수행해야합니다.

hyper-params를 argparser로 설정할 수 있도록 변경
학습과정을 확인할 수 있도록 tensorboardX를 사용
학습 결과를 중간 중간 video로 저장

이 repo의 코드들은 기본적으로 cpu에서 돌게 되어 있나요?

코드를 돌려보니, gpu 사용이 없어서 질문 드립니다.

Does this code run on cpu by default?
When i run this code, there seems no gpu usage during the execution.

경사가 있는 환경에서 에이전트 학습시키기

학습은 기존 평평한 곳에서 학습시킨 PPO 에이전트를 베이스라인으로해서 학습
환경은 가능하다면 민규식님의 도움을 받아볼 것.
아래는 대충 나눈 거니까 두 분이서 의논하시면서 진행하시면 어떨까 싶습니다.
중간중간 이 이슈에 과정 남겨주세요!

환경 구성: @pz1004 장수영님
환경에서 학습: @Hyeokreal 양혁렬님

When i install mujoco and import mujoco_py, I got a problem ....

I successfully installed mujoco, but when i import it, I got this problem...

--
PermissionError Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/lockfile/linklockfile.py in acquire(self, timeout)
18 try:
---> 19 open(self.unique_name, "wb").close()
20 except IOError:

PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/mujoco_py-1.50.1.59-py3.5.egg/mujoco_py/generated/wonchul-60572700.8099-2917094554558463988'

I followed all you mentioned...

Could you help me?

action = get_action(mu, std)[0]?

In main.py line 93,
action = get_action(mu, std)[0]
then action is just a scalar.
Is that a problem?

why log standard deviation is fixed to 0

I see that in the actor critic model(model.py) it outputs the mu and logstd as an output. In the code, logstd is fixed to 0 by defining it "logstd = torch.zeros_like(mu)" making the standard deviation fixed to 1. But as far as I know it should return the logstd which is also learned by the network(in this case logstd would be the output of some layer). Is there any reason for this behavior?

Pyramid 환경에서 에이전트 PPO로 학습시켜보기

아마 다음과 같은 순서로 진행하면 되지 않을까 싶습니다.
도움 필요하면 언제나 요청해주세요!

pyramid 환경 컴파일해서 환경 테스트 해보기 (상태, 보상 뽑아보기 등)
기존 PPO 알고리즘으로 학습시켜보기
curiosity 추가해서 학습시켜보기

README와 코드 주석 추가

README에는 다음 내용이 들어가야합니다.

프로젝트 목표
각 환경에 대한 간단 설치 가이드 ( Linux를 기준으로 설명하는게 좋을 것 같습니다)
각 알고리즘 설명
각 환경에 대한 학습 결과
학습 시키거나 테스트 하기 위한 가이드
참고한 repository

코드 주석은 알고리즘에 대해 주석이 없으면 이해하기 어려운 부분에 추가하도록 합니다.

PPO Model RuntimeError

when i run main.py i met this problem:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

How can i solve this? Thanks

error of running ppo

Hi, if I run the ppo, I get,

What is it?

Frequency of saving stats

I think this(https://github.com/reinforcement-learning-kr/pg_travel/blob/master/mujoco/main.py#L117) should be like below??

if iter % 100 == 0:

Quick question about environment normalization

Hello,

I'm planning to use your PPO implementations, which seem well-written, clear and easy to understand. But first, I'd like to have the answer to the following question:

In OpenAI baselines, environments are passed to various classes, such as VecNormalize or Observation/Reward Wrappers or even Monitor. In these cases, observations and rewards are transformed in order to ease learning. However, there is a lot of encapsulation and it makes it kinda difficult to follow the chain. After a quick glance at your implementations, I'm under the impression that you do transform the observations in unity/utils/running_state.py. Is that so ? Are there other transformations ? Or were you just careful while designing the environment, designing it to make sure rewards were appropriately scaled ?

Thanks a lot for your answers.

reinforcement-learning-kr / pg_travel Goto Github PK

pg_travel's People

Contributors

Stargazers

Watchers

Forkers

pg_travel's Issues

Recommend Projects

Recommend Topics

Recommend Org