real-stanford / diffusion_policy Goto Github PK
View Code? Open in Web Editor NEW[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
Home Page: https://diffusion-policy.cs.columbia.edu/
License: MIT License
[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
Home Page: https://diffusion-policy.cs.columbia.edu/
License: MIT License
Hi there! Thanks for your impressive work and beautiful code :)
I tried to run lift_image_abs with transformer hybrid workspace HEADLESS, but it logged that:
[root][INFO] Command '['/mambaforge/envs/robodiff/lib/python3.9/site-packages/egl_probe/build/test_device', '0']' returned non-zero exit status 1.
[root][INFO] - Device 0 is not available for rendering
and it keeps repeating on all of the 4 GPUs. Afterwards, I found the "Eval LiftImage" process is really slow, I wonder if I should turn on or install some driver for hardware acceleration?
Hello @cheng-chi ! Thank you for sharing your beautiful code as open-source. I have integrated your code into a custom environment that I've developed. After training, I noticed that the loss on the training set for the DDPM algorithm consistently decreased, whereas the loss on the validation set kept increasing (the final loss on the training set was 10e-5, and on the validation set, it was 1.3), showing a significant difference in magnitude. I also checked the training logs you provided and observed a similar magnitude difference, although the increase in validation set loss wasn't very pronounced (for example, the final losses in data/experiments/image/pusht/diffusion_policy_cnn/train_0/logs.json were 0.00024978463497540187 and 0.24248942732810974, respectively). Moreover, the success rate of the closed-loop test for the final model was just over 70% (compared to a 98% success rate with expert data). Therefore, I would like to inquire whether this issue could be affecting the test performance and if you have any good debugging experience to share.
Thank you and look forward to your reply!
Hello, thank you for providing the code! I just started learning about the diffusion model and was deeply inspired after reading your paper. In your paper, you mentioned using observation to guide the CNN base diffusion model for action generation, transform observation Ot into observation embedding sequence by a shared MLP. Is this something you first proposed or an improvement on previous papers? Thank you!
Hi, I am wondering how much harder is the transformer-based diffusion policy is to train? Will it be possible to adapt it to the vision based colab example on colab? @NeilNie @pointW @jingxixu @cheng-chi
Thank you for the beautiful code!
In the evaluation of the training process, I see you have logged the train_action_mse_error
, which samples the trajectories from the training set and calculates the error between the perdicted action and the target. Is there a specific reason why there isn't a corresponding validation_action_mse_error
that calculates this error on the validation set?
Hi Cheng, this work is incredible and elegant! I tried the code, and I found that the training time for each task and method is around 12 hrs. I am also testing on the robomimic can task, and it seems to take a while to get a performant policy. I am wondering whether you have any suggestions for speeding up the training process.
I can know from the article that the keypoint in this pustt is 9 2D positions obtained from the real pose of the block, but I still don’t know what its meaning is and which positions it is? What does it have to do with the x and y positions of the block?
Hi Cheng, I'm wondering what are the differences between low_dim.hdf5 and low_dim_abs.hdf5 since the dataset in the robomimic environment only has low_dim.hdf5 and image.hdf5. How is the low_dim_abs.hdf5 and image_abs.hdf5 dataset produced?
Thanks for your help!
The conda environment takes forever to build on my end. Does anyone know how to solve it?
Hi,
In the paper, you mentioned that you executed the policy on two different robots: UR5 and Franka. Is it possible to release the code for Franka as well? Thanks!
I haven't found a way to inspect the depth data. It seems the version of Robomimic is kind of low, but I am not sure about it.
do you have any idea about visualising the depth image and also access to the camera parameters?
Hello guys,
Very impressive work. Would love to try it out with state-based notebook, but fail to run in Google Colab runtime. Import cell fails with:
XlaRuntimeError: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.
on this line:
from diffusers.training_utils import EMAModel
Could you please advise on fix/workaround?
Hi,
I would like to use this amazing Diffusion Policy method in a custom simulation. As a first step, I am trying to reproduce some of the simulations given in the paper ('lift', 'can', 'square', ...) and understand how this code works to see how I can make an environment for my own simulation.
In the ReadMe, there are instructions on how to train and evaluate simulations, but using already gathered data for the pygame pusht example. Moreover, in the code itself, I can only find demo_real_robot.py (which obviously interfaces with the real robot) and a demo_pusht.py which interfaces with pygame but not with e.g. the robomimic simulations (which would be interesting to see how e.g. visual information is retrieved from the simulation so I can try replicate that).
Maybe I am missing something obvious, but is there any easy way to reproduced one of the simulated examples from the paper, from demonstrating to evaluation? Or would this require some custom code that is not in this repo?
Thank you!
Thanks for providing this beautiful code and documentation!
I have been reading the implementation of Dataset and SequenceSampler in the Colab example and I have a question about it.
def create_sample_indices(
episode_ends:np.ndarray, sequence_length:int,
pad_before: int=0, pad_after: int=0):
indices = list()
for i in range(len(episode_ends)):
start_idx = 0
if i > 0:
start_idx = episode_ends[i-1]
end_idx = episode_ends[i]
episode_length = end_idx - start_idx
min_start = -pad_before
max_start = episode_length - sequence_length + pad_after
# range stops one idx before end
for idx in range(min_start, max_start+1):
buffer_start_idx = max(idx, 0) + start_idx
buffer_end_idx = min(idx+sequence_length, episode_length) + start_idx
start_offset = buffer_start_idx - (idx+start_idx)
end_offset = (idx+sequence_length+start_idx) - buffer_end_idx
sample_start_idx = 0 + start_offset
sample_end_idx = sequence_length - end_offset
indices.append([
buffer_start_idx, buffer_end_idx,
sample_start_idx, sample_end_idx])
indices = np.array(indices)
return indices
I understand that we need to pad max_start
to be episode_length-sequence_length
?
Hi @cheng-chi, this work is incredible! I read the code carefully and I have a doubt about image normalization.
For example, in real_pusht_image_dataset.py
, the following code normalizes the image to [-1, 1]
for key in self.rgb_keys:
normalizer[key] = get_image_range_normalizer()
return normalizer
In multi_image_obs_encoder.py
, ImageNet statistics are also used in the code, but this requires the image to be between [0, 1]
if imagenet_norm:
this_normalizer = torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
I don’t know what the impact of this bug is on the final performance, but it may be that there is something wrong with my understanding.
Hello,
Thank you so much for your amazing work and beautiful code. However, when I was reading the code, I got confused in this reward collection section. Could you please clarify why you use the len(self.env_fns)
in the first line of this block?
https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/robomimic_lowdim_runner.py#L320-L325
My understanding is that your current code will only take part of the trajectories into consideration when the number of running simulators is smaller than the number of trajectories to test. Please correct me if I am wrong, this line should be for i in range(n_inits):
to take all trajectories into consideration. Could you take a look? Will this issue affect the numbers you reported in the paper?
Similarly at:
https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/robomimic_image_runner.py#L327
https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/kitchen_lowdim_runner.py#L282
https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/blockpush_lowdim_runner.py#L238
Btw, I'm also curious that why you take the mean of ten checkpoints as your evaluation metric, do you have a specific reason to do so?
Thank you so much for your time.
Best regards,
Xiang Li
Hi,
Thanks a lot for making the code very easy to interpret and set up!
I have been playing around with the policy, and it seems like currently, it is not able to handle "pred_horizon" which is not a power of 2. For example, it works for pred_horizon = 2, 4, 8, 16 .... but doesn't work for other values. Is there a quick solution to this?
Hi, I am currently looking at implementing a diffusion model for policy learning and was very impressed by your work! I was wondering what components of your approach you found to be particularly important for good results? 3 things I specifically was curious about were:
Hi, thank you for your beautiful code ❤️
In section III.B of your paper you mention that you replace the global average pooling in the ResNet with a spatial softmax.
However, I cannot find where this is done in your code.
I can only see where you change the batch norm for a group norm
https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/multi_image_obs_encoder.py#L62-L69
and where you remove the fully connected final layer
https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/model_getter.py#L15
but not where you change the average pooling.
Am I missing something or did you actually use average pooling, contrary to what's stated in the paper?
I was having an issue with the Kitchen environment training. I was receiving this error and the training was never starting
co/index.py", line 628, in struct_indexer
attr = getattr(struct, field_name)
AttributeError: 'MjModel' object has no attribute 'eq_active'
After some debugging I found the issue is with the dm_control
package version. I solved it by updating to the version v1.0.16.
I could update the conda environment yaml, but not sure if everyone is receiving this error.
Eval PushtImageRunner 1/1: 0%| | 0/300 [00:00<?, ?it/s][2024-03-05 13:47:41,822][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 210.5008311236281 342.48464083840656 at 210.5008311236281 342.48464083840656
[2024-03-05 13:47:41,822][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 234.99189481952544 314.67771107771085 at 234.99189481952544 314.67771107771085
[2024-03-05 13:47:41,822][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 220.39022562435579 235.97833701476938 at 220.39022562435579 235.97833701476938
[2024-03-05 13:47:41,822][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 178.67475743781893 209.88693535477299 at 178.67475743781893 209.88693535477299
[2024-03-05 13:47:41,822][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 289.14878950291342 266.5939992541135 at 289.14878950291342 266.5939992541135
[2024-03-05 13:47:41,824][shapely.geos][INFO] - Self-intersection at or near point 289.14878950291342 266.5939992541135
[2024-03-05 13:47:41,824][shapely.geos][INFO] - Self-intersection at or near point 234.99189481952544 314.67771107771085
[2024-03-05 13:47:41,824][shapely.geos][INFO] - Self-intersection at or near point 210.5008311236281 342.48464083840656
[2024-03-05 13:47:41,824][shapely.geos][INFO] - Self-intersection at or near point 220.39022562435579 235.97833701476938
[2024-03-05 13:47:41,824][shapely.geos][INFO] - Self-intersection at or near point 178.67475743781893 209.88693535477299
[2024-03-05 13:47:41,825][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 309.8263772698329 215.10372685673218 at 309.8263772698329 215.10372685673218
[2024-03-05 13:47:41,827][shapely.geos][INFO] - Self-intersection at or near point 309.8263772698329 215.10372685673218
[2024-03-05 13:47:41,825][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 291.18848680659426 277.83949209027594 at 291.18848680659426 277.83949209027594
[2024-03-05 13:47:41,827][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 283.61346399000001 286.18896886258517 at 283.61346399000001 286.18896886258517
[2024-03-05 13:47:41,829][shapely.geos][INFO] - Self-intersection at or near point 291.18848680659426 277.83949209027594
[2024-03-05 13:47:41,829][shapely.geos][INFO] - Self-intersection at or near point 283.61346399000001 286.18896886258517
[2024-03-05 13:47:41,835][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 282.46518743654559 322.54902054743388 at 282.46518743654559 322.54902054743388
[2024-03-05 13:47:41,840][shapely.geos][INFO] - Self-intersection at or near point 282.46518743654559 322.54902054743388
[2024-03-05 13:47:41,825][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 268.47390492719433 304.81559802469485 at 268.47390492719433 304.81559802469485
[2024-03-05 13:47:41,841][shapely.geos][INFO] - Self-intersection at or near point 268.47390492719433 304.81559802469485
[2024-03-05 13:47:41,838][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 318.04646451990203 363.62674201727964 at 318.04646451990203 363.62674201727964
[2024-03-05 13:47:41,841][shapely.geos][INFO] - Self-intersection at or near point 318.04646451990203 363.62674201727964
[2024-03-05 13:47:41,858][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 297.62376437258138 216.36702689446531 at 297.62376437258138 216.36702689446531
[2024-03-05 13:47:41,858][shapely.geos][ERROR] - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 228.68062339104259 244.17213369889146 at 228.68062339104259 244.17213369889146
[2024-03-05 13:47:41,859][shapely.geos][INFO] - Self-intersection at or near point 228.68062339104259 244.17213369889146
[2024-03-05 13:47:41,860][shapely.geos][INFO] - Self-intersection at or near point 297.62376437258138 216.36702689446531
ERROR: Received the following error from Worker-13: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b37f0>
ERROR: Shutting down Worker-13.
ERROR: Received the following error from Worker-24: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033c0190>
ERROR: Shutting down Worker-24.
ERROR: Received the following error from Worker-4: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b7580>
ERROR: Shutting down Worker-4.
ERROR: Received the following error from Worker-25: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033bf190>
ERROR: Shutting down Worker-25.
ERROR: Received the following error from Worker-23: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033bf220>
ERROR: Shutting down Worker-23.
ERROR: Received the following error from Worker-21: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b82b0>
ERROR: Shutting down Worker-21.
ERROR: Received the following error from Worker-9: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b62b0>
ERROR: Shutting down Worker-9.
ERROR: Received the following error from Worker-5: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b66d0>
ERROR: Shutting down Worker-5.
ERROR: Received the following error from Worker-7: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033b7970>
ERROR: Shutting down Worker-7.
ERROR: Received the following error from Worker-38: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033c12e0>
ERROR: Shutting down Worker-38.
ERROR: Received the following error from Worker-52: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033c75b0>
ERROR: Shutting down Worker-52.
ERROR: Received the following error from Worker-39: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033c1430>
ERROR: Shutting down Worker-39.
ERROR: Received the following error from Worker-53: TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f8d033c7700>
ERROR: Shutting down Worker-53.
ERROR: Raising the last exception back to the main process.
train_robomimic_image_workspace.yaml
What does this configuration file mean?
Will the data from robomimic be used to train with your method?
I want to collect my data from robomimic and train with diffusion_policy. How can I do?
我在阿里云上租用了一块V100,当diffusion_policy被安装到阿里云之后,按照diffusion_policy中给出的说明,在Running for a single seed方式下,执行
python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml training.seed=42 training.device=cuda:0 hydra.run.dir='data/outputs/${now:%Y.%m.%d}/${now:%H.%M.%S}${name}${task_name}'
指令时,程序只能正常运行一个批次的训练。当进行完一个批次的训练以后,计算机调用gym项目中的async_vector_env.py文件里的reset_async函数时,出现了崩溃现象。是不是在pusht_image_runner.py文件中的run(self, policy: BaseImagePolicy)函数里,一些语句写错了,从而引发了程序执行的异常,还要把源代码修改修改才能正常进行?或者说,是不是单独一块V100执行不了diffusion_policy,从而引发了上面所说的程序执行异常?
Great project!
I understand that the rtde_interpolation_controller.py
is controlling the UR5.
Is there any way to use a uFactory lite 6 robot arm instead?
Thanks for your help!
Hi Cheng, I found that in demo_real_robot.py
, you sync data from different sources manually, instead of using ROS. Also, multiple realsense cameras are handled by yourself instead of multiple ROS nodes. Is there any reason behind this design?
Hi, I am trying to run the command in the README.md for Reproducing Simulation Benchmark Results
:
============= Initialized Observation Utils with Obs Spec =============
using obs modality: low_dim with keys: ['agent_pos']
using obs modality: rgb with keys: ['image']
using obs modality: depth with keys: []
using obs modality: scan with keys: []
/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
[2023-12-02 23:07:29,677][diffusion_policy.model.diffusion.conditional_unet1d][INFO] - number of parameters: 2.515119e+08
Diffusion params: 2.515119e+08
Vision params: 1.119709e+07
pygame 2.1.2 (SDL 2.0.16, Python 3.9.15)
Hello from the pygame community. https://www.pygame.org/contribute.html
wandb: Currently logged in as: jehanyang (jehan_testcrew). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.0 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /home/projectimit/diffusion_project/diffusion_policy/data/outputs/2023.12.02/23.07.27_train_diffusion_unet_hybrid_pusht_image/wandb/run-20231202_230734-1g8u9a71
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2023.01.16-20.20.06_train_diffusion_unet_hybrid_pusht_image
wandb: ⭐️ View project at https://wandb.ai/jehan_testcrew/diffusion_policy_debug
wandb: 🚀 View run at https://wandb.ai/jehan_testcrew/diffusion_policy_debug/runs/1g8u9a71
Process Worker<AsyncVectorEnv>-55:
Killed
(robodiff) projectimit@RCHI-CPU-4:~/diffusion_project/diffusion_policy$ Traceback (most recent call last):
File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 622, in _worker_shared_memory
command, data = pipe.recv()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 669, in _worker_shared_memory
pipe.send((None, False))
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process Worker<AsyncVectorEnv>-54:
Traceback (most recent call last):
File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 622, in _worker_shared_memory
command, data = pipe.recv()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
The above block repeats about 50 times.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 669, in _worker_shared_memory
pipe.send((None, False))
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Exception in thread MsgRouterThr:
Traceback (most recent call last):
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
msg = self._read_message()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/wandb/sdk/interface/router_queue.py", line 36, in _read_message
msg = self._response_queue.get(timeout=1)
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/queues.py", line 117, in get
res = self._recv_bytes()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 217, in recv_bytes
self._check_closed()
File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 141, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
Dear Cheng @cheng-chi,
Thank you for your elegant and inspiring codes! I have a little question about the loss computation of noise prediction.
I think that actions before
If noise of predict_action
)?
Thank you for your time!
Regards,
Dongjie
Hello,
Thank you so much for your amazing work and beautiful code.
I'm currently working with Franka Kitchen dataset where the dimensions of a numpy array masks suggest that there are 566 demonstrations, each represented by a column in a (409, 566) shape array. Each column in masks corresponds to the existence of data per timestep for each demonstration.
However, I have encountered what seems to be a potential issue in the code where episodes are being added to the replay buffer. Here is the snippet:
diffusion_policy/diffusion_policy/dataset/kitchen_lowdim_dataset.py
Lines 23 to 37 in 548a52b
data_directory = pathlib.Path(dataset_dir)
observations = np.load(data_directory / "observations_seq.npy")
actions = np.load(data_directory / "actions_seq.npy")
masks = np.load(data_directory / "existence_mask.npy")
self.replay_buffer = ReplayBuffer.create_empty_numpy()
for i in range(len(masks)):
eps_len = int(masks[i].sum())
obs = observations[i,:eps_len].astype(np.float32)
action = actions[i,:eps_len].astype(np.float32)
data = {
'obs': obs,
'action': action
}
self.replay_buffer.add_episode(data)
From this code, it appears that each iteration of the loop is supposed to handle a single demonstration. However, the indexing used (observations[i,:eps_len] and actions[i,:eps_len]) seems to imply that the demonstrations are organized by rows rather than columns. If each demonstration is indeed a column in the observations and actions arrays, the correct indexing should possibly be column-wise rather than row-wise.
Could you please confirm whether the demonstrations are intended to be represented as rows or columns in the dataset? If they are indeed columns, would the correct approach be to modify the indexing to reflect this structure?
Thank you for looking into this matter. I am looking forward to your clarification.
Best regards
Hi,
Thank you for your fantastic work and beautiful code! As described in the paper, you use position control instead of velocity control on the robomimic tasks for diffusion policy. However, I didn't find the corresponding changes to the robomimic environment controller in the code, except for the use of "abs_action". Did I miss something?
Thank you so much for your time.
Best regards,
Weikang
Hello,
Thanks again for this great project. It would be great if you could help diagnose this issue regarding the pretrained models.
When I try to evaluate your pretrained models of the hybrid CNN setting, I found them not working properly on Push-T, Transport ph, and transport mh. There could be more but I haven't tried them yet.
Basically, the action trajectory is relatively reasonable (not random noisy actions), but the agent just could not finish the task. (Push-T mean score: 0.09, Transport mean score: 0)
However, when I tried to train a model from scratch and evaluate it, it works fine, which may indicate that the evaluation code is correct.
I tested them on two machines at different locations, and all models are directly downloaded from your website. I also performed the integrity check and confirm that the two copies of the models on the two machines are identical. The training code can properly load the model file and the num of epochs matches the filename. But, it just does not generate the correct actions. After days of debugging, I could not find any possible directions to look into.
So could you please share some insights on what may cause this issue?
Thank you so much!
Best regards,
Hello, Could you introduce your hardware setup, for example, how many GPUs you use or how much computing power is required? Thanks!
Hi,
Really great work. We are trying to extend this codebase to a different robot and with a significantly different learning setup.
In your README, it is mentioned:
"Most of our implementations of Dataset uses a combination of ReplayBuffer and SequenceSampler to generate samples. Correctly handling padding at the beginning and the end of each demonstration episode according to To and Ta is important for good performance. Please read our SequenceSampler before implementing your own sampling method."
Is it really that simple to understand exactly how SequenceSampler works? I feel like it's a bit unapproachable and it would take quite some time to really parse the code and understand what's happening. Has anyone else tried to extend the code without adhering to this structure?
I am in the process of doing that, but as I continue, I am altering significantly parts of the training code because again I am not using the existing SequenceSampler, etc. Any pointers would be helpful!
I like your work, I have learned so much from your code and paper. 👍
I still have a question about how to visualize the intermediate result, such as, how to visualize the generated trajectories with different K steps.
Do you have any suggestions about it? I will gratefully appreciate it.
and README!
Tried to train on the pusht_real w. images training data but I can't seem to find the config.yaml for this experiment. Is it me or is there only a sim pusht config file?
Appreciate your work and detailed github page, well done!
Just have a minor question, when I tried to reproduce the "pusht" example following the instruction, it worked out all well. However, when I switch to other examples(I only changed corresponding yaml and ckpt file and names), it always throw out the error below:
~/diffusion_policy$ python eval.py --checkpoint data/epoch=5900-test_mean_score=1.000.ckpt --output_dir data/transport_eval_output --device cuda:0
Traceback (most recent call last):
File "/home/ubuntu/diffusion_policy/eval.py", line 64, in
main()
File "/home/ubuntu/anaconda3/envs/robodiff/lib/python3.9/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/robodiff/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/robodiff/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/robodiff/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/diffusion_policy/eval.py", line 34, in main
workspace = cls(cfg, output_dir=output_dir)
TypeError: init() got an unexpected keyword argument 'output_dir'
same error for "can" case, for example. Wonder why and how to fix it?
I am attempting to run the state-based colab. But the first cell seems to be running into an issue.
I think the first cell tries to run the installation code, but I get the error as shown in the screenshot above. For your copy/paste convenience, the error message is:
Python 3.9.16
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Is there something that I am missing with regards to how to run and use colab? I am also running into the same error with the vision-based Diffusion Policy colab. For colab, normally I just run the cells by clicking the arrow that runs the cell, or just SHIFT+ENTER.
Hi, thanks for this amazing project!
I'm having trouble using the ConditionalUnet1D
as part of a custom low dimensional policy.
I have a pretty simple set up with a set of 12-dimensional actions and corresponding 42-dimensional observations. I want to predict the action for a given observation. I.e. horizon=1
, n_action_steps=1
and n_obs_steps=1
.
When calling the ConditionalUnet1D
model (i.e. the forward
method) I keep getting a mismatch of dimensions error:
226 for idx, (resnet, resnet2, upsample) in enumerate(self.up_modules):
227 x = torch.cat((x, h.pop()), dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list.
Here is the link to the line repo.
This seems to stem from the previous iteration of the for loop where the upsample
call returns a tensor with dimensions (256, 512, 2)
. This is incompatible with the next entry in the h
list which has dimensions (256, 512, 1)
. Due to the mismatch in dimension 2, they cannot be concatenated along axis 1.
If I simply comment out the upsample
call (i.e. this line) everything seems to be working fine and I even get reasonable results.
Might there be an issue with the upsample
module or did I not configure my dimensions correctly?
Thanks!
Jannes
Thanks for your novel work!
I'm curious about what's the end-effector of the panda robot in the 6DoF Flipping task? Did you print that or that's an official/third party component? I'd appreciate it if you could show me the way.
In single_realsense.py
there is the following code:
# grab data
data = dict()
data["camera_receive_timestamp"] = receive_time
# realsense report in ms
data["camera_capture_timestamp"] = frameset.get_timestamp() / 1000
if self.enable_color:
color_frame = frameset.get_color_frame()
data["color"] = np.asarray(color_frame.get_data())
t = color_frame.get_timestamp() / 1000
data["camera_capture_timestamp"] = t
# print('device', time.time() - t)
# print(color_frame.get_frame_timestamp_domain())
if self enable_depth:
data["depth"] = np.asarray(frameset.get_depth_frame().get_data())
if self.enable_infrared:
data["infrared"] = np.asarray(
frameset.get_infrared_frame().get_data()
)
I'm wondering why the update for camera_capture_timestamp
only occurs when color_frame
is involved, but not when dealing with depth and infrared. Is it because enable_color
is always set to True, so the timestamp is always based on the acquisition of the color image for all three types of images?
Could you give me some hits about how to prepare a custom skill dataset in the simulator?
Authors,
First, I greatly appreciate your insightful paper and well-written code!
My questions is in regards to goal-conditioning. In section 3.1 of your paper (Network Architecture Options) when discussing the CNN-based Diffusion Policy, you mention:
"However, goal conditioning is still possible with the same FiLM conditioning method used for observations."
I wonder if you could comment along these lines a bit? As the trajectory sampled is the conditional probability p(A|O), would one simply encode the goal observation and concat it to the initializing observations Ot? Or do you mean something else, like a restructuring to sample the joint trajectory p(A, O)? I'm sorry, I think I'm missing something.
Thanks very much for your time,
Robert Mash
Dear @cheng-chi,
I would like to bring to your attention a performance issue I've encountered when working with the Transport MH dataset (image input) in robomimic. In particular, the performance of the diffusion policy seems to be significantly slower than expected.
Here are the details of my setup:
The command I use to run the training process is as follows:
python train.py --config-dir configs/image/transport_mh/diffusion_policy_cnn --config-name=config.yaml training.seed=42 training.device=cuda:2 hydra.run.dir=data/outputs/${now:%Y.%m.%d}/${now:%H.%M.%S}_${name}_${task_name} dataloader.batch_size=64 dataloader.num_workers=8
Given these circumstances, I was wondering if there might be some room for optimization or if this is the expected speed considering the complexity of the task.
Also, could you provide details about the hardware you are using and the amount of time it typically takes for the diffusion policy to train on your setup? This could help me understand if what I am experiencing is within the expected range.
Looking forward to your insights.
Best Regards,
@shim0114
Hello 🤝 , @cheng-chi . I encountered an issue while applying your code to collect data on the iiwa7 robot. When the machine reaches certain specific positions, I noticed a problem with using Euler angles for interpolation. It causes sudden jerks or acceleration, which can be potentially dangerous. I suggest using quaternions for interpolation. I have tested this and found that using quaternion interpolation resolves the problem I described.
Hello:
@cheng-chi Thank you very much for your work. Currently, I want to reproduce the diffusion policy on the real ur5 robot, but I still have the following questions. Can you give me some advice?
1. I observed that you recorded more than a hundred demonstrations for the pushT task, and then I directly tried to use these demonstrations for training with your code. However, each epoch of training takes about 10 minutes on my computer. If it takes a long time to complete a task according to the 600 epochs in the configuration file, is this normal? What is the minimum epoch required to train a task? Because I noticed that it seems that you can train a policy in only 12 hours. My GPU is RTX3060 12G.
2. In addition, I noticed that you do not have demonstration examples for the cup-righting and spilling tasks. If I want to train a brand new task, do I only need to use the script you provided on github to perform the same operation? Are the action spaces the same for these different tasks?
Thank you and looking forward to your reply!🥺
In the dataset, the observation sequence is sized N*60, is there any code snippets to generate the low dim embeddings?
I read about the velocity control mentioned in your paper and am curious about the specific implementation details in both simulation and real-world settings.
Hi,
In real experiments, the output (action sequences) is always pointing to a weird direction. Can you please advise me on the following questions:
BTW, I was caught in a cycle of the problem of how to map the camera and the end effector in the global coordinate. I revisit the diffusion policy in detail, I think the desired pose of the end effector will be output in base coordinates, and the objects' positions obtained by the camera will also be mapped and uniformed in the base coordinate rather than in the camera frame through the diffusion model. Please correct me if I misunderstood anything~ thanks!
Appreciate your kind reply and help!
Best regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.