braceal / molecules Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yngtodd/molecules

5.0 5.0 5.0 35.32 MB

Machine learning for molecular dynamics.

License: MIT License

Makefile 0.51% Python 99.49%

molecules's People

Contributors

Stargazers

Watchers

Forkers

hengma1001 azrael417 ndvybios candicet233 hjjvandam

molecules's Issues

SystemError: Negative size passed to PyBytes_FromStringAndSize

This happened when I try to aggregate 240 dcd files across 40 Summit nodes:

jsrun -n 40 -r 1 -a 6 -c 7 -d packed /gpfs/alpine/proj-shared/med110/conda/pytorch/bin/python 
"/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py" 
"-t" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/tmp.2VrCh27TOx"
 "-p" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-r" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-o" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/cvae_input.h5" 
"--contact_maps_parameters" "kernel_type=threshold,threshold=16" "-s" "protein and name CA" "--rmsd" "--fnc"
"--contact_map" "--point_cloud" "--num_workers" "2" "--distributed" "--verbose"

and the error:

Traceback (most recent call last):
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 99, in <module>
    main()
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 94, in main
    sel=selection, cm_format=cm_format, num_workers=num_workers, comm=mpi_comm, verbose=verbose)
  File "/gpfs/alpine/med110/proj-shared/hrlee/git/braceal/molecules/molecules/sim/dataset.py", line 547, in traj_to_dset
    rows_ = comm.gather(rows_, 0)
  File "mpi4py/MPI/Comm.pyx", line 1262, in mpi4py.MPI.Comm.gather
  File "mpi4py/MPI/msgpickle.pxi", line 680, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 685, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 148, in mpi4py.MPI.Pickle.allocv
  File "mpi4py/MPI/msgpickle.pxi", line 139, in mpi4py.MPI.Pickle.alloc
SystemError: Negative size passed to PyBytes_FromStringAndSize

I tried to add an exception handler to line 547, and set 0 to rows_, cols_ to ignore when it's corrupted but it doesn't seem a correct patch. I will dig further but wanted to report this first.

Testing h5py

write a small benchmark reading data and also load it manually from the file and compare

run the pytorch data loader and then also load the samples manually from the file and do an np.array_equal
if a variant can run like that for a few epochs without issues it should be fine

Test fletcher checksum with and without

update LatspaceStatisticsCallback and PointCloud3dCallback

Update with new base class interface and SaveEmbeddingsCallback

change atom selection for alignment

add select=sel to alignTraj in _traj_to_dset.py

Add values bool (or dset name) to ContactMap dataset class

will allow the distance matrices to be interpreted as normal contact maps.

Run tsne callbacks in a subprocess

Could make tsne plotting module.

Make impl functions for 2d,3d that can be called as a normal python function.
Pass in a h5 or npy file path to embedding coordinates and have the function save the plot to save_path, write to tensorboard,wandb, etc.
EX: plot_tsne2d_impl(embeddings_path, save_path, wandb=None, tensorboard_writer=None, **kwargs)

Then make a click interface to the 2d, 3d tsne impls. '2d' vs '3d' can be a CLI param.

Now make another function that uses the subprocess module to call the click CLI (below)

def plot_tsne(embeddings_path, save_path, plot_dim, subprocess=False, **kwargs):
    if subprocess:
        # call click CLI with subprocess module
   else:
       if plot_dim == '2d':
           plot_tsne2d_impl(...)
      elif plot_dim == '3d':
          plot_tsne_3d_impl(...)

Note: both the click interface and the subprocess interface are very small functions... just passing args basically.

This makes the callbacks very simple. They only need to save the embeddings to disk as a npy file (or something) and then call the plot_tsne function with subprocess=True. In fact, this way we only need 1 tsne callback and we can specify 2d,3d,both as an input parameter. This way the embeddings file won't get saved twice.

Putting the 2d,3d together is not the most general approach e.g. what if a future callback also needs the embeddings. There is probably a better solution to saving the embeddings for use with multiple callbacks...

Encoder/Decoder base classes

Would enforce the interface for all models.

resnet vae latent dim restriction

I have an idea how you can get rid off weird resnet restrictions for latent space dim.

instead of flatten the vector, average pool the spatial dims and just take the filter outputs to feed into the next layer.

that way your latent space dim does not change with image size and you can adjust it by adjusting that filter dim

so you might need to add more filters to capture the spatial features you lose but that could work too

display helpful value error if no files are found in directory for traj_to_dset.py

Add plotly plot to wandb log

Add interactive tsne visualization chart to molecules.plot
wandb allows you to pass a custom html string

TODO: play around with this to test it
use the plot figure object and call to_html()
https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html#plotly.graph_objects.Figure.to_html

Then use wandb api:
https://docs.wandb.com/library/log#media

wandb.log({"custom_string": wandb.Html(fig.to_html())})

Then save html file to disk:
plotly.offline.plot(fig, filename='plot.html')

HDF5 intermittent failure 2

Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 259, in main
    cms_transform=False,
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 118, in get_dataset
    cms_transform=cms_transform,
  File "/p/gpfs1/brace3/src/molecules/molecules/ml/datasets/point_cloud.py", line 62, in __init__
    with open_h5(self.file_path, 'r', libver = 'latest', swmr = False) as f:
  File "/p/gpfs1/brace3/src/molecules/molecules/utils/read_file.py", line 20, in open_h5
    return h5py.File(h5_file, mode, **kwargs)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file is already open for write (may use <h5clear file> to clear file consistency flags))

HDF5 intermittent failure

In dataset class and a try and except block to retry opening the h5 file. Should retry a parameterized number of times and wait for 10 seconds each time.

Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 259, in main
    cms_transform=False,
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 118, in get_dataset
    cms_transform=cms_transform,
  File "/p/gpfs1/brace3/src/molecules/molecules/ml/datasets/point_cloud.py", line 62, in __init__
    with open_h5(self.file_path, 'r', libver = 'latest', swmr = False) as f:
  File "/p/gpfs1/brace3/src/molecules/molecules/utils/read_file.py", line 20, in open_h5
    return h5py.File(h5_file, mode, **kwargs)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 48, sblock->base_addr = 0, stored_eof = 2048)

Additional unsupervised models

Variational Autoencoders with Inverse Autoregressive Flows:
Allows modelling multi-modal latent distributions as could be present in contact matrix datasets.
https://bjlkeng.github.io/posts/variational-autoencoders-with-inverse-autoregressive-flows/

Add sklearn to setup

Save shuffled data index to h5 in save_embeddings callback

Rectangular contact matrices

Adapt VAE models to account for rectangular matrices.
Write tests for this
Test preprocessing for rectangular matrices

yaml parameter file

have a --config parameters.yaml
and then parse the other CLA paramters
and overwrite the parameters.yaml with those

config file is the only required command line arg
the others are all optional, check if they are specified and then conditionally overwrite

function: process parameters which does that in a clean way
function: generate yaml files on the fly for hparam sweep and documenting model params

issue with storing all the embedding vectors on the node for the big data.

need to project down to the frequency we are going to use.

Add weight initialization hparam

ConvTranspose2d in PyTorch

TODO: revist output_padding. This code may not generalize to other examples. Needs testing.
See https://github.com/pytorch/pytorch/pull/904/files could fix by storing list of
output_sizes from the encoder conv layers and removing the decoder conv layers from
the sequential and instead store a list of them. However, this approach has problems
because output_sizes must be passed in the forward function, meaning that we can't
use the nn.Sequential of conv layers i.e. the conv layers are stored as a list, not
as member variables as is needed by nn.Module.

TODO referenced in molecules/ml/unsupervised/conv_vae/pytorch_cvae/cvae.py

"CUDA error: invalid device ordinal" when non 0 gpus are specified.

python examples/pytorch/example_vae.py -i ../data/contact_maps.h5 -o ../output/ -m 3 -t symmetric -e 2 -b 128 -E 1 -D 2
CUDA devices:  1,2
Traceback (most recent call last):
  File "examples/pytorch/example_vae.py", line 141, in <module>
    main()
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "examples/pytorch/example_vae.py", line 81, in main
    gpu=(encoder_gpu, decoder_gpu))
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 167, in __init__
    self.model = VAEModel(input_shape, hparams, self.device)
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 32, in __init__
    self.decoder.to(device.decoder)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 443, in to
    return self._apply(convert)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply
    param_applied = fn(param)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 441, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Branch: feature/multi-gpu-vae

This happens when we set the encoder gpu to 1 and the decoder gpu to 2 (-E 1, -D 2). Any idea what is going on here? @atrifan2

Settings that work: -E 0 -D 1, -E 1 -D 0
Settings that don't: -E 0 -D 2, -E 1 -D 1

Perhaps this could help.

Take random sample of train and validation contact matrices.
Compute RMSD to native state and fraction of native contacts for sampled matrices.
2.1 Use openmm RMSD and fraction of native contacts reporters.
Make a callback which stores a matplotlib plot as a member variable. The plot has the axis/labels and color map formatted at initialization of the callback. The only thing that changes when the callback is called during training is the coordinates of the t-SNE embeddings. Then during training we can observe the points moving. Draw train samples as circles and validation samples as triangles.
Save dictionary of embeddings and indices to disk, save plot png to disk.
Add images to tensorboard during training.

Bonus: Make a movie of the embeddings in bokeh

embeddings = generate_embeddings(
        hparams_path: PathLike,
        checkpoint_path: PathLike,
        input_path: PathLike,
        input_shape: Tuple[int, ...],
        device: str ="cpu",
        batch_size: int =528,
        dataset_name: str = "point_cloud",
    )