Giter VIP home page Giter VIP logo

molecules's People

Contributors

atrifan2 avatar azrael417 avatar braceal avatar hengma1001 avatar lee212 avatar yngtodd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

molecules's Issues

SystemError: Negative size passed to PyBytes_FromStringAndSize

This happened when I try to aggregate 240 dcd files across 40 Summit nodes:

jsrun -n 40 -r 1 -a 6 -c 7 -d packed /gpfs/alpine/proj-shared/med110/conda/pytorch/bin/python 
"/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py" 
"-t" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/tmp.2VrCh27TOx"
 "-p" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-r" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/Parameters/input_protein/prot.pdb"
 "-o" "/gpfs/alpine/proj-shared/med110/hrlee/pasc/exp1/test.200frames.480/MD_to_CVAE/cvae_input.h5" 
"--contact_maps_parameters" "kernel_type=threshold,threshold=16" "-s" "protein and name CA" "--rmsd" "--fnc"
"--contact_map" "--point_cloud" "--num_workers" "2" "--distributed" "--verbose"

and the error:

Traceback (most recent call last):
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 99, in <module>
    main()
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/alpine/proj-shared/med110/conda/pytorch/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gpfs/alpine/proj-shared/med110/hrlee/git/braceal/molecules/scripts/traj_to_dset.py", line 94, in main
    sel=selection, cm_format=cm_format, num_workers=num_workers, comm=mpi_comm, verbose=verbose)
  File "/gpfs/alpine/med110/proj-shared/hrlee/git/braceal/molecules/molecules/sim/dataset.py", line 547, in traj_to_dset
    rows_ = comm.gather(rows_, 0)
  File "mpi4py/MPI/Comm.pyx", line 1262, in mpi4py.MPI.Comm.gather
  File "mpi4py/MPI/msgpickle.pxi", line 680, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 685, in mpi4py.MPI.PyMPI_gather
  File "mpi4py/MPI/msgpickle.pxi", line 148, in mpi4py.MPI.Pickle.allocv
  File "mpi4py/MPI/msgpickle.pxi", line 139, in mpi4py.MPI.Pickle.alloc
SystemError: Negative size passed to PyBytes_FromStringAndSize

I tried to add an exception handler to line 547, and set 0 to rows_, cols_ to ignore when it's corrupted but it doesn't seem a correct patch. I will dig further but wanted to report this first.

Testing h5py

write a small benchmark reading data and also load it manually from the file and compare

  1. run the pytorch data loader and then also load the samples manually from the file and do an np.array_equal

  2. if a variant can run like that for a few epochs without issues it should be fine

Test fletcher checksum with and without

Run tsne callbacks in a subprocess

Could make tsne plotting module.

Make impl functions for 2d,3d that can be called as a normal python function.
Pass in a h5 or npy file path to embedding coordinates and have the function save the plot to save_path, write to tensorboard,wandb, etc.
EX: plot_tsne2d_impl(embeddings_path, save_path, wandb=None, tensorboard_writer=None, **kwargs)

Then make a click interface to the 2d, 3d tsne impls. '2d' vs '3d' can be a CLI param.

Now make another function that uses the subprocess module to call the click CLI (below)

def plot_tsne(embeddings_path, save_path, plot_dim, subprocess=False, **kwargs):
    if subprocess:
        # call click CLI with subprocess module
   else:
       if plot_dim == '2d':
           plot_tsne2d_impl(...)
      elif plot_dim == '3d':
          plot_tsne_3d_impl(...)

Note: both the click interface and the subprocess interface are very small functions... just passing args basically.

This makes the callbacks very simple. They only need to save the embeddings to disk as a npy file (or something) and then call the plot_tsne function with subprocess=True. In fact, this way we only need 1 tsne callback and we can specify 2d,3d,both as an input parameter. This way the embeddings file won't get saved twice.

Putting the 2d,3d together is not the most general approach e.g. what if a future callback also needs the embeddings. There is probably a better solution to saving the embeddings for use with multiple callbacks...

resnet vae latent dim restriction

I have an idea how you can get rid off weird resnet restrictions for latent space dim.

instead of flatten the vector, average pool the spatial dims and just take the filter outputs to feed into the next layer.

that way your latent space dim does not change with image size and you can adjust it by adjusting that filter dim

so you might need to add more filters to capture the spatial features you lose but that could work too

Add plotly plot to wandb log

Add interactive tsne visualization chart to molecules.plot
wandb allows you to pass a custom html string

TODO: play around with this to test it
use the plot figure object and call to_html()
https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html#plotly.graph_objects.Figure.to_html

Then use wandb api:
https://docs.wandb.com/library/log#media

wandb.log({"custom_string": wandb.Html(fig.to_html())})

Then save html file to disk:
plotly.offline.plot(fig, filename='plot.html')

HDF5 intermittent failure 2

Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 259, in main
    cms_transform=False,
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 118, in get_dataset
    cms_transform=cms_transform,
  File "/p/gpfs1/brace3/src/molecules/molecules/ml/datasets/point_cloud.py", line 62, in __init__
    with open_h5(self.file_path, 'r', libver = 'latest', swmr = False) as f:
  File "/p/gpfs1/brace3/src/molecules/molecules/utils/read_file.py", line 20, in open_h5
    return h5py.File(h5_file, mode, **kwargs)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file is already open for write (may use <h5clear file> to clear file consistency flags))

HDF5 intermittent failure

In dataset class and a try and except block to retry opening the h5 file. Should retry a parameterized number of times and wait for 10 seconds each time.

Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 259, in main
    cms_transform=False,
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 118, in get_dataset
    cms_transform=cms_transform,
  File "/p/gpfs1/brace3/src/molecules/molecules/ml/datasets/point_cloud.py", line 62, in __init__
    with open_h5(self.file_path, 'r', libver = 'latest', swmr = False) as f:
  File "/p/gpfs1/brace3/src/molecules/molecules/utils/read_file.py", line 20, in open_h5
    return h5py.File(h5_file, mode, **kwargs)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 48, sblock->base_addr = 0, stored_eof = 2048)

Rectangular contact matrices

  1. Adapt VAE models to account for rectangular matrices.
  2. Write tests for this
  3. Test preprocessing for rectangular matrices

yaml parameter file

have a --config parameters.yaml
and then parse the other CLA paramters
and overwrite the parameters.yaml with those

config file is the only required command line arg
the others are all optional, check if they are specified and then conditionally overwrite

function: process parameters which does that in a clean way
function: generate yaml files on the fly for hparam sweep and documenting model params

ConvTranspose2d in PyTorch

TODO: revist output_padding. This code may not generalize to other examples. Needs testing.
See https://github.com/pytorch/pytorch/pull/904/files could fix by storing list of
output_sizes from the encoder conv layers and removing the decoder conv layers from
the sequential and instead store a list of them. However, this approach has problems
because output_sizes must be passed in the forward function, meaning that we can't
use the nn.Sequential of conv layers i.e. the conv layers are stored as a list, not
as member variables as is needed by nn.Module.

TODO referenced in molecules/ml/unsupervised/conv_vae/pytorch_cvae/cvae.py

"CUDA error: invalid device ordinal" when non 0 gpus are specified.

python examples/pytorch/example_vae.py -i ../data/contact_maps.h5 -o ../output/ -m 3 -t symmetric -e 2 -b 128 -E 1 -D 2
CUDA devices:  1,2
Traceback (most recent call last):
  File "examples/pytorch/example_vae.py", line 141, in <module>
    main()
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "examples/pytorch/example_vae.py", line 81, in main
    gpu=(encoder_gpu, decoder_gpu))
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 167, in __init__
    self.model = VAEModel(input_shape, hparams, self.device)
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 32, in __init__
    self.decoder.to(device.decoder)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 443, in to
    return self._apply(convert)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply
    param_applied = fn(param)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 441, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Branch: feature/multi-gpu-vae

This happens when we set the encoder gpu to 1 and the decoder gpu to 2 (-E 1, -D 2). Any idea what is going on here? @atrifan2

Settings that work: -E 0 -D 1, -E 1 -D 0
Settings that don't: -E 0 -D 2, -E 1 -D 1

Perhaps this could help.

t-SNE callback

  1. Take random sample of train and validation contact matrices.
  2. Compute RMSD to native state and fraction of native contacts for sampled matrices.
    2.1 Use openmm RMSD and fraction of native contacts reporters.
  3. Make a callback which stores a matplotlib plot as a member variable. The plot has the axis/labels and color map formatted at initialization of the callback. The only thing that changes when the callback is called during training is the coordinates of the t-SNE embeddings. Then during training we can observe the points moving. Draw train samples as circles and validation samples as triangles.
  4. Save dictionary of embeddings and indices to disk, save plot png to disk.
  5. Add images to tensorboard during training.

Bonus: Make a movie of the embeddings in bokeh

implement a cropping transform

Dataset classes should take a arbitrary composed transform.

Test the cropping transform and add it to the main training scripts.

Add AAE inference function

embeddings = generate_embeddings(
        hparams_path: PathLike,
        checkpoint_path: PathLike,
        input_path: PathLike,
        input_shape: Tuple[int, ...],
        device: str ="cpu",
        batch_size: int =528,
        dataset_name: str = "point_cloud",
    )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.