smi2sdf3d's Introduction

smi2sdf3d

3D diverse conformers generation using rdkit

Algorithm and parameters come from:

@article{DBLP:journals/jcisd/EbejerMD12,
  author    = {Jean{-}Paul Ebejer and Garrett M. Morris and
               Charlotte M. Deane},
  title     = {Freely Available Conformer Generation Methods:
               How Good Are They?},
  journal   = {Journal of Chemical Information and Modeling},
  volume    = {52},
  number    = {5},
  pages     = {1146--1158},
  year      = {2012},
  url       = {https://doi.org/10.1021/ci2004658},
  doi       = {10.1021/ci2004658},
}

smi2sdf3d's People

Contributors

Stargazers

Watchers

smi2sdf3d's Issues

try rdkit's ETKDG

would allow to throw away the FF optimization pass, if I understand well

add visualize option

show all conformers of one molecule in a separate pymol
if that's possible

add ocaml driver for CLI and parallelization

faster rmsd pruning

it might be possible to use the lower bound of rmsd to do the pruning:
if the lower bound is already over the threshold, then we don't need to compute the exact rmsd

run on the dataset from Ebejer and monitor quality

quality, diversity and speed, as in their paper

optional alignment by crippen contribs

cf. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/rdkit_3d.ipynb
for code example

paralellize on a per molecule basis

parallelize at molecule granularity level

each molecule in the input file should be handled by a different process

proper parsing of CLI args

Hi Francois,

Comments and contributions are welcome.

Even auditing the code for correctness is welcome since it is
doing some scientific computation.

I have some stylistic comments.

Why do you have the "RobustSmilesMolSupplier"?

The two differences I see from the built-in SMILES supplier are that (a) it doesn't skip the first line and (b) it gives access to the name even if the SMILES cannot be parsed.

However, titleLine=False handles case (a) and the return values are ignored if case (b) occurs.

A function like "how_many_conformers()" is by convention more often named "get_num_conformers()".
As a minor point, you use "n_confs" and "nb_rot_bonds" and "nprocs". You might want to be consistent.
You might want to look into using argparse to parse your command-line. It would look something like:

import argparse

parser = argparse.ArgumentParser(
description="generate up to N low energy conformers from 2D input")
parser.add_argument("n_confs", metavar="N", type=int, help="number of conformers")
parser.add_argument("smiles_filename", metavar="input.smi")
parser.add_argument("sdf_filename", metavar="output.sdf")

def main(argv=None):
args = parser.parse_args(argv)
if args.N < 1:
parser.error("N must be positive")
n_confs = args.n_confs
smiles_filename = args.smiles_filename
sdf_filename = args.sdf_filename

if name == "main":
main()

This would also give you a "-h"/"--help" option.

If you do this then it's easy to switch to having a default value for N, and to allow people to specify a different value using "-n" or some other command-line option. Similarly, you can lets users set "rmsd_threshold" using "-r"/"--rmsd-threshold" and "nprocs" using "-j". It would look something like:

parser = argparse.ArgumentParser(
description="generate up to N low energy conformers from 2D input")
parser.add_argument("-n", "--n-confs", metavar="N", type=int, default=20,
help="number of conformers (default: 20)")
parser.add_argument("-r", "--rmsd-threshold", metavar="FLT", type=float, default=0.35,
help="minimum RMSD to distinguish two conformers")
parser.add_argument("-j", "--n-jobs", metavar="N", type=int, default=1,
help="number of jobs (default: 1)")
parser.add_argument("smiles_filename", metavar="input.smi")
parser.add_argument("sdf_filename", metavar="output.sdf")

However, I see that "nprocs" doesn't actually do anything at this point.

The "-j" option is short for "number of jobs" and is the command-line argument often used for this value.

You can also write "while len(kept) < n_confs and len(conf_energies) > 0:" as "while len(kept) < n_confs and conf_energies:". The difference is "must have at least 1 energy" vs. "must have energies".
You might consider moving the processing code into its own function so that people can import the code as a library and call the function directly:

def generate_conformers(reader, writer, n_confs=20, rmsd_threshold=0.35, verbose=True)
for name, mol in reader:
if mol is None:
continue
...
if verbose:
print("generating starting conformers ...")
...
writer.write(res, confId=cid)

def main():
...
with closing(Chem.SDWriter(output_sdf)) as writer:
reader = RobustSmilesMolSupplier(input_smi)
generate_conformers(reader, writer, n_confs, rmsd_threshold)

(I usually have a "--quiet" option in my tools to disable verbose progress reporting, which would pass in verbose=False.)

You might consider renaming "rmsd_filter()" to "remove_neighbors()" because "rmsd_filter" is very generic.
Also, if you change "(e, conf) = conf_energies.pop(0)" to "(e, ref_conf) = conf_energies.pop(0)" then it's more clear that you want to use the first conformer as the reference conformer.

Cheers,

			Andrew
			[email protected]

unixjunkie / smi2sdf3d Goto Github PK

smi2sdf3d's Introduction

smi2sdf3d

smi2sdf3d's People

Contributors

Stargazers

Watchers

Forkers

smi2sdf3d's Issues

Recommend Projects

Recommend Topics

Recommend Org