Hi Francois,
Comments and contributions are welcome.
Even auditing the code for correctness is welcome since it is
doing some scientific computation.
I have some stylistic comments.
- Why do you have the "RobustSmilesMolSupplier"?
The two differences I see from the built-in SMILES supplier are that (a) it doesn't skip the first line and (b) it gives access to the name even if the SMILES cannot be parsed.
However, titleLine=False handles case (a) and the return values are ignored if case (b) occurs.
-
A function like "how_many_conformers()" is by convention more often named "get_num_conformers()".
-
As a minor point, you use "n_confs" and "nb_rot_bonds" and "nprocs". You might want to be consistent.
-
You might want to look into using argparse to parse your command-line. It would look something like:
import argparse
parser = argparse.ArgumentParser(
description="generate up to N low energy conformers from 2D input")
parser.add_argument("n_confs", metavar="N", type=int, help="number of conformers")
parser.add_argument("smiles_filename", metavar="input.smi")
parser.add_argument("sdf_filename", metavar="output.sdf")
def main(argv=None):
args = parser.parse_args(argv)
if args.N < 1:
parser.error("N must be positive")
n_confs = args.n_confs
smiles_filename = args.smiles_filename
sdf_filename = args.sdf_filename
if name == "main":
main()
This would also give you a "-h"/"--help" option.
If you do this then it's easy to switch to having a default value for N, and to allow people to specify a different value using "-n" or some other command-line option. Similarly, you can lets users set "rmsd_threshold" using "-r"/"--rmsd-threshold" and "nprocs" using "-j". It would look something like:
parser = argparse.ArgumentParser(
description="generate up to N low energy conformers from 2D input")
parser.add_argument("-n", "--n-confs", metavar="N", type=int, default=20,
help="number of conformers (default: 20)")
parser.add_argument("-r", "--rmsd-threshold", metavar="FLT", type=float, default=0.35,
help="minimum RMSD to distinguish two conformers")
parser.add_argument("-j", "--n-jobs", metavar="N", type=int, default=1,
help="number of jobs (default: 1)")
parser.add_argument("smiles_filename", metavar="input.smi")
parser.add_argument("sdf_filename", metavar="output.sdf")
However, I see that "nprocs" doesn't actually do anything at this point.
The "-j" option is short for "number of jobs" and is the command-line argument often used for this value.
-
You can also write "while len(kept) < n_confs and len(conf_energies) > 0:" as "while len(kept) < n_confs and conf_energies:". The difference is "must have at least 1 energy" vs. "must have energies".
-
You might consider moving the processing code into its own function so that people can import the code as a library and call the function directly:
def generate_conformers(reader, writer, n_confs=20, rmsd_threshold=0.35, verbose=True)
for name, mol in reader:
if mol is None:
continue
...
if verbose:
print("generating starting conformers ...")
...
writer.write(res, confId=cid)
def main():
...
with closing(Chem.SDWriter(output_sdf)) as writer:
reader = RobustSmilesMolSupplier(input_smi)
generate_conformers(reader, writer, n_confs, rmsd_threshold)
(I usually have a "--quiet" option in my tools to disable verbose progress reporting, which would pass in verbose=False.)
-
You might consider renaming "rmsd_filter()" to "remove_neighbors()" because "rmsd_filter" is very generic.
-
Also, if you change "(e, conf) = conf_energies.pop(0)" to "(e, ref_conf) = conf_energies.pop(0)" then it's more clear that you want to use the first conformer as the reference conformer.
Cheers,