metamolecular / balsadoc Goto Github PK

Specification of Balsa, a chemical line notation based on SMILES.

License: MIT License

Shell 0.18% TeX 99.05% Ruby 0.77%

balsadoc's Introduction

Balsa

This working paper describes Balsa, a compact molecular line notation based on SMILES. For an introduction, see the blog post. A draft working paper is available on ChemRxiv. A reference implementation is in progress on GitHub. This project was previously hosted on GitHub as Dialect.

Goals

The purpose of the working paper is to fully specify Balsa, a language for molecular serialization. Balsa's compact string representation makes it suitable for the storage, retrieval, and manipulation of molecular structures.

Balsa is grounded in the concept of a language subset. A language subset contains some of the features of its parent, but adds none of its own. This means that in principle every feature found in Balsa will also be found in SMILES. The opposite is, however, not true. Existing SMILES implementations should in general be capable of reading all Balsa strings. Balsa readers, however, may not be able to read every SMILES string. As a language subset, Balsa subtracts features from SMILES due to obsolescence, errors in specification, low utility, or ambiguity.

There is no single, widely-recognized SMILES specification, a shortfall that poses barriers to users and developers alike. This problem was resolved through the introduction of ProtoSMILES. As described in the manuscript, ProtoSMILES is the language defined by David Weininger in a 2003 book chapter. Because it is based on an authoritative, substantial source, ProtoSMILES is likely to span every feature likely to be considered part of SMILES. Defining Balsa as a subset of ProtoSMILES, rather than "SMILES," bypasses the problem of defining SMILES in the first place.

Balsa is closed to further modification, but open to further extension. Ideas for extending Balsa are discussed, but the core language will remain as defined in the manuscript. The main reason is that ProtoSMILES offers no support whatsoever for versioning, a limitation Balsa must inherit. Allowing an unversionable language to be extended would just create many of the same problems that Balsa is trying to address.

To ensure maximum compatibility of Balsa-branded readers and writers, certain aspects of implementation are addressed.

Non-Goals

to add any feature to SMILES
to support SMARTS, SMIRKS, Reaction SMILES, or SMILES File compatibility
to preserve any aspect of SMILES that is ambiguous, redundant, or self-contradictory
to prescribe any particular style of string
to prescribe or require any form of canonicalization
to maintain compatibility with any single SMILES implementation (e.g., Daylight Toolkit, OpenEye, or JChem).
to define a path for base language extensions, other than new element symbols
to preserve any legacy terms or concepts used previously for SMILES
to leave any aspect of Balsa undefined, regardless of perceived importance

FAQ

Will Balsa support multi-center bonding? No. Balsa is based on the valence bond (VB) model, which views a bond as a feature consisting of two atoms and a nonzero, even electron count drawn equally from each atom. This simplification is key to Balsa's brevity. Any structure compatible with the VB model can be encoded and decoded through Dialect without information loss. Other structures can be encoded and decoded using more capable (and verbose) methods should they become available.

Building the PDF

Given pdflatex and BibTex installations, a PDF can be built with:

./bin/build.sh

balsadoc's People

Contributors

Stargazers

Watchers

balsadoc's Issues

Figure: Configurational Descriptor with One Virtual Hydrogen

Show re-orientation of view axis.

Figure 26 should be graphic over table

Graphic is the tree, which appears on top. Table appears below graphic and has 3 columns: Step; Action; and Stack.

This can be done in LaTeX:

% ...

\begin{tabular}{c}
    % the graphics
    \includegraphics[width=2in]{name.pdf}
\end{tabular}

\bigskip

\begin{tabular}{l L}
% the actual table
\end{tabular}

% ...

Figure: Bridged PPB Bond

Again, use the graphical language rather than string syntax.

Figure: Implicit Hydrogen Count

Example molecules with highlighted implicit hydrogen count. Include counter-intuitive cases.

Figure: Partial Parity Bond Placement

Depict what's in the text.

Figure: Deselection

A graphical method to indicate selected status would be nice.

Figure: Testing Bridge Bond Compatibility

Examples of compatible and incompatible pairings for left/right.

All figures should be referenced from text

Figure 4, for example, is not.

Selection algorithm passage should point out arbitrary nature.

Selected and unselected forms of a molecular tree are 100% equivalent to each other. So the passage should point out that the selection algorithm itself is an implementation detail only, provided that it produces DS with perfect matchings after pruning.

Mismatch between text and production rule

Earlier in the text, it is stated that:

A selected atom must use one of the following five elements: B; C; N; O; P; or S.

But later the selection production rule was defined without "b".

<selection> ::= "c" | "n" | "o" | "p" | "s"

Figure 27 should be composed as a LaTeX table with top graphic

See #33.

The terms "node" and "atom" should be used consistently.

For example, the passage on selection algorithm notes: "A selection algorithm selects two or more nodes, thereby adding them to the DS." It should instead refer to atoms, because only atoms can be selected.

Add parser generator test harness

It should be possible to directly demonstrate the functionality of the grammar by running a test suite over test strings. Running this test suite as changes are made can prevent regressions and syntax errors in the grammar itself as well.

Candidates include:

Ruby EBNF

Figure: Electron Counting

Left, center, and right panels should be depicted.

Figure: Gratuitous Selection

Example molecules whose marked atoms would be selected gratuitously.

Figure: PPB Error States

The symbols up arrow/down arrow might be helpful labels on the molecular tree.

Ensure consistent use of parity variants

Discussion in Syntax uses the variants Left and Right. Correct this and any other occurrences.

Ensure use of Clockwise and Counterclockwise are consistent

They are not currently. Table 3 lists the two atom parity variants as Clockwise and Counterclockwise. But under syntax, "... corresponding to the values Left and Right respectively."

Pick one pair of variants and use it consistently.

Figure: Invalid Uses of Configurational Descriptors

Example molecules whose marked atom should not use a configurational descriptor.

Figure: Fastener

Might want to consult the reference implementation to see if this one still applies.

Revise section headings

Current organization:

Introduction
Goals and Overall Design
Molecular Tree
Atoms and Bonds
Electron Counting
Delocalization Subgraph
Valence and Subvalence
Computing Implicit Hydrogen Count
Conformation
Configuration
Syntax (might be broken down further)
- Atom
- Sequence
Reading Strings
Writing Strings
Working with Molecular Graphs
Compatibility
Discussion
Conclusion

These topics could be broken down more evenly like so:

Introduction
Goals
Molecular Representation (starts with last paragraph of Goals...)
- Molecular Tree
- Atoms and Bonds
- Electron Counting
- Delocalization Subgraph
- Valence and Subvalence
- Computing Implicit Hydrogen Count
- Conformation
- Configuration
Syntax
- Grammar
- Atom
- Sequence
Implementation
- Reading Strings
- Pruning
- Writing Strings
- Working with Molecular Graphs
Compatibility
Discussion
Conclusion

Figure: Cyclooctatetraene

Structure drawing of unrepresentable conformation.

Stack defined twice

Stack is defined twice (Reading and Writing). It should only be defined once.

Define characters using ASCII table

To avoid any confusion. List each character together with its ASCII code.

Ensure non-terminals are wrapped by angle brackets

<virtual_hydrogen at "The virtual hydrogen count is set to zero by omitting..." Correct similar mistakes.

Add introduction to Implementation section

Explain what this section is for. It's to guide implementation by answering specific questions likely to come up.

Implicit H handling on nitrogen

Is there any real example for such a pentavalent nitrogen center? There are about 550 examples of N atoms with a coordination number (= "valence") of five in the Cambridge Structural Database, but most of these are clusters with nitrogen bonded to small 1st or 2nd row metals such as lithium. A rare example which does not match this description is here:

Much more likely, you will find an ammonium cation:

Valence [= coordination number]: 1+1+1+1=4
Subvalence: 4-4 = 0
Implicit H: 0

Figure 19 incorrect structure.

The right-hand side of Figure 19 should place the H atom at the leftmost child position. The cyan atom should have a "[H]" next to it to indicate virtual hydrogen.

Figure: Stack for Branches

Example use of a stack to push/pop head atom as open/close parens are read.

Figure: Perfect Matching

It might be possible to re-purpose one already found on depth-first.com.

Figure: Stack for Branch Assembly

Figure: Configurational Transformations

Illustrates the five transformations defined in the text.

Address parentheses balancing

Not defined at the level of syntax, but could be. Readers encountering unbalanced parentheses must report an error. Writers must not write such strings.

Figure 22 should be a table with top graphic

See #32 for one way to do this.

Put tree into graphic and put graphic at top of table. The table columns would be: Step; Character; Stack; and Action.

Levels of detail

Intro gives 3 but Conclusion lists 4.

Figure: Selection Algorithm

Maybe two examples, one of which selects gratuitously.

Figure 1 incorrect structure

"C[NH4+]" should be "C[NH3+]". The graphical structure should also be fixed.

Figure: Partial Partity Bond

Highlight all of the bonds that participate in the system.

Figure: Depth-First Traversal

Show traversal order, including bridges and gaps.

Figure: PPB for anti-2-butene

A worked example based on text description.

Figure: DIME

An example molecule showing DIME and the corresponding molecular trees.

Qualify "atomic index" in discussion of meta formats.

Paragraph starting with "Broader expansion..." uses the term.

Add semicolons as production terminators for grammar

Turns out they're required and some tools won't like it if they're missing.

Figure: Configurational Descriptor

Handling of atoms which do not follow octet rule?

How do you want to handle atoms which do not follow the octet rule?

Classical examples of 2nd row compounds are diborane (B2H6) and nitric oxide (NO). How do you want to make sure BALSA handles NO and HNO differently?

And more importantly, how to handle diborane with its 3c2e bonds, which cannot be expressed in terms of Lewis formulas?

https://en.wikipedia.org/wiki/Diborane
https://en.wikipedia.org/wiki/Nitric_oxide
https://en.wikipedia.org/wiki/Nitroxyl

Rework grammar to use W3C EBNF style

This specification appears to be the most coherent and widely-supported of the "EBNF-like" notations.

Its main differences are the angle brackets around production rule names are not allowed, nor are semicolons at the end of rules.

Switching to W3C EBNF will make it possible to use parser generator tooling more easily.

Electron counting does not update referenced atom

As described, the procedure deducts double the electrons it should from the referenced atom.

The referenced atom's electron count will be deducted when it is added as a parent and then bonded to its own referenced atom.

It could be that the concept of referenced atom is no longer needed if this use is dropped.