Giter VIP home page Giter VIP logo

balsadoc's Introduction

Balsa

This working paper describes Balsa, a compact molecular line notation based on SMILES. For an introduction, see the blog post. A draft working paper is available on ChemRxiv. A reference implementation is in progress on GitHub. This project was previously hosted on GitHub as Dialect.

Goals

The purpose of the working paper is to fully specify Balsa, a language for molecular serialization. Balsa's compact string representation makes it suitable for the storage, retrieval, and manipulation of molecular structures.

Balsa is grounded in the concept of a language subset. A language subset contains some of the features of its parent, but adds none of its own. This means that in principle every feature found in Balsa will also be found in SMILES. The opposite is, however, not true. Existing SMILES implementations should in general be capable of reading all Balsa strings. Balsa readers, however, may not be able to read every SMILES string. As a language subset, Balsa subtracts features from SMILES due to obsolescence, errors in specification, low utility, or ambiguity.

There is no single, widely-recognized SMILES specification, a shortfall that poses barriers to users and developers alike. This problem was resolved through the introduction of ProtoSMILES. As described in the manuscript, ProtoSMILES is the language defined by David Weininger in a 2003 book chapter. Because it is based on an authoritative, substantial source, ProtoSMILES is likely to span every feature likely to be considered part of SMILES. Defining Balsa as a subset of ProtoSMILES, rather than "SMILES," bypasses the problem of defining SMILES in the first place.

Balsa is closed to further modification, but open to further extension. Ideas for extending Balsa are discussed, but the core language will remain as defined in the manuscript. The main reason is that ProtoSMILES offers no support whatsoever for versioning, a limitation Balsa must inherit. Allowing an unversionable language to be extended would just create many of the same problems that Balsa is trying to address.

To ensure maximum compatibility of Balsa-branded readers and writers, certain aspects of implementation are addressed.

Non-Goals

  • to add any feature to SMILES
  • to support SMARTS, SMIRKS, Reaction SMILES, or SMILES File compatibility
  • to preserve any aspect of SMILES that is ambiguous, redundant, or self-contradictory
  • to prescribe any particular style of string
  • to prescribe or require any form of canonicalization
  • to maintain compatibility with any single SMILES implementation (e.g., Daylight Toolkit, OpenEye, or JChem).
  • to define a path for base language extensions, other than new element symbols
  • to preserve any legacy terms or concepts used previously for SMILES
  • to leave any aspect of Balsa undefined, regardless of perceived importance

FAQ

  • Will Balsa support multi-center bonding? No. Balsa is based on the valence bond (VB) model, which views a bond as a feature consisting of two atoms and a nonzero, even electron count drawn equally from each atom. This simplification is key to Balsa's brevity. Any structure compatible with the VB model can be encoded and decoded through Dialect without information loss. Other structures can be encoded and decoded using more capable (and verbose) methods should they become available.

Building the PDF

Given pdflatex and BibTex installations, a PDF can be built with:

./bin/build.sh

balsadoc's People

Contributors

rapodaca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

balsadoc's Issues

Figure 26 should be graphic over table

Graphic is the tree, which appears on top. Table appears below graphic and has 3 columns: Step; Action; and Stack.

This can be done in LaTeX:

% ...

\begin{tabular}{c}
    % the graphics
    \includegraphics[width=2in]{name.pdf}
\end{tabular}

\bigskip

\begin{tabular}{l L}
% the actual table
\end{tabular}

% ...

Selection algorithm passage should point out arbitrary nature.

Selected and unselected forms of a molecular tree are 100% equivalent to each other. So the passage should point out that the selection algorithm itself is an implementation detail only, provided that it produces DS with perfect matchings after pruning.

Mismatch between text and production rule

Earlier in the text, it is stated that:

A selected atom must use one of the following five elements: B; C; N; O; P; or S.

But later the selection production rule was defined without "b".

<selection> ::= "c" | "n" | "o" | "p" | "s"

Add parser generator test harness

It should be possible to directly demonstrate the functionality of the grammar by running a test suite over test strings. Running this test suite as changes are made can prevent regressions and syntax errors in the grammar itself as well.

Candidates include:

Ensure use of Clockwise and Counterclockwise are consistent

They are not currently. Table 3 lists the two atom parity variants as Clockwise and Counterclockwise. But under syntax, "... corresponding to the values Left and Right respectively."

Pick one pair of variants and use it consistently.

Revise section headings

Current organization:

  • Introduction
  • Goals and Overall Design
  • Molecular Tree
  • Atoms and Bonds
  • Electron Counting
  • Delocalization Subgraph
  • Valence and Subvalence
  • Computing Implicit Hydrogen Count
  • Conformation
  • Configuration
  • Syntax (might be broken down further)
    • Atom
    • Sequence
  • Reading Strings
  • Writing Strings
  • Working with Molecular Graphs
  • Compatibility
  • Discussion
  • Conclusion

These topics could be broken down more evenly like so:

  • Introduction
  • Goals
  • Molecular Representation (starts with last paragraph of Goals...)
    • Molecular Tree
    • Atoms and Bonds
    • Electron Counting
    • Delocalization Subgraph
    • Valence and Subvalence
    • Computing Implicit Hydrogen Count
    • Conformation
    • Configuration
  • Syntax
    • Grammar
    • Atom
    • Sequence
  • Implementation
    • Reading Strings
    • Pruning
    • Writing Strings
    • Working with Molecular Graphs
  • Compatibility
  • Discussion
  • Conclusion

Stack defined twice

Stack is defined twice (Reading and Writing). It should only be defined once.

Implicit H handling on nitrogen

NR4

Is there any real example for such a pentavalent nitrogen center? There are about 550 examples of N atoms with a coordination number (= "valence") of five in the Cambridge Structural Database, but most of these are clusters with nitrogen bonded to small 1st or 2nd row metals such as lithium. A rare example which does not match this description is here:

n4

Much more likely, you will find an ammonium cation:

Valence [= coordination number]: 1+1+1+1=4
Subvalence: 4-4 = 0
Implicit H: 0

Figure 19 incorrect structure.

The right-hand side of Figure 19 should place the H atom at the leftmost child position. The cyan atom should have a "[H]" next to it to indicate virtual hydrogen.

Address parentheses balancing

Not defined at the level of syntax, but could be. Readers encountering unbalanced parentheses must report an error. Writers must not write such strings.

Figure: DIME

An example molecule showing DIME and the corresponding molecular trees.

Handling of atoms which do not follow octet rule?

How do you want to handle atoms which do not follow the octet rule?

Classical examples of 2nd row compounds are diborane (B2H6) and nitric oxide (NO). How do you want to make sure BALSA handles NO and HNO differently?

And more importantly, how to handle diborane with its 3c2e bonds, which cannot be expressed in terms of Lewis formulas?

https://en.wikipedia.org/wiki/Diborane
https://en.wikipedia.org/wiki/Nitric_oxide
https://en.wikipedia.org/wiki/Nitroxyl

Rework grammar to use W3C EBNF style

This specification appears to be the most coherent and widely-supported of the "EBNF-like" notations.

Its main differences are the angle brackets around production rule names are not allowed, nor are semicolons at the end of rules.

Switching to W3C EBNF will make it possible to use parser generator tooling more easily.

Electron counting does not update referenced atom

As described, the procedure deducts double the electrons it should from the referenced atom.

The referenced atom's electron count will be deducted when it is added as a parent and then bonded to its own referenced atom.

It could be that the concept of referenced atom is no longer needed if this use is dropped.

Figure: Pool

It should illustrate exactly how index reuse works. Two levels deep should do it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.