Giter VIP home page Giter VIP logo

Comments (20)

dmaziuk avatar dmaziuk commented on September 9, 2024

AFAICT this is not a valid loop. So if PyNMRSTAR can write it out like that, that's the bug.

To wit, the 3rd value in the 2nd row is Optional.

from pynmrstar.

dmaziuk avatar dmaziuk commented on September 9, 2024

PS. this is why I added STAR-2012's triple-quoted values to SAS.

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

So if PyNMRSTAR can write it out like that, that's the bug.

Exactly.

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

This is resolved (pending further testing) in the pure python implementation in these commits: 69c3208 and b90667f

I need to implement it in the C library before pushing the update to pip.

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

This warrants further discussion. (@elulrich @perzeus @dmaziuk @kumar-physics ) There are several ways to "resolve" this, each of which has its own benefits and drawbacks. My commit above implements one of them (2), but I think we should have consensus before I push it to the master branch. Here are the options as I see them (in no particular order):

  1. Implement full STAR-2012 support in the library and use triple-quoted values to contain the newline+semi-colon delineated values. If triple quoted values are also inserted, then use the new bell escape sequence to escape them. The major downside of this is that all STAR parsers that don't implement STAR-2012 will fail to parse the file. My understanding is that the BMRB officially releases files in the original STAR format. If we start generating files with STAR-2012 features then other's STAR-parsing libraries will break. We can make the transition official though, and try to encourage other users to use our existing libraries. This could have other drawbacks though - we would have to start officially supporting unicode in files. Dimitri, my understanding from your comment above is that you already use this method when writing STAR which contains STAR in data values? Or do you just support parsing files with triple-quotes?
  • Use a BMRB-defined NMR-STAR specific (not official STAR) method to escape these problematic values. I see at least two different ways to do this:
  1. If a data value contains the string \n; then add 3 space characters before every line in the data value. This way the existing \n; will transform into \n\ \ \ ;. (Interpret \s as - Markdown doesn't preserve multiple spaces in a row.) During the parsing of these files, if a field is semi-colon delineated, contains the string \n\ \ \ ;, and every line starts with at least 3 spaces we can automatically replace all instances of \n\ \ \ with \n. This has the benefit of being visually clear of what is happening, and it works recursively. (I.e. STAR-in-STAR-in-STAR) I believe something like this is what we have used in the past when we have inserted STAR loops as the value of tags? The downside is that if the user inserts data into a tag that contains the literal string \n\ \ \ ; and has every line beginning with at least 3 spaces then the spaces will be removed from their data upon parsing. I think this would be extremely rare, but it is still worth acknowledging. This solution does not break existing parsers, though they would fail to properly interpret the STAR-in-STAR values. (They would not remove the whitespace and therefore if they tried to parse the inner STAR data would encounter an error.) On the other hand, I think most other people have not thoroughly implemented their parsers and therefore simply having the STAR-in-STAR may break their parsers as it is. Furthermore, upon the error visual inspection of the file should make it clear what is happening.

  2. Borrowing from the STAR-2012 specification in a backwards-compatible way, simply replace all instances of \n; in a text value with \n\a; where \a is the BELL control character when printing. Upon parsing replace all \n\a; in a data value with \n;. Like solution 2 this does not break backwards compatibility, though it shares the same downside of preventing the inner-STAR data from being reinterpreted without updating the code to understand it. This solution has a downside in that it doesn't work recursively. To make it work recursively would require additional BELL-escaping logic.

To summarize the pros and cons:

  1. Most robust solution, human readable, recursive, but prevents all existing parsers from parsing the file at all and means we should probably update to full STAR-2012 support everywhere.
  1. Backwards compatible, human readable, recursive, and doesn't break existing parsers from parsing the file. It would only prevent existing parsers from properly parsing the inner-STAR values, but this is an easy fix either in code or by copying and pasting the inner STAR text. I think this also makes it clear when viewing the file what is going on.
  2. Backwards compatible, the BELL sequence makes it clear when viewing the file that something special is happening, not recursive without additional logic that BELL-escapes the BELL character. May be confusing when viewing the file, but the BELL character is more unambiguous than spaces and wouldn't ever* be inserted into an actual data value.

Regardless of what we choose, I think that all BMRB libraries/tools should implement the same behavior and all NMR-STAR files we produce should use the same method.

My vote is for (2). Thoughts?

from pynmrstar.

perzeus avatar perzeus commented on September 9, 2024

from pynmrstar.

perzeus avatar perzeus commented on September 9, 2024

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

Yes, up to this point in time all of the 3.1 files we generate are pure STAR. (While Dimitri's code supports reading in triple-quoted values, I don't see any released entries using that feature.)

We don't use the full STAR feature set, but everything we use is currently a legal subset of it. I have to admit I'm a bit concerned with breaking existing parsers - it seemed to take a very long time to get the community to upgrade to NMR-STAR 3.x from 2. Furthermore, I believe a lot of people in the community use regular-expression based parsers they implemented themselves for simple scripts and updating those to support this might actually be a lot of work for them. (Perhaps that isn't all bad, since when they come to us to complain we could point them to use our robust tools rather than their own implementations.)

This would also mean that all of our existing software would need to be updated including ADIT-NMR.

Nevertheless, I'm happy to go this route if it is the group consensus.

from pynmrstar.

perzeus avatar perzeus commented on September 9, 2024

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

I definitely agree that we should consider adopting some of the features from STAR-2012 when NMR-STAR 4.0 is released. It will be a great opportunity to resolve some of the oddities from STAR.

from pynmrstar.

dmaziuk avatar dmaziuk commented on September 9, 2024

The other option could be to replace \n; in the embedded STAR with \n''' instead of \n\a; -- that way if someone reads it with a STAR-2012 parser it'll "just work" and if not: they have to put the \n; back anyway.

Unfortunately the triple-quotes, like most other STAR-2012 "improvements" are incompatible with previous specification. Including BELL escapes. All existing STAR software out there will need to be rewritten for STAR-2012. Personally I don't see the urgency, as the 2012 specification doesn't actually fix any of the STAR practical "oddities", it just adds more breakage,

from pynmrstar.

perzeus avatar perzeus commented on September 9, 2024

from pynmrstar.

elulrich avatar elulrich commented on September 9, 2024

I vote for option 2, for now. It is not clear how many NEF files will have software specific tags and data so I am guessing that there will be little impact.

Eldon

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

The other option could be to replace \n; in the embedded STAR with \n''' instead of \n\a; -- that way if someone reads it with a STAR-2012 parser it'll "just work" and if not: they have to put the \n; back anyway.

That is an interesting idea, and I see the appeal in that it would work natively for STAR-2012, but I think it could lead to confusion when looking at a file. I think it is more likely that a user would figure out the embedded-STAR needs to be shifted over versus knowing the """ is a delimeter and not part of the tag value. (I mean, it isn't that hard to figure out, but in STAR whitespace is usually not important to proper parsing, whereas quotes always are.) Also, imagine if you put a whole NMR-STAR entry in a tag and only had a STAR parser - with the spaces method you could just copy the embedded-STAR, paste it elsewhere, and shift the text over in your IDE or text editor. With the quote method you would have to go through and remove all the triple quotes with syntactic-awareness. In other words, you would pretty much need to have a STAR-2012 parser.

Finally, if there actually are triple quotes in the data value then they would need to be escaped - which would require the BELL character which we were trying to avoid.

from pynmrstar.

dmaziuk avatar dmaziuk commented on September 9, 2024

Well, you can't win in general, but a global search and replace of \n''' with \n; is actually easier than column-wise editing. Typically multiple quotes used in nucleic acid atom names, like H-double-prime, so you shouldn't see them at the start of a line. (And I don't think they do triple-primes; certainly not triple-double-quotes.)

If someone was silly enough to use triple-quotes in a value they'll get to keep the pieces: at some point you just have to let it break. They may have used a \a, too, who knows.

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

Well, you can't win in general, but a global search and replace of \n''' with \n; is actually easier than column-wise editing.

Yeah, good point. You could also replace \n\ \ \ with \n globally too.

If someone was silly enough to use triple-quotes in a value they'll get to keep the pieces: at some point you just have to let it break.

I see a few entries with triple quotes within values.

I agree we can't always handle every edge-case, but it seems to me the spaces-method is less effected by them. At worst, it ensures that the file will parse okay but may return data with extra spaces. At worst the quoting method will fail to parse (even in STAR-2012).

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

C test implementation at f753a93.

from pynmrstar.

dmaziuk avatar dmaziuk commented on September 9, 2024

Worst case scenario is they fail to replace \n ; with \n; and get parse errors reading the inner file. They should be able to figure them out.

(You're right: in general you can't use triple quotes without BELs. It could have been a quick workaround for NEF, but otherwise it potentially creates more problems than just adding space before semicolon. Like the rest of STAR-2012: not really a fix that adds more breakage.)

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

Completely fixed at 859b1b1.

from pynmrstar.

jonwedell avatar jonwedell commented on September 9, 2024

Further improvements in edge-case behavior: ada7976

from pynmrstar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.