There is a problem when I read a text block containing ; in it. There was no problem w

So if PyNMRSTAR can write it out like that, that's the bug. </blockqu

This warrants further discussion. (<a class="user-mention notranslate" data-hovercard-

Escaping ';' in text block about pynmrstar HOT 20 CLOSED

bmrb-io commented on September 9, 2024

Escaping ';' in text block

from pynmrstar.

Comments (20)

dmaziuk commented on September 9, 2024

AFAICT this is not a valid loop. So if PyNMRSTAR can write it out like that, that's the bug.

To wit, the 3rd value in the 2nd row is Optional.

from pynmrstar.

dmaziuk commented on September 9, 2024

PS. this is why I added STAR-2012's triple-quoted values to SAS.

from pynmrstar.

jonwedell commented on September 9, 2024

So if PyNMRSTAR can write it out like that, that's the bug.

Exactly.

from pynmrstar.

jonwedell commented on September 9, 2024

This is resolved (pending further testing) in the pure python implementation in these commits: 69c3208 and b90667f

I need to implement it in the C library before pushing the update to pip.

from pynmrstar.

jonwedell commented on September 9, 2024

This warrants further discussion. (@elulrich @perzeus @dmaziuk @kumar-physics ) There are several ways to "resolve" this, each of which has its own benefits and drawbacks. My commit above implements one of them (2), but I think we should have consensus before I push it to the master branch. Here are the options as I see them (in no particular order):

Implement full STAR-2012 support in the library and use triple-quoted values to contain the newline+semi-colon delineated values. If triple quoted values are also inserted, then use the new bell escape sequence to escape them. The major downside of this is that all STAR parsers that don't implement STAR-2012 will fail to parse the file. My understanding is that the BMRB officially releases files in the original STAR format. If we start generating files with STAR-2012 features then other's STAR-parsing libraries will break. We can make the transition official though, and try to encourage other users to use our existing libraries. This could have other drawbacks though - we would have to start officially supporting unicode in files. Dimitri, my understanding from your comment above is that you already use this method when writing STAR which contains STAR in data values? Or do you just support parsing files with triple-quotes?

Use a BMRB-defined NMR-STAR specific (not official STAR) method to escape these problematic values. I see at least two different ways to do this:

If a data value contains the string \n; then add 3 space characters before every line in the data value. This way the existing \n; will transform into \n\ \ \ ;. (Interpret \s as - Markdown doesn't preserve multiple spaces in a row.) During the parsing of these files, if a field is semi-colon delineated, contains the string \n\ \ \ ;, and every line starts with at least 3 spaces we can automatically replace all instances of \n\ \ \ with \n. This has the benefit of being visually clear of what is happening, and it works recursively. (I.e. STAR-in-STAR-in-STAR) I believe something like this is what we have used in the past when we have inserted STAR loops as the value of tags? The downside is that if the user inserts data into a tag that contains the literal string \n\ \ \ ; and has every line beginning with at least 3 spaces then the spaces will be removed from their data upon parsing. I think this would be extremely rare, but it is still worth acknowledging. This solution does not break existing parsers, though they would fail to properly interpret the STAR-in-STAR values. (They would not remove the whitespace and therefore if they tried to parse the inner STAR data would encounter an error.) On the other hand, I think most other people have not thoroughly implemented their parsers and therefore simply having the STAR-in-STAR may break their parsers as it is. Furthermore, upon the error visual inspection of the file should make it clear what is happening.
Borrowing from the STAR-2012 specification in a backwards-compatible way, simply replace all instances of \n; in a text value with \n\a; where \a is the BELL control character when printing. Upon parsing replace all \n\a; in a data value with \n;. Like solution 2 this does not break backwards compatibility, though it shares the same downside of preventing the inner-STAR data from being reinterpreted without updating the code to understand it. This solution has a downside in that it doesn't work recursively. To make it work recursively would require additional BELL-escaping logic.

To summarize the pros and cons:

Most robust solution, human readable, recursive, but prevents all existing parsers from parsing the file at all and means we should probably update to full STAR-2012 support everywhere.

Backwards compatible, human readable, recursive, and doesn't break existing parsers from parsing the file. It would only prevent existing parsers from properly parsing the inner-STAR values, but this is an easy fix either in code or by copying and pasting the inner STAR text. I think this also makes it clear when viewing the file what is going on.
Backwards compatible, the BELL sequence makes it clear when viewing the file that something special is happening, not recursive without additional logic that BELL-escapes the BELL character. May be confusing when viewing the file, but the BELL character is more unambiguous than spaces and wouldn't ever* be inserted into an actual data value.

Regardless of what we choose, I think that all BMRB libraries/tools should implement the same behavior and all NMR-STAR files we produce should use the same method.

My vote is for (2). Thoughts?

from pynmrstar.

perzeus commented on September 9, 2024

(Working from home today - grant proposal writing crunch mode) I agree with Jon on using the second alternative, if we worry about existing parsers and staying within the STAR specification. I would say I prefer the idea of triple quotes as a cleaner and concise method for encoding verbatim text blocks, but as far as I understand the issue, that solution would make NMR-STAR non STAR compliant, am I right? Cheers, Pedro

from pynmrstar.

perzeus commented on September 9, 2024

When I said triple quotes would make NMR-STAR non STAR compliant, I meant the STAR version we currently conform to. Seeing that STAR 2012 includes them, i think we should move towards implementing at least that aspect of STAR 2012. To be precise, we don't conform to the full STAR specification anyway (e.g. no nested loops). Triple quotes should be easy enough to adopt for any parser out there. Pedro

from pynmrstar.

jonwedell commented on September 9, 2024

Yes, up to this point in time all of the 3.1 files we generate are pure STAR. (While Dimitri's code supports reading in triple-quoted values, I don't see any released entries using that feature.)

We don't use the full STAR feature set, but everything we use is currently a legal subset of it. I have to admit I'm a bit concerned with breaking existing parsers - it seemed to take a very long time to get the community to upgrade to NMR-STAR 3.x from 2. Furthermore, I believe a lot of people in the community use regular-expression based parsers they implemented themselves for simple scripts and updating those to support this might actually be a lot of work for them. (Perhaps that isn't all bad, since when they come to us to complain we could point them to use our robust tools rather than their own implementations.)

This would also mean that all of our existing software would need to be updated including ADIT-NMR.

Nevertheless, I'm happy to go this route if it is the group consensus.

from pynmrstar.

perzeus commented on September 9, 2024

I see your point. Let's implement your solution 2 now. We will discuss any STAR upgrades later, possibly for a new NMR-STAR major version (NMR-STAR 4.0). Major versions typically necessitate parser upgrades anyway. Thanks, Pedro

from pynmrstar.

jonwedell commented on September 9, 2024

I definitely agree that we should consider adopting some of the features from STAR-2012 when NMR-STAR 4.0 is released. It will be a great opportunity to resolve some of the oddities from STAR.

from pynmrstar.

dmaziuk commented on September 9, 2024

The other option could be to replace \n; in the embedded STAR with \n''' instead of \n\a; -- that way if someone reads it with a STAR-2012 parser it'll "just work" and if not: they have to put the \n; back anyway.

Unfortunately the triple-quotes, like most other STAR-2012 "improvements" are incompatible with previous specification. Including BELL escapes. All existing STAR software out there will need to be rewritten for STAR-2012. Personally I don't see the urgency, as the 2012 specification doesn't actually fix any of the STAR practical "oddities", it just adds more breakage,

from pynmrstar.

perzeus commented on September 9, 2024

Thanks, Dimitri. This further justifies leaving STAR 2012 for later discussions. I would say we should not worry about anyone parsing NMR-STAR with a STAR 2012 parser. I believe we will know when that happens, but I won't hold my breath. Cheers, Pedro

from pynmrstar.

elulrich commented on September 9, 2024

I vote for option 2, for now. It is not clear how many NEF files will have software specific tags and data so I am guessing that there will be little impact.

Eldon

from pynmrstar.

jonwedell commented on September 9, 2024

The other option could be to replace \n; in the embedded STAR with \n''' instead of \n\a; -- that way if someone reads it with a STAR-2012 parser it'll "just work" and if not: they have to put the \n; back anyway.

That is an interesting idea, and I see the appeal in that it would work natively for STAR-2012, but I think it could lead to confusion when looking at a file. I think it is more likely that a user would figure out the embedded-STAR needs to be shifted over versus knowing the """ is a delimeter and not part of the tag value. (I mean, it isn't that hard to figure out, but in STAR whitespace is usually not important to proper parsing, whereas quotes always are.) Also, imagine if you put a whole NMR-STAR entry in a tag and only had a STAR parser - with the spaces method you could just copy the embedded-STAR, paste it elsewhere, and shift the text over in your IDE or text editor. With the quote method you would have to go through and remove all the triple quotes with syntactic-awareness. In other words, you would pretty much need to have a STAR-2012 parser.

Finally, if there actually are triple quotes in the data value then they would need to be escaped - which would require the BELL character which we were trying to avoid.

from pynmrstar.

dmaziuk commented on September 9, 2024

Well, you can't win in general, but a global search and replace of \n''' with \n; is actually easier than column-wise editing. Typically multiple quotes used in nucleic acid atom names, like H-double-prime, so you shouldn't see them at the start of a line. (And I don't think they do triple-primes; certainly not triple-double-quotes.)

If someone was silly enough to use triple-quotes in a value they'll get to keep the pieces: at some point you just have to let it break. They may have used a \a, too, who knows.

from pynmrstar.

jonwedell commented on September 9, 2024

Well, you can't win in general, but a global search and replace of \n''' with \n; is actually easier than column-wise editing.

Yeah, good point. You could also replace \n\ \ \ with \n globally too.

If someone was silly enough to use triple-quotes in a value they'll get to keep the pieces: at some point you just have to let it break.

I see a few entries with triple quotes within values.

I agree we can't always handle every edge-case, but it seems to me the spaces-method is less effected by them. At worst, it ensures that the file will parse okay but may return data with extra spaces. At worst the quoting method will fail to parse (even in STAR-2012).

from pynmrstar.

jonwedell commented on September 9, 2024

C test implementation at f753a93.

from pynmrstar.

dmaziuk commented on September 9, 2024

Worst case scenario is they fail to replace \n ; with \n; and get parse errors reading the inner file. They should be able to figure them out.

(You're right: in general you can't use triple quotes without BELs. It could have been a quick workaround for NEF, but otherwise it potentially creates more problems than just adding space before semicolon. Like the rest of STAR-2012: not really a fix that adds more breakage.)

from pynmrstar.

jonwedell commented on September 9, 2024

Completely fixed at 859b1b1.

from pynmrstar.

jonwedell commented on September 9, 2024

Further improvements in edge-case behavior: ada7976

from pynmrstar.

Escaping ';' in text block about pynmrstar HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent