Giter VIP home page Giter VIP logo

miniseed3's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

miniseed3's Issues

extract docs from json schema file

Need to pull docs from json schema into rst.

Related to changing the documentation of extra headers: I realized that the canonical documentation for each field is not displayed in the specification proper. This document exists, as I think it should, in the JSON schema itself. I had not settled on how to beat render this in the spec. but am working on it (volunteers welcome!).

IF ADOPTED: steps identified if the specification is adopted

If this proposal is adopted:

less zeros in reference data and variation in sample rate

From @crotwell:

Might be better for testing if there were less zeros in the reference data. Particularly, the hour, min, sec and nanosec are all zeros, making unit tests less reliable. Perhaps, time could be something like: 12:34:56.123456789.

Related, would be good if the sample rate/period was not always 1. Because the "negative means period" concept, a reference files with high sample rate (>> 1) and low sample rate (<< -1) would be really useful to catch parse errors.

add int64 encoding

Consider adding

64-bit integer (two’s complement), little endian byte order

as an additional "primitive" data encoding. Unlikely this would ever be needed for digitized data from an actual sensor, but since most computers are 64bit it may be useful to support this as a type. The resulting file size would be 2x, but writing an array of native 64bit integers would be possible without having to check to see if any value exceeds 32bits, for example from a numpy int_. array.

20 appears to be the next unused encoding.

miniseed version number in the header (header field 2)

We'd recommend to allow for a version indication of both major and minor versions
(minor versions indicating those where a file according to the spec of the previous minor is still valid according to the new minor, major versions being not backward-compatible)
Reasoning:
A typical case for a new minor version would be the the extension of the list of valid payload encodings (header field 5), a process anticipated in the spec and an information relevant for software to check compatibility with a seedlink file.
Potential implementations:
a) add a new byte for the minor (especially, alongside with proposal/issue #6, the record header would remain unchanged)
b) interpret the byte as version x10, allowing 25 majors and up to 10 minors per major.

expression of the sampling rate (header field 6)

We would actually prefer a representation of the sampling rate by two integers (numerator, denominator)

Reasoning:
At least as long as the sampling rate is rational (afaik, we have not met others), this representation is more precise and less prone to code implementation issues and equality check mismatches due to slightly different floating point representation of the same sampling rate, depending on platform, code and compiler versions.
It would probably be good enough to use 4 bytes integers for both numerator and denominator, leaving the header size unchanged.

Proposal: move the CRC and number of samples from the header section to a 8 bytes footer section *after* the data payload

reasoning:
When writing a record, the CRC, and (depending on the record sizing paradigm) the number of samples become available only after all data of the record is collected.
Now, miniseed3 format is used for both data transfer and data storage.
In many cases of data transfer, small incremental transfer is an asset (regular data flow, small latency in real-time applications)
In many cases of data storage, larger records are an asset: less overhead due to repetitive header information, more efficient seek processes in larger (e.g. day-) files.
With the proposed change, a miniseed producer can write a header (including intended payload size) and start an incremental transfer of the data before the CRC (and, in case of compressed data, the number of samples fitting into the payload) is known. A miniseed receiver/processor can decide whether to

  • whether to start interpretation right away from the receiving the first junk of the record (with the typically small risk of the need to discard results lateron, if the CRC turns out to be invalid) -> in this case, transfer latency is minimized without resulting in inefficient target side storage or the need to record size transcoding
  • or to wait for the entire record to be transferred (without relevant advantages or drawbacks compared to classical miniseed2 handling.

Note that

  • as many data transfers are using TCP protocols (implementing its own checksums on IP packet level), the CRC is much more a tool for sanity check of storage than of transfer these days
  • Moving CRC and number of samples to the end of the record does not imply any overhead in accessing this information, as its position is known from the beginning (header size, plus values of header fields 10, 11, and 12)

non-FDSN extra headers missing

In Appendix A, the non-FDSN extra headers example is missing. Would be useful to add something to show the structure relative to FDSN.

Proposal: Drop data publication version (header field 9)

Reasoning:
imagine a raw data set using publication version 1
Two different authors could pick this up, one author doing gapfilling, while the other doing time correction. Both, if unaware of each other, would increment the data publication version, resulting in two datasets carrying the same version, but having different content. This seems to be a potential source of confusion
In a first order attempt to avoid the problem, one could consider adding a data field like "agency", and implying that among dataset versions of the same agency, version numbering should be consistent, while between dataset versions authored by different agencies, it may not.
However, with the source identifier URI (and uri-style ...#fragment extensions of it), miniseed3 already provides a much more flexible & powerful tool for versioning and provenance indication). Thus, we think the error-prone version byte can be dropped
(note, together with proposal #5, this leads to an unchanged header size

Overall record size

In most cases when evolving a data format an increase in space requirements is inevitable; at the same time it is recognized that storage density continues to increase and corresponding costs continue to decrease. However it is important to consider that there is still a cost to incur, and this cost is incurred on low power dataloggers producing the data, the data center storing and distributing the data, the computers consuming the data, and the transmission of the data on the networks connecting all of these. A technical review of the proposed data format is an opportunity to take a step back to consider whether the technical proposal is cost effective overall - in other words whether the level of cost optimization in the design undertaken for the new data format is suitable.

The space requirements associated with a few aspects of the new data format can be considered via a few examples:

  • Source identifier: a text-based, variable length field
  • Record start time: The original straw man proposal specified 8 bytes and the current proposal specifies 12 as an outcome of a brief discussion thread
  • Record length: A proposal to reduce this in half to 2 bytes was not successful
  • Sample rate/period: 8 byte float

In the case of the Source Identifier field, the need to expand the namespace is clear, the approach taken here is mostly consistent with today's conventions, and the benefits of an expanded namespace will be constantly leveraged going forward.

The original arguments put forth in support of moving to an 8 byte time representation are compelling and, in the years since, momentum has continued to gain with regard to dispensing entirely with creating new leap seconds, and the rationale for this includes a recognition that the world is unprepared for a possible negative leap second ("What could possibly go wrong?" ;-))

With regard to the remaining 2 examples cited, is there too much weight being placed on possible edge cases justifying a 4GB record length and the very high level of precision offered by a 64-bit float for the sample rate? It is hard to predict the community's needs over the fullness of time, but the new data format is designed to be more easily extended in the future as new significant use cases emerge.

formating of times in reference data

Times in reference data are inconsistent.
For example in reference-detectiononly.txt:

             start time: 2004,210,20:28:09.000000

while in same file

                        "OnsetTime": "2022-06-05T20:32:39.120000Z",

which are slightly different. First has day of year, second month-day. First doesn't have Z for UTC.

Also, mseed3 has nanosec precision, but both times are microsecond.

JSON version of same file has

    "StartTime": "2004-07-28T20:28:09.000000000Z",

which is again slightly different, nanosecond with Z, and month-day. The JSON also has

                        "OnsetTime": "2022-06-05T20:32:39.120000Z",

with only microsec precision.

Would be nice if all of these were unified. While I like the month-day format in general, the mseed3 format has day of year, so that might be a more accurate representation?

version of extra headers schema

Currently the json-schema for the extra headers is unversioned. As it is anticipated that the FDSN can update this by adding to the "reserved" section of the extra headers, this schema should be versioned, if only to allow verification that any change is backwards compatible.

URI syntax

The syntax in RFC3986 specifies that the forward slash ("/") be used as the delimiter between the hierarchical path components and the underscore ("_") is explicitly defined as an unreserved character. Using the forward slash is the conventional and familiar way of delimiting path components in a hierarchical namespace and is well supported by existing software used for parsing URIs. It would be ideal to (re)surface the rationale for using underscore "_" instead.

add fdsn standard extra header keys for stationxml and quakml

While not useful in most cases for data centers, it is often useful in research to associate an event with a time series within a single file. For example, the sac file format has headers for event latitude, longitude and depth. Because these are standard, tools can easily find and use them. Putting channel metadata into the record, while perhaps space-inefficient, is also often useful.

While we could create specialized json objects to store this type of information, we already have the QuakeML and StationXML XML formats, which are just of course just strings.

Propose that we reserve keys within the fdsn portion of the json extra headers for QuakeML and StationXML. The idea is that one could insert the contents of a QuakeML or StationXML file as a string into the json extra headers, allowing this metadata to be contained within a mseed record. Because these two file formats (actually string formats) are already standardized, tools should already know how to deal with them and so the standard key is all that is needed.

Optionally, these could also be stored in otherwise "empty" records, eg no time series data, allowing one copy within a mseed3 file that might have many records. The association here would be to other records in the file, but would still provide a simple and useful way to aggregate related information in a standard way.

Suggest adding to the json schema:

                "QuakeML": {
                    "description": "QuakeML XML as a string",
                    "type": "string"
                },
                "StationXML": {
                    "description": "StationXML XML as a string",
                    "type": "string"
                },

And yes I do realize it is weird to have one file format (mseed3) containing another file format (json) which then contains yet another file format (XML).

packet time accuracy

The FDSN object in the Extra Headers section has the FDSN.Time.Quality property which is carried forward from SEED 2.4 Blockette 1001, field 3. In the original specification it is defined as "...a vendor specific value from 0 to 100% of maximum accuracy, taking into account both clock quality and data flags". There is also the FDSN.Time.MaxEstimatedError property defined in the JSON schema. The latter is arguably more useful and does not require itself to be supplemented by manufacturer-specific documentation to interpret it. Propose that FDSN.Time.Quality be dropped for the record, but I recognize that this likely will not be successful due to compatibility concerns with 2.4.

Tye of timing quality value should be integer

From: https://github.com/iris-edu/miniSEED3/blob/main/extra-headers/ExtraHeaders-FDSN-v1.0.schema-2020-12.json#L37-L40

"Quality": {
  "description": "DEPRECATED Timing quality.  A vendor specific value from 0 to 100% of maximum accuracy. [same as SEED 2.4 Blockette 1001, field 3] It is recommended to use MaxEstimatedError instead.",
  "type": "number"
},

The /FDSN/Timing/Quality extra header is specified in the schema as a number and also that it's the same as SEED 2.4, B1001, field 3, which is an integer.

I believe the correct type should be integer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.