fdsn / miniseed3 Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://docs.fdsn.org/projects/miniseed3/
Home Page: https://docs.fdsn.org/projects/miniseed3/
The link to the json schema file for the extra headers is broken. I know this will change upon transfer to fdsn, but would be nice to have it correct during review.
https://miniseed3.readthedocs.io/en/latest/extra-headers.html#fdsn-reserved-headers
@chad-iris can you fix this easily?
Need to pull docs from json schema into rst.
Related to changing the documentation of extra headers: I realized that the canonical documentation for each field is not displayed in the specification proper. This document exists, as I think it should, in the JSON schema itself. I had not settled on how to beat render this in the spec. but am working on it (volunteers welcome!).
If this proposal is adopted:
fdsn-reserved.rst
should be updated to reflect the new location.software.rst
from "iris-edu" to "EarthScope"conf.py
, 2 places: release
string and sphinxmark_enable
boolean.From @crotwell:
Might be better for testing if there were less zeros in the reference data. Particularly, the hour, min, sec and nanosec are all zeros, making unit tests less reliable. Perhaps, time could be something like: 12:34:56.123456789.
Related, would be good if the sample rate/period was not always 1. Because the "negative means period" concept, a reference files with high sample rate (>> 1) and low sample rate (<< -1) would be really useful to catch parse errors.
Consider adding
64-bit integer (two’s complement), little endian byte order
as an additional "primitive" data encoding. Unlikely this would ever be needed for digitized data from an actual sensor, but since most computers are 64bit it may be useful to support this as a type. The resulting file size would be 2x, but writing an array of native 64bit integers would be possible without having to check to see if any value exceeds 32bits, for example from a numpy int_. array.
20 appears to be the next unused encoding.
We'd recommend to allow for a version indication of both major and minor versions
(minor versions indicating those where a file according to the spec of the previous minor is still valid according to the new minor, major versions being not backward-compatible)
Reasoning:
A typical case for a new minor version would be the the extension of the list of valid payload encodings (header field 5), a process anticipated in the spec and an information relevant for software to check compatibility with a seedlink file.
Potential implementations:
a) add a new byte for the minor (especially, alongside with proposal/issue #6, the record header would remain unchanged)
b) interpret the byte as version x10, allowing 25 majors and up to 10 minors per major.
We would actually prefer a representation of the sampling rate by two integers (numerator, denominator)
Reasoning:
At least as long as the sampling rate is rational (afaik, we have not met others), this representation is more precise and less prone to code implementation issues and equality check mismatches due to slightly different floating point representation of the same sampling rate, depending on platform, code and compiler versions.
It would probably be good enough to use 4 bytes integers for both numerator and denominator, leaving the header size unchanged.
Once EarthScope/mseed3-utils#26 is fixed, regenerate json reference data so record json object is inside array instead of at top level. Testing is easier if reference data stays consistent with output of mseed3-json.
In the description of "Length of data payload" field, it would be useful to clearly state that padding is allowed. It that wasn't the case the field would be redundant, since the number of samples and encoding information would be enough to know the payload length.
reasoning:
When writing a record, the CRC, and (depending on the record sizing paradigm) the number of samples become available only after all data of the record is collected.
Now, miniseed3 format is used for both data transfer and data storage.
In many cases of data transfer, small incremental transfer is an asset (regular data flow, small latency in real-time applications)
In many cases of data storage, larger records are an asset: less overhead due to repetitive header information, more efficient seek processes in larger (e.g. day-) files.
With the proposed change, a miniseed producer can write a header (including intended payload size) and start an incremental transfer of the data before the CRC (and, in case of compressed data, the number of samples fitting into the payload) is known. A miniseed receiver/processor can decide whether to
Note that
In Appendix A, the non-FDSN extra headers example is missing. Would be useful to add something to show the structure relative to FDSN.
Reasoning:
imagine a raw data set using publication version 1
Two different authors could pick this up, one author doing gapfilling, while the other doing time correction. Both, if unaware of each other, would increment the data publication version, resulting in two datasets carrying the same version, but having different content. This seems to be a potential source of confusion
In a first order attempt to avoid the problem, one could consider adding a data field like "agency", and implying that among dataset versions of the same agency, version numbering should be consistent, while between dataset versions authored by different agencies, it may not.
However, with the source identifier URI (and uri-style ...#fragment extensions of it), miniseed3 already provides a much more flexible & powerful tool for versioning and provenance indication). Thus, we think the error-prone version byte can be dropped
(note, together with proposal #5, this leads to an unchanged header size
In most cases when evolving a data format an increase in space requirements is inevitable; at the same time it is recognized that storage density continues to increase and corresponding costs continue to decrease. However it is important to consider that there is still a cost to incur, and this cost is incurred on low power dataloggers producing the data, the data center storing and distributing the data, the computers consuming the data, and the transmission of the data on the networks connecting all of these. A technical review of the proposed data format is an opportunity to take a step back to consider whether the technical proposal is cost effective overall - in other words whether the level of cost optimization in the design undertaken for the new data format is suitable.
The space requirements associated with a few aspects of the new data format can be considered via a few examples:
In the case of the Source Identifier field, the need to expand the namespace is clear, the approach taken here is mostly consistent with today's conventions, and the benefits of an expanded namespace will be constantly leveraged going forward.
The original arguments put forth in support of moving to an 8 byte time representation are compelling and, in the years since, momentum has continued to gain with regard to dispensing entirely with creating new leap seconds, and the rationale for this includes a recognition that the world is unprepared for a possible negative leap second ("What could possibly go wrong?" ;-))
With regard to the remaining 2 examples cited, is there too much weight being placed on possible edge cases justifying a 4GB record length and the very high level of precision offered by a 64-bit float for the sample rate? It is hard to predict the community's needs over the fullness of time, but the new data format is designed to be more easily extended in the future as new significant use cases emerge.
Times in reference data are inconsistent.
For example in reference-detectiononly.txt:
start time: 2004,210,20:28:09.000000
while in same file
"OnsetTime": "2022-06-05T20:32:39.120000Z",
which are slightly different. First has day of year, second month-day. First doesn't have Z
for UTC.
Also, mseed3 has nanosec precision, but both times are microsecond.
JSON version of same file has
"StartTime": "2004-07-28T20:28:09.000000000Z",
which is again slightly different, nanosecond with Z
, and month-day. The JSON also has
"OnsetTime": "2022-06-05T20:32:39.120000Z",
with only microsec precision.
Would be nice if all of these were unified. While I like the month-day format in general, the mseed3 format has day of year, so that might be a more accurate representation?
Currently the json-schema for the extra headers is unversioned. As it is anticipated that the FDSN can update this by adding to the "reserved" section of the extra headers, this schema should be versioned, if only to allow verification that any change is backwards compatible.
The syntax in RFC3986 specifies that the forward slash ("/") be used as the delimiter between the hierarchical path components and the underscore ("_") is explicitly defined as an unreserved character. Using the forward slash is the conventional and familiar way of delimiting path components in a hierarchical namespace and is well supported by existing software used for parsing URIs. It would be ideal to (re)surface the rationale for using underscore "_" instead.
See EarthScope/mseed3-utils#25
Bit flags in text version of reference data has wrong order.
flags: [00100000] 8 bits
[Bit 2] Clock locked
but clock locked should be third bit from right, not left, so should be
flags: [00000100] 8 bits
[Bit 2] Clock locked
While not useful in most cases for data centers, it is often useful in research to associate an event with a time series within a single file. For example, the sac file format has headers for event latitude, longitude and depth. Because these are standard, tools can easily find and use them. Putting channel metadata into the record, while perhaps space-inefficient, is also often useful.
While we could create specialized json objects to store this type of information, we already have the QuakeML and StationXML XML formats, which are just of course just strings.
Propose that we reserve keys within the fdsn portion of the json extra headers for QuakeML and StationXML. The idea is that one could insert the contents of a QuakeML or StationXML file as a string into the json extra headers, allowing this metadata to be contained within a mseed record. Because these two file formats (actually string formats) are already standardized, tools should already know how to deal with them and so the standard key is all that is needed.
Optionally, these could also be stored in otherwise "empty" records, eg no time series data, allowing one copy within a mseed3 file that might have many records. The association here would be to other records in the file, but would still provide a simple and useful way to aggregate related information in a standard way.
Suggest adding to the json schema:
"QuakeML": {
"description": "QuakeML XML as a string",
"type": "string"
},
"StationXML": {
"description": "StationXML XML as a string",
"type": "string"
},
And yes I do realize it is weird to have one file format (mseed3) containing another file format (json) which then contains yet another file format (XML).
jsonschema.org is version 2020-12 and we have 0.7.
Might not really be that much difference, but maybe should upgrade ExtraHeaders-FDSN-v1.0.schema.json from
"$schema": "http://json-schema.org/draft-07/schema#",
to
"$schema": "https://json-schema.org/draft/2020-12/schema",
and update any changes that need to be.
See https://json-schema.org/specification.html#migrating-from-older-drafts
The FDSN object in the Extra Headers section has the FDSN.Time.Quality property which is carried forward from SEED 2.4 Blockette 1001, field 3. In the original specification it is defined as "...a vendor specific value from 0 to 100% of maximum accuracy, taking into account both clock quality and data flags". There is also the FDSN.Time.MaxEstimatedError property defined in the JSON schema. The latter is arguably more useful and does not require itself to be supplemented by manufacturer-specific documentation to interpret it. Propose that FDSN.Time.Quality be dropped for the record, but I recognize that this likely will not be successful due to compatibility concerns with 2.4.
"Quality": {
"description": "DEPRECATED Timing quality. A vendor specific value from 0 to 100% of maximum accuracy. [same as SEED 2.4 Blockette 1001, field 3] It is recommended to use MaxEstimatedError instead.",
"type": "number"
},
The /FDSN/Timing/Quality extra header is specified in the schema as a number
and also that it's the same as SEED 2.4, B1001, field 3, which is an integer.
I believe the correct type should be integer
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.