libraryofcongress / bagit-spec Goto Github PK
View Code? Open in Web Editor NEWThis project forked from jkunze/bagitspec
This project forked from jkunze/bagitspec
This issue contains feedback on the bagit1.0 branch, in particular on the ABNF grammars in section 7.
First, a small grammatical issue in the prose: Future version of BagIt may disallow
should maybe be Future versions of BagIt may disallow
or A future version of BagIt may disallow
at https://github.com/LibraryOfCongress/bagit-spec/blob/bagit1.0/bagit.xml#L719.
The following commentary on the ABNF section of the BagIt1.0 spec, is based on an attempt to create parsers from the supplied ABNF grammars using the Instaparse library. The (still in-development) baglidate repository can be used to test proposed ABNF grammars and determine if they can be used to parse inputs that are expected to be valid.
bagit.txt
on the lefthand side of the start rule seems a bit odd given that no other nonterminals contain periods; maybe bagit-txt
would be better?Here is the original grammar:
payload-manifest = checksum 1*WSP filename ending
checksum = 1*hex-val
hex-val = "x" 1*case-hexdig
[ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
"d" / "F" / "f"
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "F"
filename = (
"data"
"/"
*( unreserved / pct-encoded / sub-delims )
)
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*"
/ "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
ending = CR / LF / CRLF
The original grammar only captures one line of a manifest-algorithm.txt file. If it is desirable to capture the entire file, the provisional grammar given below shows one approach.
The original grammar requires the character "x"
as a prefix for all "hex-encoded checksums". Is this intentional? Previous versions of BagIt did not, I believe, require this.
The original grammar is missing the "e"
and "E"
characters from its hexadecimal digit sets.
The original grammar allows for effectively empty filenames, i.e., "data/". See the definition of filename
above. I presume this should not be permitted.
The original grammar does not appear to allow for solidus (forward slash) characters in filenames.
A provisional amended grammar is given below. (Note that it does not allow for Unicode characters in filenames, unlike the provisional amended grammar given for 7.3 below.) This grammar addresses the issues described above:
payload-manifest = 1*p-m-line
p-m-line = checksum 1*WSP filename ending
checksum = 1*hex-val
hex-val = 1*case-hexdig
[ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
"d" / "E" / "e" / "F" / "f"
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "F"
filename = (
"data"
"/"
1*( unreserved / pct-encoded / sub-delims )
)
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*"
/ "+" / "," / ";" / "=" / "/"
pct-encoded = "%" HEXDIG HEXDIG
ending = CR / LF / CRLF
Here is the original grammar:
metadata = key ":" WSP value ending
key = 1*alpha-numeric
value = 1*(alpha-numeric ending) *(WSP 1*alpha-numeric ending)
alpha-numeric = ALPHA / DIGIT
ending = CR / LF / CRLF
This grammar only captures one line of a bag-info.txt file. If it is desirable to capture the entire file, one could add a rule like metadata-set = 1*metadata
.
The grammar does not allow for non-ASCII (i.e., Unicode) characters despite the fact that the prose of the spec makes it clear that they are licit. The ABNF spec would appear to lack a provision for Unicode also. A provisional definition of a Unicode counterpart to ABNF's VCHAR
(using Java-compatible regexes) would be UNICODE-VCHAR = ( #'\p{Punct}' / #'\p{L}' / #'\d' )
.
The prose clearly states that the keys of bag-info.txt lines can contain spaces, yet the grammar does not allow for this. One solution would be to define key
as follows (using UNICODE-VCHAR
defined in the bullet above): key = UNICODE-VCHAR *( UNICODE-VCHAR / WSP )
.
In the original grammar, ending
occurs both at the end of metadata
and value
. This causes problems for parsing.
The original grammar does not capture the fact that continuation lines of a key-value pair must be indented by whitespace.
A provisional amended grammar is given below. Putting aside the fact that it would need revision because it uses regexes outside of the ABNF spec, it does seem to solve the issues described above:
metadata-set = 1*metadata
metadata = key ":" WSP value
key = UNICODE-VCHAR *( UNICODE-VCHAR / WSP )
value = value-line *( 1*WSP value-line )
value-line = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ) ending
ending = CR / LF / CRLF
UNICODE-VCHAR = ( #'\p{Punct}' / #'\p{L}' / #'\d' )
Here is the original grammar:
fetch = url 1*WSP length 1*WSP filename line-terminator
url = <absolute-URI, see [RFC3986], Section 4.3>
length = DIGIT
filename = (
"data"
"/"
*( unreserved / pct-encoded / sub-delims )
)
line-terminator = CR / LF / CRLF
Again, this describes a single line of a fetch.txt file, and not the entire file.
The length
non-terminal only allows integers 0-9.
The length
rule does not allow for use of the hyphen "-"
instead of an integer, as the text states is permissible.
Again, empty filenames like data/
are permitted when they probably should not be.
Small point, but line-terminator
is used in this grammar where ending
was used for the same rule in the previous grammars.
Here is a provisional revised grammar that addresses the issues listed above:
fetch-file = 1*fetch-line
fetch-line = url 1*WSP length 1*WSP filename line-terminator
url = <absolute-URI, see [RFC3986], Section 4.3>
length = 1*DIGIT / "-"
filename = (
"data"
"/"
1*( unreserved / pct-encoded / sub-delims )
)
line-terminator = CR / LF / CRLF
Via https://twitter.com/jakkbl/status/1141492471655886848?s=09
(((sfn))) @ampoffcom
BagIt question: It is a MUST that all resources from fetch.txt are in all manifest-ALG.txt files, but not in the oxum. Does this make sense? #rfc8493
From John Scancella
Agreed. Though why would the fetch itmd not be in the payload oxum? The bag isn't complete until you fetch the items missing
I ran across a bag today that used sha256 for the payload manifest and md5 for the tag manifest. While there's nothing technically wrong with this according to the spec, it is awkward to work with. A recommendation to use the same algorithms between manifests would be useful to discourage that type of bag.
E.g Tag manifests should only use the hashing algorithm(s) as the payload manifests present in the bag.
In regards to writing file paths in manifest files, the spec states the following:
If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].
My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR
and LF
are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of %
too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.
I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR
and LF
and do not encode %
. The first behavior is broken for file names that contain an LF
or CR
and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.
I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.
Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n
and a \
is added to the beginning of the line. Additionally, literal \
characters are escaped with another \
and a \
is also added to the beginning of the line.
For example, let's say that we have the file named new\nline
(important, this must be an actual newline and not the characters \
and n
) and one named back\slash
, and then executed the following:
# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nline
# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nline
This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.
Currently, every payload file must be in at least one payload manifest. The proposal is to change this to require every payload file to be in every payload manifest.
The motivation for this is that it reflects actual real world usage and reduces errors. However, it would mean that some existing bags may not be forward compatible with version 1.0.
Please add comments to this ticket.
Earlier drafts of bagit included a section on Serialization that recommended a bag was archived with a single folder in the zip/tar that then had bagit.txt etc. (rather than bagit.txt being in the root).
This was removed in 94bcdaa by @acdha but I can't see the discussion that led to this.
Is this still the convention for ZIP or .tar archiving a bag? I see that https://github.com/fair-research/bdbag by @mikedarcy et al assumes opposite - it will only validate correctly a zip if bagit.txt is in the root.
I am asking because we are making such archives for downloading a bag, and obviously would want to be on the right side of the convention :)
The current 1.0 branch doesn't mention RFC2234 as the source of ABNF syntax.
It might also be good to explicitly mention core rules used, (e.g. DIGIT, HEXDIG) as is done in RFC3986 - URI syntax for example
I work with a group that uses the Python Bagit library to package downloaded software applications. On some occasions, we've encountered an inability to use the package because of odd behaviors with soft links in Linux and macOS environments. (Regular links became copies of the referenced file; broken links caused the script to fail when it tried hashing.) Discussing this with @edsu a bit, he helped me realize that the Bagit specification does not cover soft links.
Soft links are worth recording, as their ability to reference arbitrary paths can provide important context to capture of some file systems. (E.g. Apache web server configurations once, and possibly still do, recommend soft links to configuration files. I recently observed some game data that does the same.) However, I appreciate that determining a specification form might not be straightforward, as soft links don't intrinsically have contents to hash like regular files, and there are decisions that need to be made clear about whether to follow links that exit the file system scope of the bagging target.
I have a branch of the Bagit Python library that has a shell script that drafts soft link support requirements. The usage comments at the top of that script show my first draft of the test matrix, but I realized later that there's a bit more combinatoric expansion to do. For instance, a bag's manifest file should be able to represent a link whether it points to:
That brainstorming implies these test dimensions:
Before the kind of testing can be implemented, some kind of special manifest is needed for soft links. That Bash test script assumes a file manifest-links.txt
(living alongside manifest-$hashname.txt
), that has a two-column tab-delimited format, column 0 the link contents, column 1 the path to the link file. This follows the "content, whitespace, path" summarizing pattern of the Bag hash manifests, but relies on tab as a safe character that should not appear in soft links. (Classic HFS allowed tabs in file names, but supported aliases, a subtly different file type from soft links. No more-modern file systems, to my knowledge, allow tabs.)
On a related note, I also work with another representation for file system metadata, Digital Forensics XML. That language represents file metadata as extracted from the file system, with some summaries of file content available (including coverage of the hashes in the Bagit spec). Its solution to representing soft links is to give them a designated type, code "l
". In discussion with someone, I recently realized the language could also use a representation for the type of a link's target. So, I've since also come to believe the manifest-links.txt
format I'd originally drafted could use an embedded type for the link target.
Is this data/metadata-recording feature support something the Bagit community would be interested in developing further? I'm happy to help spell out the test conditions.
From the "Beyond the Repository" project (notes https://docs.google.com/document/d/1QXLbvEqUS5GGfH7CozcyvX_om8w_xOTH0psmNWCATzM/edit): suggested changes to "2.2.2. Bag Metadata: bag-info.txt":
Contact-Name
Provenance: BagIt
Definition: Person at the source organization who is responsible for the content transfer.
Suggested edit to BagIt spec: consider changing “person” to “Name of individual or group”
Contact-Phone
Provenance: BagIt
Definition: International format telephone number of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”
Contact-Email
Provenance: BagIt
Definition: Fully qualified email address of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”
Should we start working on version 2.0 of the spec for breaking changes?
Some ideas that come to mind:
Payload-Oxum
in bagit.txtAccording to https://github.com/search?q=bag-software-agent&type=Code there are Python, Ruby, Java, Scala, and Bash implementations of bagit which all set Bag-Software-Agent
in bag-info.txt
. Should we add that to the well-known field list in the spec?
First of all, great work on the RFC. I feel a bit embarrassed opening an issue for this, but I have a minor suggestion.
Section 2.2.1. "Tag Manifest: tagmanifest-algorithm.txt" of the RFC finishes with the sentence
As a result, no filepath listed in a tag manifest begins "data/".
It should probably read
As a result, no filepath listed in a tag manifest begins with "data/".
Maybe this should be mentioned in the Errata section of the RFC? Anyway, great work!
---
Need to reduce the front page authors to five.
Those names should appear in the "Authors' Addresses" section.
The other names should be moved into a new "Contributors" section.
---
Loads of little nits ...
Abstract
s/specifies/describes/
s/should be/is/
---
When you cite LC-CONFORMANCE-SUITE you are using [[ not [
---
You cite RFC 2234, but that was obsoleted by RFC 4234. Any reason not to
use the more recent reference?
---
Could you please split the References section into two subsections
called "Normative References" and "Informative References" and divide
the references accordingly. Roughly speaking, the Normative References
are those that you absolutely must read to understand this document,
the other are background.
---
Section 1.2
OLD
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
NEW
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
END
...and add a normative reference to RFC 8174
---
A little pedantically, I would like to reduce the emphasis on this
being a "specification" since that has connotations in the RFC series.
We can do that relatively easily as follows.
s/specification/document/
1.3 seven times
2. once
2.1.2 once
2.2.4 once
2.3 once
Abstract
s/specifies/describes/
2.1.1
OLD
The number for this version of the specification is "1.0".
NEW
The number for this version of BagIt is "1.0".
END
3.
OLD
4. For BagIt 1.0, every payload file MUST be listed in every payload
manifest. Note that older versions of this specification allowed
payload files to be listed in just one of the manifests.
5. Every element present MUST comply with this specification.
NEW
4. For BagIt 1.0, every payload file MUST be listed in every payload
manifest. Note that older versions of BagIt allowed payload
files to be listed in just one of the manifests.
5. Every element present MUST conform to BagIt 1.0.
END
---
1.3
I wonder whether you should add "manifest" as a term in this section:
you use it quite a lot without specific explanation.
---
Error handling. I see a lot of description of what must be in a bag and
how it must be formatted, but couldn't find anything about what to do if
a bag is found to be deficient or, for example, if a checksum fails.
I don't think this needs much text. Just a statement about not
processing a bag if it is in error, and possibly something about logging
or reporting the problem.
The spec currently doesn't indicate where community discussion occurs on small (or large) change requests. Should we add a pointer to this github issues list?
Currently, 6.1.1.3. Recommendations includes:
Implementations MUST prevent the creation of bags containing files which differ only in normalization form.
if this stays in the non-normative section 6 then I think it can only be a SHOULD. (Another approach would be to avoid RFC2119 words in the non-normative portion.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.