libraryofcongress / bagit-spec Goto Github PK

This project forked from jkunze/bagitspec

Makefile 100.00%

bagit-spec's Introduction

bagitspec

This repository is used for managing the development and maintenance of the BagIt File Packaging Format specification as an IETF draft. The spec itself is formatted as XML format as specified in in RFC 2629.

Please feel free to use the issue tracker here to submit feature requests, bugs, etc. Discussion on the digital-curation discussion list is also welcome.

To convert bagit.xml to HTML and text files you will need xml2rfc.

install Python
install pip
pip install -r requirements.txt
make

bagit-spec's People

Contributors

Stargazers

Watchers

Forkers

houzanme1 vischub janvanmansum zimeon justinlittman jscancella mcneight

bagit-spec's Issues

Some feedback on the bagit1.0 branch

This issue contains feedback on the bagit1.0 branch, in particular on the ABNF grammars in section 7.

First, a small grammatical issue in the prose: Future version of BagIt may disallow should maybe be Future versions of BagIt may disallow or A future version of BagIt may disallow at https://github.com/LibraryOfCongress/bagit-spec/blob/bagit1.0/bagit.xml#L719.

ABNF Grammar feedback

The following commentary on the ABNF section of the BagIt1.0 spec, is based on an attempt to create parsers from the supplied ABNF grammars using the Instaparse library. The (still in-development) baglidate repository can be used to test proposed ABNF grammars and determine if they can be used to parse inputs that are expected to be valid.

7.1. Bag Declaration: bagit.txt

The 4-space indentation at https://github.com/LibraryOfCongress/bagit-spec/blob/bagit1.0/bagit.xml#L1114 and subsequent lines should probably be removed so that it matches the formatting of the following grammars. Also, using bagit.txt on the lefthand side of the start rule seems a bit odd given that no other nonterminals contain periods; maybe bagit-txt would be better?

7.2. Payload Manifest: manifest-algorithm.txt

Here is the original grammar:

payload-manifest = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = "x" 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                 )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&#38;" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "="
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF

The original grammar only captures one line of a manifest-algorithm.txt file. If it is desirable to capture the entire file, the provisional grammar given below shows one approach.
The original grammar requires the character "x" as a prefix for all "hex-encoded checksums". Is this intentional? Previous versions of BagIt did not, I believe, require this.
The original grammar is missing the "e" and "E" characters from its hexadecimal digit sets.
The original grammar allows for effectively empty filenames, i.e., "data/". See the definition of filename above. I presume this should not be permitted.
The original grammar does not appear to allow for solidus (forward slash) characters in filenames.

A provisional amended grammar is given below. (Note that it does not allow for Unicode characters in filenames, unlike the provisional amended grammar given for 7.3 below.) This grammar addresses the issues described above:

payload-manifest = 1*p-m-line
p-m-line         = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "E" / "e" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                    "data"
                    "/"
                    1*( unreserved / pct-encoded / sub-delims )
                   )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&#38;" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "=" / "/"
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF

7.3. Bag Metadata: bag-info.txt

Here is the original grammar:

metadata      = key ":" WSP value ending
key           = 1*alpha-numeric
value         = 1*(alpha-numeric ending) *(WSP 1*alpha-numeric ending)
alpha-numeric = ALPHA / DIGIT
ending        = CR / LF / CRLF

This grammar only captures one line of a bag-info.txt file. If it is desirable to capture the entire file, one could add a rule like metadata-set = 1*metadata.
The grammar does not allow for non-ASCII (i.e., Unicode) characters despite the fact that the prose of the spec makes it clear that they are licit. The ABNF spec would appear to lack a provision for Unicode also. A provisional definition of a Unicode counterpart to ABNF's VCHAR (using Java-compatible regexes) would be UNICODE-VCHAR = ( #'\p{Punct}' / #'\p{L}' / #'\d' ).
The prose clearly states that the keys of bag-info.txt lines can contain spaces, yet the grammar does not allow for this. One solution would be to define key as follows (using UNICODE-VCHAR defined in the bullet above): key = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ).
In the original grammar, ending occurs both at the end of metadata and value. This causes problems for parsing.
The original grammar does not capture the fact that continuation lines of a key-value pair must be indented by whitespace.

A provisional amended grammar is given below. Putting aside the fact that it would need revision because it uses regexes outside of the ABNF spec, it does seem to solve the issues described above:

metadata-set            = 1*metadata
metadata                = key ":" WSP value
key                     = UNICODE-VCHAR *( UNICODE-VCHAR / WSP )
value                   = value-line *( 1*WSP value-line )
value-line              = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ) ending
ending                  = CR / LF / CRLF
UNICODE-VCHAR           = ( #'\p{Punct}' / #'\p{L}' / #'\d' )

7.4. Fetch File: fetch.txt

Here is the original grammar:

fetch           = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = DIGIT
filename        = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF

Again, this describes a single line of a fetch.txt file, and not the entire file.
The length non-terminal only allows integers 0-9.
The length rule does not allow for use of the hyphen "-" instead of an integer, as the text states is permissible.
Again, empty filenames like data/ are permitted when they probably should not be.
Small point, but line-terminator is used in this grammar where ending was used for the same rule in the previous grammars.

Here is a provisional revised grammar that addresses the issues listed above:

fetch-file      = 1*fetch-line
fetch-line      = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = 1*DIGIT / "-"
filename        = (
                  "data"
                  "/"
                  1*( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF

Should not have a MUST in a non-normative section

Currently, 6.1.1.3. Recommendations includes:

Implementations MUST prevent the creation of bags containing files which differ only in normalization form.

if this stays in the non-normative section 6 then I think it can only be a SHOULD. (Another approach would be to avoid RFC2119 words in the non-normative portion.)

Minor suggestion

First of all, great work on the RFC. I feel a bit embarrassed opening an issue for this, but I have a minor suggestion.

Section 2.2.1. "Tag Manifest: tagmanifest-algorithm.txt" of the RFC finishes with the sentence

As a result, no filepath listed in a tag manifest begins "data/".

It should probably read

As a result, no filepath listed in a tag manifest begins with "data/".

Maybe this should be mentioned in the Errata section of the RFC? Anyway, great work!

Tag and payload manifest checksum concordance

I ran across a bag today that used sha256 for the payload manifest and md5 for the tag manifest. While there's nothing technically wrong with this according to the spec, it is awkward to work with. A recommendation to use the same algorithms between manifests would be useful to discourage that type of bag.

E.g Tag manifests should only use the hashing algorithm(s) as the payload manifests present in the bag.

being explicit: all fetch.txt items in oxum?

Via https://twitter.com/jakkbl/status/1141492471655886848?s=09
(((sfn))) @ampoffcom
BagIt question: It is a MUST that all resources from fetch.txt are in all manifest-ALG.txt files, but not in the oxum. Does this make sense? #rfc8493

From John Scancella
Agreed. Though why would the fetch itmd not be in the payload oxum? The bag isn't complete until you fetch the items missing

Version 2.0

Should we start working on version 2.0 of the spec for breaking changes?
Some ideas that come to mind:

.bagit
putting Payload-Oxum in bagit.txt

Consider adding Bag-Software-Agent to bag info well-known field list

According to https://github.com/search?q=bag-software-agent&type=Code there are Python, Ruby, Java, Scala, and Bash implementations of bagit which all set Bag-Software-Agent in bag-info.txt. Should we add that to the well-known field list in the spec?

cc: @dbrunton @johnscancella @justinlittman @jkunze

Manifest filename escaping

In regards to writing file paths in manifest files, the spec states the following:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.

I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.

I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.

Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.

For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:

# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.

Add note to spec pointing to this issues list?

The spec currently doesn't indicate where community discussion occurs on small (or large) change requests. Should we add a pointer to this github issues list?

Changes requested from ISE review

---

Need to reduce the front page authors to five.
Those names should appear in the "Authors' Addresses" section.
The other names should be moved into a new "Contributors" section.

---

Loads of little nits ...

Abstract
s/specifies/describes/
s/should be/is/

---

When you cite LC-CONFORMANCE-SUITE you are using [[ not [

---

You cite RFC 2234, but that was obsoleted by RFC 4234. Any reason not to
use the more recent reference?

---

Could you please split the References section into two subsections
called "Normative References" and "Informative References" and divide
the references accordingly. Roughly speaking, the Normative References
are those that you absolutely must read to understand this document,
the other are background.

---

Section 1.2

OLD
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].
NEW
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.
END

...and add a normative reference to RFC 8174

---

A little pedantically, I would like to reduce the emphasis on this
being a "specification" since that has connotations in the RFC series.
We can do that relatively easily as follows.

s/specification/document/
1.3 seven times
2. once
2.1.2 once
2.2.4 once
2.3 once

Abstract
s/specifies/describes/

2.1.1
OLD
   The number for this version of the specification is "1.0".
NEW
   The number for this version of BagIt is "1.0".
END

3.
OLD
   4.  For BagIt 1.0, every payload file MUST be listed in every payload
       manifest.  Note that older versions of this specification allowed
       payload files to be listed in just one of the manifests.

   5.  Every element present MUST comply with this specification.
NEW
   4.  For BagIt 1.0, every payload file MUST be listed in every payload
       manifest.  Note that older versions of BagIt allowed payload
       files to be listed in just one of the manifests.

   5.  Every element present MUST conform to BagIt 1.0.
END

---

1.3
I wonder whether you should add "manifest" as a term in this section:
you use it quite a lot without specific explanation.

---

Error handling. I see a lot of description of what must be in a bag and
how it must be formatted, but couldn't find anything about what to do if
a bag is found to be deficient or, for example, if a checksum fails.

I don't think this needs much text. Just a statement about not
processing a bag if it is in error, and possibly something about logging
or reporting the problem.

Reference RFC2234 for ABNF and core rules?

The current 1.0 branch doesn't mention RFC2234 as the source of ABNF syntax.

It might also be good to explicitly mention core rules used, (e.g. DIGIT, HEXDIG) as is done in RFC3986 - URI syntax for example

Requiring every payload file to be in every payload manifest

Currently, every payload file must be in at least one payload manifest. The proposal is to change this to require every payload file to be in every payload manifest.

The motivation for this is that it reflects actual real world usage and reduces errors. However, it would mean that some existing bags may not be forward compatible with version 1.0.

Please add comments to this ticket.

bag-info.txt clarifications

From the "Beyond the Repository" project (notes https://docs.google.com/document/d/1QXLbvEqUS5GGfH7CozcyvX_om8w_xOTH0psmNWCATzM/edit): suggested changes to "2.2.2. Bag Metadata: bag-info.txt":

Contact-Name
Provenance: BagIt
Definition: Person at the source organization who is responsible for the content transfer.
Suggested edit to BagIt spec: consider changing “person” to “Name of individual or group”

Contact-Phone
Provenance: BagIt
Definition: International format telephone number of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”

Contact-Email
Provenance: BagIt
Definition: Fully qualified email address of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”

Specification-level support needed for soft links

I work with a group that uses the Python Bagit library to package downloaded software applications. On some occasions, we've encountered an inability to use the package because of odd behaviors with soft links in Linux and macOS environments. (Regular links became copies of the referenced file; broken links caused the script to fail when it tried hashing.) Discussing this with @edsu a bit, he helped me realize that the Bagit specification does not cover soft links.

Soft links are worth recording, as their ability to reference arbitrary paths can provide important context to capture of some file systems. (E.g. Apache web server configurations once, and possibly still do, recommend soft links to configuration files. I recently observed some game data that does the same.) However, I appreciate that determining a specification form might not be straightforward, as soft links don't intrinsically have contents to hash like regular files, and there are decisions that need to be made clear about whether to follow links that exit the file system scope of the bagging target.

I have a branch of the Bagit Python library that has a shell script that drafts soft link support requirements. The usage comments at the top of that script show my first draft of the test matrix, but I realized later that there's a bit more combinatoric expansion to do. For instance, a bag's manifest file should be able to represent a link whether it points to:

A regular file
A directory
A path to a non-existent file
A link to a regular file
A link to a directory
All of the above when the link references a file either (a) still within a bag, or (b) external to the bag
All of the above when the link is absolute or relative
Not capturing any file or directory references that are under a link to a directory.
The applicable of the above when the file exists but without read permissions (i.e. where file system metadata can be captured, but not file contents)

That brainstorming implies these test dimensions:

name type - r, d, -, l(r), l(d), l(-)
relativity - relative, absolute
containment (within bag target directory) - internal, external
under soft link to directory - yes, no
link target has read permission - yes, no

Before the kind of testing can be implemented, some kind of special manifest is needed for soft links. That Bash test script assumes a file manifest-links.txt (living alongside manifest-$hashname.txt), that has a two-column tab-delimited format, column 0 the link contents, column 1 the path to the link file. This follows the "content, whitespace, path" summarizing pattern of the Bag hash manifests, but relies on tab as a safe character that should not appear in soft links. (Classic HFS allowed tabs in file names, but supported aliases, a subtly different file type from soft links. No more-modern file systems, to my knowledge, allow tabs.)

On a related note, I also work with another representation for file system metadata, Digital Forensics XML. That language represents file metadata as extracted from the file system, with some summaries of file content available (including coverage of the hashes in the Bagit spec). Its solution to representing soft links is to give them a designated type, code "l". In discussion with someone, I recently realized the language could also use a representation for the type of a link's target. So, I've since also come to believe the manifest-links.txt format I'd originally drafted could use an embedded type for the link target.

Is this data/metadata-recording feature support something the Bagit community would be interested in developing further? I'm happy to help spell out the test conditions.

Convention for ZIP archiving a bag?

Earlier drafts of bagit included a section on Serialization that recommended a bag was archived with a single folder in the zip/tar that then had bagit.txt etc. (rather than bagit.txt being in the root).

This was removed in 94bcdaa by @acdha but I can't see the discussion that led to this.

Is this still the convention for ZIP or .tar archiving a bag? I see that https://github.com/fair-research/bdbag by @mikedarcy et al assumes opposite - it will only validate correctly a zip if bagit.txt is in the root.

I am asking because we are making such archives for downloading a bag, and obviously would want to be on the right side of the convention :)

libraryofcongress / bagit-spec Goto Github PK

bagit-spec's Introduction

bagitspec

bagit-spec's People

Contributors

Stargazers

Watchers

Forkers

bagit-spec's Issues

ABNF Grammar feedback

7.1. Bag Declaration: bagit.txt

7.2. Payload Manifest: manifest-algorithm.txt

7.3. Bag Metadata: bag-info.txt

7.4. Fetch File: fetch.txt

Recommend Projects

Recommend Topics

Recommend Org