Giter VIP home page Giter VIP logo

bagit-spec's People

Contributors

acdha avatar edsu avatar jkunze avatar johnscancella avatar jscancella avatar justinlittman avatar stain avatar zimeon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bagit-spec's Issues

Some feedback on the bagit1.0 branch

This issue contains feedback on the bagit1.0 branch, in particular on the ABNF grammars in section 7.

First, a small grammatical issue in the prose: Future version of BagIt may disallow should maybe be Future versions of BagIt may disallow or A future version of BagIt may disallow at https://github.com/LibraryOfCongress/bagit-spec/blob/bagit1.0/bagit.xml#L719.

ABNF Grammar feedback

The following commentary on the ABNF section of the BagIt1.0 spec, is based on an attempt to create parsers from the supplied ABNF grammars using the Instaparse library. The (still in-development) baglidate repository can be used to test proposed ABNF grammars and determine if they can be used to parse inputs that are expected to be valid.

7.1. Bag Declaration: bagit.txt

7.2. Payload Manifest: manifest-algorithm.txt

Here is the original grammar:

payload-manifest = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = "x" 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                 )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "="
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF
  • The original grammar only captures one line of a manifest-algorithm.txt file. If it is desirable to capture the entire file, the provisional grammar given below shows one approach.

  • The original grammar requires the character "x" as a prefix for all "hex-encoded checksums". Is this intentional? Previous versions of BagIt did not, I believe, require this.

  • The original grammar is missing the "e" and "E" characters from its hexadecimal digit sets.

  • The original grammar allows for effectively empty filenames, i.e., "data/". See the definition of filename above. I presume this should not be permitted.

  • The original grammar does not appear to allow for solidus (forward slash) characters in filenames.

A provisional amended grammar is given below. (Note that it does not allow for Unicode characters in filenames, unlike the provisional amended grammar given for 7.3 below.) This grammar addresses the issues described above:

payload-manifest = 1*p-m-line
p-m-line         = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "E" / "e" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                    "data"
                    "/"
                    1*( unreserved / pct-encoded / sub-delims )
                   )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "=" / "/"
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF

7.3. Bag Metadata: bag-info.txt

Here is the original grammar:

metadata      = key ":" WSP value ending
key           = 1*alpha-numeric
value         = 1*(alpha-numeric ending) *(WSP 1*alpha-numeric ending)
alpha-numeric = ALPHA / DIGIT
ending        = CR / LF / CRLF
  • This grammar only captures one line of a bag-info.txt file. If it is desirable to capture the entire file, one could add a rule like metadata-set = 1*metadata.

  • The grammar does not allow for non-ASCII (i.e., Unicode) characters despite the fact that the prose of the spec makes it clear that they are licit. The ABNF spec would appear to lack a provision for Unicode also. A provisional definition of a Unicode counterpart to ABNF's VCHAR (using Java-compatible regexes) would be UNICODE-VCHAR = ( #'\p{Punct}' / #'\p{L}' / #'\d' ).

  • The prose clearly states that the keys of bag-info.txt lines can contain spaces, yet the grammar does not allow for this. One solution would be to define key as follows (using UNICODE-VCHAR defined in the bullet above): key = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ).

  • In the original grammar, ending occurs both at the end of metadata and value. This causes problems for parsing.

  • The original grammar does not capture the fact that continuation lines of a key-value pair must be indented by whitespace.

A provisional amended grammar is given below. Putting aside the fact that it would need revision because it uses regexes outside of the ABNF spec, it does seem to solve the issues described above:

metadata-set            = 1*metadata
metadata                = key ":" WSP value
key                     = UNICODE-VCHAR *( UNICODE-VCHAR / WSP )
value                   = value-line *( 1*WSP value-line )
value-line              = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ) ending
ending                  = CR / LF / CRLF
UNICODE-VCHAR           = ( #'\p{Punct}' / #'\p{L}' / #'\d' )

7.4. Fetch File: fetch.txt

Here is the original grammar:

fetch           = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = DIGIT
filename        = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF
  • Again, this describes a single line of a fetch.txt file, and not the entire file.

  • The length non-terminal only allows integers 0-9.

  • The length rule does not allow for use of the hyphen "-" instead of an integer, as the text states is permissible.

  • Again, empty filenames like data/ are permitted when they probably should not be.

  • Small point, but line-terminator is used in this grammar where ending was used for the same rule in the previous grammars.

Here is a provisional revised grammar that addresses the issues listed above:

fetch-file      = 1*fetch-line
fetch-line      = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = 1*DIGIT / "-"
filename        = (
                  "data"
                  "/"
                  1*( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF

Tag and payload manifest checksum concordance

I ran across a bag today that used sha256 for the payload manifest and md5 for the tag manifest. While there's nothing technically wrong with this according to the spec, it is awkward to work with. A recommendation to use the same algorithms between manifests would be useful to discourage that type of bag.

E.g Tag manifests should only use the hashing algorithm(s) as the payload manifests present in the bag.

Manifest filename escaping

In regards to writing file paths in manifest files, the spec states the following:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.

I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.

I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.

Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.

For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:

# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.

Requiring every payload file to be in every payload manifest

Currently, every payload file must be in at least one payload manifest. The proposal is to change this to require every payload file to be in every payload manifest.

The motivation for this is that it reflects actual real world usage and reduces errors. However, it would mean that some existing bags may not be forward compatible with version 1.0.

Please add comments to this ticket.

Convention for ZIP archiving a bag?

Earlier drafts of bagit included a section on Serialization that recommended a bag was archived with a single folder in the zip/tar that then had bagit.txt etc. (rather than bagit.txt being in the root).

This was removed in 94bcdaa by @acdha but I can't see the discussion that led to this.

Is this still the convention for ZIP or .tar archiving a bag? I see that https://github.com/fair-research/bdbag by @mikedarcy et al assumes opposite - it will only validate correctly a zip if bagit.txt is in the root.

I am asking because we are making such archives for downloading a bag, and obviously would want to be on the right side of the convention :)

Specification-level support needed for soft links

I work with a group that uses the Python Bagit library to package downloaded software applications. On some occasions, we've encountered an inability to use the package because of odd behaviors with soft links in Linux and macOS environments. (Regular links became copies of the referenced file; broken links caused the script to fail when it tried hashing.) Discussing this with @edsu a bit, he helped me realize that the Bagit specification does not cover soft links.

Soft links are worth recording, as their ability to reference arbitrary paths can provide important context to capture of some file systems. (E.g. Apache web server configurations once, and possibly still do, recommend soft links to configuration files. I recently observed some game data that does the same.) However, I appreciate that determining a specification form might not be straightforward, as soft links don't intrinsically have contents to hash like regular files, and there are decisions that need to be made clear about whether to follow links that exit the file system scope of the bagging target.

I have a branch of the Bagit Python library that has a shell script that drafts soft link support requirements. The usage comments at the top of that script show my first draft of the test matrix, but I realized later that there's a bit more combinatoric expansion to do. For instance, a bag's manifest file should be able to represent a link whether it points to:

  • A regular file
  • A directory
  • A path to a non-existent file
  • A link to a regular file
  • A link to a directory
  • All of the above when the link references a file either (a) still within a bag, or (b) external to the bag
  • All of the above when the link is absolute or relative
  • Not capturing any file or directory references that are under a link to a directory.
  • The applicable of the above when the file exists but without read permissions (i.e. where file system metadata can be captured, but not file contents)

That brainstorming implies these test dimensions:

  • name type - r, d, -, l(r), l(d), l(-)
  • relativity - relative, absolute
  • containment (within bag target directory) - internal, external
  • under soft link to directory - yes, no
  • link target has read permission - yes, no

Before the kind of testing can be implemented, some kind of special manifest is needed for soft links. That Bash test script assumes a file manifest-links.txt (living alongside manifest-$hashname.txt), that has a two-column tab-delimited format, column 0 the link contents, column 1 the path to the link file. This follows the "content, whitespace, path" summarizing pattern of the Bag hash manifests, but relies on tab as a safe character that should not appear in soft links. (Classic HFS allowed tabs in file names, but supported aliases, a subtly different file type from soft links. No more-modern file systems, to my knowledge, allow tabs.)

On a related note, I also work with another representation for file system metadata, Digital Forensics XML. That language represents file metadata as extracted from the file system, with some summaries of file content available (including coverage of the hashes in the Bagit spec). Its solution to representing soft links is to give them a designated type, code "l". In discussion with someone, I recently realized the language could also use a representation for the type of a link's target. So, I've since also come to believe the manifest-links.txt format I'd originally drafted could use an embedded type for the link target.

Is this data/metadata-recording feature support something the Bagit community would be interested in developing further? I'm happy to help spell out the test conditions.

bag-info.txt clarifications

From the "Beyond the Repository" project (notes https://docs.google.com/document/d/1QXLbvEqUS5GGfH7CozcyvX_om8w_xOTH0psmNWCATzM/edit): suggested changes to "2.2.2. Bag Metadata: bag-info.txt":

Contact-Name
Provenance: BagIt
Definition: Person at the source organization who is responsible for the content transfer.
Suggested edit to BagIt spec: consider changing “person” to “Name of individual or group”

Contact-Phone
Provenance: BagIt
Definition: International format telephone number of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”

Contact-Email
Provenance: BagIt
Definition: Fully qualified email address of person or position responsible.
Suggested edit to BagIt spec: consider changing “person or position” to “individual or group”

Version 2.0

Should we start working on version 2.0 of the spec for breaking changes?
Some ideas that come to mind:

  • .bagit
  • putting Payload-Oxum in bagit.txt

Minor suggestion

First of all, great work on the RFC. I feel a bit embarrassed opening an issue for this, but I have a minor suggestion.

Section 2.2.1. "Tag Manifest: tagmanifest-algorithm.txt" of the RFC finishes with the sentence

As a result, no filepath listed in a tag manifest begins "data/".

It should probably read

As a result, no filepath listed in a tag manifest begins with "data/".

Maybe this should be mentioned in the Errata section of the RFC? Anyway, great work!

Changes requested from ISE review

---

Need to reduce the front page authors to five.
Those names should appear in the "Authors' Addresses" section.
The other names should be moved into a new "Contributors" section.

---

Loads of little nits ...

Abstract
s/specifies/describes/
s/should be/is/

---

When you cite LC-CONFORMANCE-SUITE you are using [[ not [

---

You cite RFC 2234, but that was obsoleted by RFC 4234. Any reason not to
use the more recent reference?

---

Could you please split the References section into two subsections
called "Normative References" and "Informative References" and divide
the references accordingly. Roughly speaking, the Normative References
are those that you absolutely must read to understand this document,
the other are background.

---

Section 1.2

OLD
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].
NEW
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.
END

...and add a normative reference to RFC 8174

---

A little pedantically, I would like to reduce the emphasis on this
being a "specification" since that has connotations in the RFC series.
We can do that relatively easily as follows.

s/specification/document/
1.3 seven times
2. once
2.1.2 once
2.2.4 once
2.3 once

Abstract
s/specifies/describes/

2.1.1
OLD
   The number for this version of the specification is "1.0".
NEW
   The number for this version of BagIt is "1.0".
END

3.
OLD
   4.  For BagIt 1.0, every payload file MUST be listed in every payload
       manifest.  Note that older versions of this specification allowed
       payload files to be listed in just one of the manifests.

   5.  Every element present MUST comply with this specification.
NEW
   4.  For BagIt 1.0, every payload file MUST be listed in every payload
       manifest.  Note that older versions of BagIt allowed payload
       files to be listed in just one of the manifests.

   5.  Every element present MUST conform to BagIt 1.0.
END

---

1.3
I wonder whether you should add "manifest" as a term in this section:
you use it quite a lot without specific explanation.

---

Error handling. I see a lot of description of what must be in a bag and
how it must be formatted, but couldn't find anything about what to do if
a bag is found to be deficient or, for example, if a checksum fails.

I don't think this needs much text. Just a statement about not
processing a bag if it is in error, and possibly something about logging
or reporting the problem.

Should not have a MUST in a non-normative section

Currently, 6.1.1.3. Recommendations includes:

Implementations MUST prevent the creation of bags containing files which differ only in normalization form.

if this stays in the non-normative section 6 then I think it can only be a SHOULD. (Another approach would be to avoid RFC2119 words in the non-normative portion.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.