jkunze / bagitspec Goto Github PK

bagitspec's Introduction

bagitspec

This repository is used for managing the development and maintenance of the BagIt File Packaging Format specification as an IETF draft. The spec itself is formatted as XML format as specified in in RFC 2629.

Please feel free to use the issue tracker here to submit feature requests, bugs, etc. Discussion on the digital-curation discussion list is also welcome.

To convert bagit.xml to HTML and text files you will need xml2rfc.

install Python
install pip
pip install -r requirements.txt
make

bagitspec's People

Contributors

Stargazers

Watchers

Forkers

edsu ardvaark stain libraryofcongress

bagitspec's Issues

Release new BagIt spec?

Hi, the current draft draft-kunze-bagit-14 expired 11 months ago (http://www.ietf.org/id/draft-kunze-bagit now 404) - should we not publish a new spec? What is stopping BagIt from moving to RFC status?

There has been continued wide interest in BagIt, most recently at a Research Data Alliance meeting on Data Archives where all but 1 presenter used BagIt.

Are files from fetch.txt part of the payload?

It is unclear from the spec if files in fetch.txt towards the data/ directory must be included in the manifest-* files or not. As fetch.txt permits - for undefined file size, my first interpretation is "no" - but that means that if you in that case try to complete the bag by downloading from fetch.txt, then the bag would go from valid incomplete to invalid - which is a bit odd.

It is unclear if file sizes from fetch.txt should be included in the calculations of bag-info.txt properties like Bag-Size (bag being transferred), Payload-Oxum. Are "Payload files" only the files that actually exist within the data/ folder, or does that include the fetch.txt payload files?

I understand fetch.txt files may also be tagfiles - but the spec allows for tagfiles to not be listed in the tag manifests - so this question is only relevant for fetch files to data/.

More complex data structures for bag-info.txt

I'm running tools that update our bags between receipt and ingest. I would like to record PREMIS events for each of these, and bag-info.txt seems like the most appropriate location. However, bag-info.txt can only contain key: value lines.

My PREMIS implementation would use a YAML format like this:

Bagging-Date: 2017-06-22
Payload-Oxum: 12345.67
Bag-History: 
 - Event-Date-Time: 20170622155934EDT
    Event-Detail-Information: "md_1.json, md_2.json, md_3.json updated"
    Event-Outcome: Pass
    Event-Outcome-Detail-Note: "Bag no longer valid"
    Event-Type: "Payload Metadata Update"
 - Event-Date-Time: 20170622160001EDT
    Event-Detail-Information: "Hashes updated as follows filename, previous hash, new hash md_1.json, 09678de75874f324793a8cafd2db4ea3, 8066e52b17095446e41f57fdd88fe405 ...\n"
    Event-Outcome: Pass
    Event-Type: "Bag Hash Update"
 - Event-Date-Time: 2017062216045EDT
    Event-Detail-Information: "Previous 0xum - 12345.67 New 0xum - 12344.67\n"
    Event-Outcome: Pass
    Event-Type: "Bag 0xum Update"

The simpler route would be to create a standalone bag-premis.txt for this information, but I wanted to see if there was interest in incorporating a more complex structured data format into bag-info.txt

What is the semantics of fetch.txt items that already exist?

Is it currently undefined what is the meaning for files listed in fetch.txt that already exist in the bag.

How should a consumer of such a bag interpret this?

a) The existing file came from that URL (but may no longer be available)
b) The existing file should also be available at that URL (and can thus be removed from the payload directory)
c) The existing file should be replaced with a download from that URL
d) None of the above

My suggestion is b).

How should a resource in fetch.txt be content negotiated?

When resolving a URL in fetch.txt, you may get different results depending on content negotiation. Therefore you may get a different resource back (e.g. HTML instead of JSON) depending on the browser and client setting used to retrieve such a resource - obviously if you get the "wrong one" the bagit checksums will be wrong.

I think the specification should recognize this, and perhaps specify the default Accept headers to use, e.g.:

Accept-Language: *
Accept-Charset: *
Accept: application/octet-stream, */*;q=0.1

The headers Accept-Language and Accept-Charset may be excluded as their default is *.

Missing media type (MIME type) for BagIt

Media type (also known as a MIME type) is a widely used system for labelling the format of data. There is a central database of Media/MIME types maintained by Internet Assigned Numbers Authority (IANA). More information about Media/MIME types is available at the Media Type wikipedia entry.

Currently, there is no media type for BagIt.

This lack of a media type can causes problems, particularly in situations where a file might (or might not) be a BagIt file. As a concrete example, DataCite is updating their metadata schema so that it supports accessing the files in a dataset. One possibility is to provide the data directly (e.g., as a zip file) another possibility is to describe how to fetch the data using an empty BagIt file (one with an empty /data directory and details on how to fetch the data via the fetch.txt file). The DataCite metadata scheme supports recording the Media Type of the file; however, in both cases, the file would have the media type application/zip. A client may wish to download the data if it is a BagIt file (for example, to obtain metadata), but is currently unable to determine whether the linked zip file is a BagIt file.

Media type labels are somewhat sophisticated and include a few features that may prove useful for BagIt.

One feature of media types is the availability of suffixes. This allows a media type to describe both the file format and the underlying format; e.g., application/bagit+zip could describe a BagIt file that is based on the zip archive format. This allows clients that do not support BagIt but that do support zip (application/zip) archives to process the file; for example, to check the integrity of the files in the archive or to scan the file for viruses.

Another feature of media types is parameters. Parameters allows a media type to include metadata about the file. One common parameter is profile. This provides a flexible way to be more specific about the nature of the file without creating many new media types. There is already a profile language for BagIt: BagIt-profiles.

Altogether, this is an example of my suggestion for a BagIt Media Type:

application/bagit+zip;profile=https://example.org/bagit/my-profile

I would advocate that there is a discussion on what should the BagIt media type look like. Once a consensus is established, the corresponding media type should then be registered with IANA, so that it may be used to describe BagIt files.

Multiple URLs in fetch.txt for same file?

In one use-case for handling Big Data (tm) with BagIt, we've been discussing if it's valid to list the same file multiple times in fetch.txt from different locations, e.g.:

fetch.txt

http://www.example.com/bigfile.txt 1099511627776 /data/bigfile.txt
http://cdn.example.org/bigfile.txt 1099511627776 /data/bigfile.txt
ftp://ftp.example.com/pub/bigfile.txt 1099511627776 /data/bigfile.txt
gsiftp://grid.example.com/store5/bigfile.txt 1099511627776 /data/bigfile.txt
magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C 1099511627776 /data/bigfile.txt

Reading the spec I don't see how this is invalid (except perhaps the magnet link not being a URL, just a URI).

I think this could be quite powerful - should this be explicitly permitted? Obviously the choice of which one to use would have to be down to the client, falling back to top-first or something.

Don't allow multiple incomplete manifests (was: Make explicit that incomplete manifest are permitted)

3.4 says

Every file in every payload manifest MUST be present.

Every payload file MUST be listed in at least one manifest.
Payload files MAY be listed in more than one payload manifest.

so does that mean it is perfecly valid to have payload files mentioned in only some manifests? E.g.

manifest-md5.txt

e2b25c29051b9d6f0ae7ef5b12a3251b data/file1.txt

manifest-sha1.txt

24f5c8113eade58f5d5fa1b38fbbe1076ad3408a data/file2.txt

That means it is impossible to validate a bag without going through ALL the manifest-* files - for nothing else but to check that there is not a spurious extra file in any of the manifest files that the consumer is not able to checksum against. (at least the fileformat of the file is given :)

What is the purpose of allowing multiple manifest files? I would assume it would be to allow consumers to pick one they support and validate the content of the bag - rather than for producers that update an existing bag without caring about existing manifest files. If this is the case, then I would hope for all manifests to be complete and list the same payload files (but not necessarily in the same order.)

This applies as well to the tag manifests, if present.

release v0.98

Does the fix for #2 merit another IETF draft release?

Where is the 1.0 spec?

Congrats on RFC 8493! Is the plan to put the lastest spec test in this repository or over at https://github.com/libraryofcongress/bagit-spec ?

manifest filename with newline

Over in the bagit-python repository we had a issue opened regarding a validation error for a newly created bag. The issue was tracked down to filenames that had an embedded carriage return (0x0d) in them, which made their way into the manifest, and ultimately disrupted validation.

One approach would be to prevent the creation of bags with filenames that have embedded CR, LF or CRLF. This would involve throwing an exception or error during bag creation. Another would be to allow these filenames to exist in the manifest, but to take care to encode them in such a way that doesn't disturb the line oriented format of the manifest.

I think it's in the spirit of BagIt to do the latter, accepting that some filesystems allow CR, LF and CRLF to be present in the filename.

Unicode Normalization?

The BagIt specification lets you specify that UTF-8 encoding be used in tag manifests. But it doesn't appear to assume a particular normalization form.

I have a problem where files are bagged and transferred from an OS X filesystem (which uses NFD) and are copied to Linux (which uses NFC). During validation the NFC normalized form from the filesystem is compared against the NFD normalized form from the manifest and validation fails.

Should a particular normalization form (NFC?) be assumed for unicode encodings?

How to handle empty directories

Both are valid according to the spec, but I think that dropping the empty directory is not expected behavior. The spec has a suggestion in 2.1.3 to a zero-length file with the same name as the directory and keep as an extension (pointed out to me by @andrewjbtw), but I haven't seen any bagging tools follow this suggestion. Andrew and I do not see this as an expected behavior.

Can the specification be more explicit about how to handle empty directories? I prefer bagit-python's strategy, but realize that this means empty directories have no bearing on the completeness or fixity of a bag.