Giter VIP home page Giter VIP logo

bagit_indexer's People

Contributors

mjordan avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bagit_indexer's Issues

Provide a 'tombstone' option to indicate that Bag was deleted

If a bag has been deleted, we should not delete its document in the index. Rather, we should

  1. update the tombstone field from false to 'true and
  2. update the document's timestamp

That way, the last state of the document, and all of its previous versions, remain in the index.

Index bag-info.txt fields that are moved to a METS file when a Bag is processed by Archivematica

When Archivematica processes a zipped bag with a bag-info.txt file in it, the information is transferred to a METS file, and can be found in a predictable location in the analog/digital source metadata sub-section of the administrative metadata section . See attached as an example (the original bag-info.txt file and the METS file generated by Archivematica are included).

The tool that extracts data from the bags and indexes them will need to be aware of the location of the data we want to index is (e.g., in a METS file at the analog/digital source metadata sub-section of the administrative metadata section , or if that data doesn't exist, in the bag-info.txt file).

Using something like the proof of concept BagIt Indexer, example logic would be: If there is a file named "METS.xml" at the root of the Bag's /data directory, look for data at /data/METS.xml// and index it so each element is in a searchable field; if "METS.xml" doesn't exist, index the fields in /bag-info.txt. (We'll need a third fallback option here, in case there is a METS.xml file but it doesn't contain /.) Pretty standard stuff. The gotcha here is that if we add a third deposit type (say a non-bag deposit), the indexer script would need to know where to get the relevant data for that deposit type. Given today's software development practices, this sort of extensibility can be handled by using a plugin architecture where a different plugin detects the presence of the desired data and extracts it, and then passes it off to the indexing engine.

Of course, using the same names for the METS elements and bag-info.txt fields will result in better queries. However, plugins can also map source fields to a common field name for indexing purposes.

Support subdirectories

Currently, the directory specified in --input needs to be flat. Ability to put Bags in subdirectories would be useful.

Watcher script should be updated to be recursive too.

Add option to index specific files in data directory

Maybe by providing a command-line option that takes a comma-separated list of paths, like

--files_to_index="foo.txt,dir/bar.xml"

Not sure what the resulting field names should be. Perhaps ES's dynamic mapping is useful here.

To start, make this a dumb indexer, with no preprocessing of XML files, etc.

Use sha1 hashes as bag identifiers

Currently the indexer uses the bag's filename as its ID. For example, for a bag at the location /mnt/storage/bag_567.zip, the ID will be 'bag_567'. This ID is only unique for bags within a given directory, and if we rename a file, any association between its ID and its name is lost.

To work around these limitations, we should use the bag's sha1 checksum as its ID, since the probability of a collision is extremely low. If the probability of collisions is low enough for Git to use sha1 hashes as unique IDs, it's good enough for an index of bags. Perhaps we can even allow users to enter the first x digits of a sha1 hash in find queries.

The main disadvantage is the bag filenames are much more human-readable than hashes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.