mjordan / bagit_indexer Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 216 KB

Proof-of-concept tool for extracting data from Bags and indexing it in Elasticsearch

License: The Unlicense

PHP 80.02% Ruby 2.41% Shell 7.94% Python 9.63%

bagit digital-preservation elasticsearch

bagit_indexer's People

Contributors

Stargazers

Watchers

bagit_indexer's Issues

Provide a 'tombstone' option to indicate that Bag was deleted

If a bag has been deleted, we should not delete its document in the index. Rather, we should

update the tombstone field from false to 'true and
update the document's timestamp

That way, the last state of the document, and all of its previous versions, remain in the index.

'content' field should be text type

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/text.html

Index bag-info.txt fields that are moved to a METS file when a Bag is processed by Archivematica

When Archivematica processes a zipped bag with a bag-info.txt file in it, the information is transferred to a METS file, and can be found in a predictable location in the analog/digital source metadata sub-section of the administrative metadata section . See attached as an example (the original bag-info.txt file and the METS file generated by Archivematica are included).

The tool that extracts data from the bags and indexes them will need to be aware of the location of the data we want to index is (e.g., in a METS file at the analog/digital source metadata sub-section of the administrative metadata section , or if that data doesn't exist, in the bag-info.txt file).

Using something like the proof of concept BagIt Indexer, example logic would be: If there is a file named "METS.xml" at the root of the Bag's /data directory, look for data at /data/METS.xml// and index it so each element is in a searchable field; if "METS.xml" doesn't exist, index the fields in /bag-info.txt. (We'll need a third fallback option here, in case there is a METS.xml file but it doesn't contain /.) Pretty standard stuff. The gotcha here is that if we add a third deposit type (say a non-bag deposit), the indexer script would need to know where to get the relevant data for that deposit type. Given today's software development practices, this sort of extensibility can be handled by using a plugin architecture where a different plugin detects the presence of the desired data and extracts it, and then passes it off to the indexing engine.

Of course, using the same names for the METS elements and bag-info.txt fields will result in better queries. However, plugins can also map source fields to a common field name for indexing purposes.

Queries on bag_location return all records

For example, ./find -q bag_location:bag_01.tgz will return all records in the index.

Add a field that contains the timestamp the document was created

ES doesn't do this automatically (_timestamp was dropped in verion 2). It would be useful to be able to query when the document was added to our index. Maybe something like:

  timestamp: {
    type: 'date',
    format: 'epoch_second'
  }

Add support to index script for renamed and moved bags

Related to #9.

Make script names look cool

php bagit_indexer.php => ./index
php bagit_search.php=> ./find
python bagit_watcher.py => ./watch

bag_location field should contain full path

For some reason the extension is missing.

Install more stuff on VM

php5-cli, composer, git

Also, remove /home/vagrant/elasticsearch-5.4.1.deb

Support subdirectories

Currently, the directory specified in --input needs to be flat. Ability to put Bags in subdirectories would be useful.

Watcher script should be updated to be recursive too.

Figure out how to allow AND queries

For example, AND tombstone:false.

Add option to index specific files in data directory

Maybe by providing a command-line option that takes a comma-separated list of paths, like

--files_to_index="foo.txt,dir/bar.xml"

Not sure what the resulting field names should be. Perhaps ES's dynamic mapping is useful here.

To start, make this a dumb indexer, with no preprocessing of XML files, etc.

On new, moved and renamed, and modied bags. indexer script should compare the bag's checksum with one in index

The watcher script should use the bag's checksum to confirm that it is dealing with the same bag file in its on_moved() method.

Use sha1 hashes as bag identifiers

Currently the indexer uses the bag's filename as its ID. For example, for a bag at the location /mnt/storage/bag_567.zip, the ID will be 'bag_567'. This ID is only unique for bags within a given directory, and if we rename a file, any association between its ID and its name is lost.

To work around these limitations, we should use the bag's sha1 checksum as its ID, since the probability of a collision is extremely low. If the probability of collisions is low enough for Git to use sha1 hashes as unique IDs, it's good enough for an index of bags. Perhaps we can even allow users to enter the first x digits of a sha1 hash in find queries.

The main disadvantage is the bag filenames are much more human-readable than hashes.

mjordan / bagit_indexer Goto Github PK

bagit_indexer's People

Contributors

Stargazers

Watchers

bagit_indexer's Issues

Recommend Projects

Recommend Topics

Recommend Org