exponential-decay / demystify Goto Github PK

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.

Home Page: http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export

License: zlib License

Python 42.18% HTML 57.69% Makefile 0.13%

code4lib digital-preservation pronom archives format-analysis duplicate-detection digipres collection-profiling

demystify's Introduction

Demystify

Static analysis and reporting for file-format reports generated by digital preservation tools, DROID and Siegfried.

Working example Siegfried: Siegfried Govdocs Select Results...
Working example DROID: DROID Govdocs Select Results...

Introduction

Utility for the analysis of DROID CSV and Seigfried file-format reports. The tool has three purposes:

break the export into its components and store them within a set of tables in a SQLite database for performance and consistent access;
provide additional information about a collection's profile where useful;
and query the SQLite database, outputting results in a visually pleasant report for further analysis by digital preservation specialists and archivists.

For departments on-boarding archivists or building digital capability, the report contains descriptions, written by archivists for each of the statistics output.

Analysis of file format reports

This Code4Lib article published early in 2022 describes some of the important information in file-format reports that appear, in-aggregate. It describes the challenges of accessing that information consistently also.

Fractal in detail: What information is in a file format identification report?

2020/2021 refactor

This utility was first written in 2013. The code was pretty bad, but worked. It wrapped a lot of technical debt into a small package.

The 2020/2021 refactor tries to do three things:

Fix minor issues.
Make compatible with Python 3 and temporarily, one last time with Python 2.
Add unit tests.

Adding unit tests is the key to your contributions and greater flexibility with refactoring. One a release candidate is available of this work, there is more freedom to think about next steps including exposing queries more generically so that more folk can work with sqlitefid. And finding more generic API-like abstractions in general so the utility is less like a monolith and more like a configurable static analysis engine analogous to something you might work with in Python or Golang.

More information

See the following blogs for more information:

[2014-06-03] On the creation of this tool
[2015-08-25] Creating a digital preservation rogues gallery
[2016-05-23] Consistent and repeatable digital preservation reporting
[2016-05-24] A multi-lingual lingua-franca and exploring ID methods

COPTR Link: DROID_Siegfried_Sqlite_Analysis_Engine

Components

There are three components to the tool.

sqlitefid

Adds identification data to an SQLite database that forms the basis of the entire analysis. There are five tables.

DBMD - Database Metadata
FILEDATA - File manifest and filesystem metadata
IDDATA - Identification metadata
IDRESULTS - FILEDATA/IDRESULTS junction table
NSDATA - Namespace metadata, also secondary key (NS_ID) in IDDATA table

Will also augment DROID or Siegfried export data with additional columns:

URI_SCHEME: Separates the URI_SCHEME from the DROID URI column. This is to enable the identification of container objects found in the export specifically, and the distinction of files stored in container objects from standalone files.
DIR_NAME: Returns the base directory name from the file path to enable analysis of directory names, e.g. the number of directories in the collection.

demystify

Outputs an analysis from extensive querying of the SQLite database created by sqlitefid,

HTML is the default report output, with plain-text, and file-listings also available.

It is a good idea to run the analysis and > pipe the result to a file, e.g. python demystify.py --export my_export.csv > my_analysis.htm.

Rogues Gallery (v.0.2.0, v0.5.0+)

The following flags provide Rogue or Hero output:

--rogues

Outputs a list of files returned by the identification tool that might require more analysis e.g. non-IDs, multiple IDs, extension mismatches, zero-byte objects and duplicate files.

--heroes

Outputs a list of files considered to need less analysis.

The options can be configured by looking at denylist.cfg. More information can be found here.

pathlesstaken

A string analysis engine created to highlight when string values, e.g. file paths might need more care taken of them in a digital preservation environment, e.g. so we don't lose diacritics during transfer - providing a checklist of items to look at.

Includes:

Class to handle analysis of non-recommended filenames from Microsoft.
Copy of a library from Cooper Hewitt to enable writing of plain text descriptions of Unicode characters.

Architecture

The tool is designed to be easily modified to create your own output by using the Analysis Results class as a further abstraction layer (API).

The recent re-factor resulted in more generic python data structures being returned from queries and less (if not zero) formatted output. This means a little more work has to be put into presentation of results, but it is more flexible to what you want to do.

Tests are being implemented to promote the reliability of data returned.

Design Decisions

There should be no dependencies associated with this tool. That being said, you may need lxml for HTML output. An alternative may be found as the tool is refactored.

If we can maintain a state of few repositories then it should promote use across a wide-number of institutions. This has been driven by my previous two working environments where installing Python was the first challenge... PIP and the ability to get hold of code dependencies another - especially on multiple user's machines where we want this tool to be successful.

Usage Notes

Summary/Aggregate Binary / Text / Filename identification statistics are output with the following priority:

Namespace (e.g. ordered by PRONOM first [configurable])

Binary and Container Identifiers
XML Identifiers
Text Identifiers
Filename Identifiers
Extension Identifiers

We need to monitor how well this works. Namespace specific statistics are also output further down the report.

TODO, and how you can get involved

Internationalizing archivist descriptions here.
Improved container listing/handling.
Improved 'directory' listing and handling.
Output formatting unit tests!

As you use the tool or find problems, please report them. If you find you are missing summaries that might be useful to you please let me know. The more the utility is used, the more we can all benefit.

I have started a discussion topic for improvements: here.

Installation

Installation should be easy. Until the utility is packaged, you need to do the following:

Find a directory you want to install demystify to.
Run git clone.
Navigate into the demystify repository, cd demystify.
Checkout the sub-modules (pathlesstaken, and sqlitefid): git submodule update --init --recursive.
Install lxml: python -m pip install -r requirements/production.txt.
Run tests to make sure everything works: tox -e py39.

NB. tox is cool. If you're working on this code and want to format it idiomatically, run tox -e linting. If there are errors, they will point to where you may need to improve your code.

Virtual environment

A virtual environment is recommended in some instances, e.g. you don't want to pollute your Python environment with other developer's code. To do this, for Linux you can do the following:

Create a virtual environment: python3 -m virtualenv venv-py3.
Activate the virtual environment: source venv-py3/bin/activate.

Then follow the installation instructions above this.

Releases

See the Releases section on GitHub.

License

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.

Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.

This notice may not be removed or altered from any source distribution.

demystify's People

Contributors

Stargazers

Watchers

Forkers

eckardm floatingbugs archives-new-zealand kieranjol

demystify's Issues

File-size analysis, mean, median, medium, top-five

A report that can produce various statistics relating to file-size.

Note: Could be configured to produce a list larger than top five.
Note: May not always be useful, but might help users to understand their collections better.
Note: Configure via external config file.

Understand the implications of outputting folders in ROGUE denylists, e.g. affecting business classification schemes

Because a split ingest could be wildly impacted by this feature, and would require quite a bit of effort.

Add .strip() to csv handler class for headers and cell content

Based on sscompare work this will be useful. It makes the CSV input more flexible.

Namespace counts are all off (broken)... Need to rework.

Since creation so need to have a look at how to better do these counts.

Remove hard coded indexes, e.g. URI_COL = 2

Remove hard coded indexes from the script. Proposal: read CSV header, index. Hard code the column names we are interested in, but not their index. We are more interested in the content than the index of the header. Hypothesis is that the index is more likely to change than the column name.

Two column indexes that are currently hard coded are:

URI_COL = 2
PATH_COL = 3

Output i18n strings in README.md

Make README international.

Wildcard searches cannot be used for disposal actions

e.g. for creating a distilled set of objects during transfer. (the opposite being a distilled set of objects for destruction/passing back to an agency to make a decision on.)

Binder-ize me!

Via @joshuatj, inspured by @anjackson and his work on ident-o-matic

This is likely an epic, so epic related tasks will appear below:

Renaming
Python 3
Molularizing (pathlesstaken, sqlitfid, etc.)

Tasks on this are likely to begin early July.

Google Doc to help organize things...

Add code of conduct

Related to #46 if folks are to engage, folks need to do so knowing there is a code of conduct to help ensure everyone is looked after.

Source RE-DUMP functionality? i.e. get a DROID report back out of the database

Used to be able to output DROID CSV again from SQLITE DB... do we re-add this functionality?

One idea might be to create a dump per namespace? https://github.com/exponential-decay/droid-sqlite-analysis/blob/master/libs/ExportDBClass.py

Can Cooper Hewitt code identify Private User Areas?

ref: http://ndsr.nycdigital.org/trojan-dots-and-diy-solutions/

Large swathes of code are not compatible with Python 3

https://pythonclock.org/

Add a contents section to point to anchors in HTML5 version

Contents section to link to specific anchor sections of the document so it is easier to browse.

Package the tool correctly when renaming

Workarounds are currently happening in the tests folder... for example.

Highlight NTFS limits (minimum) in file paths

MAX_PATH is 255 characters in NTFS. If we see a path longer, raise it as a digital preservation risk in the script.

https://msdn.microsoft.com/en-us/library/aa365247.aspx#maxpath

Container black-list PUIDs and Extensions

Black-list of container file types that we can identify DROID won't analyse. There will need to be further analysis done on these files in any collection. E.g. RAR.

Sqlite error with Yaml

C:\Working\git\droid-siegfried-sqlite-analysis-engine>python droidsqliteanalysis.py --export sf-working-copies.json > working-copies.htm
Traceback (most recent call last):
  File "droidsqliteanalysis.py", line 129, in 
    main()
  File "droidsqliteanalysis.py", line 112, in main
    handleDROIDCSV(args.export, True, args.txt, blacklist, args.rogues, args.heroes)
  File "droidsqliteanalysis.py", line 73, in handleDROIDCSV
    dbfilename = droid2sqlite.identifyinput(droidcsv)
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\droid2sqlite.py", line 22, in identifyinput
    return handleSFYAML(export)
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\droid2sqlite.py", line 41, in handleSFYAML
    loader.sfDBSetup(sfexport, basedb.getcursor())
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\libs\SFLoaderClass.py", line 152, in sfDBSetup
    cursor.execute(i)
sqlite3.OperationalError: near "s": syntax error

Log HASH type used by DROID in analysis

As a result of making the code more generic we may need more pointers as to the hash mechanism used in the code. Output this as a value in the report.

Do more with unidentified files in HTML report

Because right now only rogues can be used to access the unidentified listing...

Rename this application to Demystify!

By popular opinion!

Output results based on collection 'risk' issues first, before summary information

Sorting results based on the issues that put a collection/ingest at most risk, first, might be a more useful output than a mixture of summary and risk based information.

An example might be seeing unidentified files and extension only identifications output before frequency charts.

Similarly, identical content could be output immediately after those two as well.

Initial list with biggest priority:

Unidentified list
Total files in container objects
Extension only identifications
Extension mismatches
Multiple identifications
Duplicate file names
Duplicate content

Output total number of files per archive object...

Total files Files inside containersArchive Objects:
The number of files inside a container archival objects

SF YAML time parsing error

Hi Ross,

Loving the new version of the analysis tool, but I seem to have hit a snag in trying it out. I tried to run droid2sqlite.py against a YAML export from Siegfried 1.5, using both PRONOM and tika namespaces. It's unsuccessful, seemingly due to the parsing of years in the SFHandlerClass.

STDERR is copied below:

mfmmessier:droid-sqlite-analysis-0.4.0 twalsh$ python droid2sqlite.py --export kolmactest.yaml
Traceback (most recent call last):
  File "droid2sqlite.py", line 79, in <module>
    main()
  File "droid2sqlite.py", line 72, in main
    identifyinput(args.export)
  File "droid2sqlite.py", line 21, in identifyinput
    return handleSFYAML(export)
  File "droid2sqlite.py", line 40, in handleSFYAML
    loader.sfDBSetup(sfexport, basedb.getcursor())
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFLoaderClass.py", line 108, in sfDBSetup
    sf.addYear(sfdata)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 284, in addYear
    row[self.FIELDYEAR] = self.getYear(year)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 290, in getYear
    dt = datetime.datetime.strptime(datestring.split('+', 1)[0], '%Y-%m-%dT%H:%M:%S')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 328, in _strptime
    data_string[found.end():])
ValueError: unconverted data remains: -04:00

And the relevant lines from _strptime.py:

if len(data_string) != found.end():
        raise ValueError("unconverted data remains: %s" %
                          data_string[found.end():])

It appears that the issue may be in the time zone parsing but I wasn't able to figure out exactly what in the limited time I had to tinker with getYear in SFHandlerClass. Any ideas? (I'm running Python 2.7.10, if that makes any difference)

Thanks!

Null character in string breaks analysis

Via: https://twitter.com/euanc/status/793108956021489664

Blacklist PUIDs

Mechanism to blacklist certain PUIDs to place them into a separate listing in some useful manner.

Examples might be fmt/111 - OLE2 Compound Object Format, or x-fmt/411 - Windows Portable Executable.

These are files identified by DROID but are likely to require action before ingest, e.g. the more specific identification of fmt/111 as a child format, e.g. Microsoft Word, San Serif Page Plus.

This will be institution specific.

Note: Likely to be configured by external config file for usability.

Sort signature identified PUIDs listing by format name rather than PUID

Sort the identified PUID listing by format name to make it more useful to spot trends. There is no relationship between PUID numbers and so a sort is useless. PDF 1.1 and PDF 1.2 could be separated by 100 other PUIDs. As such, sort by format name to highlight trends across a collection.

Translation Project

These variables need translating to other languages! We've started with French and Czech, but need to keep going :D

https://docs.google.com/spreadsheets/d/1dVsRsXgD9V2GarNHHpf6Tzhrfx99_MXt3LjSSDrNLOY/edit#gid=0

Check filetype on loading to ensure correct, e.g. csv is csv

To provide more robust input and output.

Hyperlink important pieces of information within definitions

E.g. https://en.wikipedia.org/wiki/PRONOM#The_PRONOM_Persistent_Unique_Identifier_.28PUID.29_scheme

List extension mismatches

List all extension mismatches found in a file.

Figure out a way to do this nicely.

Cumulative analysis results

Have removed getContentHash() and getTimeStamp()

Original idea was to prevent a DB from generating again, but also to try and begin generating cumulative results per improved signature release to understand the evolution of a collection.

Will an add function for the DB prove useful? - Namespace could include a timedatestamp also.

Filter for certain characters

In any given context, a large number of Unicode characters is very likely. It may be useful to understand groupings, or themes, but not necessarily all. I'm considering a filter for certain characters so that they don't explore the report, but I am also considering other ways of presenting this information, e.g. organize by category and providing a collapsable file-list underneath.

Connected to #49

Output a (suggested) denylist config file per collection?

This would enable users to better use the empty-folders script in the exponential-decay repositories.

https://github.com/exponential-decay/detect-empty-folders

Analysis not output if SF report doesn't have hashes!

Blacklist file names and format extensions

Use a blacklist of file extensions to generate special listings. E.g. for the recognition of Thumbs.db files, .tmp files, .bat files etc.

Note: May keep listings separate depending on how this works.
Note: Likely to refer to an external config file.

Consider what to do with Top Fives (and/or) Pareto 80:20 stats

Removing top IDs for now because it does not make sense in a multi-identifier environment.

Output how much space is used on duplicates

This should be pretty easy to return, and might be quite nice for users.

Container heuristic may not be accurate enough...

See stat in HTML: Archive File Types in Accession/Extract

os.path.dirname() not consistently working for windows paths

e.g. Z:\transfers-working\18-Jan-2016\dispositions\disp_20160115143947_334133\Docs_WIN1648_108_12\88800.1.doc'

DROID only mode

For SF with any number of namespaces - always allow output of a simplified PRONOM only mode.

Multiple namespaces need to be documented better

E.G. For stat; Frequency of Extension Only Identification - We don't filter on distinct file ID and so a result of 22 extension only, might present as:

ns:pronom 22
ns:freedesktop 22
ns:tika 22

Be wiser in MsoftFilenameAnalysis output handling, e.g enums/controlled lists

Currently for each character issue discovered we do something a bit verbose:

def reportIssue(self, s, msg, value=''):
    self.report = self.report + "File: " + s + " " + msg + " " + value + "\n"

E.g.

self.reportIssue(s, "contains, non-printable character:", hex(c) + ", " + self.unicodename(c))

We might be able to create a list of value pairs, e.g. (enum, char), e.g. ('non-ascii-char', '{char}) we can then process this more intelligently at the end of the analysis and perhaps in different ways too.

Change blacklist to denylist

Additionally, revisit OPF blog to correct language usage there.

Remove folders from IDDATA table

They're redundant but do not interfere with results. Probably slows things down.

Better column population logic

Poor IF logic currently used, line 150 droid2sqlite.py onwards.

ImportError on one of the internationalstrings modules when trying to run analysis

Hi,

I just tried to run the analysis on a Siegfried CSV export file, and I got the following error message:

akb@debian:~$ python git-repos/droid-siegfried-sqlite-analysis-engine-master/droidsqliteanalysis.py --export ResurectionMen.csv
Traceback (most recent call last):
File "git-repos/droid-siegfried-sqlite-analysis-engine-master/droidsqliteanalysis.py", line 10, in
from libs.DroidAnalysisClass import DROIDAnalysis
File "/home/akb/git-repos/droid-siegfried-sqlite-analysis-engine-master/libs/DroidAnalysisClass.py", line 7, in
import MsoftFnameAnalysis
File "/home/akb/git-repos/droid-siegfried-sqlite-analysis-engine-master/libs/MsoftFnameAnalysis.py", line 8, in
from internationalstrings import AnalysisStringsEN as IN_EN
ImportError: No module named internationalstrings

It looks to me like the MsoftFnameAnalysis.py module is trying to import a module that doesn't exist in the repo?

Thank you!