Giter VIP home page Giter VIP logo

exponential-decay / demystify Goto Github PK

View Code? Open in Web Editor NEW
23.0 8.0 5.0 6.82 MB

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.

Home Page: http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export

License: zlib License

Python 42.18% HTML 57.69% Makefile 0.13%
code4lib digital-preservation pronom archives format-analysis duplicate-detection digipres collection-profiling

demystify's Issues

Output results based on collection 'risk' issues first, before summary information

Sorting results based on the issues that put a collection/ingest at most risk, first, might be a more useful output than a mixture of summary and risk based information.

An example might be seeing unidentified files and extension only identifications output before frequency charts.

Similarly, identical content could be output immediately after those two as well.

Initial list with biggest priority:

  • Unidentified list
  • Total files in container objects
  • Extension only identifications
  • Extension mismatches
  • Multiple identifications
  • Duplicate file names
  • Duplicate content

File-size analysis, mean, median, medium, top-five

A report that can produce various statistics relating to file-size.

Note: Could be configured to produce a list larger than top five.
Note: May not always be useful, but might help users to understand their collections better.
Note: Configure via external config file.

Be wiser in MsoftFilenameAnalysis output handling, e.g enums/controlled lists

Currently for each character issue discovered we do something a bit verbose:

def reportIssue(self, s, msg, value=''):
    self.report = self.report + "File: " + s + " " + msg + " " + value + "\n"

E.g.

self.reportIssue(s, "contains, non-printable character:", hex(c) + ", " + self.unicodename(c))

We might be able to create a list of value pairs, e.g. (enum, char), e.g. ('non-ascii-char', '{char}) we can then process this more intelligently at the end of the analysis and perhaps in different ways too.

Blacklist PUIDs

Mechanism to blacklist certain PUIDs to place them into a separate listing in some useful manner.

Examples might be fmt/111 - OLE2 Compound Object Format, or x-fmt/411 - Windows Portable Executable.

These are files identified by DROID but are likely to require action before ingest, e.g. the more specific identification of fmt/111 as a child format, e.g. Microsoft Word, San Serif Page Plus.

This will be institution specific.

Note: Likely to be configured by external config file for usability.

Blacklist file names and format extensions

Use a blacklist of file extensions to generate special listings. E.g. for the recognition of Thumbs.db files, .tmp files, .bat files etc.

Note: May keep listings separate depending on how this works.
Note: Likely to refer to an external config file.

Sort signature identified PUIDs listing by format name rather than PUID

Sort the identified PUID listing by format name to make it more useful to spot trends. There is no relationship between PUID numbers and so a sort is useless. PDF 1.1 and PDF 1.2 could be separated by 100 other PUIDs. As such, sort by format name to highlight trends across a collection.

Sqlite error with Yaml

C:\Working\git\droid-siegfried-sqlite-analysis-engine>python droidsqliteanalysis.py --export sf-working-copies.json > working-copies.htm
Traceback (most recent call last):
  File "droidsqliteanalysis.py", line 129, in 
    main()
  File "droidsqliteanalysis.py", line 112, in main
    handleDROIDCSV(args.export, True, args.txt, blacklist, args.rogues, args.heroes)
  File "droidsqliteanalysis.py", line 73, in handleDROIDCSV
    dbfilename = droid2sqlite.identifyinput(droidcsv)
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\droid2sqlite.py", line 22, in identifyinput
    return handleSFYAML(export)
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\droid2sqlite.py", line 41, in handleSFYAML
    loader.sfDBSetup(sfexport, basedb.getcursor())
  File "C:\Working\git\droid-siegfried-sqlite-analysis-engine\libs\SFLoaderClass.py", line 152, in sfDBSetup
    cursor.execute(i)
sqlite3.OperationalError: near "s": syntax error

ImportError on one of the internationalstrings modules when trying to run analysis

Hi,

I just tried to run the analysis on a Siegfried CSV export file, and I got the following error message:

akb@debian:~$ python git-repos/droid-siegfried-sqlite-analysis-engine-master/droidsqliteanalysis.py --export ResurectionMen.csv
Traceback (most recent call last):
File "git-repos/droid-siegfried-sqlite-analysis-engine-master/droidsqliteanalysis.py", line 10, in
from libs.DroidAnalysisClass import DROIDAnalysis
File "/home/akb/git-repos/droid-siegfried-sqlite-analysis-engine-master/libs/DroidAnalysisClass.py", line 7, in
import MsoftFnameAnalysis
File "/home/akb/git-repos/droid-siegfried-sqlite-analysis-engine-master/libs/MsoftFnameAnalysis.py", line 8, in
from internationalstrings import AnalysisStringsEN as IN_EN
ImportError: No module named internationalstrings

It looks to me like the MsoftFnameAnalysis.py module is trying to import a module that doesn't exist in the repo?

Thank you!

Multiple namespaces need to be documented better

E.G. For stat; Frequency of Extension Only Identification - We don't filter on distinct file ID and so a result of 22 extension only, might present as:

ns:pronom 22
ns:freedesktop 22
ns:tika 22

Add code of conduct

Related to #46 if folks are to engage, folks need to do so knowing there is a code of conduct to help ensure everyone is looked after.

DROID only mode

For SF with any number of namespaces - always allow output of a simplified PRONOM only mode.

SF YAML time parsing error

Hi Ross,

Loving the new version of the analysis tool, but I seem to have hit a snag in trying it out. I tried to run droid2sqlite.py against a YAML export from Siegfried 1.5, using both PRONOM and tika namespaces. It's unsuccessful, seemingly due to the parsing of years in the SFHandlerClass.

STDERR is copied below:

mfmmessier:droid-sqlite-analysis-0.4.0 twalsh$ python droid2sqlite.py --export kolmactest.yaml
Traceback (most recent call last):
  File "droid2sqlite.py", line 79, in <module>
    main()
  File "droid2sqlite.py", line 72, in main
    identifyinput(args.export)
  File "droid2sqlite.py", line 21, in identifyinput
    return handleSFYAML(export)
  File "droid2sqlite.py", line 40, in handleSFYAML
    loader.sfDBSetup(sfexport, basedb.getcursor())
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFLoaderClass.py", line 108, in sfDBSetup
    sf.addYear(sfdata)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 284, in addYear
    row[self.FIELDYEAR] = self.getYear(year)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 290, in getYear
    dt = datetime.datetime.strptime(datestring.split('+', 1)[0], '%Y-%m-%dT%H:%M:%S')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 328, in _strptime
    data_string[found.end():])
ValueError: unconverted data remains: -04:00

And the relevant lines from _strptime.py:

if len(data_string) != found.end():
        raise ValueError("unconverted data remains: %s" %
                          data_string[found.end():])

It appears that the issue may be in the time zone parsing but I wasn't able to figure out exactly what in the limited time I had to tinker with getYear in SFHandlerClass. Any ideas? (I'm running Python 2.7.10, if that makes any difference)

Thanks!

Binder-ize me!

Via @joshuatj, inspured by @anjackson and his work on ident-o-matic

This is likely an epic, so epic related tasks will appear below:

  • Renaming
  • Python 3
  • Molularizing (pathlesstaken, sqlitfid, etc.)

Tasks on this are likely to begin early July.

Remove hard coded indexes, e.g. URI_COL = 2

Remove hard coded indexes from the script. Proposal: read CSV header, index. Hard code the column names we are interested in, but not their index. We are more interested in the content than the index of the header. Hypothesis is that the index is more likely to change than the column name.

Two column indexes that are currently hard coded are:

URI_COL = 2
PATH_COL = 3

Filter for certain characters

In any given context, a large number of Unicode characters is very likely. It may be useful to understand groupings, or themes, but not necessarily all. I'm considering a filter for certain characters so that they don't explore the report, but I am also considering other ways of presenting this information, e.g. organize by category and providing a collapsable file-list underneath.

Connected to #49

Lxml is an external dependency...

This app shouldn't use any external dependencies. It was designed to work in an environment that doesn't have the ability to use PIP.

Lxml.etree should be replaced by etree pretty easily. Lxml.html, I am not sure about.

Container black-list PUIDs and Extensions

Black-list of container file types that we can identify DROID won't analyse. There will need to be further analysis done on these files in any collection. E.g. RAR.

Cumulative analysis results

Have removed getContentHash() and getTimeStamp()

Original idea was to prevent a DB from generating again, but also to try and begin generating cumulative results per improved signature release to understand the evolution of a collection.

Will an add function for the DB prove useful? - Namespace could include a timedatestamp also.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.