Giter VIP home page Giter VIP logo

iff's Introduction

METADATA DEDUPLICATION VIA SRU SEARCH

Description

Perl Script for data deduplication with SRU service of swissbib or GVI. The deduplication process is specially adapted to data from IFF institute of the university of St. Gallen

Versions

There are several versions of this script. The recommended version is v4_combined/dedup.pl. This Readme is for the recommended version. Older versions can be found in directory old_versions, there is a separate readme for versions 1 and 2.

Prerequisites

You need to have perl and libxml2 installed to run this.

Developed with Strawberry Perl v5.28.1 (LibXML is included). For Strawberry Perl (Windows): include path to .\iff in @INC see here

An image for Ubuntu (VM) with all necessary modules installed can be downloaded here: switch drive.

Please read the installation notes.

Usage

The script calls a SRU service for each document in the input file, so you will need an active internet connection. Performance depends on internet connection as well as availability of SRU service. Call the script like this:

perl dedup.pl -c [swissbib|gvi] -f [filename]

For more information about the script, see the POD documentation for dedup.pl.

Parameters

SRU Service (Configuration file)

You can choose between swissbib or GVI SRU interface.

  • There is a .conf file for each of these SRU interfaces.
  • You choose the service with command line parameter -c gvi or -c swissbib.
  • Currently, there are no other SRU interfaces implemented.
  • To implement another SRU interface, the new interface needs its own .conf file.

The dedup.pl script would also need to be adapted slightly when getting the options from command line. In the configuration file, you can parametrize several values:

  • weights for matching fields and preferred origin
  • sru server configuration and query strings

Each section in the configuration file needs a header, each value needs its own line. Full documentation on the configuration file (how to edit or add entries, how to call it in the script) can be found here: Config:

Config::Tiny

Input file (Data)

You can feed this script with an input file of your choice. It needs to be in csv format.

  • Several testfiles and the full metadata file for the IFF library catalogue are in subdirectory ./data.
  • Warning: IFF_Katalog_FULL_normalized will take about 20 - 30 minutes, depending on network. Try a smaller file first!

The data needs to be arranged in rows like the example files in subdirectory ./data, otherwise this script will not work. Data needs to be in the following rows (rows may be empty unless stated otherwise):

  • 1: author1
  • 2: author2
  • 3: author3
  • 4: title (mandatory)
  • 5: subtitle
  • 6: vol-info1
  • 7: vol-info2
  • 8: isbn
  • 9: pages
  • 10: material type (possible values: Druckerzeugnis, CD-ROM/DVD, Loseblattwerk, Online-Publikation, Zeitung)
  • 11: addendum
  • 12: library or collection
  • 13: call no.
  • 14: place of publication
  • 15: publisher
  • 16: year
  • 17: code1 (refers to a subject table, see map)
  • 18: code2 (refers to a subject table, see map)
  • 19: code3 (refers to a subject table, see map)
  • 20: subj1 (see above)
  • 21: subj2 (see above)
  • 23: subj3 (see above)
  • rows 24ff. need to be empty.

Output

Console output will show a progress bar and give you the logfile name at the end. The script creates the following output:

  • an export file with the original data and the document numbers that need to be replaced and/or imported: export.csv
  • an xml file with the exported metadata, which via MARC field 001: metadata.xml
  • a report with debugging info: report.txt
  • a logfile: log_.txt
export.csv

It contains all documents (equal to input file) Additional mapping info can be found in following columns:

  • w: what to do with the documents. Cases:

    • bestcase: already matched correctly
    • iffonly: only original iff found
    • replace: replace col. x with col. y, data: see metadata.xml
    • reimport: replace col. x with col. y, data: see metadata.xml
    • iffnotfound: original not found, could be replaced manually with docnr. from col. y
    • notfound: no result
    • unsure: no certain result
    • ignore: excluded from deduplication
  • x: docnr. of document to be replaced (swissbib only: system number)

  • y: docnr. of replacement document (system number from swissbib or gvi: MARC field 001)

metadata.xml

MARCXML-Export for cases reimport and replace. Docnr. can be found in controlfield 001 and corresponds to the export file .

report.txt

Contains debugging info (quite chatty) for each document, its result set and matching values.

log_timestamp.txt

Logfile with statistics

iff's People

Contributors

kathrin77 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.