Giter VIP home page Giter VIP logo

cds-migrator-kit's People

Contributors

equadon avatar kpsherva avatar ludmilamarian avatar ntarocco avatar zzacharo avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cds-migrator-kit's Issues

Books: clean aleph numbers

More or less 25,000 records need to be corrected.
One way to see if the value is an Aleph number is that the number stats with 000s:
https://cds.cern.ch/search?ln=en&sc=1&p=962__b%3A%22000*%22+or+785%3A%22000*%22+or+770%3A%22000*%22+or+780%3A%22000*%22+or+787%3A%22000*%22+or+772%3A%22000*%22&action_search=Search&op1=a&m1=a&p1=&f1=&c=Articles+%26+Preprints&c=Books+%26+Proceedings&c=Presentations+%26+Talks&c=Periodicals+%26+Progress+Reports&c=Multimedia+%26+Outreach

(The search is maybe not 100% accurate).

The fields that need to be checked are:

  • 962b
  • 770w
  • 772w
  • 780w
  • 785w
  • 787w

Additionally:

  • 035$$9CERCER

The matching needs to be done against 970__a where there is ‘CER’ and one needs to replace this value with the corresponding CDS record number.

Here is an example:
https://cds.cern.ch/record/1163043?ln=en

As far as as know, there is also the field 035 that contains Aleph Numbers when $$9CERCER:
Ex: https://cds.cern.ch/record/1220684?ln=en -> to be checked if it is still in use for something

( requires #21 )

internal locations: health check

  • check if all locations migrated (especially make sure that all internal/external were split correctly)
  • check if addresses are correct

document requests: health check

  • check if all document requests migrated ( formerly known as "proposal-new", "proposal-put aside")
  • check if the document requests have the fields migrated correctly

journals: health check

  • check if all journals migrated
  • check if the journals have correct fields migrated
  • check if the access_urls are correct

loans: health check

  • check if all loans migrated (current state is on 05 March 2021)
  • check if patrons attached correctly
  • check if statuses make sense (fe. no ongoing loan on the anonymous patron)
  • check if the end dates/start dates are correct

Books: identify all records that need to be migrated

Based on the requirements, the following collections need to be migrated. Compute a query (or generate only one collection) in order to identify all records that need to be moved as part of the Books satellite.

Collection # of records
Books held by LHC (obsolete) 0
Books held by PS-PO (obsolete) 0
Customary law part of Legal Service Library 2
Highway Code part of Legal Service Library 6
Environmental Law part of Legal Service Library 7
Law of research part of Legal Service Library 12
Criminal Law part of Legal Service Library 21
Nuclear Law part of Legal Service Library 22
Fiscal Law part of Legal Service Library 24
Social Security & Public Health part of Legal Service Library 24
Building Law part of Legal Service Library 36
Legal Research part of Legal Service Library 52
Books held by EST (obsolete) 70
Labour Law part of Legal Service Library 75
Books held by TIS (obsolete) 104
Public & Administrative Law part of Legal Service Library 122
Civil Law part of Legal Service Library 211
Pauli's scientific book collection to be moved with the Archives 357
International Law part of Legal Service Library 530
Book Proposals 633
CERN Bookshop 1049
CERN Computing Bookshop (same as CERN Bookshop) 1049
Legal Service Library 1160
CERN Yellow Reports 1183
Periodicals-> different display, no individual items per volume 2290
English Book Club 4705
UDC (marked as hidden) 5667
Standards 12959
Proceedings 22973
eBooks 81184
Books 105341

serials: health check

  • check if all the serials migrated correctly
  • check if the identifiers are present (ISSN)
  • check if the documents were attached correctly (there should be no empty serials, without documents)

borrowing requests: health check

borrowing requests are formerly known as ILLs of type "book"

  • check if all borrowing requests migrated
  • check if statues are correct
  • check if fields migrated correctly

multiparts: health check

  • check if all multiparts migrated
  • check if there is no empty multiparts
  • check if all documents created
  • check if documents of multiparts have attached correct items
  • check if all fields are correct

Series migration clean-up

  • fix the series similarity check on qa (is working locally, on qa it is not - probably due to python-leveinstein library which finds similarities)
  • remove 'View' button from the results of the series if they were already classified as a part of another - because they don't produce the file
  • make sure that nothing is missing from CERNDocumentServer/cds-dojson#222 (it should be closed by a PR but it was not)
  • fix the failing tests on PR #35, merge to master and change deployment ref on openshift to master
  • make sure that migration script works (production script) and switch the membership of @agentilb to books-migrator-qa from books-migrator-dev openshift project

loan requests: health check

  • check if correct number of loans in state PENDING is migrated (they correspond with the loan requests)
  • check if valid until dates are correct
  • check if loan requests dates are correct

Books: Items: migrate volumes and series

In our system, to model Series/Volumes/Items, we will have the following:

  • Series: a Document referencing a list of Volumes
  • Volume: a Document referencing a list of Items
  • Edition: it is just the name of the Document (of a Series, of a Volume, of a simple Document)
                                        - Item 1 (item) 
                - Volume 1 (doc)   < 
                                        - Item 2 (item)
Series (doc) <
                                        - Item 1 (item) 
                - Volume 2 (doc)   < 
                                        - Item 2 (item)

BibRecs have information about Volumes in fields:

  • 246__v: volume
  • 300__a: will contain the number of volumes

In legacy, Series are BibRecs and Volumes are Items where Item description states the volume name, for example:
Series
3 volumes

TODO before migration

  1. Find all BibRecs that have volume info (246__v/300__a)
  2. For each BibRec:
    1. find all the Items and get the description field that contains the volume name (several different formats, tricky...). The number of Items attached to that BibRec should correspond to the 300__a number...
    2. create a new BibRec that will be the volume, copy the needed metadata, and set the name as the Item description
    3. change the original BibRec to be the Series by fixing metadata and attaching all the Volumes BibRecs

documents: health check

  • check if all documents migrated (random check)
  • check migrated fields
  • check migrated urls
  • check if all the identifiers present on the documents

vendors: health check

  • check if vendors are correctly merged (Wiley)
  • check if all vendors migrated
  • check if all fields migrated correctly

[US] Books: migrate items

parent of #16
parent of #17 [CLOSED]
parent of #18

Tasks

  • Prepare the migration of items (~300K items).
  • Identify the ones that need to be migrated as records.

Analysis

Items migration consists of 2 main objects:

  • Libraries
  • Items

Libraries

  • id (barcode), primary key, unique, not null
  • type, possible values:
    • external: 43 libraries (ex. RERO, Nebis, Springer, ...), 0 items referencing them
    • internal: 24 libraries (ex. CERN LHC Library, CERN EP Library, CERN ARC Library, CERN Courier, ...)
    • main: 2 libraries, CERN Central Library and Other but 0 items referencing this last one
    • hidden: 1 library, CERN Central Storage, 0 items referencing it
  • name, address, email, phone, notes

Comparing this legacy data model with the new data model, libraries will corresponds to different rooms or buildings, belonging to the unique Location. It means that Items will have a reference to a room and not to a location.

Migration

The proposed schemas for libraries are the followings:

  • 1 single location:
{
    "location": {
        "locid": "1", # id format to be defined
        "name": "CERN Central Library"
      }
}
  • multiple internal locations:
{
    "internal_locations": [
        {
            "phone": "", 
            "ilocid": 0, # id format to be defined
            "name": "Legal Service Library", 
            "legacy_id": 3, 
            "address": "", 
            "notes": "", 
            "locid": "1", 
            "email": ""
        }, 
        {
            "phone": "", 
            "ilocid": 1, 
            "name": "CERN Central Library", 
            "legacy_id": 6, 
            "address": "CH-1211 Geneva 23", 
            "notes": "", 
            "locid": "1", 
            "email": "[email protected]"
        }, 
        ...
}

The location -> internal location structure is needed because Invenio ILS, in case of multiple locations, will change the loan workflow to "transit" books between different locations.

Items

  • barcode, primary key, unique, not null
  • id_bibrec, reference to the document
  • id_crcLIBRARY, reference to the library
  • collection, possible values Monograph, Reference, Archives, Library, Conference, LSL Depot, Oversize, Official, Pamphlet, CDROM, Standards, Video & Trainings, Periodical. To be deleted #3.
  • location, a string for the UDC classification (ex. R 614(02) PAN blue, Blacksburg 1981, 530.145.29 MAN, 539.125 WOR, 621.3.02 NAI, ...)
  • description, Volumes/Series identifier, see below
  • loan_period, 4 weeks, 1 week, Reference
  • status: see below
  • number_of_requests: int, number of requests while item on loan
  • expected_arrival_date: related to acquisition
  • creation_date, modification_date

One item does not have the barcode, to be fixed.

Volumes/Series

Currently, volume or series definition is in the description field. A few numbers:

  • Description with v. or vol.: ~39K
  • Description without v. or vol.: ~5K (ex. "volume", "missing", "part. 1", "1978", "Part 2", "v 1")
  • Description is NOT NULL or NOT "-": ~58K
  • Description is NULL or "-": ~29K

Status

  • missing: ~12K -> MISSING
  • on shelf ~280K -> LOANABLE (maybe better name)
  • in binding: ~120 -> IN_BINDING
  • on loan: ~2.7K -> not needed, handled by Loan object
  • on order: ~160 -> related to acquisition
  • out of print: ~40 -> to keep?
  • not published: ~84 -> to keep?
  • cancelled: ~10 -> acquisition/ILL?
  • not arrived: ~260 -> acquisition/ILL?
  • untraceable: ~700 -> what is the difference with missing?
  • under review: ~1 -> to keep?
  • scanning: not existing for now, do you need?

Migration

The proposed schema for item is the following:

{
    "itemid": "1", # id format to be defined
    "docid": "1", # id format to be defined
    "ilocid": "1",  # id format to be defined
    "legacy_bibrec_id": "111111",
    "legacy_library_id": "3",
    "barcode": "83-0384-4",
    "classification": "530.24 SEI",
    "description": "",
    "status": "LOANABLE",
    "circulation_restriction": "Reference",
    "medium": "",
    "created": "date",
    "updated": "date"
}

Questions

Library

  • What is type external of libraries used for? Looks like ILL.
  • What is type hidden of libraries used for?
  • What is type = main, name =Other (0 items)?

Items

  • What is collection for?
  • Collection Periodical, there are 47K items, but they are not displayed in the interface to the user. For example: https://cds.cern.ch/record/229779. Take into account that we wanted to delete collections, see #3.
  • What is location for? UDC classification?
  • Can we have loan_period 4 weeks as default and then restrict for Reference for some items? Is 1 week really needed? (There are 19K items with 1 weeek)
  • Review all status.
  • #3 Medium field wanted: how this affect loans? Possible values: --, online, paper, CD-ROM, DVD, VHS. Currently it does not exist, are you going to update 300K items?
  • number_of_requests: how do you use this today?
  • expected_arrival_date: related to acquisition ?

TO DO

electronic items: health check

  • check if all electronic items
  • check if EBL/ezproxy works correctly
  • check if all files migrated (files now migrated as EItems)

Migration of files

Find and implement solution of
1 case: charts of the articles from inspire (?) to discuss
2 case: compressed full text attached to the record - EItem
3 case: icons of subformat - they should be dropped

Photos Migration: clean 962

962__l:PHOPHO-> it seems to correspond to records (mostly Bulletin articles) linked with photos.
In this case, the corresponding record has a 035__ with $$9PHOPHO
Ex: https://cds.cern.ch/record/46124 linked to https://cds.cern.ch/record/43022?ln=en

962__l:MMD it seems to correspond to records (mostly Bulletin articles) linked with photos.
In this case, the corresponding record has 970__a:'MMD' (and/or 035__9:'MMD')
Ex: https://cds.cern.ch/record/749053 linked to https://cds.cern.ch/record/615876

962__l:ADMBUL it seems to correspond to photo records linked with Bulletin issues (all are from the years 2000-2001).
In this case, the corresponding record has 035:'ADMBUL'
Ex: https://cds.cern.ch/record/41801 linked to https://cds.cern.ch/record/44476?ln=en

Related to #15

items: broken data on CDS

  • item with barcode: P00025946 has \ in the barcode field, remove "\"
  • item attached to recid 356068 has \ instead of barcode and status missing. remove the item ?

items: health check

  • check if all items migrated
  • check if item statuses are consistent
  • check if barcodes are correct
  • check if items are attached to correct locations

orders: health check

orders are formerly known as ills with types ("acq-book", "acq-standard", "proposal-book", "article")

  • check if orders migrated correctly
  • check order lines
  • check if the statuses make sense

General: Aleph number in 035 does not match the Aleph number on 970

There are 58 records (in addition to the 13 mentioned in #15 ) that have different Aleph numbers in 035 and 970. These records need to be fixed manually (they concern future migrations of CERN Research Output).

Record: http://cds.cern.ch/record/194091        035$$9CERCER$$a0104950  970$$a000105690CER
Record: http://cds.cern.ch/record/195686        035$$9CERCER$$a0099711  970$$a000107310CER
Record: http://cds.cern.ch/record/202170        035$$9CERCER$$a0117910  970$$a000113835CER
Record: http://cds.cern.ch/record/209171        035$$9CERCER$$a0127446  970$$a000121097CER
Record: http://cds.cern.ch/record/209860        035$$9CERCER$$a0162005  970$$a000121816CER
Record: http://cds.cern.ch/record/213210        035$$9CERCER$$a2209846  970$$a000125367CER
Record: http://cds.cern.ch/record/243637        035$$9CERCER$$a0269426  970$$a000159398CER
Record: http://cds.cern.ch/record/269418        035$$9CERCER$$a2226562  970$$a000188427CER
Record: http://cds.cern.ch/record/271584        035$$9CERCER$$a2226566  970$$a000190935CER
Record: http://cds.cern.ch/record/284699        035$$9CERCER$$a2226573  970$$a000205010CER
Record: http://cds.cern.ch/record/288412        035$$9CERCER$$a0219865  970$$a000209164CER
Record: http://cds.cern.ch/record/309561        035$$9CERCER$$a0222517  970$$a000232098CER
Record: http://cds.cern.ch/record/326783        035$$9CERCER$$a0251850  970$$a000250165CER
Record: http://cds.cern.ch/record/388496        035$$9CERCER$$a2187195  970$$a000314599CER
Record: http://cds.cern.ch/record/392830        035$$9CERCER$$a0319049  NO 970
Record: http://cds.cern.ch/record/410746        035$$9CERCER$$a0338008  NO 970
Record: http://cds.cern.ch/record/426599        035$$9CERCER$$a2175829  NO 970
Record: http://cds.cern.ch/record/426600        035$$9CERCER$$a2175830  NO 970
Record: http://cds.cern.ch/record/430066        035$$9CERCER$$a2188979  970$$a002179441CER
Record: http://cds.cern.ch/record/433058        035$$9CERCER$$a2271797  970$$a002182579CER
Record: http://cds.cern.ch/record/448286        035$$9CERCER$$a2199243  NO 970
Record: http://cds.cern.ch/record/476281        035$$9CERCER$$a2229795  NO 970
Record: http://cds.cern.ch/record/501204        035$$9CERCER$$a2256379  NO 970
Record: http://cds.cern.ch/record/504321        035$$9CERCER$$a2259580  NO 970
Record: http://cds.cern.ch/record/504326        035$$9CERCER$$a2259586  NO 970
Record: http://cds.cern.ch/record/504759        035$$9CERCER$$a2266321  970$$a002260033CER
Record: http://cds.cern.ch/record/506607        035$$9CERCER$$a2176523  970$$a002261933CER
Record: http://cds.cern.ch/record/519146        035$$9CERCER$$a2275568  NO 970
Record: http://cds.cern.ch/record/532655        035$$9CERCER$$a2173483  970$$a002289647CER
Record: http://cds.cern.ch/record/535960        035$$9CERCER$$a2293011  NO 970
Record: http://cds.cern.ch/record/545306        035$$9CERCER$$a2302560  NO 970
Record: http://cds.cern.ch/record/553396        035$$9CERCER$$a2194463  970$$a002310926CER
Record: http://cds.cern.ch/record/566853        035$$9CERCER$$a2343329  970$$a002324720CER
Record: http://cds.cern.ch/record/585796        035$$9CERCER$$a2343691  970$$a002344568CER
Record: http://cds.cern.ch/record/599531        035$$9CERCER$$a2358567  NO 970
Record: http://cds.cern.ch/record/610249        035$$9CERCER$$a2369623  NO 970
Record: http://cds.cern.ch/record/621944        035$$9CERCER$$a2348993  970$$a002382392CER
Record: http://cds.cern.ch/record/682499        035$$9CERCER$$a0271495  970$$a002408955CER
Record: http://cds.cern.ch/record/684098        035$$9CERCER$$a2212346  970$$a002410511CER
Record: http://cds.cern.ch/record/684138        035$$9CERCER$$a2192143  970$$a002410551CER
Record: http://cds.cern.ch/record/684225        035$$9CERCER$$a2196586  970$$a002410638CER
Record: http://cds.cern.ch/record/685431        035$$9CERCER$$a2350522  970$$a002411814CER
Record: http://cds.cern.ch/record/685675        035$$9CERCER$$a0218878  970$$a002412058CER
Record: http://cds.cern.ch/record/686348        035$$9CERCER$$a2371741  970$$a002412713CER
Record: http://cds.cern.ch/record/688730        035$$9CERCER$$a2284468  970$$a002415020CER
Record: http://cds.cern.ch/record/689235        035$$9CERCER$$a2361267  970$$a002415471CER
Record: http://cds.cern.ch/record/698612        035$$9CERCER$$a0113323  970$$a000410321CER
Record: http://cds.cern.ch/record/700006        035$$9CERCER$$a0113318  970$$a000411845CER
Record: http://cds.cern.ch/record/781363        035$$9CERCER$$a2194481  970$$a002471749CER
Record: http://cds.cern.ch/record/798281        035$$9CERCER$$a4091040  970$$a002487763CER
Record: http://cds.cern.ch/record/879171        035$$9CERCER$$a0043327  970$$a002552298CER
Record: http://cds.cern.ch/record/978549        035$$9CERCER$$a0236412  970$$a002642021CER
Record: http://cds.cern.ch/record/1015056       035$$9CERCER$$a0197554  970$$a002675719CER
Record: http://cds.cern.ch/record/1023569       035$$9CERCER$$a0232429  970$$a002682666CER
Record: http://cds.cern.ch/record/1330744       035$$9CERCER$$a0068296  970$$a002949798CER
Record: http://cds.cern.ch/record/2026616       035$$9CERCER$$a0016553  NO 970
Record: http://cds.cern.ch/record/2306622       035$$9CERCER$$a0245770  NO 970
Record: http://cds.cern.ch/record/2306623       035$$9CERCER$$a0245770  NO 970

ILL libraries: health check

  • verify if all the external libraries are migrated
  • verify if the libraries are merged correctly
  • verify if all fields migrated

Books: Items: clean items data before migration

TODO before migration

Libraries

Some libraries need to be cleaned. This is the list of libraries to keep:

'163322','CERN Central Library','CH-1211 Geneva 23'
'1046','CERN LSL Library','CH-1211 Geneva 23'
'38875','CERN Depot 1, bldg. 2 (DE1)','CH-1211 Geneva 23, basement, building 2'
'49101','CERN Depot 2, bldg. 2 (DE2)','Basement, building 2'
'8070','CERN Depot 3, bldg. 2 (DE3)','Basement, building 2’
'15732','CERN Depot 4, bldg. 500-S-023 (DE4)',’'
'15156','CERN ARC Library','CH-1211 Geneva 23’
'167','CERN Didactic Library',’'
'4886','English Book Club','English Book Club'
- CERN TE-VSC Library

For these libraries, delete them but the attached items should be moved to CERN Central Library

- CERN Central Library BIB
- Same for Press Office

Items

  1. In legacy, items with status:

    • scanning: there will be a new list of items to bulk update items and set the status to scanning.
    • cancelled, not arrived, untraceable: there should be 0 items with these statuses, so they should disappear
    • not published, out of print: to be kept
  2. Ensure that the item without barcode has been fixed:

(('', 810479L, 12L, 'Periodical', '', '2014 vol.39 no.10', 'Reference', 'on shelf', datetime.datetime(2014, 9, 28, 14, 35, 35), datetime.datetime(2015, 3, 5, 10, 42, 27), 0L, ''),)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.