mi-dpla / combine Goto Github PK

Combine /kämˌbīn/ - Metadata Aggregator Platform

License: MIT License

Python 25.22% JavaScript 51.69% HTML 15.34% XSLT 1.03% Shell 0.01% CSS 6.59% SCSS 0.11%

combine's Introduction

🚜 Combine

Overview

Combine is a Django application to facilitate the harvesting, transformation, analysis, and publishing of metadata records by Service Hubs for inclusion in the Digital Public Library of America (DPLA).

The name "Combine", pronounced /kämˌbīn/, is a nod to the combine harvester used in farming famous for, "combining three separate harvesting operations - reaping, threshing, and winnowing - into a single process" Instead of grains, we have metadata records! These metadata records may come in a variety of metadata formats, various states of transformation, and may or may not be valid in the context of a particular data model. Like the combine equipment used for farming, this application is designed to provide a single point of interaction for multiple steps along the way of harvesting, transforming, and analyzing metadata in preperation for inclusion in DPLA.

Documentation

The combine team is in the process of updating the documentation. The installation process and user interface have had significant changes. In the meantime some out-of-date documentation is available at Read the Docs.

Installation

Combine has a fair amount of server components, dependencies, and configurations that must be in place to work, as it leverages Apache Spark, among other applications, for processing on the backend. For previous version of combine there were a couple of deployment options. However, for the current and future versions (v0.11.1 and after) only the docker option is available.

Docker

A GitHub repository Combine-Docker exists to help stand up an instance of Combine as a series of interconnected Docker containers.

Security Warning

Combine code should be run behind your institution's firewall on a secured server. Access to combine should be protected by your instituion-wide identity and password system, preferably using two-factor authentication. If your institution supports using VPNs for access to the server's network that is a good additional step.

This is in addition to the combine's own passwords. While we haven't got explicit documentation on how to set up SSL inside the provided nginx in combine it's possible and strongly recommended.

Tech Stack Details

Django

The whole app is a Django app.

MySQL

The system configuration is stored in MySQL. This includes users, organizations, record groups, jobs, transformations, validation scenarios, and so on.

Mongo

The harvested and transformed Records themselves are stored in MongoDB, to deal with MySQL's scaling problems.

ElasticSearch

We use ElasticSearch for indexing and searching the contents of Records.

Celery

Celery runs background tasks that don't deal with largescale data, like prepping job reruns or importing/exporting state.

Redis

Redis is just keeping track of Celery's job queue.

Livy

Livy is a REST interface to make it easier to interact with Spark.

Apache Spark

Spark runs all the Jobs that harvest or alter records, for better scalability.

Hadoop

Hadoop is just backing up Spark.

User-suppliable Configurations

Field Mapper

Field Mappers let you make changes when mapping a Record from XML to key/value pairs (JSON).

XML to Key-Value Pair

XSL Stylesheet

Python Code Snippet

Transformation

Transformations let you take a Record in one format and turn it into a new Record in another format.

XSLT Stylesheet

Python Code Snippet

Open Refine Actions

Validation Scenario

You can run Validation Scenarios against the Records in a Job to find out which records do or do not meet the requirements of the Validation Scenario.

Schematron

Python Code Snippet

ElasticSearch Query

XML Schema

Record Identifier Transformation Scenario

RITS are used to transform a Record's Identifier, which is used for publishing and for uniqueness checks.

Regular Expression

Python Code Snippet

XPath Expression

combine's People

Contributors

Stargazers

Watchers

Forkers

markpbaggett blancoj mlibrary ghukill tulibraries fruviad brown-university-library carli cclauss olis-ri

combine's Issues

Table of records for a job

Rough implementation using django-tables2, but consider using django-datatables-view to leverage the powerful DataTables library.

Ability to re-index a job with a different index mapper

Creating issue only to close: as we can "merge" jobs, user has the option of selecting an indexing scheme again.

It might make sense to rename, or expand, Merge to inform the user that it also works for copying.

PySpark shell does not work with Python 3.6.0

Our conda combine environment is python 3.6, but not compatible with Spark 2.0.1 that we're currently using:
https://issues.apache.org/jira/browse/SPARK-19019

Not a breaking problem, but prevents using nicer ipython shell with pyspark. Worth keeping an eye on this.

"mysql has gone away"

Occasionally, if Combine has been idle for awhile and your perform a Spark-related task, you'll get the error that MySQL has gone away from the returning Livy Statement.

This Django forum post touches on this error.

Without much so far, it appears this effects the spark code that is uploaded from core/spark/jobs.py and core/spark/es.py. It might be worth looking into whether or not these jobs should close their database connection when complete, re-open when starting. The post above suggests this is possible with:

from django.db import connection
connection.close()

Do not allow multiple Publish jobs for a single Record Group

Or, what would this mean? That they would all publish under the publish_set_id of that RG?

There is dupe detection at the job level, but not at the record group, as the same ID might very likely exist in multiple jobs.

use .get instead of .filter

Convert instances of known DB retrieval from .filter().first():

models.RecordGroup.objects.filter(id=record_group_id).first()

to .get():

models.Job.objects.get(pk=job_id)

Spark UI link in record group details is flawed

Looks as though it's taking the Livy Job ID, which works when there is a 1:1 relationship between livy session and the container. But this isn't always the case.

Not sure what to do now, but noting here.

Detect duplicates in Merge job

Currently, during a Merge job, it will happily merge records with the same ID. Should probably detect duplicates ID's, and report as error.

Exceeding memory for OAI Harvests for large sets

After some testing, it would appear that OAIHarvester is failing for harvests where a single OAI set exceeds a threshold of records. Tested to work with sets ~46k, but consistent failure for sets ~90k, which suggests the number is somewhere in the middle.

Encountered this problem once before when ingesting VMC collection, which was 38k. Bumped up memory for executors the following which fixed:

# memory
spark.driver.memory 2g
spark.executor.memory 2g
spark.driver.cores 1
spark.executor.instances 1

However, looking like we might be encountering this again?

Receiving the following error in the executor logs:

java.lang.OutOfMemoryError: Java heap space

Could potentially bump the memory for the executors (though not possible on current 8gb VM), but that would just "kick the can" with regards to this issue. Holding off for now until speaking with DPLA folks to see if known issue.

We have observed there might be a "sweet spot" that obfuscate this issue. In a typical harvest, it's quite common to have many sets, but of smaller size. Or, because 90k is not that large, even just slightly higher memory for the Spark executors might be enough to avoid this problem. If not resolved entirely, might be something for documentation where if this error is encountered, the only option is more memory?

Filesystem cleanup

When deleting orgs, record groups, or jobs, remove the appropriate directories from the filesystem as well.

record count not fired if job never finished

If finished is never set to true, a record count is never triggered.

Once counting via SQL, this is likely moot, but noting.

retrieving record stages efficiency

With about 130k records in core_record table, retrieving records stages -- up and downstream -- takes about 1s. This is not terribly slow, but method for retrieving these record stages might not scale when DB has 500k, 1m records.

app/server crash update statuses of Livy sessions and Jobs

When there is a catastrophic failure, which will be rare, jobs are not updated because the jobs page does not attempt to update the LivySession, which the app state still thinks is active.

What might be good, when Combine starts, would be to check any LivySessions that claim to be active.

Single record details job/record ancestry

When viewing details for a record, it computes the ancestry of the record. This works, but there are confusing elements and performance problems.

With merge jobs, ancestry can repeat upstream jobs
Maybe this isn't an issue? But it's a bit confusing

If a downstream job is running, can be slow to look for record
Not entirely sure what to do here, yet

handle indexing errors

Now that records are being written to SQL, indexing results (failures) are the lone data being written to avro that is not accessible by other means.

Results in the same memory hit when trying to load failures for a job, would be better to store in SQL most likely.

Unlike job errors, which are stored right along next to jobs, indexing errors are slightly different beast.

Indexing failures might warrant it's own Django model / DB table...

check for dupes

This is something that will likely touch MergeJobs and PublishJobs.

Merge
- would make sense to alert users if merged jobs contain records with the same identifier
- could error out duplicates, nothing more
Publish / OAI
- dupes would confuse harvesters and OAI GetRecord
- probably other ramifications...

Generic field analysis

Field analysis, specifically the URL route, was tethered to a particular job. This was fine, until we wanted to analyze fields from the published ES index.

Propose reworking field analysis to have a URL structure like:

/analysis/es/[index]/field/[field_name]

That way, we can call from job details or published the same way. This makes sense, as ES analysis is somewhat removed from the Combine data model, at least related to a particular job.

edit Record document (likely XML)

It would be nice to have the ability to edit a Record's document after Harvest/Transform/Publish, what have you.

This has some ramifications:

would need to re-index in ES
what to do about version in avro files on disk? consider as a restore point?

"Harvest" static files in addition to OAI

One major area that has not yet been addressed is the "harvesting" of static files.

The form of static records may include:

discrete XML files, one record per
aggregate XML file, with multiple records contained
CSV file mapped to metadata fields (e.g. DC, where nested standards like MODS might be hard to map)

To what degree these will all be supported, remains to be seen, but looking into the most common use cases.

Link metrics from ES analysis to Combine Records view

When analyzing indexed documents and fields in ES, we can easily generate links that point to a raw ES search that contains those documents. Unfortunately, there is no connection between that raw search and the documents as they exist in Combine (from the DB).

It would be interesting to try and unite this, to provide a record query that includes only the subset of records that satisfy the ES query.

One approach might be to issue a secondary quiet query that retrieves only the ES _ids for that query, and then pipe those to a Django DB query. I would wonder how that scales, where an ES query might bring back 10, 100, 1000, 100,000 _ids. That seems unreasonable to wire across HTTP request/response cycle, and even if done on the backend, still pretty large queries.

Noting here, return to later.

Repartitioning Spark jobs

Coalescing a couple of cards/issues into one.

There is some room for improving spark performance by attending to partitions throughout our rather complex spark jobs.

For example, in this line from a Transform job, this was handled by Spark as a single partition (it would appear). Repartitioning here -- currently, to *4 number of max spark workers -- improves speed quite dramatically when there are multiple spark workers, as the partitions are farmed out to the workers in parallel. Formerly, even with two workers, this would operate on a single thread/worker.

It's possible more repartitioning could take place to further improve speed and consistency of partitions throughout Spark pipelines.

Create test data and structure for unit tests

Now that some preliminary unit tests are being created, and there is an ability to "harvest" static XML files, there is good opportunity for including some demo data that can ship with Combine.

One good option might be a zip/tar file of a large number of discrete XML records, real or fabricated.

These records could have some known quantities like poorly formed XML records that would result in errors that should be caught.

Analyze all published records

With progress made on Western/XSLT 2.0 transformations, obvious that it would be nice -- short of a merge -- to be able to preview all published records in Combine.

What would that look like?

Record table
- would be relatively trivial to select all published jobs and display
Indexed fields
- would require a modified ES query, and then field analysis on that query set
- might make sense to do this under the PublishedRecords class

Record table filtering performance

Currently, searches across record_id and document, case sensitive:

qs = qs.filter(Q(record_id__contains=search) | Q(document__contains=search))

Can make case insensitive with __icontains, but slows down query:

qs = qs.filter(Q(record_id__icontains=search) | Q(document__icontains=search))

Testing on 45k records is usable, but will likely suffer exponentially for more records. Consider slowing down DataTables eagerness for keystrokes, which will dramatically reduce number of queries.

Create generic XPath parser for ES indexing

For early prototyping, we used a MODS flattener used for indexing MODS records to Solr. However, Combine will need to flatten/map documents from multiple formats, not just MODS.

Additionally, in looking into a potential XPath approach, it was observed that repeating fields from MODS are not getting indexed into ES. And so, additional reason to work on a generic mapper for indexing into ES.

error reporting for failed livy session start

Currently, errors from starting a Livy session through combine are suppressed.

Starting a session is triggered by saving a LivySession DB object. So error reporting will need to bubble up from:

@receiver(models.signals.pre_save, sender=LivySession)
def create_livy_session(sender, instance, **kwargs):

	'''
	Before saving a LivySession instance, check if brand new, or updating status
		- if not self.id, assume new and create new session with POST
		- if self.id, assume checking status, only issue GET and update fields
	'''

	# not instance.id, assume new
	if not instance.id:

		logger.debug('creating new Livy session')

		# create livy session, get response
		livy_response = LivyClient().create_session()

		# parse response and set instance values
		response = livy_response.json()
		headers = livy_response.headers

		instance.name = 'Livy Session, sessionId %s' % (response['id'])
		instance.session_id = int(response['id'])
		instance.session_url = headers['Location']
		instance.status = response['state']
		instance.session_timestamp = headers['Date']
		instance.active = True

hash oai identifier for ES identifier

Logs are showing some OAI identifiers failing as id's, consider md5 hashing them.

Update: Imagine this means that OAI identifiers acquired through harvesting, which are then used as record_id in Combine, some fail due to character encoding.

Have not seen this in quite awhile now, but keeping issue open, as it's still not a bad idea. It would be a shame to lose that semantic meaning from the record_id column, but worth considering.

Livy sessions restart numbering after Livy restart

This issue is a bit preemptive, but presents a potential problem. When creating new sessions, Livy returns a new sessionId to build the URL such as /sessions/1, /sessions/2, etc. When Livy restarts, the numbering for new sessions begins at 1 again.

The problem would be if Combine saved a user's Livy session foobar with the url /sessions/1, but Livy was restarted, and a new /sessions/1 was started. This would result in a scenario where the foobar is pointing to a different Livy session, even though the URL is the same at /sessions/1.

This might be putting too much emphasis on what sessions jobs are run in. Perhaps we skew towards user simply selecting an available session, and that being sufficient. The only helpful thing about saving session information that a job was run from would be looking at the configurations used for that session, which might inform the success/failure for a job.

However, if the configuration options are the only thing useful for saving, we can still point from the Job table to the Session table, which will contain the code used to start that Livy session, whether it's still available or not.

Working through this, it's probably not going to be an issue, we'll just have to tailor the tables accordingly so it's clear that Combine Livy session foobar @ /1 is distinct from baz @ /1 when looking at that historically. We don't envision sharing much data in memory, so it'll be okay from session to session.

Write Livy Statement output, when job complete, to Job DB

This is handy if Livy goes down, particularly if errors occur, to retain output from Livy.

speed up job delete

When a job is deleted in Combine, the Record table contains records with FK pointers to the Job table. So, deleting job 42, might require deleting 100k rows from the Record table.

While Django can handle this, it does so by emulating SQL's ON DELETE CASCADE. In all likelihood, this might normally be just as fast and efficient as InnoDB's native index for FKs, but, this might be due to an internal indexing pattern by Django. In our situation, even though the core_record table was created by Django, we are writing the rows directly to this DB table from Spark with jdbc.write(). In this way, we might be bypassing any indexing internal to Django that would speed up the delete.

Holding off on this for now, but will need to return. Deleting a job with ~45k records takes 7-10 seconds, but this just increases as the Record table grows.

Consider Elsevier XSLT Spark library for XSL transformations (talk to DPLA folks)

Emailed creator of Elsevier plugin, and it sounds like it just might not play nice with pyspark.

In the interim, found a solution with the pyjxslt that creates a tiny, lightweight HTTP server that performs XSLT transformations with an embedded version of Saxon HE.

Explore separation of successful records vs. errors

Extension of #57.

It's become clear now that 500k, 1m jobs are possible, that methods for counting docs vs. errors within a job are prohibitively slow. This is because successful documents have content for the column document, while documents that erred along the way have content for error.

So, counting successful records was WHERE document ~! '', and same for errors. This decision was made to mimick the way avro files were written, but is not efficient in a parsed SQL context.

Looking into what it might mean to split errors from successful records in the DB...

Performance of loading job details

Specifically, benchmark and look at the method get_detailed_job_record_count from CombineJob.

Because it counts records and errors from job, each time, it's conceivable that it will prohibitively slow for large jobs.

Add pyjxslt to Combine ansible/VM build

'MySQL server has gone away' for Livy sessions

Now that Spark is interacting with SQL more, this error cropped up after idle for 8 hours+:

'MySQL server has gone away'

This was caused by this line in HarvestJob in spark/jobs.py:

job = Job.objects.get(pk=int(kwargs['job_id']))

Which, as can be seen, is a common Django ORM request. Django, not through Livy, does not appear to have this problem, as it can sit idle indefinitely and then work. Will need to keep an eye on this.

consider using Livy python client

https://github.com/cloudera/livy/blob/branch-0.3/python-api/src/main/python/livy/client.py

refactor automatic job indexing (should be same for each job type)

Holding off on this for now, as it's conceivable that some jobs will not require indexing.

Additionally, the indexing process includes the hardcoded MODS flattener which must be updated. This will complicate or at least influence this.

Aggregate published RG sets for OAI server

After digging into the OAI endpoint for Combine, it's becoming clear that it will likely make sense to aggregate all published RecordGroups to a single location.

This will allow:

detection of record dupes (by ID)
retrieval of a single record via GetRecord
much easier logic for resumption tokens
customized indexing in ES for this aggregate record group

Job delete is slow to delete associated Records

When deleting a Job instance via the Django ORM, e.g. job.delete(), it's quite slow to remove associated Records.

Filter out / remove OAI docs that are marked as "deleted"

While a harvest from ContentDM, noticed that some OAI servers return tombstones of deleted records. Will need to handle these:

<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/"><header status="deleted"><identifier>oai:cdm16259.contentdm.oclc.org:p124301coll2/279</identifier><datestamp>2016-10-18</datestamp><setSpec>p124301coll2</setSpec></header>

From Job / Record GUI, identify select fields and apply job-wide

From within a Job, or perhaps a record field, or both, allow user to associate a parsed ES field with a known type of field important for DPLA records.

e.g. If a user knows that thumbnail URLs are usually parsed to mods_location_url_@access_preview in ES, allow them to associate this field with a parameter somewhere in the Job that will faciliate showing that thumbnail on Record page (or Job DT view!).

Do the same for other important fields:

thumbnail (e.g. mods_location_url_@access_preview)
access URL (e.g. mods_location_url_@usage_primary )
title (e.g. mods_titleInfo_title)
description (e.g. mods_abstract )

This makes sense at the Job level, as a Transform job might change the mapping.

These could be stored as JSON in Job table, perhaps a field like dpla_mappings? Not ideal, as would require JSON parsing in and out.

What is known, are the target fields (see above), but these might shift or change. These fields are useful only insofar they are used to add information to the preview pages, or are used in analysis.

This begins to dip into a DPLA mapping, as separate from ES mapping/indexing, which is primarily a rough comb for QAing records before DPLA mapping.

Holding off on a decision for now, but might be groundwork for exploring DPLA mapping options.

Create ES - DataTables connector

Not able to find any pre-existing ElasticSearch to DataTables connector. This would be handy for providing DT interfaces for things like values for fields in an index (might be 10, 50, 100k).

ES search broken for multivalued facets

The facet Architecture; Detroit; Buildings. (18), with the constructed URL (URL quoted in app, but still breaks):

http://192.168.45.10:9200/j187/_search?q=mods_titleInfo_title:http://192.168.45.10:9200/j187/_search?q=mods_titleInfo_title:"Architecture; Detroit; Buildings. "

fails with theES error:

{
error: {
root_cause: [
{
type: "illegal_argument_exception",
reason: "request [/j187/_search] contains unrecognized parameters: [ Buildings. "], [ Detroit]"
}
],
type: "illegal_argument_exception",
reason: "request [/j187/_search] contains unrecognized parameters: [ Buildings. "], [ Detroit]"
},
status: 400
}

Looks like the ; is breaking the ES query...

Handle XSLT 2.0 stylesheets

Address file:// vs hdfs:// shims throughout

As Spark has the ability to read natively from either the filesystem or HDFS, it requires a bit of juggling for the appropriate file:// and hdfs:// prefixes.

Additionally, where /foo is, by default, interpreted as an HDFS path.

There are shims throughout to convert file:///foo to /foo, but would be worth standardizing how this is handled

Generic index parser, hierarchical MODS, and complex subject terms

The generic parser is proving effective, but noticed an unfortunate effect while parsing MODS. Here are three subjects from MODS:

<mods:subject>
    <mods:topic>Labor unions</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Saginaw</mods:geographic>
    <mods:temporal>1930-1940</mods:temporal>
</mods:subject>
<mods:subject>
    <mods:topic>Strikes</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Saginaw</mods:geographic>
    <mods:temporal>1930-1940</mods:temporal>
</mods:subject>
<mods:subject>
    <mods:name>
        <mods:namePart>Consumers Power Company (Michigan)</mods:namePart>
    </mods:name>
    <mods:topic>Strikes</mods:topic>
    <mods:temporal>1930-1940</mods:temporal>
</mods:subject>

This gets parsed as:

mods_subject_geographic | ['Michigan', 'Saginaw', 'Michigan', 'Saginaw']
mods_subject_name_namePart | Consumers Power Company (Michigan)
mods_subject_temporal | ['1930-1940', '1930-1940', '1930-1940']
mods_subject_topic | ['Labor unions', 'Strikes', 'Strikes']

However, these are how these MODS fields are concatenated on our digital collections front-end, likely much more accurate to their original intent as complex/composite subjects:

Labor unions
Labor unions--Michigan--Saginaw--1930-1940
Strikes
Strikes--Michigan--Saginaw--1930-1940
StrikesConsumers Power Company (Michigan)

Not sure how to address this. This is where a dedicated MODS parser (XSLT), would probably concatenate these in the proper way.

remove symlinks from Publish jobs

Now that records are stored in DB, don't need to symlink the avro files from published jobs. Can remove the creation and teardown of these to simplify publish process.

deleting jobs that failed

On failed jobs (that result in 0 objects harvested), there seems to be an issue with deleting them. This error appears: ''tuple' object has no attribute 'split''

DB configuration

Have settings.py default to sqlite3, but create template for MySQL DB config in localsettings.py

Dynamic OAI identifier x25 slower

Generating a dynamic OAI identifier for a record takes, on average, about 1ms, as opposed to about 0.06ms for a static id from the DB. However, this adds up.

record_id straight from DB, @ 500 records: ~32ms
dynamic OAI identifier @ 500 records: ~800ms

That works out to x25 slower with the dynamically generated (not stored in DB) OAI identifiers. Would not have an enormous effect on OAI performance, as the majority of time spent is sending XML over the wire, but still worth noting.

mi-dpla / combine Goto Github PK

combine's Introduction

🚜 Combine

Overview

Documentation

Installation

Docker

Security Warning

Tech Stack Details

Django

MySQL

Mongo

ElasticSearch

Celery

Redis

Livy

Apache Spark

Hadoop

User-suppliable Configurations

Field Mapper

XML to Key-Value Pair

XSL Stylesheet

Python Code Snippet

Transformation

XSLT Stylesheet

Python Code Snippet

Open Refine Actions

Validation Scenario

Schematron

Python Code Snippet

ElasticSearch Query

XML Schema

Record Identifier Transformation Scenario

Regular Expression

Python Code Snippet

XPath Expression

combine's People

Contributors

Stargazers

Watchers

Forkers

combine's Issues

Recommend Projects

Recommend Topics

Recommend Org