codeforkjeff / conciliator Goto Github PK

View Code? Open in Web Editor NEW

110.0 6.0 22.0 937 KB

OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.

License: GNU General Public License v3.0

Shell 0.45% Java 99.00% Dockerfile 0.55%

reconciliation-service viaf openrefine orcid solr entity-resolution openlibrary

conciliator's People

Contributors

Stargazers

Watchers

conciliator's Issues

VIAF name matching should exclude related persons

Hi,

when reconciling against VIAF I frequently get candidates pointing to related persons of the actual person to be reconciled.
For example, VIAF stores co-authors in the MARC field 950 (see https://viaf.org/viaf/viaftags.xml#mrca950), and this field seems to be one of the sources for candidate retrieval. Naturally, these candidates only get very low scores, but it can be distracting and costly during validation (because in OpenRefine you can't immediately judge whether a candidate name is the same person's pseudonym, or birth name, or an entirely different person that is simply someone's co-author).

Is it possible to exclude MARC field 950 from the matching code?

Thanks and many regards,

Christiane

CORS support

The reconciliation endpoints are only available via JSONP at the moment. It would be great to enable it for CORS too.

We are currently planning to phase out JSONP in the reconciliation API: see reconciliation-api/specs#19.

Support for CORS will be added in OpenRefine: OpenRefine/OpenRefine#2260

solr data source, parse multiValued fields

a very common use case is to populate a solr index with a csv, fairly straightforward:

solr create -c reconcile
post -c reconcile data.csv

the default "schemaless" configuration has all fields defined as multiValued by default.
for example, given a field (csv column) label_en that has no explicit "multiValued":false

http://localhost:8983/solr/reconcile/schema/fields/label_en

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"label_en",
    "type":"strings"
    }}

the query will result in:

<doc>
    <arr name="label_en">
      <str>forgery, falsification and theft of artworks</str>
    </arr>
   ....
</doc>

would be easy to implement parsing of this result rather than modifying the solr schema?
thanks

Service does not work as of around August 2022

As of around August 2022, the reconciliation service will run but will not return any results, even for exact matches, at least for the http://refine.codefork.com/reconcile/viafproxy/LC option. I have tried the service using different versions of OpenRefine (3.5.2, which historically worked fine with this service, and 3.6), but it does not seem to work. I wonder if the VIAF APIs have been updated, necessitating changes to the reconciliation service?

ORCID data source is broken

ORCID has turned off its v1.2 API, which is what conciliator uses.

I need to change the code to use the v2.0 API and see if we run into rate limiting problems because of needing to fetch names for each result in a separate request. There's a chance it will work, though it'll be slow.

ORCID seems to have deliberately designed this limitation into the new API, which is a shame:

https://groups.google.com/d/topic/orcid-api-users/xVk-JDua2c0/discussion

Issue with Java 10

hello, I'd like to report an error with my Java:

openjdk version "10.0.2" 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2, mixed mode)

here's the complete log, I hope it's helpful.
java1.txt

Add more information on configuring Solr datasources

I found it a bit tricky to set up a Solr datasource. I eventually succeeded, but only after grepping the source code for a few of the properties listed in conciliator.properties and doing lots of trial and error.

The comments in conciliator.properties could be expanded with a bit more information, for example: what each parameter does, what values they expect, if any are optional, how to add multiple sources (ie, for multiple fields), etc.

I've annotated conciliator.properties based on my limited knowledge of Solr and my testing of conciliator:

##  Name will appear in OpenRefine's reconciliation interface
# datasource.solr.name=A Solr Collection of Books

## Seems to be some internal type?
# datasource.solr.nametype.id=/book/book

## Seems to correspond to internal type somehow?
# datasource.solr.nametype.name=Book

## Solr query URL with placeholders for query term and rows (will be replaced by conciliator for each query)
# datasource.solr.url.query=http://localhost:8983/solr/test-core/select?wt=xml&q={{QUERY}}&rows={{ROWS}}

## Not sure why we need to get each matching document
# datasource.solr.url.document=http://localhost:8983/solr/test-core/get?id={{id}}

## ???
# datasource.solr.field.id=id

## Solr field name (sounds like Solr's filter list parameter, not sure why we need this if we could just search for the field directly in the query URL?)
# datasource.solr.field.name=title_display

## can be 'concat' or 'first'. defaults to 'concat'
# datasource.solr.field.name.multivalue.strategy=first
# datasource.solr.field.name.multivalue.delimiter=,

I'd appreciate your feedback. Thanks!

NPE in com.codefork.refine.solr.Solr

Trying to reconciliate using Solr as backend leads to a null pointer exception for every query, example:

2022-11-01 11:48:29.865 ERROR 13338 --- [pool-5-thread-3] com.codefork.refine.solr.Solr            : error for query=Monte Cristi

java.lang.NullPointerException: null
	at com.codefork.refine.solr.Solr.createURL(Solr.java:57) ~[classes!/:3.1.0]
	at com.codefork.refine.solr.Solr.search(Solr.java:63) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceDataSource.searchCheckCache(WebServiceDataSource.java:272) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:45) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:15) ~[classes!/:3.1.0]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]

The configuration is simple:

cache.enabled=true
cache.ttl=3600
cache.size=64MB

datasource.solr.name=A Solr Second Level administrative division of Dominican Republic
datasource.solr.nametype.id=/location/location
datasource.solr.nametype.name=Geographic Name
datasource.solr.url.query=http://localhost:8983/solr/iso_adm2_dom/select?wt=xml&df=nombre_provincia&fl=id%20score%20nombre_provincia&q={{QUERY}}&rows={{ROWS}}&sort=score%20desc
datasource.solr.url.document=http://localhost:8983/solr/iso_adm2_dom/get?id={{id}}
datasource.solr.field.id=id
datasource.solr.field.name=nombre_provincia

The solr backend has zero problem responding to the queries:

➜  ~ curl http://localhost:8983/solr/iso_adm2_dom/select\?wt\=xml\&df\=nombre_provincia\&fl\=id%20score%20nombre_provincia\&q\=Monte%20Cristi\&rows=5\&sort=score%20desc
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">46</int>
  <lst name="params">
    <str name="q">Monte Cristi</str>
    <str name="df">nombre_provincia</str>
    <str name="fl">id score nombre_provincia</str>
    <str name="sort">score desc</str>
    <str name="rows">5</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="3.966993" numFoundExact="true">
  <doc>
    <str name="id">15</str>
    <str name="nombre_provincia">Monte Cristi</str>
    <float name="score">3.966993</float></doc>
  <doc>
    <str name="id">29</str>
    <str name="nombre_provincia">Monte Plata</str>
    <float name="score">1.8048377</float></doc>
</result>
</response>

Reconcile against multiple Solr sources?

I can define a Solr source in conciliator.properties, but let's say I want to reconcile against a list of countries and a list of languages without having to change the definition and restart conciliator. Is it possible to define multiple sources, for example one collection with countries and one with languages? Am I missing something here or is this beyond the use case of conciliator's Solr support?

It seems like this could be what the datasource.solr.nametype.id property is for, but I am not really sure how to use that (and again, it seems that you can only set one type).

Parsing time of VIAF results: test threshold too low

When running mvn package on my computer:

Failed tests: 
  VIAFParserTest.testParseTime:160 should take less than 50ms, on average, to parse a big XML doc, but took 52ms

I see you have added a special case for travis, I think it would make sense to use 100ms regardless of the environment.

Also: thank you so much for your implementation of the data extension API! It looks very exciting.

Order reconciliation results by decreasing score

Sometimes, the reconciliation results returned by the services are not sorted by decreasing score: at the moment, http://refine.codefork.com/reconcile/viaf?query=Kamila gives the following results:

{

    "result": [
        {
            "id": "18951129",
            "name": "Varano, Camilla Battista �da� 1458-1524",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.1282051282051282,
            "match": false
        },
        {
            "id": "102271932",
            "name": "Shamsie, Kamila, 1973-....",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.23076923076923078,
            "match": false
        },
        {
            "id": "63233597",
            "name": "Camilla, Duchess of Cornwall, 1947-",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.14285714285714285,
            "match": false
        }
    ]

}

See the corresponding SO question:
https://stackoverflow.com/questions/53852042/openrefine-reconcile-by-second-or-third-candidate
Corresponding OpenRefine issue:
OpenRefine/OpenRefine#1913

Orcid reference links in documentation are broken

New orcid documentation is here: https://members.orcid.org/api/tutorial/search-orcid-registry

accuracy issue for orcid reconciliation

Hi
Sometimes the results of ORCID reconcilation service is perfect, sometimes it seems broken.
See 2 examples for "Igor Ozerov" and "Li Xi"
-> for Igor Ozerov, it should be the 1st answer, because this name is unique in Orcid base
-> for Li Xi, we should have the list of all Li Xi, and not "Li-Li Xi" or "Li Bo Xi"

Do you think it could be improved?

Configurable port (if 8080 is already used)

Is there an easy way to run the service on some other port? If so it would make sense to add it to the README.

Scores aren't returned on Solr sources, it's correct?

I'm using conciliator to allow reconciliation against a custom private database. Some of these are perfect match. But I can't tell OpenRefine to automatically accept them because the score returned is 0. Is there something I'm missing on my solr configuration that returns the matching score? Or it should be calculated by conciliator?

Full stop behind some VIAF person headings prevent automatic matching

Hi,

I'm using your VIAF recon service to reconcile scholar's names from the field of Lexicography and Dictionary Research, to contruct a domain bibliography and person registry in the Linked Open Data environment.

After reconciling and manually validating 200 person names with VIAF (and getting very good results in general!), I came across a peculiar feature in VIAF that seems to prevent automatic matching in many cases, and increases tedious manual validation. Apparently, one of the VIAF contributors, NUKAT, sets a full stop behind a person name heading, resulting in an otherwise non-existent edit distance and causing the score to drop below 1. Even with the selected option in OpenRefine to auto-match candidates with a high confidence during reconciling, the score is often below the threshold.

Typical example from my data:

Name literal: Quasthoff, Uwe
VIAF candidate: Quasthoff, Uwe. (score: 0.933)
VIAF URI: https://viaf.org/viaf/22741331/

As far as I can see, NUKAT ist the only VIAF contributor with a full stop behind a person's name, and yet this particular heading is always ranked highest in the VIAF cluster. As we have no way to anticipate whether a matching VIAF cluster includes NUKAT headings or not, is there a way to modify the matching algorithm and chop off the full stop (if it exists) for the candidates returned from VIAF?

This would really help to improve your VIAF recon service even further. Thanks for all the work you've already done!

Regards,
Christiane

I am using VIAF as data source. Is it possible to use the additional properties section of open refine

When using wikidata, you have the option to specify additional information to get more accurate results like below

is it possible to provide additional information for VIAF or can you only reconcile using a single column?

Available properties for ORCID?

Hi!
Thanks for your ORCID reconciliation service. I wonder if it supports any property? It would be great if we could use Researcher ID, Scopus ID, ISNI, institutions, or other things like that. We're using it at WikiCite 2017 to add ORCID ids to Wikidata items for researchers.

Encoding issue on some reconciliation results

I'm using this program to perform reconciliation on a database of names, specifically attempting to retrieve LCNAF IDs (by way of VIAF). Some results, however, seem to be encountering an encoding issue of some sort. I am unsure if this has to do with OpenRefine, the conciliator program, or both.

For example, reconciling "Menéndez Pidal, Ramón 1869-1968" results in "MenÃ©ndez Pidal, RamÃ³n, 1869-1968.", and reconciling "Ōe, Kenzaburō 1935-" gave me ÅŒe, KenzaburÅ�, 1935-.

Any suggestions?

codeforkjeff / conciliator Goto Github PK

conciliator's People

Contributors

Stargazers

Watchers

Forkers

conciliator's Issues

Recommend Projects

Recommend Topics

Recommend Org