codeforkjeff / conciliator Goto Github PK
View Code? Open in Web Editor NEWOpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
License: GNU General Public License v3.0
OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
License: GNU General Public License v3.0
Hi,
when reconciling against VIAF I frequently get candidates pointing to related persons of the actual person to be reconciled.
For example, VIAF stores co-authors in the MARC field 950 (see https://viaf.org/viaf/viaftags.xml#mrca950), and this field seems to be one of the sources for candidate retrieval. Naturally, these candidates only get very low scores, but it can be distracting and costly during validation (because in OpenRefine you can't immediately judge whether a candidate name is the same person's pseudonym, or birth name, or an entirely different person that is simply someone's co-author).
Is it possible to exclude MARC field 950 from the matching code?
Thanks and many regards,
Christiane
The reconciliation endpoints are only available via JSONP at the moment. It would be great to enable it for CORS too.
We are currently planning to phase out JSONP in the reconciliation API: see reconciliation-api/specs#19.
Support for CORS will be added in OpenRefine: OpenRefine/OpenRefine#2260
a very common use case is to populate a solr index with a csv, fairly straightforward:
solr create -c reconcile
post -c reconcile data.csv
the default "schemaless" configuration has all fields defined as multiValued by default.
for example, given a field (csv column) label_en that has no explicit "multiValued":false
http://localhost:8983/solr/reconcile/schema/fields/label_en
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"label_en",
"type":"strings"
}}
the query will result in:
<doc>
<arr name="label_en">
<str>forgery, falsification and theft of artworks</str>
</arr>
....
</doc>
would be easy to implement parsing of this result rather than modifying the solr schema?
thanks
As of around August 2022, the reconciliation service will run but will not return any results, even for exact matches, at least for the http://refine.codefork.com/reconcile/viafproxy/LC option. I have tried the service using different versions of OpenRefine (3.5.2, which historically worked fine with this service, and 3.6), but it does not seem to work. I wonder if the VIAF APIs have been updated, necessitating changes to the reconciliation service?
ORCID has turned off its v1.2 API, which is what conciliator uses.
I need to change the code to use the v2.0 API and see if we run into rate limiting problems because of needing to fetch names for each result in a separate request. There's a chance it will work, though it'll be slow.
ORCID seems to have deliberately designed this limitation into the new API, which is a shame:
https://groups.google.com/d/topic/orcid-api-users/xVk-JDua2c0/discussion
hello, I'd like to report an error with my Java:
openjdk version "10.0.2" 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2, mixed mode)
here's the complete log, I hope it's helpful.
java1.txt
I found it a bit tricky to set up a Solr datasource. I eventually succeeded, but only after grepping the source code for a few of the properties listed in conciliator.properties
and doing lots of trial and error.
The comments in conciliator.properties
could be expanded with a bit more information, for example: what each parameter does, what values they expect, if any are optional, how to add multiple sources (ie, for multiple fields), etc.
I've annotated conciliator.properties
based on my limited knowledge of Solr and my testing of conciliator:
## Name will appear in OpenRefine's reconciliation interface
# datasource.solr.name=A Solr Collection of Books
## Seems to be some internal type?
# datasource.solr.nametype.id=/book/book
## Seems to correspond to internal type somehow?
# datasource.solr.nametype.name=Book
## Solr query URL with placeholders for query term and rows (will be replaced by conciliator for each query)
# datasource.solr.url.query=http://localhost:8983/solr/test-core/select?wt=xml&q={{QUERY}}&rows={{ROWS}}
## Not sure why we need to get each matching document
# datasource.solr.url.document=http://localhost:8983/solr/test-core/get?id={{id}}
## ???
# datasource.solr.field.id=id
## Solr field name (sounds like Solr's filter list parameter, not sure why we need this if we could just search for the field directly in the query URL?)
# datasource.solr.field.name=title_display
## can be 'concat' or 'first'. defaults to 'concat'
# datasource.solr.field.name.multivalue.strategy=first
# datasource.solr.field.name.multivalue.delimiter=,
I'd appreciate your feedback. Thanks!
Trying to reconciliate using Solr as backend leads to a null pointer exception for every query, example:
2022-11-01 11:48:29.865 ERROR 13338 --- [pool-5-thread-3] com.codefork.refine.solr.Solr : error for query=Monte Cristi
java.lang.NullPointerException: null
at com.codefork.refine.solr.Solr.createURL(Solr.java:57) ~[classes!/:3.1.0]
at com.codefork.refine.solr.Solr.search(Solr.java:63) ~[classes!/:3.1.0]
at com.codefork.refine.datasource.WebServiceDataSource.searchCheckCache(WebServiceDataSource.java:272) ~[classes!/:3.1.0]
at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:45) ~[classes!/:3.1.0]
at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:15) ~[classes!/:3.1.0]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
The configuration is simple:
cache.enabled=true
cache.ttl=3600
cache.size=64MB
datasource.solr.name=A Solr Second Level administrative division of Dominican Republic
datasource.solr.nametype.id=/location/location
datasource.solr.nametype.name=Geographic Name
datasource.solr.url.query=http://localhost:8983/solr/iso_adm2_dom/select?wt=xml&df=nombre_provincia&fl=id%20score%20nombre_provincia&q={{QUERY}}&rows={{ROWS}}&sort=score%20desc
datasource.solr.url.document=http://localhost:8983/solr/iso_adm2_dom/get?id={{id}}
datasource.solr.field.id=id
datasource.solr.field.name=nombre_provincia
The solr backend has zero problem responding to the queries:
➜ ~ curl http://localhost:8983/solr/iso_adm2_dom/select\?wt\=xml\&df\=nombre_provincia\&fl\=id%20score%20nombre_provincia\&q\=Monte%20Cristi\&rows=5\&sort=score%20desc
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">46</int>
<lst name="params">
<str name="q">Monte Cristi</str>
<str name="df">nombre_provincia</str>
<str name="fl">id score nombre_provincia</str>
<str name="sort">score desc</str>
<str name="rows">5</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="3.966993" numFoundExact="true">
<doc>
<str name="id">15</str>
<str name="nombre_provincia">Monte Cristi</str>
<float name="score">3.966993</float></doc>
<doc>
<str name="id">29</str>
<str name="nombre_provincia">Monte Plata</str>
<float name="score">1.8048377</float></doc>
</result>
</response>
I can define a Solr source in conciliator.properties
, but let's say I want to reconcile against a list of countries and a list of languages without having to change the definition and restart conciliator. Is it possible to define multiple sources, for example one collection with countries and one with languages? Am I missing something here or is this beyond the use case of conciliator's Solr support?
It seems like this could be what the datasource.solr.nametype.id
property is for, but I am not really sure how to use that (and again, it seems that you can only set one type).
When running mvn package
on my computer:
Failed tests:
VIAFParserTest.testParseTime:160 should take less than 50ms, on average, to parse a big XML doc, but took 52ms
I see you have added a special case for travis, I think it would make sense to use 100ms regardless of the environment.
Also: thank you so much for your implementation of the data extension API! It looks very exciting.
Sometimes, the reconciliation results returned by the services are not sorted by decreasing score: at the moment, http://refine.codefork.com/reconcile/viaf?query=Kamila gives the following results:
{
"result": [
{
"id": "18951129",
"name": "Varano, Camilla Battista �da� 1458-1524",
"type": [
{
"id": "/people/person",
"name": "Person"
}
],
"score": 0.1282051282051282,
"match": false
},
{
"id": "102271932",
"name": "Shamsie, Kamila, 1973-....",
"type": [
{
"id": "/people/person",
"name": "Person"
}
],
"score": 0.23076923076923078,
"match": false
},
{
"id": "63233597",
"name": "Camilla, Duchess of Cornwall, 1947-",
"type": [
{
"id": "/people/person",
"name": "Person"
}
],
"score": 0.14285714285714285,
"match": false
}
]
}
See the corresponding SO question:
https://stackoverflow.com/questions/53852042/openrefine-reconcile-by-second-or-third-candidate
Corresponding OpenRefine issue:
OpenRefine/OpenRefine#1913
New orcid documentation is here: https://members.orcid.org/api/tutorial/search-orcid-registry
Hi
Sometimes the results of ORCID reconcilation service is perfect, sometimes it seems broken.
See 2 examples for "Igor Ozerov" and "Li Xi"
-> for Igor Ozerov, it should be the 1st answer, because this name is unique in Orcid base
-> for Li Xi, we should have the list of all Li Xi, and not "Li-Li Xi" or "Li Bo Xi"
Do you think it could be improved?
Is there an easy way to run the service on some other port? If so it would make sense to add it to the README.
I'm using conciliator to allow reconciliation against a custom private database. Some of these are perfect match. But I can't tell OpenRefine to automatically accept them because the score returned is 0. Is there something I'm missing on my solr configuration that returns the matching score? Or it should be calculated by conciliator?
Hi,
I'm using your VIAF recon service to reconcile scholar's names from the field of Lexicography and Dictionary Research, to contruct a domain bibliography and person registry in the Linked Open Data environment.
After reconciling and manually validating 200 person names with VIAF (and getting very good results in general!), I came across a peculiar feature in VIAF that seems to prevent automatic matching in many cases, and increases tedious manual validation. Apparently, one of the VIAF contributors, NUKAT, sets a full stop behind a person name heading, resulting in an otherwise non-existent edit distance and causing the score to drop below 1. Even with the selected option in OpenRefine to auto-match candidates with a high confidence during reconciling, the score is often below the threshold.
Typical example from my data:
Name literal: Quasthoff, Uwe
VIAF candidate: Quasthoff, Uwe. (score: 0.933)
VIAF URI: https://viaf.org/viaf/22741331/
As far as I can see, NUKAT ist the only VIAF contributor with a full stop behind a person's name, and yet this particular heading is always ranked highest in the VIAF cluster. As we have no way to anticipate whether a matching VIAF cluster includes NUKAT headings or not, is there a way to modify the matching algorithm and chop off the full stop (if it exists) for the candidates returned from VIAF?
This would really help to improve your VIAF recon service even further. Thanks for all the work you've already done!
Regards,
Christiane
Hi!
Thanks for your ORCID reconciliation service. I wonder if it supports any property? It would be great if we could use Researcher ID, Scopus ID, ISNI, institutions, or other things like that. We're using it at WikiCite 2017 to add ORCID ids to Wikidata items for researchers.
I'm using this program to perform reconciliation on a database of names, specifically attempting to retrieve LCNAF IDs (by way of VIAF). Some results, however, seem to be encountering an encoding issue of some sort. I am unsure if this has to do with OpenRefine, the conciliator program, or both.
For example, reconciling "Menéndez Pidal, Ramón 1869-1968" results in "Menéndez Pidal, Ramón, 1869-1968.", and reconciling "Ōe, Kenzaburō 1935-" gave me ÅŒe, KenzaburÅ�, 1935-.
Any suggestions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.