Giter VIP home page Giter VIP logo

conciliator's People

Contributors

codeforkjeff avatar dependabot[bot] avatar ruebot avatar wetneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

conciliator's Issues

VIAF name matching should exclude related persons

Hi,

when reconciling against VIAF I frequently get candidates pointing to related persons of the actual person to be reconciled.
For example, VIAF stores co-authors in the MARC field 950 (see https://viaf.org/viaf/viaftags.xml#mrca950), and this field seems to be one of the sources for candidate retrieval. Naturally, these candidates only get very low scores, but it can be distracting and costly during validation (because in OpenRefine you can't immediately judge whether a candidate name is the same person's pseudonym, or birth name, or an entirely different person that is simply someone's co-author).

Is it possible to exclude MARC field 950 from the matching code?

Thanks and many regards,

Christiane

solr data source, parse multiValued fields

a very common use case is to populate a solr index with a csv, fairly straightforward:

solr create -c reconcile
post -c reconcile data.csv

the default "schemaless" configuration has all fields defined as multiValued by default.
for example, given a field (csv column) label_en that has no explicit "multiValued":false

http://localhost:8983/solr/reconcile/schema/fields/label_en

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"label_en",
    "type":"strings"
    }}

the query will result in:

<doc>
    <arr name="label_en">
      <str>forgery, falsification and theft of artworks</str>
    </arr>
   ....
</doc>

would be easy to implement parsing of this result rather than modifying the solr schema?
thanks

Service does not work as of around August 2022

As of around August 2022, the reconciliation service will run but will not return any results, even for exact matches, at least for the http://refine.codefork.com/reconcile/viafproxy/LC option. I have tried the service using different versions of OpenRefine (3.5.2, which historically worked fine with this service, and 3.6), but it does not seem to work. I wonder if the VIAF APIs have been updated, necessitating changes to the reconciliation service?

Issue with Java 10

hello, I'd like to report an error with my Java:

openjdk version "10.0.2" 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2, mixed mode)

here's the complete log, I hope it's helpful.
java1.txt

Add more information on configuring Solr datasources

I found it a bit tricky to set up a Solr datasource. I eventually succeeded, but only after grepping the source code for a few of the properties listed in conciliator.properties and doing lots of trial and error.

The comments in conciliator.properties could be expanded with a bit more information, for example: what each parameter does, what values they expect, if any are optional, how to add multiple sources (ie, for multiple fields), etc.

I've annotated conciliator.properties based on my limited knowledge of Solr and my testing of conciliator:

##  Name will appear in OpenRefine's reconciliation interface
# datasource.solr.name=A Solr Collection of Books

## Seems to be some internal type?
# datasource.solr.nametype.id=/book/book

## Seems to correspond to internal type somehow?
# datasource.solr.nametype.name=Book

## Solr query URL with placeholders for query term and rows (will be replaced by conciliator for each query)
# datasource.solr.url.query=http://localhost:8983/solr/test-core/select?wt=xml&q={{QUERY}}&rows={{ROWS}}

## Not sure why we need to get each matching document
# datasource.solr.url.document=http://localhost:8983/solr/test-core/get?id={{id}}

## ???
# datasource.solr.field.id=id

## Solr field name (sounds like Solr's filter list parameter, not sure why we need this if we could just search for the field directly in the query URL?)
# datasource.solr.field.name=title_display

## can be 'concat' or 'first'. defaults to 'concat'
# datasource.solr.field.name.multivalue.strategy=first
# datasource.solr.field.name.multivalue.delimiter=,

I'd appreciate your feedback. Thanks!

NPE in com.codefork.refine.solr.Solr

Trying to reconciliate using Solr as backend leads to a null pointer exception for every query, example:

2022-11-01 11:48:29.865 ERROR 13338 --- [pool-5-thread-3] com.codefork.refine.solr.Solr            : error for query=Monte Cristi

java.lang.NullPointerException: null
	at com.codefork.refine.solr.Solr.createURL(Solr.java:57) ~[classes!/:3.1.0]
	at com.codefork.refine.solr.Solr.search(Solr.java:63) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceDataSource.searchCheckCache(WebServiceDataSource.java:272) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:45) ~[classes!/:3.1.0]
	at com.codefork.refine.datasource.WebServiceSearchTask.call(WebServiceSearchTask.java:15) ~[classes!/:3.1.0]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]

The configuration is simple:

cache.enabled=true
cache.ttl=3600
cache.size=64MB

datasource.solr.name=A Solr Second Level administrative division of Dominican Republic
datasource.solr.nametype.id=/location/location
datasource.solr.nametype.name=Geographic Name
datasource.solr.url.query=http://localhost:8983/solr/iso_adm2_dom/select?wt=xml&df=nombre_provincia&fl=id%20score%20nombre_provincia&q={{QUERY}}&rows={{ROWS}}&sort=score%20desc
datasource.solr.url.document=http://localhost:8983/solr/iso_adm2_dom/get?id={{id}}
datasource.solr.field.id=id
datasource.solr.field.name=nombre_provincia

The solr backend has zero problem responding to the queries:

➜  ~ curl http://localhost:8983/solr/iso_adm2_dom/select\?wt\=xml\&df\=nombre_provincia\&fl\=id%20score%20nombre_provincia\&q\=Monte%20Cristi\&rows=5\&sort=score%20desc
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">46</int>
  <lst name="params">
    <str name="q">Monte Cristi</str>
    <str name="df">nombre_provincia</str>
    <str name="fl">id score nombre_provincia</str>
    <str name="sort">score desc</str>
    <str name="rows">5</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="3.966993" numFoundExact="true">
  <doc>
    <str name="id">15</str>
    <str name="nombre_provincia">Monte Cristi</str>
    <float name="score">3.966993</float></doc>
  <doc>
    <str name="id">29</str>
    <str name="nombre_provincia">Monte Plata</str>
    <float name="score">1.8048377</float></doc>
</result>
</response>

Reconcile against multiple Solr sources?

I can define a Solr source in conciliator.properties, but let's say I want to reconcile against a list of countries and a list of languages without having to change the definition and restart conciliator. Is it possible to define multiple sources, for example one collection with countries and one with languages? Am I missing something here or is this beyond the use case of conciliator's Solr support?

It seems like this could be what the datasource.solr.nametype.id property is for, but I am not really sure how to use that (and again, it seems that you can only set one type).

Parsing time of VIAF results: test threshold too low

When running mvn package on my computer:

Failed tests: 
  VIAFParserTest.testParseTime:160 should take less than 50ms, on average, to parse a big XML doc, but took 52ms

I see you have added a special case for travis, I think it would make sense to use 100ms regardless of the environment.

Also: thank you so much for your implementation of the data extension API! It looks very exciting.

Order reconciliation results by decreasing score

Sometimes, the reconciliation results returned by the services are not sorted by decreasing score: at the moment, http://refine.codefork.com/reconcile/viaf?query=Kamila gives the following results:

{

    "result": [
        {
            "id": "18951129",
            "name": "Varano, Camilla Battista �da� 1458-1524",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.1282051282051282,
            "match": false
        },
        {
            "id": "102271932",
            "name": "Shamsie, Kamila, 1973-....",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.23076923076923078,
            "match": false
        },
        {
            "id": "63233597",
            "name": "Camilla, Duchess of Cornwall, 1947-",
            "type": [
                {
                    "id": "/people/person",
                    "name": "Person"
                }
            ],
            "score": 0.14285714285714285,
            "match": false
        }
    ]

}

See the corresponding SO question:
https://stackoverflow.com/questions/53852042/openrefine-reconcile-by-second-or-third-candidate
Corresponding OpenRefine issue:
OpenRefine/OpenRefine#1913

accuracy issue for orcid reconciliation

Hi
Sometimes the results of ORCID reconcilation service is perfect, sometimes it seems broken.
See 2 examples for "Igor Ozerov" and "Li Xi"
-> for Igor Ozerov, it should be the 1st answer, because this name is unique in Orcid base
-> for Li Xi, we should have the list of all Li Xi, and not "Li-Li Xi" or "Li Bo Xi"

Do you think it could be improved?

image

Scores aren't returned on Solr sources, it's correct?

I'm using conciliator to allow reconciliation against a custom private database. Some of these are perfect match. But I can't tell OpenRefine to automatically accept them because the score returned is 0. Is there something I'm missing on my solr configuration that returns the matching score? Or it should be calculated by conciliator?

Full stop behind some VIAF person headings prevent automatic matching

Hi,

I'm using your VIAF recon service to reconcile scholar's names from the field of Lexicography and Dictionary Research, to contruct a domain bibliography and person registry in the Linked Open Data environment.

After reconciling and manually validating 200 person names with VIAF (and getting very good results in general!), I came across a peculiar feature in VIAF that seems to prevent automatic matching in many cases, and increases tedious manual validation. Apparently, one of the VIAF contributors, NUKAT, sets a full stop behind a person name heading, resulting in an otherwise non-existent edit distance and causing the score to drop below 1. Even with the selected option in OpenRefine to auto-match candidates with a high confidence during reconciling, the score is often below the threshold.

Typical example from my data:

Name literal: Quasthoff, Uwe
VIAF candidate: Quasthoff, Uwe. (score: 0.933)
VIAF URI: https://viaf.org/viaf/22741331/

As far as I can see, NUKAT ist the only VIAF contributor with a full stop behind a person's name, and yet this particular heading is always ranked highest in the VIAF cluster. As we have no way to anticipate whether a matching VIAF cluster includes NUKAT headings or not, is there a way to modify the matching algorithm and chop off the full stop (if it exists) for the candidates returned from VIAF?

This would really help to improve your VIAF recon service even further. Thanks for all the work you've already done!

Regards,
Christiane

Available properties for ORCID?

Hi!
Thanks for your ORCID reconciliation service. I wonder if it supports any property? It would be great if we could use Researcher ID, Scopus ID, ISNI, institutions, or other things like that. We're using it at WikiCite 2017 to add ORCID ids to Wikidata items for researchers.

Encoding issue on some reconciliation results

I'm using this program to perform reconciliation on a database of names, specifically attempting to retrieve LCNAF IDs (by way of VIAF). Some results, however, seem to be encountering an encoding issue of some sort. I am unsure if this has to do with OpenRefine, the conciliator program, or both.

For example, reconciling "Menéndez Pidal, Ramón 1869-1968" results in "Menéndez Pidal, Ramón, 1869-1968.", and reconciling "Ōe, Kenzaburō 1935-" gave me ÅŒe, KenzaburÅ�, 1935-.

Any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.