Giter VIP home page Giter VIP logo

string2vocabulary's Introduction

String2Vocabulary

Look for literals in an RDF graph and substitute them with URIs from controlled vocabularies. Built with Gradle and Apache Jena.

It uses the vocabulary filenames for grouping them in families. For example, city-italy.ttl and city-france.ttl are part of the family city.

Input

The library needs in input:

  • a folder containing vocabularies
  • for a full graph replacement, a configuration in csv that declares the property to match, the relative vocabulary family, an if it should eventually check for the singular version of the label. Example:
http://data.doremus.org/ontology#U2_foresees_use_of_medium_of_performance,mop,singular
http://data.doremus.org/ontology#U11_has_key,key,

Given as input:

ns:myMusicWork mus:U11_has_key [
          a mus:M4_Key ;
          rdfs:label "Ré majeur"@fr ] ;
   mus:U2_foresees_use_of_medium_of_performance "mezzosoprano" .

... this produces as output:

ns:myMusicWork mus:U11_has_key <http://data.doremus.org/vocabulary/key/d> ;
   mus:U2_foresees_use_of_medium_of_performance  <http://data.doremus.org/vocabulary/iaml/mop/vms> .

Features

  • Vocabulary syntax supported:
    • SKOS
    • MODS
  • Support for families of vocabularies
  • Replace literals that match the given label
  • Replace objects that have a rdfs:label or ecrm:P1_is_identified_by which match the given label
  • Strict mode: match both label and language
  • Normalise the labels by removing punctuation, decoding to ASCII, using lowercase
  • Search also for the singular version of the word with Stanford CoreNLP
  • Support for RDF Dataset:
    • replace content at the default graph level
    • replace content at a given named graph level
  • Supported textual syntax for RDF (serialization):

Dependencies:

  • Build tool: Gradle 7+
  • See the dependencies section in the build.gradle file for project dependencies.

Usage

As a module

  1. Add it as dependency. E.g. in build.gradle:
dependencies {
   compile 'com.github.DOREMUS-ANR:string2vocabulary:0.7'
}
  1. Import and init in your Java class
import org.doremus.string2vocabulary.VocabularyManager;

// ...

// print full logs
VocabularyManager.setVerbose(true);

// set the folder where to find vocabuaries
VocabularyManager.setVocabularyFolder("/location/to/vocabularyFolder");
// set the folder where to find the config csv
VocabularyManager.init("/location/to/property2family.csv");
// set the language to be used for singularising the words
VocabularyManager.setLang("fr");
  1. Use it :)
// Search for a term in a given family
// this performs a normal full search and one in strict mode
VocabularyManager.searchInCategory("violin", "en", "mop");
// --> http://www.mimo-db.eu/InstrumentsKeywords/3573

// or
// Search for a term in a given vocabulary
VocabularyManager.getVocabulary("mop-iaml").findConcept("violin", false);
// --> http://data.doremus.org/vocabulary/iaml/mop/svl
// strict mode
VocabularyManager.getVocabulary("mop-iaml").findConcept("violin@it", true);
// --> null

// or
// Get the URI by code (what is written after the namespace)
VocabularyManager.getVocabulary("key").getConcept("dm");
// --> http://data.doremus.org/vocabulary/key/dm

// or
// Full graph replacement
// search and substitute in the whole Jena Model
// (following the csv configuration)
VocabularyManager.string2uri(model)

See the test folder for another example of usage.

Command Line

Run the library from CLI with gradle run:

# Canonical form
gradle run -Pmap="/location/to/property2family.csv" \
  -Pinput="/location/to/input.ttl" \
  -Pvocabularies="/location/to/vocabularyFolder"

Available CLI parameters:

param example comment
map /location/to/property2family.csv A table with mapping property-vocabulary
vocabularies /location/to/vocabularyFolder Folder containing the vocabularies in turtle format
input /location/to/input.ttl The input file (Turtle or TriG syntax)
output (Optional) /location/to/output.ttl The output turtle file. Default: <inputPath/inputName>_output.<inputFileExt>
lang (Optional) fr Language to be used for singularising the words. Default: en.
graph (Optional) http://example.org/graph/object/ The named graph to process. Default: `` (i.e. the default graph)

Default gradle run behavior rely on project properties set in the gradle.properties file. See the following links for details about properties in Gradle:

CLI examples with provided test files:

# Example: Turtle syntax
gradle run -Pmap="src/test/resources/property2family.csv" \
  -Pinput="src/test/resources/input.ttl" \
  -Pvocabularies="src/test/resources/vocabulary"

# Example: TriG syntax, replace at the default graph level
gradle run -Pmap="src/test/resources/property2family.csv" \
  -Pinput="src/test/resources/input.trig" \
  -Poutput="src/test/resources/output.trig" \
  -Pvocabularies="src/test/resources/vocabulary"

# Example: TriG syntax, replace at the default graph level (alternative)
gradle run -Pmap="src/test/resources/property2family.csv" \
  -Pinput="src/test/resources/input.trig" \
  -Poutput="src/test/resources/output.trig" \
  -Pvocabularies="src/test/resources/vocabulary" \
  -Pgraph=""

# Example: TriG syntax, replace at a given named graph level
gradle run -Pmap="src/test/resources/property2family.csv" \
  -Pinput="src/test/resources/input.trig" \
  -Poutput="src/test/resources/output.trig" \
  -Pvocabularies="src/test/resources/vocabulary" \
  -Pgraph="http://example.org/graph/object/"

Documentation

Generating local code documentation:

javadoc -d doc/ ./org/doremus/string2vocabulary/VocabularyManager.java

References:

Contribute

In the general case, please

  • fork and create merge request OR
  • raise an issue into the project's space.

Citation

If you use this software in a scientific publication, please cite:

Pasquale Lisena, Konstantin Todorov, Cécile Cecconi, Françoise Leresche, Isabelle Canno, Frédéric Puyrenier, Martine Voisin, Thierry Le Meur, & Raphaël Troncy. (2018). Controlled Vocabularies for Music Metadata. Proceedings of the 19th International Society for Music Information Retrieval Conference, 424–430. https://doi.org/10.5281/zenodo.1492441

In BibTex:

@inproceedings{lisena2018vocabularies,
  author       = {Pasquale Lisena and
                  Konstantin Todorov and
                  Cécile Cecconi and
                  Françoise Leresche and
                  Isabelle Canno and
                  Frédéric Puyrenier and
                  Martine Voisin and
                  Thierry Le Meur and
                  Raphaël Troncy},
  title        = {Controlled Vocabularies for Music Metadata},
  booktitle    = {{19th International Society for 
                   Music Information Retrieval Conference}},
  year         = 2018,
  pages        = {424-430},
  publisher    = {ISMIR},
  address      = {Paris, France},
  month        = sep,
  venue        = {Paris, France},
  doi          = {10.5281/zenodo.1492441},
  url          = {https://doi.org/10.5281/zenodo.1492441}
}

string2vocabulary's People

Contributors

genears avatar pasqlisena avatar rtroncy avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

genears

string2vocabulary's Issues

Use rdfs:label

Currently it works only if the label is in ecrm:P1_is_identified_by. Make it work also with labels in rdfs:label

Singularise

Add the possibility of automatically search for a singular form if a first search fail.

Example:

  • search for sopranos : fail
  • singularise -> soprano
  • search for soprano: success

Add more complex context-aware methods for matching strings to entities

When we applied string2vocabulary with strings representing cities and towns to match with Geonames in SILKNOW, we obtained a lot of bad results.

Example: http://data.silknow.org/production/41481202-0c96-3171-82ca-099088faf425.
The original city mentioned is simply "Saint Etienne" identified by http://www.geonames.org/2980291/. Strangely, string2vocabulary has matched it with a much smaller town, "Saint-Étienne-du-Rouvray" identified by http://sws.geonames.org/2980236/. Having said this, there are a 100 cities in France named "Saint Etienne something".

This shows the limit of pure fuzzy string matching. Should we consider having more complex matching techniques, e.g. relying on pre-trained word embeddings. It is possible that "Saint Etienne" used with the other contextual words (satin, faille, soie, tissu façonné) will have lead to the right city.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.