alexhoorn / movielocationsontology Goto Github PK

A linked data ontology about movies and their filming locations with an accompanying application. Built as a project for the course Knowledge and Data @ Vrije Universiteit Amsterdam.

Python 63.74% Jupyter Notebook 36.26%

movielocationsontology's People

Contributors

Stargazers

Watchers

Forkers

joshuaseth ramon284

movielocationsontology's Issues

Ontology fixes

As it's not realistic to load all individuals into the ontology through Protege yet again I have made a small script that fixes the issues that existed.

The following changes have been made to the ontology:

Removed wrong class assignments for :
- ml:Show
- ml:Scene
- ml:Location
- ml:Character
- ml:Person
Renamed some predicates that were supposed to be denoted with rdfs:label :
- ml:hasSceneName
- ml:hasFullName
- ml:hasPrimaryTitle

If there any mistakes in these fixes or if anything needs to be added then please add this in the comments. The original ontology is untouched and rerunning the script is easy.

An ontology that explicitly represents the conceptualization in OWL. (200-300 words)

Use at least 3 class restrictions. The ontology should reuse at least 2existing vocabularies or ontologies. Describe the ontology, and the decisions made (200-300 words). Also provide the resulting ontology as a separate Turtle file.

Update UI with Streamlit update/layouts

Add description of converted dataset for importing into Protege

Documentation has been added. See the README file in /converted_data

We should probably standardize a code formatter

Code formatters are great to standardize code formatting and therefor increase readability. I've already seen you guys use it which is great. But I believe we're all using different tools to format our code.

I'm personally a big fan of black. It's known as an opinionated formatter. Which means it comes with a preset configuration and leaves nothing that has to be changed beforehand. From its description: Blackened code looks the same regardless of the project you're reading.

We could set this up as a Github action. Which means any code we push to the repository is automatically formatted. Or perhaps even better is to use its pre-commit hook. Which locally automatically formats any code before pushing to the repository.

What do you guys think? Black is not the only formatter out there. So please share your opinion!

A description of the domain and scope of the ontology (100-200 words)

A description of the domain and scope of the ontology, as determined by the application (100-200 words)

Fetch location data - Davey

Convert and clean geocoded location data to maps for protege.

Fix duplicate scenes

Check milestone 1

Lets use this issue to make sure everyone has checked the document for milestone 1. Please leave a 👍🏻 if you've reread and agree to its current state. With at least 3 thumbs we can say the document is final and ready to hand in.

Description of the Application and Users

Goal
The users of the application

From the description it is very clear what the application is intended to do, the motivation and need for the application is very convincing. The prospective users of the application are described in detail, and it is clear what problem the application solves for them.

Populate ontology with new data

Convert data source numbers to categories

Identify 2 ontologies that you can reuse, and motivate your choice.

The goal of the application

What does it do? What task does it perform? (100-200 words)

Make filters dependent on each other

Fetch data from Wikidata within application with IMDB ID

Fetch location data - Seth

A description of how you integrated the external datasets with your ontology and how you used inferencing and/or mapping constructs.(200-300 words)

Fix missing scenes

Currently our ontology holds many scenes which are completely identical to each other, except for their identifier. There also seems to be a severe lack of scenes compared to what can be found on the official IMDB pages.

A walk-through of how users will interact with the application (150-300 words)

A description and evidence that running the SPARQL queries against the ontology and data produces inferences(screenshots reasoning on/off). Discuss the inferences. (200-300 words)

Should we use CSV formatting + Pandas for SPARQLWrapper results?

I noticed SPARQLWrapper supports converting query results to CSV. Whereas we're currently using JSON. Which can often be painful to parse.

If we use CSV as a return format we can very simply load the results in Pandas DataFrames and work from there. Would it be an idea to use that from now on?

from io import StringIO

import pandas as pd
from SPARQLWrapper import CSV, SPARQLWrapper


def query_to_dataframe(query):
    endpoint = "your endpoint"
    repository = "your repository"
    sparql = SPARQLWrapper(f"http://{endpoint}/repositories/{repository}")
    sparql.setReturnFormat(CSV)
    sparql.setQuery(query)

    results = sparql.query().convert()
    # This converts the results to a byte-like object, which can be loaded
    # into Pandas without having to save it locally to a csv file first
    csv = StringIO(results.decode())
    df = pd.read_csv(csv)

    return df


example_query = """
PREFIX ex: <http://example.com/projectkand/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select DISTINCT ?sceneName ?lon ?lat
WHERE
{ 
    ?show a ex:Show.
    ?show ex:hasPrimaryTitle ?title.
    ?show ex:hasScene ?scene.
    ?scene rdfs:label ?sceneName.
    ?scene ex:hasLocation ?location.
    ?location ex:hasLongitude ?lon;
        ex:hasLatitude ?lat
    FILTER(?title = 'Game of Thrones')
}
"""

df = query_to_dataframe(example_query)

Where df is now an easily used DataFrame with nicely parsed results:

	sceneName	lon	lat
0	Water Gardens of Dorne	-5.99113	37.384
1	Long Bridge of Volantis	-4.77601	37.8846
2	Dragonstone beach exterior scenes	-2.96099	43.4066
3	Dragonpit summit	-6.04539	37.4419
4	House Martell, Water gardens of Dorne	-2.47235	36.8413

Fetch location data - Alex

Fetch location data - Ramon

Add inference rules to ontology

Initialize ontology in Protege

IMDB Dataset processing

Python file for streamlit inputs to SPARQL Query logic. (input=inputs, output=list of markers that folium map understands?)

The users of the application

For whom is the application intended? (people, machines, mobile users). How does the application satisfy a need of the users? (100-200 words)

A description of multiple complex SPARQL queries, relevant for the application,that produce results over the integrated data and ontology (200-300 words)

A description of the methodology that is used in the construction of the ontology (100-200 words)

Querying data from wikidata

While trying to create a wikidata query to find the coordinates of locations based on user input I ran into multiple issues. The main issue is that not all cities are actually classified as cities. Querying for a more general term like "location" also doesn't work, since the wikidata endpoint does not use transitivity / inferencing.

The current query I am using is the one below, where Amsterdam in filter(?cityLabel = 'Amsterdam'@en). represents the user input of a location.

Endpoint = https://query.wikidata.org/bigdata/namespace/wdq/sparql

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?city ?cityLabel ?coordinates WHERE {   
  ?city wdt:P31 wd:Q515;
        rdfs:label ?cityLabel.
  filter(LANG(?cityLabel) = 'en').
  filter(?cityLabel = 'Amsterdam'@en).
  ?city wdt:P625 ?coordinates
}LIMIT 10
#wd:Q17334923 = location
  #wd:Q2221906 = geographic location
    #wd:Q486972 = Human Settlement
      #wd:Q515 = City

A description of at least 2 external sources of data that will be used by your application(i.e. for creating the instances).

Motivate your choice of these 2 sources of data. At least one of these must be an external SPARQL endpoint. The other dataset should not be in RDF. (100-200 words).

Fetch_nominatim_data: Create script to get latlong from Nominatim for 4 clients with 1/4 datasets

A conceptualization of the domain (concepts, relations) described, discussed and depicted in a drawing. (200-300 words)

The conceptualization should encompass more than 15 classes and at least 5 properties

Start creating rules for import to Protege

Filter data by new locations

Scrape show locations to CSV

Streamlit has been updated and brings an amazing feature

So I was just collecting the packages we depend on and checking their versions on my local system. I decided to update Streamlit and noticed it received an update on the 8th of this month.

It's now updated to version 0.68 and brings a very useful feature: new layout options! Read more about it in their blog post. I think the timing couldn't have been better. This can be extremely useful for making our app even more beautiful. Thus far we've been restricted by a vertical layout only.

So everyone make sure to run pip install -U streamlit. I'll create a requirements.txt to make sure we're all running the same version of other dependencies.

Combine milestone 2 with a revised version of milestone 1

Spotify embed requires Spotify Premium

https://developer.spotify.com/documentation/web-playback-sdk/quick-start/

Nevermind, an embedded component does work. No spotify SDK needed.

Describe the inferences (100-500 words)

The ontology should produce meaningful inferences that are essential for the application. This should be evidenced by screenshots of,e.g., Protégé reasoning results. (NB: For the final report: inferences should be on the external data)

Problem discussion: Streamlit max 100 text entries

But apperantly resolved: streamlit/streamlit#1059 (comment)

Might still be clunky for 160.000 items though.

Initial UI with Streamlit

Determine taxonomy

See step 4 in Ontology Engineering.

The design of the application

The design of the application: what does it look like, how does it present information to users (what viewsof the data are presented), and why does that make sense? For instance, showing restaurants on a map may be more intuitive than showing a list (maybe show both?). (150-300 words)

Add data fixes to convert_data.py

Special chars in director_map
Naming conflict with genre Short and titleType
titleTypes missing capitalization