Assignments during Information Integration on Web Course of Fall-2016
Python 0.07%Julia 99.93%
iiw_assignments's Introduction
HW2 : Extract key information from unstructured text in webpages.
Train a Conditional Random Field (CRF) classifier to extract the desired information
Use the classifier on a large amount of unstructured text to extract desired meaningful information
HW3 : Construct a wrapper to extract data from semi-structured sources.
Identify a data source and the data to extract.Choose a website and scrape at least 500 webpages, then identify at least 5 fields to extract from it.
Construct wrapper using any existing tools available.Extract at least 5 fields of data from all the webpages, and save them into a file in json format
HW4 : Write SPARQL queries to retrieve data from dbpedia and become familiar with RDF/S data
Write queries to retrieve data on entities of the Artist class and its subclasses (http://dbpedia.org/resource/Artist) in dbpedia according to the specifications in homework description hw4-sparql.pdf
HW5 : To become familiar with ontology management using Protégé, understand description logics and the services a reasoner provides.
Provide solutions for all the 10 problem in hw5-owl.pdf
HW6 : Perform data cleaning on the data you have extracted in a previous homework (e.g. for wrappers) or for your project.
Identify at least 3 fields that requires cleaning in your data.
For each field chosen, describe what it is and what was the operation performed to clean it.
Include an example of a record from each chosen field before and after cleaning in your report.
HW7 : Find matching records among two datasets by record linkage, and evaluate the performance of the methodology.
Choose two datasets that contain records that can be matched. Datasets should contain at least 100 records. Identify at least 20 matching records manually from the dataset.