usc-isi-i2 / datamart Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 12.0 4.45 MB

Data augment

License: MIT License

Python 46.68% Jupyter Notebook 49.92% HTML 3.40%

datamart's People

Contributors

Stargazers

Watchers

Forkers

sukritisharma linqyd bhavyagulati cybergla jifere jcarlosroldan shadowander rqs xkgoodbest chrischen1225 liangmuxin cmgus

datamart's Issues

TODO for worldbank

Finish updates in current pull request.
Create standalone script which generates Worldbank schema json files.
https://github.com/usc-isi-i2/datamart/tree/development/scripts/generate_schema
Add unit tests for testing Worldbank materializer.
https://github.com/usc-isi-i2/datamart/tree/development/datamart/unit_tests

add two ravens profiler

https://github.com/usc-isi-i2/datamart/blob/development/datamart/profilers/two_raven_profiler.py
implement the profiler using tworaven's api. Take the dataframe and the metadata as input, then enrich the metadata if possible

Metadata cluster

Finalize the design of cluster metadata. https://paper.dropbox.com/doc/Cluster-metadata--ASlsOktGBk4gryK5EVJKsi2kAg-wnL5KbP5IvnT5E7ModWNf

Things maybe affected by cluster metadata (not limited to):

Query system
Materializer
Dataset schema

Breakdown WorldBank data into chunks for materialization

Materialize worldbank data for all time across all countries takes too long to finish. Not able to materialize it within acceptable time.

Consider break it down for every country. This means we will have a dataset for all time range for a single country, each associates with one metadata. There is already around 16000 datasets. Doing this will make the number of datasets (metadata) become 16000* (#countries). And we need to en`
Otherwise, since the dataset is too big for materializing and profiling, we can put a fixed set of countries in the country column as named_entity. But for materialization, need to query with list of named_entities (countries).

We can discuss here for the correct way of dealing with worldbank.

Agreement on API

Reach an agreement on API across all datamart teams.
https://paper.dropbox.com/doc/Datamart-API-gakEEN6LbPUQy4z5W4RNy

Now it is a query language in json:
https://www.dropbox.com/sh/964bd0hm4xcjm12/AAD2D4CdyO-uZQ-F6VsJ9alYa?dl=0

RLTK joiner

Subclass

datamart/datamart/joiners/joiner_base.py

Line 7 in 35341f8

class JoinerBase(ABC):

Basically it gets 2 dataframes and return a joined one. Detailed inputs need to be confirmed with front end. Will update here. Hold on for a while to start.

Wikidata

@linqyd Please update the Wikidata progress and todos here, and communicate with @Lituta about this.

Write script to create properties to represent datasets in Wikidata

We need to create the following properties to represent datasets in Wikidata:


C2001:
  label: datamart identifier
  description: identifier of a dataset in the Datamart system
  datatype: MonolingualText
  P31: Q19847637
  P1629: Q1172284

C2004:
  label: keywords
  description: keywords associated with an item to facilitate finding the item using text search
  datatype: StringValue
  P31: Q18616576

C2005:
  label: variable measured
  description: the variables measured in a dataset
  datatype: StringValue
  P31: Q18616576
  P1628: http://schema.org/variableMeasured

C2006:
  label: values
  description: the values of a variable represented as a text document
  datatype: StringValue
  P31: Q18616576

C2007:
  label: data type
  description: the data type used to represent the values of a variable, integer (Q729138), Boolean (Q520777), Real (Q4385701), String (Q184754), Categorical (Q2285707)
  datatype: Item
  P31: Q18616576

C2008:
  label: semantic type
  description: a URL that identifies the semantic type of a variable in a dataset
  datatype: URL
  P31: Q18616576

d3m-version

https://github.com/usc-isi-i2/datamart/blob/master/requirements.txt#L7
should this following the new version d3m==2019.1.21?

Cluster layer between query data and dataset metadata

NOAA api is down

NOAA api is down. Many test cases and examples rely on NOAA, should find replacement and/or create some other examples.

check example:
https://ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=PRCP&locationid=CITY:US060013&startdate=2016-09-23&enddate=2016-09-23&limit=10&offset=1

generate_tradingeconomics_market_schema.py errors out

Error message

Generating schema for Trading economics HYPE3:BS
Traceback (most recent call last):
  File "generate_tradingeconomics_market_schema.py", line 123, in <module>
    generate_json_schema(args.dst_path)
  File "generate_tradingeconomics_market_schema.py", line 48, in generate_json_schema
    data = res_indicator.json()
  File "/anaconda3/envs/datamart_env/lib/python3.6/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Keep track on progress of data sources, index when any finishes

https://paper.dropbox.com/doc/Sources-For-Datamart--ASKb1QpgtuJ7z8HpM5kn1pnjAg-8g26QdW9vwCtFtWKIzJEP

Trading Economics Materializer

For the Trading Economics dataset schema, the arguments objects under the materialization objects are always empty. All the information needed by the materializer to retrieved the dataset should be in the arguments section. The materializer should not use information elsewhere in the dataset schema.

Trading economics market data need different materializer from original tradingeconomics_materializer

It looks to me that trading economics market data can only specify start and end date in query. No location restriction allowed. So the general tradingeconomics_materializer does not.

You can make the change by one of the following:

Since your schema of trading economics market data already points to tradingeconomics_materializer, you can modify tradingeconomics_materializer to check whether the metadata is market or indicator (by putting some parameter in materialization.auguments in the schema json). Then in tradingeconomics_materializer.py, treats them differently to form the url query.
Create another tradingeconomics_market_materializer.py, generate new schema json where materialization.python_path points to this new materializer.

user upload dataset api

Implement an API that users could provide a description of an online dataset, with the url of the real data under 'materialization', so that datamart could index the user provided dataset.

Need to implement the following parsers to support different types of dataset:
https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/parsers

csv
html
json
excel

----- bulk load -----

Implement the link-detection on html pages and try to index the data one by one.
Support compressed files

Support compressed files uploading(indexing)

wikitable schema format

@juancroldan Please follow the schema https://github.com/usc-isi-i2/datamart/blob/development/datamart/resources/index_schema.json

You may validate your generated schema using https://github.com/usc-isi-i2/datamart/blob/development/scripts/validate_schema.py

And if invalid, use https://www.jsonschemavalidator.net/ to see detailed validation errors.

Thanks!

real NER

Currently, if a column is not numeric and can not be parsed as date time, it is a named entity column. May have a real NER:

datamart/datamart/profilers/basic_profiler.py

Line 37 in d3d9033

"""TODO: Write a real NER here maybe"""

Documentation of current datamart API

Conda environment checksums

The enviroment.yml includes checksums for specific versions of the packages which are probably the OSX versions. Removing the checksums solves this problem for Ubuntu 16.04 and Windows 7.
For instance, changing libffi=3.2.1=h475c297_4 to libffi=3.2.1 allows me to install it.

Wikitables

I merged pull request #33 from @juancroldan .

Imports are messed up. Missing packages. Please double check the import and update any dependency packages with version in environment.yml and requirements.txt.

Noaa is not able to return data more than one year with single query

Noaa is not able to return data more than one year with single query, using multiple queries to return dataset for all years by default is not possible either (too big).

In this case, if users want to join their datasets with NOAA. We need a time range for querying data.

We need to form queries by each year if the input time range is more than a year in noaa_materializer. And by defaut, it returns the last year maybe.

But still, how we gonna get the time range? Another user input from UI?

Or break down NOAA data by year. Then with current system, we may not able to find dataset for join, we need to getting the cluster metadata working for multiple years data

wikitable dependencies/env

When running unittest I met the following error

ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
    func(self)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
    result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
    tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
    document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
    res = target(*args)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
    driver = get_driver(headless, disable_images, open_links_same_tab)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
    _driver = Firefox(options=opts)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
    self.service.start()
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

after manually download geckodriver and specify the path, I met:

ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
    func(self)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
    result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
    tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
    document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
    res = target(*args)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
    driver = get_driver(headless, disable_images, open_links_same_tab)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
    _driver = Firefox(options=opts)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    keep_alive=True)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Both on Mac OS and CentOS 7

Decide to breakdown open API data to small component

Decided to breakdown API data to small component.
Like one location and one year for one indicator for NOAA, Worldbank and so on.

Need to update script for generating dataset schema and materializer.
Then using cluster to manage calling materializers. stack datasets, filtering and so on.

This may cause too much dataset schema for indexing. Will see.

Support indexing multiple docs when a materializer returns a list of dataframes

support a materializer returns a list of dataframes(e.g. an excel file with many sheets)
(It is likely to have different implicit var, how to get them)
augment the description schema with each dataframe -> get a list of metadata docs
index each metadata doc

TODO for NOAA materializer

locationid, datatypeid are hard coded if data_range is not provided.

datamart/datamart/materializers/nooa_materializer.py

Line 47 in 442c8c5

api = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=TAVG&locationid=CITY:US060013&startdate=' \

Right now, if data_range is none, it will always return the same dataset.
change data_range to date_range or time_range, the name of data_range is misleading
Maybe location should be a list of locations. By default, it is 'los angeles'

datamart/datamart/materializers/nooa_materializer.py

Line 29 in 442c8c5

def fetch_data(self, data_range=None, location: str = 'los angeles', datatype: str = 'TAVG'):

So that, one can query multiple locations later on.

There are changes, please do pull the latest to your fork then do implementation.

Create example dataset in our Blazegraph

The dataset would be modeled roughly like the following:

D1000001:
  label: Table 10 ...
  description: ...
  P31: Q1172284
  P2699: http://...
  C2001: D1000001
  C2004: "whatever text we can put here to match the keywords in the query"
  C2005: County # variable measured
    C2007: string # data type
    C2008: http://schema.org/city #semantic type
    C2006: "Autauga Balwdwin ..." # text
    P1545: 0 # column index
  C2005: Violent Crime
    C2007: integer
    C2008: ??? # semantic type always required?
    C2006: ??? # maybe for numeric we don't store the values
    P1545: 1
  C2005: County_wikidata_0
    C2007: string
    C2008: http://wikidata.org/Q???
    C2006: "Q1234, Q2345, ..."
    P1545: 9

Downloader need to be updated

Migrate downloader here: https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/tradingeconomics_downloader

Materialization method has following problem:

Reading downloaded csv file throws an error:
Reading csv error is fixed by changing encoding to 'utf-16', df6754b.

Focuse on the following improvements:

If possible, have a flag saying if we want to save the metadata and csv file or not. Because in datamart, we just want a dataframe, save it and read it seems complex, can we just query and return instead of saving it.
Add some unit tests for testing tradingeconomics_materializer