Giter VIP home page Giter VIP logo

datamart's People

Contributors

bhavyagulati avatar ckxz105 avatar cybergla avatar jcarlosroldan avatar kyao avatar linqyd avatar lituta avatar rqs avatar sukritisharma avatar xkgoodbest avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamart's Issues

Breakdown WorldBank data into chunks for materialization

Materialize worldbank data for all time across all countries takes too long to finish. Not able to materialize it within acceptable time.

  1. Consider break it down for every country. This means we will have a dataset for all time range for a single country, each associates with one metadata. There is already around 16000 datasets. Doing this will make the number of datasets (metadata) become 16000* (#countries). And we need to en`
  2. Otherwise, since the dataset is too big for materializing and profiling, we can put a fixed set of countries in the country column as named_entity. But for materialization, need to query with list of named_entities (countries).

We can discuss here for the correct way of dealing with worldbank.

Wikidata

@linqyd Please update the Wikidata progress and todos here, and communicate with @Lituta about this.

Write script to create properties to represent datasets in Wikidata

We need to create the following properties to represent datasets in Wikidata:


C2001:
  label: datamart identifier
  description: identifier of a dataset in the Datamart system
  datatype: MonolingualText
  P31: Q19847637
  P1629: Q1172284

C2004:
  label: keywords
  description: keywords associated with an item to facilitate finding the item using text search
  datatype: StringValue
  P31: Q18616576

C2005:
  label: variable measured
  description: the variables measured in a dataset
  datatype: StringValue
  P31: Q18616576
  P1628: http://schema.org/variableMeasured

C2006:
  label: values
  description: the values of a variable represented as a text document
  datatype: StringValue
  P31: Q18616576

C2007:
  label: data type
  description: the data type used to represent the values of a variable, integer (Q729138), Boolean (Q520777), Real (Q4385701), String (Q184754), Categorical (Q2285707)
  datatype: Item
  P31: Q18616576

C2008:
  label: semantic type
  description: a URL that identifies the semantic type of a variable in a dataset
  datatype: URL
  P31: Q18616576

generate_tradingeconomics_market_schema.py errors out

Error message

Generating schema for Trading economics HYPE3:BS
Traceback (most recent call last):
  File "generate_tradingeconomics_market_schema.py", line 123, in <module>
    generate_json_schema(args.dst_path)
  File "generate_tradingeconomics_market_schema.py", line 48, in generate_json_schema
    data = res_indicator.json()
  File "/anaconda3/envs/datamart_env/lib/python3.6/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/anaconda3/envs/datamart_env/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Trading Economics Materializer

For the Trading Economics dataset schema, the arguments objects under the materialization objects are always empty. All the information needed by the materializer to retrieved the dataset should be in the arguments section. The materializer should not use information elsewhere in the dataset schema.

Trading economics market data need different materializer from original tradingeconomics_materializer

It looks to me that trading economics market data can only specify start and end date in query. No location restriction allowed. So the general tradingeconomics_materializer does not.

You can make the change by one of the following:

  1. Since your schema of trading economics market data already points to tradingeconomics_materializer, you can modify tradingeconomics_materializer to check whether the metadata is market or indicator (by putting some parameter in materialization.auguments in the schema json). Then in tradingeconomics_materializer.py, treats them differently to form the url query.

  2. Create another tradingeconomics_market_materializer.py, generate new schema json where materialization.python_path points to this new materializer.

user upload dataset api

Implement an API that users could provide a description of an online dataset, with the url of the real data under 'materialization', so that datamart could index the user provided dataset.

Need to implement the following parsers to support different types of dataset:
https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/parsers

  • csv

  • html

  • json

  • excel

----- bulk load -----

  • Implement the link-detection on html pages and try to index the data one by one.
  • Support compressed files

Conda environment checksums

The enviroment.yml includes checksums for specific versions of the packages which are probably the OSX versions. Removing the checksums solves this problem for Ubuntu 16.04 and Windows 7.
For instance, changing libffi=3.2.1=h475c297_4 to libffi=3.2.1 allows me to install it.

Wikitables

I merged pull request #33 from @juancroldan .

Imports are messed up. Missing packages. Please double check the import and update any dependency packages with version in environment.yml and requirements.txt.

Noaa is not able to return data more than one year with single query

Noaa is not able to return data more than one year with single query, using multiple queries to return dataset for all years by default is not possible either (too big).

In this case, if users want to join their datasets with NOAA. We need a time range for querying data.

We need to form queries by each year if the input time range is more than a year in noaa_materializer. And by defaut, it returns the last year maybe.

But still, how we gonna get the time range? Another user input from UI?

Or break down NOAA data by year. Then with current system, we may not able to find dataset for join, we need to getting the cluster metadata working for multiple years data

wikitable dependencies/env

When running unittest I met the following error

ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
    func(self)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
    result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
    tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
    document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
    res = target(*args)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
    driver = get_driver(headless, disable_images, open_links_same_tab)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
    _driver = Firefox(options=opts)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
    self.service.start()
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

after manually download geckodriver and specify the path, I met:

ERROR: test_get (datamart.unit_tests.test_wikitables_materializer.TestWikitablesMaterializer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/nas/home/dongyul/datamart_work/datamart/datamart/utilities/utils.py", line 156, in __decorator
    func(self)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/unit_tests/test_wikitables_materializer.py", line 25, in test_get
    result = self.wikitables_materializer.get(metadata=mock_metadata).to_dict(orient="records")
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_materializer.py", line 31, in get
    tab = tables(article=args['url'], lang=lang, store_result=False, xpath=args['xpath'])
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/wikitables.py", line 43, in tables
    document = cache(get_with_render, (url, SELECTOR_ROOT), identifier=url)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 163, in cache
    res = target(*args)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 219, in get_with_render
    driver = get_driver(headless, disable_images, open_links_same_tab)
  File "/nas/home/dongyul/datamart_work/datamart/datamart/materializers/wikitables_downloader/utils.py", line 204, in get_driver
    _driver = Firefox(options=opts)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    keep_alive=True)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/nas/home/dongyul/miniconda3/envs/datamart_env/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Both on Mac OS and CentOS 7

Decide to breakdown open API data to small component

Decided to breakdown API data to small component.
Like one location and one year for one indicator for NOAA, Worldbank and so on.

Need to update script for generating dataset schema and materializer.
Then using cluster to manage calling materializers. stack datasets, filtering and so on.

This may cause too much dataset schema for indexing. Will see.

TODO for NOAA materializer

  1. locationid, datatypeid are hard coded if data_range is not provided.

    api = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=TAVG&locationid=CITY:US060013&startdate=' \

    Right now, if data_range is none, it will always return the same dataset.

  2. change data_range to date_range or time_range, the name of data_range is misleading

  3. Maybe location should be a list of locations. By default, it is 'los angeles'

    def fetch_data(self, data_range=None, location: str = 'los angeles', datatype: str = 'TAVG'):

    So that, one can query multiple locations later on.

There are changes, please do pull the latest to your fork then do implementation.

Create example dataset in our Blazegraph

The dataset would be modeled roughly like the following:

D1000001:
  label: Table 10 ...
  description: ...
  P31: Q1172284
  P2699: http://...
  C2001: D1000001
  C2004: "whatever text we can put here to match the keywords in the query"
  C2005: County # variable measured
    C2007: string # data type
    C2008: http://schema.org/city #semantic type
    C2006: "Autauga Balwdwin ..." # text
    P1545: 0 # column index
  C2005: Violent Crime
    C2007: integer
    C2008: ??? # semantic type always required?
    C2006: ??? # maybe for numeric we don't store the values
    P1545: 1
  C2005: County_wikidata_0
    C2007: string
    C2008: http://wikidata.org/Q???
    C2006: "Q1234, Q2345, ..."
    P1545: 9

Downloader need to be updated

Migrate downloader here: https://github.com/usc-isi-i2/datamart/tree/development/datamart/materializers/tradingeconomics_downloader

Materialization method has following problem:

  1. Reading downloaded csv file throws an error:
    Reading csv error is fixed by changing encoding to 'utf-16', df6754b.

Focuse on the following improvements:

  1. If possible, have a flag saying if we want to save the metadata and csv file or not. Because in datamart, we just want a dataframe, save it and read it seems complex, can we just query and return instead of saving it.

  2. Add some unit tests for testing tradingeconomics_materializer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.