Giter VIP home page Giter VIP logo

nowcastlib's Introduction

Nowcast Library

๐Ÿง™โ€โ™‚๏ธ๐Ÿ”ง Utils that can be reused and shared across and beyond the ESO Nowcast project

This is a public repository hosted on GitHub via a push mirror setup in the internal ESO GitLab repository

Installation

Simply run

pip install nowcastlib

Usage and Documentation

Nowcast Library (nowcastlib) consists in a collection of functions organized in submodules (API) and a tool accessible via the command line (CLI). The latter is primarily intended for accessing the Nowcast Library Pipeline, an opinionated yet configurable set of processing steps for wrangling data and evaluating models in a consistent and rigorous way. More information can be found on the nowcastlib pipeline index page (link to markdown and link to hosted docs)

Please refer to the examples folder on GitHub for usage examples.

API

Here is a quick example of how one may import nowcastlib and access to one of the functions:

"""Example showing how to access compute_trig_fields function"""
import nowcastlib as ncl
import pandas as pd
import numpy as np

data_df = pd.DataFrame(
    np.array([[0, 3, 4, np.NaN], [32, 4, np.NaN, 4], [56, 8, 0, np.NaN]]).T,
    columns=["A", "B", "C"],
    index=pd.date_range(start="1/1/2018", periods=4, freq="2min"),
)

result = ncl.datasets.compute_trig_fields(data_df, ["A", "C"])

More in-depth API documentation can be found here.

CLI

Some of the library's functionality is bundled in configurable subcommands accessible via the terminal with the command nowcastlib:

usage: nowcastlib [-h] [-v]
                  {triangulate,preprocess,sync,postprocess,datapipe} ...

positional arguments:
  {triangulate,preprocess,sync,postprocess,datapipe}
                        available commands
    triangulate         Run `nowcastlib triangulate -h` for further help
    preprocess          Run `nowcastlib preprocess -h` for further help
    sync                Run `nowcastlib sync -h` for further help
    postprocess         Run `nowcastlib postprocess -h` for further help
    datapipe            Run `nowcastlib datapipe -h` for further help

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase verbosity level from INFO to DEBUG

Repository Structure

The following output is generated with tree . -I 'dist|docs|*.pyc|__pycache__'

.
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ Makefile # currently used to build docs
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ de421.bsp # not committed
โ”œโ”€โ”€ docs/ # html files for the documentation static website
โ”œโ”€โ”€ examples
โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”œโ”€โ”€ cli_triangulate_config.yaml
โ”‚ย ย  โ”œโ”€โ”€ data/  # not committed
โ”‚ย ย  โ”œโ”€โ”€ datasync.ipynb
โ”‚ย ย  โ”œโ”€โ”€ output/ # not committed
โ”‚ย ย  โ”œโ”€โ”€ pipeline_datapipe.json
โ”‚ย ย  โ”œโ”€โ”€ pipeline_preprocess.json
โ”‚ย ย  โ”œโ”€โ”€ pipeline_sync.json
โ”‚ย ย  โ”œโ”€โ”€ signals.ipynb
โ”‚ย ย  โ””โ”€โ”€ triangulation.ipynb
โ”œโ”€โ”€ images
โ”‚ย ย  โ””โ”€โ”€ pipeline_flow.png
โ”œโ”€โ”€ nowcastlib # the actual source code for the library
โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”œโ”€โ”€ cli
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ triangulate.py
โ”‚ย ย  โ”œโ”€โ”€ datasets.py
โ”‚ย ย  โ”œโ”€โ”€ dynlag.py
โ”‚ย ย  โ”œโ”€โ”€ gis.py
โ”‚ย ย  โ”œโ”€โ”€ pipeline
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ cli.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ process
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ postprocess
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ cli.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ generate.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ preprocess
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ cli.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ split
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ structs.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ sync
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ cli.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”‚ย ย  โ”œโ”€โ”€ signals.py
โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ poetry.lock # lock file generated by python poetry for dependency mgmt
โ””โ”€โ”€ pyproject.toml # general information file, handled by python poetry

Directories and Files not Committed

There are a number of files and folders that are not committed due to their large and static nature that renders them inappropriate for git version control. The following files and folder warrant a brief explanation.

  • Certain functions (time since sunset, sun elevation) of the Nowcast Library rely on the use of a .bsp file, containing information on the locations through time of various celestial bodies in the sky. This file will be automatically downloaded upon using one of these functions for the first time.
  • The examples scripts make use of a data/ directory containing a series of csv files. Most of the data used in the examples can be downloaded from the ESO Ambient Condition Database. Users can then change the paths set in the examples to fit their needs. For users interested in replicating the exact structure and contents of the data directory, a compressed copy of it (1.08 GB) is available to ESO members through this Microsoft Sharepoint link.
  • At times the examples show the serialization functionality of the nowcastlib pipeline or need to output some data. In these situations the output/ directory in the examples folder is used.

Development Setup

This repository relies on Poetry for tracking dependencies, building and publishing. It is therefore recommended that developers install poetry and make use of it throughout their development of the project.

Dependencies

Make sure you are in the right Python environment and run

poetry install

This reads pyproject.toml, resolves the dependencies, and installs them.

Deployment

The repository is published to PyPi, so to make it accessible via a pip install command as mentioned earlier.

To publish changes follow these steps. Ideally this process is automated via a CI tool triggered by a push/merge to the master branch:

  1. Optionally run poetry version with the appropriate argument based on semver guidelines.

  2. Update the documentation by running

    make document
  3. Prepare the package by running

    poetry build
  4. Ensure you have TestPyPi and PyPi configured as your poetry repositories:

    poetry config repositories.testpypi https://test.pypi.org/legacy/
    poetry config repositories.pypi https://pypi.org/
  5. Publish the repository to TestPyPi, to see that everything works as expected:

    poetry publish -r testpypi
  6. Stage, commit and push your changes (to master) with git.

  7. Publish the repository to PyPi:

    poetry publish -r pypi

Upon successful deployment, the library should be available for install via pip

nowcastlib's People

Contributors

thesofakillers avatar

Watchers

 avatar  avatar

nowcastlib's Issues

Tensorflow Compatibility

tensorflow (2.6.0) depends on numpy (>=1.19.2,<1.20.0),
tensorflow (>=2.6.0,<3.0.0) requires numpy (>=1.19.2,<1.20.0).
nowcastlib (3.0.12) depends on numpy (>=1.20.3,<2.0.0)

So nowcastlib cannot be used with tensorflow, unless numpy is downgraded to (>=1.19.2,<1.20.0)

skyfield calculations may be overly accurate for requirements at the cost of computation

Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.

The following lines need addressing:

sunset_idxs = np.zeros(len(datetime_series), dtype=int)
for sunset in sunsets[1:]:
change = np.where(datetime_series > sunset)[0][0]
sunset_idxs[change:] += 1

More efficient Data Synchronization

The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.

P. Fluxa mentions:

A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found.
This is a quick-and-dirty implementation showing how it works:

"""
Sample script showing the solution of the following problem:

"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""

import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
    ms = random.randint(0, 5)
    for nr in range(nRanges):
        jitter1 = 0
        jitter2 = 1 #random.randint(2, 6)
        width = 7
        start = ms + jitter1
        end = start + width
        entry = dict()
        entry['start'] = start
        entry['sflag'] = 1
        entry['end'] = end
        entry['eflag'] = -1
        entry['channel'] = nch
        entry['rangeidx'] = nr
        data.append(entry)        
        ms = end + jitter2
rangesdf = pandas.DataFrame(data)  
 
# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist() 
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
    crarr.append(e)
    crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))

# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
    xs = entry['start']
    xe = entry['end']
    ys = entry['channel']
    ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
    # avoid drawing ranges with no width
    if cr[1] == cr[0]:
        continue
    ax.vlines(cr[0], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
    ax.vlines(cr[1], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')

And this is the kind of the result you get

image

Example in README contains mistake

The example listed here in the README leads to the following error trace:

data_df = pd.DataFrame(
...     [[0, 3, 4, np.NaN], [32, 4, np.NaN, 4], [56, 8, 0, np.NaN]],
...     columns=["A", "B", "C"],
...     index=pd.date_range(start="1/1/2018", periods=4, freq="2min"),
... )
Traceback (most recent call last):
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 898, in _finalize_columns_and_data
    columns = _validate_or_indexify_columns(contents, columns)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 947, in _validate_or_indexify_columns
    f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\frame.py", line 700, in __init__
    dtype,
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 483, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 799, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
  File "C:\Users\angel\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 901, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 3 columns passed, passed data had 4 columns

Get rid of cascade-like nature of pipeline

Currently, because the pipeline assumes an order of operations, running an individual process (e.g. postprocessing) will also run all the individual processes leading up to it.

For example, suppose the user wants to run postprocessing. The pipeline will run preprocessing, synchronization and postprocessing in that order.

At the moment, the best way to keep things truly independent of previous processes is keeping the configuration for those previous processes to a minimum, so that minimal processing is performed.

This is however a bit cumbersome, as the user needs to open, edit and maintain different configuration files for different processes, which defeats the purpose of having a single configuration schema (the DataSet config struct).

The reason the pipeline works this way is that the output of a given process will serve as the input to the next process and the only input the user can specify in the configuration is the input to the first step of the pipeline, i.e. preprocessing. Therefore if a user wishes to run a process, all the processes leading before it need to run so that it receives the right input.


Ideally, the user should be able to have a very complete configuration (if they wished) but choose to run only a part of the pipeline by using the right CLI command and providing the necessary input themselves.

So, if the user wanted to postprocess a synchronized dataset that they already have, they would call nowcastlib postprocess with the relevant configuration and the path to the file they wish to postprocess.

Ideally, this would tell the pipeline to only perform postprocessing, rather than the current form in which preprocessing and synchronization are performed beforehand.


Each subprocess cli command should therefore take at least one additional (optional) argument -i or --input where the user can specify the path to an input file to use, so to be able to skip all the previous steps

ModuleNotFoundError: No module named 'importlib_metadata'

When importing nowcastlib, the following error is outputted:

>>> import nowcastlib as ncl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/py3.7/lib/python3.7/site-packages/nowcastlib/__init__.py", line 5, in <module>
    from importlib_metadata import version
ModuleNotFoundError: No module named 'importlib_metadata'

As such the import fails and the library remains unusable

Postprocessing is slow in general

Splitting data into test/train/val before the vast majority of our postprocessing seems unnecessary, and we actually end up making redundant computations this way. For example when generating new fields. Splitting is basically only necessary for standardization.

Should be in this order

  • preprocess
  • sync
  • postprocess
  • generate new fields
  • split
  • standardize

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.