amacd31 / phildb Goto Github PK

Timeseries database

License: Other

Makefile 0.35% Shell 0.60% Python 99.05%

python timeseries timeseries-database time-series temporal logging pandas

phildb's Introduction

PhilDB project

Timeseries database project: For storing potentially changing timeseries data. For example hydrological data, like streamflow data, where the timeseries may be revised as quality control processes improve the recorded dataset over time.

PhilDB should be capable of storing data at any frequency supported by Pandas. At this time only daily data has been extensively tested with some limited sub-daily usage.

Further information about the design of PhilDB can be found in the paper: PhilDB: the time series database with built-in change logging. That paper explores existing time series database solutions, discusses the motivation for PhilDB, describes the architecture and philosophy of the PhilDB software, and includes an evaluation between InfluxDB, PhilDB, and SciDB.

Dependencies

Requires Python 3.7 or greater (mostly tested on Mac OSX and Linux). Test suite runs on Linux using Travis CI with Python 3.6, 3.7, and 3.8. Test suite runs on Windows using Appveyor with Python 3.7.

All the python dependencies are recorded in the python_requirements file.

Installation

PhilDB is pip installable.

The latest stable version can be installed from pypi with:

pip install phildb

The latest stable version can also be installed from conda with:

conda install -c amacd31 phildb

The latest development version can be installed from github with:

pip install git+https://github.com/amacd31/phildb.git@dev

The latest development version can be installed from conda with:

conda install -c amacd31/label/dev phildb

Development environment

A number of processes for a development environment with tests and documentation generation have been automated in a Makefile.

The virtualenv package can be used to create an isolated install of required Python packages.

Create a virtual environment with dependencies installed:

make venv

Test everything is working:

make test

Build the documentation:

make docs

View the generated documentation at doc/build/html/index.html

For additional details see the INSTALL file.

Usage

Create a new PhilDB

phil-create new_tsdb

Open the newly created PhilDB

phildb new_tsdb

If using the development environment built with make, Load it along with adding PhilDB tools to your path:

. load_env

Examples

See the examples directory for code on setting up test phil databases with different data sets. Each example comes with a README file outlining the steps to acquire some data and load it. The loading scripts in each example can be used as a basis for preparing a timeseries database and loading it with data.

The examples/hrs/ example also contains an example script (autocorr.py) for processing the HRS data using phildb. The script calculates auto-correlation for all the streamflow timeseries in the HRS dataset.

Presently there are three sets of example code, acorn-sat, bom_observations, and hrs.

ACORN-SAT

ACORN-SAT Example.ipynb located in examples/acorn-sat demonstrates loading minimum and maximum daily temperature records for 112 stations around Australia.

The dataset used in this example is the Australian Climate Observations Reference Network – Surface Air Temperature (ACORN-SAT) as found on the Australian Bureau of Meteorology website ACORN-SAT website.

BOM Observations

Bureau of Meterology observations example.ipynb located in examples/bom_observations demonstrates loading half hourly air temperature data from a 72 hour observations JSON file.

The data used in this example is a 72 hour observations JSON file from the Australian Bureau of Meteorology website (e.g. JSON file as linked on this page: Sydney Airport observations

HRS

HRS Example.ipynb located in examples/hrs demonstrates loading daily streamflow data for 221 streamflow stations around Australia.

The dataset used in this example is the Hydrologic Reference Stations (HRS) dataset as found on the Australian Bureau of Meteorology website HRS website.

This example also includes a script to calculate the auto-correlation for all the streamflow timeseries in the HRS dataset.

phildb's People

Contributors

Stargazers

Watchers

Forkers

cc272309126 dmkent xephon-contrib jlerat xianwuxue-noaa

phildb's Issues

Convert examples to Jupyter notebooks

Jupyter notebooks can be readily executed to ensure the examples still work, will look nice and render on github for easy reading.

NaN values constantly being written to the log.

Every time there is an update any existing NaN values are written to the log again as 'nan'.

Need methods to easily query logs

Release version 1.0

IOError when reading a time series instance that hasn't had data written

After creating a time series instance an IOError ('IOError: [Errno 2] No such file or directory') can be triggered by attempting to read the series before any data has been written.

Rename commandline tool to `phildb` from `phil`

The server is phildb-server, the imports are phildb, having the command phil to run the interactive prompt is confusing.

Should rename the commandline tool to phildb from phil

Initially the phil command should remain at the same time as phildb, perhaps with a deprecation warning, with later removal of the phil command.

Improve performance of ts_list

Currently ts_list performs poorly when there are thousands of timeseries instances.

Eager loading the timeseries id should greatly improve things.

Improve logging

Currently only modified values are logged and there is no way to identify when a value first was added.

Ability to prepend to time series when calling write

Presently prepending to a series is not supported (resulting in an uncaught exception). The current work around is to ensure the earliest value to be used is written during the very first write.

Update versioneer

Newer versioneer produces better development version strings from git. Plus it moves some information out of setup.py variables and into setup.cfg.

Ability to remove data points

In particular removing points in a irregular time series.

An irregular series could potentially accrue points if the data is moved relative to the timestamps.

Being able to trim regular series as well would be useful but not as critical as removal of specific values from an irregular series. As a work around a regular series can handle removal by using NaN values to represent a missing regular point to remove a value.

Loss of datetime precision with Pandas < 0.20

With Pandas < 0.20 the precision of datetimes can be lost when writing irregular series.

This can occur when the data to be written can be stored as float32 and Pandas treats the data as float32 instead of float64. The below snippet of code demonstrates the problem and the final assert will fail with older versions of Pandas.


sample = pd.DataFrame(
    pd.np.array(
        [x + 0.1 for x in range(10)],
        dtype=pd.np.float32
    ),
    index=pd.date_range(
        '2017-08-06 06:50:00+00:00',
        periods=10,
        freq='1T'
    )
)

sample['datestamp'] = pd.Series(
    sample.index.map(
        lambda dateval: dateval.value // 1000000000
    ),
    index=sample.index
)

sample.values.astype(pd.np.int64)

print(sample.values.astype(pd.np.int64)[0][1])
assert sample.values.astype(pd.np.int64)[0][1] == 1502002200

This issue stems from the fact that calling .values on a pandas.DataFrame should cast the data types to the a capable common data type. However the bug in pandas results in the int64 being cast to a float32, losing precision in the process, instead of correctly upcasting everything to float64 (which can hold both types successfully).

This bug can be worked around by either ensuring that the time series to be written has a dtype of float64 or you are using Pandas >= 0.20.

Can't instantiate two PhilDB classes at once

The scope of the SQLite instance appears to be module level meaning multiple PhilDB instances can't be used in the one script (only the most recent has the correct sqlite meta-data access).

Index error when writing an empty series

.../phildb/database.py in write(self, identifier, freq, ts, **kwargs)
    343             :type ts: pd.Series
    344         """
--> 345         modified = writer.write(self.get_file_path(identifier, freq, **kwargs), ts, freq)
    346 
    347         log_file = self.get_file_path(identifier, freq, ftype = 'hdf5', **kwargs)

.../phildb/writer.py in write(tsdb_file, ts, freq)
     89         return write_irregular_data(tsdb_file, series)
     90     else:
---> 91         return write_regular_data(tsdb_file, series)
     92 
     93 def write_regular_data(tsdb_file, series):

.../phildb/writer.py in write_regular_data(tsdb_file, series)
    103         :type series: pandas.Series
    104     """
--> 105     start_date = series.index[0]
    106     end_date = series.index[-1]
    107 

.../anaconda/envs/dev/lib/python2.7/site-packages/pandas/indexes/base.pyc in __getitem__(self, key)
   1262 
   1263         if lib.isscalar(key):
-> 1264             return getitem(key)
   1265 
   1266         if isinstance(key, slice):

IndexError: index 0 is out of bounds for axis 0 with size 0

Debugging:

.../phildb/writer.py(105)write_regular_data()
    104     """
--> 105     start_date = series.index[0]
    106     end_date = series.index[-1]

ipdb> series
Series([], dtype: float64)

Performance issue when loading sparse data

Need to fix a performance issue where loading values that are far apart in a regular series results in an expensive pandas offset operation per intervening missing value.

Fix FutureWarning from newer numpy

Numpy 1.10 issues a future warning in the reader code. The code should be updated so that this won't be a problem in the future.

See:
phildb/reader.py:17: FutureWarning: elementwise == comparison failed and returning scalar instead; this will raise an error or perform elementwise comparison in the future.
if records == []: return pd.DataFrame(None, columns = ['date', 'value', 'metaID'])