artesiawater / hydropandas Goto Github PK
View Code? Open in Web Editor NEWModule for loading observation data into custom DataFrames
Home Page: https://hydropandas.readthedocs.io
License: MIT License
Module for loading observation data into custom DataFrames
Home Page: https://hydropandas.readthedocs.io
License: MIT License
Now you can only read data using the class methods e.g. GroundwaterObs.from_knmi()
or ObsCollection.from_dino
. It would be nice to have more general methods like pandas such as read_knmi
or read_dino
.
Only challenge is to make a clear distinction between reading a single observation and reading a collection of observations. An option would be that the read functions will always return a collection.
I would like a feature where we can read KNMI observations (to an Obs(Collection)) from a local file. I know we already do this internally so it should not be that hard to implement right?
Often there are multiple observations at the same location. It is good to know which observations are at the same location so you can plot them in the same figure for example.
Now we only deal with this with GroundwaterObs type of observations. I think it is better to make a more generic method for this.
It is nice to have consistent methods. Therefore we decided that get_...
methods wil always return data and never change the ObsCollection. set_..
methods will be allowed to modify the values in the ObsCollection and corresponding observations. I applied this to the geo
methods in the dev branch. Still need to change this for the gwobs
and stats
methods.
It seems like there is something wrong with the depencies at readthedocs, see this issue: readthedocs/readthedocs.org#8639
When cache=True is used in ObsCollection.from_knmi() the start and end parameter are ignored. This can cause problems when you run a script daily and need data for the last day.
In the latest version, the function io_knmi.fill_missing_measurements() uses the condition meteo_var=='RH' to run function _check_latest_measurement_date_RD_debilt .
In previous versions (I only checked in 0.4.2) the condition was meteo_var == 'RD', which is correct in my opinion.
You only want to do this for the 'dagstations', not for the 'meteostations'.
Please have a look at this.
Discovered by @martinvonk.
If you try to get meteo data from a station in a certain period that does not have data in this period and you have fill_missing_measurements=True
it will return the data from another station. If this other station has all the data in this period there won't be a 'station_opvulwaarde' in the observation timeseries.
The first example below shows what goes wrong. You can see that the metadata is confusing because some parts shows the original requested station (210 Valkenburg) and some parts show the station used to fill the data (215 Voorschoten).
>>> import hydropandas as hpd
>>> hpd.EvaporationObs.from_knmi(210, et_type='EV24', startdate='2020-1-1', enddate='2020-1-10')
-----metadata------
name : EV24_VALKENBURG
x : 88603.9252503658
y : 466467.2721140531
meta : {'LON_east': {'215': 4.437}, 'LAT_north': {'215': 52.141}, 'ALT_m': {'215': -1.1}, 'NAME': {'215': 'Voorschoten'}, 'EV24': 'Referentiegewasverdamping (Makkink) (in m) / Potential evapotranspiration (Makkink) (in m)', 'x': 88603.9252503658, 'y': 466467.2721140531, 'station': 210, 'name': 'EV24_VALKENBURG'}
filename :
station : 210
meteo_var : EV24
EV24
2020-01-01 01:00:00 0.0004
2020-01-02 01:00:00 0.0001
2020-01-03 01:00:00 0.0001
2020-01-04 01:00:00 0.0001
2020-01-05 01:00:00 0.0002
2020-01-06 01:00:00 0.0001
2020-01-07 01:00:00 0.0004
2020-01-08 01:00:00 0.0004
2020-01-09 01:00:00 0.0002
2020-01-10 01:00:00 0.0001
If you do have data for a part of the requested period at the requested station, everything is fine:
>>> import hydropandas as hpd
>>> hpd.EvaporationObs.from_knmi(210, et_type='EV24', startdate='2016-5-1', enddate='2016-5-10')
-----metadata------
name : EV24_VALKENBURG
x : 88603.9252503658
y : 466467.2721140531
meta : {'LON_east': {'210': 4.43}, 'LAT_north': {'210': 52.171}, 'ALT_m': {'210': -0.2}, 'NAME': {'210': 'Valkenburg Zh'}, 'EV24': 'Referentiegewasverdamping (Makkink) (in m) / Potential evapotranspiration (Makkink) (in m)', 'x': 88603.9252503658, 'y': 466467.2721140531, 'station': 210, 'name': 'EV24_VALKENBURG'}
filename :
station : 210
meteo_var : EV24
EV24 station_opvulwaarde
2016-05-01 01:00:00 0.0023 NaN
2016-05-02 01:00:00 0.0036 NaN
2016-05-03 01:00:00 0.0029 NaN
2016-05-04 01:00:00 0.0034 215
2016-05-05 01:00:00 0.0028 215
2016-05-06 01:00:00 0.0042 215
2016-05-07 01:00:00 0.0043 215
2016-05-08 01:00:00 0.0036 215
2016-05-09 01:00:00 0.0048 215
2016-05-10 01:00:00 0.0045 215
I messed up the 'getting started' part in the documentation, see https://hydropandas.readthedocs.io/en/latest/getting_started.html.
I will fix this
When you have an ObsCollection a property of an observation point can be defined on three different levels:
This can be confusing when you want to change or set the value of this property. I think it is nice to be able to access a property on all three levels. So maybe we should work with a setter method to change a value on all levels. Does this seem reasonable or is this not how setters work?
.get_regis_layers()
raises an OSError:
OSError: [Errno -68] NetCDF: I/O failure: 'http://www.dinodata.nl:80/opendap/REGIS/REGIS.nc'
# src/netCDF4/_netCDF4.pyx:2012: OSError
# ----------------------------- Captured stderr call -----------------------------
# Error:curl error: Problem with the SSL CA cert (path? access rights?)
# curl error details:
# Warning:oc_open: Could not read url
Temporarily fixed (ugly) with :
f4f89ce
Maybe @rubencalje is familiar with this issue? Maybe it has something to do with the netCDF4 python package?
The startdate defaults to one year of data. It would be nice to get the full time series by default.
Separate the reading of XML files from the API request so users can read manually downloaded XML files.
Originally posted by rt84ro March 2, 2023
Hi every one,
I have downloaded some wells from the BROloket website but their format is .xml. I want to read them using the HydroPandas but actually I do not know how I should open these files. could you please let me know how to read them?
Right now knmi-data is downloaded in ObsCollection.from_knmi(). KNMI-data from stations is combined to generate an equidistant series.
For some reason there is a duplicate index at 1-1-1970. I think the KNMI just sends this date twice. I think we should drop this duplicate index, to generate an equidistant series.
When I use:
from hydropandas import ObsCollection
stns = [344] #Rotterdam
oc_knmi = ObsCollection.from_knmi(stns=stns,
meteo_vars=["EV24", "RD"],
start=['2010', '2010'],
end=['2015', '2015'],
verbose=True)
I get a really weird result. Evaporation is from the correct station (Rotterdam) but the RD is suddenly from a different station (Tollebeek):
I think it has something to do with the fact that station 344 does not have 'RD' measurements, only 'RH'. Probably the combination with the fill_missing_obs = True
gives this weird result.
The function get_knmi_obslist
has an option to get a time series at any location by using spline interpolation. In order to clean up the io_knmi module I propose to remove it from there and make a more general function that does the trick for any ObsCollection. I am aware that it would probably give bad results on most ObsCollections but I would like to try it anyway. Furthermore we could add more methods than just the spline interpolation in the future.
Than the function would probably look something like this:
def interpolate_oc(xy, ObsCollection, method='spline'):
# some magic
return observation
Curious what your thought are on this @martinvonk
Wondering out loud whether it would make sense to move some of our custom functionality to a submodule, i.e. moving all the custom plotting capability to ObsCollection.customplots or something?
Because of the large amount of methods and attributes in pandas dataframes it's sometimes hard to find our custom methods without looking at the source code. Restructuring this way might be helpful in that regard. But it might also introduce some difficulties that I haven't yet thought about.
Curious about your thoughts!
Issue: reading-in a Dino zipfile returns a ValueError:
ValueError: Shape of passed values is (3329, 9), indices imply (3327, 9)
For example in tube B22D0155 filter 1
This occurs while reshaping in io_dino.py line 297
measurements = pd.concat([measurements, s], axis=1)
This filter has a duplicate time-index that is the probable culprit.
Since Dinoloket is in a frozen state (no more data will be added by TNO), perhaps we can change _read_dino_groundwater_measurements to accommodate?
proposed change from line 156 of io_dino.py:
try:
measurements = pd.read_csv(f, header=None, names=titel,
parse_dates=['peildatum'],
index_col='peildatum',
dayfirst=True,
usecols=usecols)
measurements = measurements[~measurements.index.duplicated(keep='last')]
projects.knmi.nl seems to be down (permanently?)
the description page is still on it seems: https://www.knmi.nl/kennis-en-datacentrum/achtergrond/data-ophalen-vanuit-een-script
BRO data is now also available via Dinoloket. The BRO data, and also some of the DINO data, is available in xml format. If the data of Dinoloket is requested via a zip, the metadata from the wells is delivered in the map 'Grondwatermonitoringput BRO' and the measurement data in the map 'Grondwaterstandonderzoek BRO'. In the comming years there will be increasingly more BRO data available .
Can you maybe add the functionality of reading this BRO data to your package?
As discussed during the "Pastas gebruikersdag"
When I cloned the repository there was no directory named 'plots' inside the 'tests\data\2019-Dino-test' directory. This results in failing tests for the plot functions. I created the plots
dir but then it was not discovered by git when adding with 'git add .'
I could not find the 'plots' directory in the .gitignore so I don't understand why it is not added. Why is this directory not there?
ObsCollection does not yet have any methods for downloading KNMI data, which would be nice to have.
I propose something along these lines:
ObsCollection.from_knmi(stns=None, df=None, variable="RD")
So both a list of stations can be entered or another dataframe with 'x' and 'y' columns.
We have quite a few Dutch words in the repository (bovenkant_filter, maaiveld, etc.). For consistency we should switch to English completely. Maybe we can store a translation file somewhere that we can use when you need a Dutch keyword.
shouldn't line 237, be start instead of first None;
knmi_df, variables = get_knmi_daily_rainfall(url, 550, "RD", start, None, False, verbose=verbose)
furthermore current last measurement at De Bilt is at 2020-11-30, therefor checking for last 3 weeks of measurements seems to short.
best wishes ;)
Possibly somewhat related to the list of ToDos in #90.
I'd like to pass a single setting to hpd.read_knmi
:
hpd.read_knmi(
locations=pstore.oseries,
meteo_vars=("RD",),
starts="2010",
ends="2023-01-31",
settings={"normalize_index": False},
)
Currently, I have to provide all settings before I can get this to work. This can be fixed by updating the default settings with the user-provided settings:
default_settings = _get_default_settings(settings)
if settings is not None:
default_settings.update(settings)
settings = default_settings
I believe there are quite some differences between the different from_*
methods. Sometimes all the logic is stored in the io.io_*.py
files whereas in other cases some of the logic is still performed in the from_*()
method. I think this is mostly the case in ObsCollection
.
So I'm wondering out loud whether this is something that could be made more uniform?
Proposal
All io methods used for constructing the Obs
and ObsCollections
should return either:
io_*.py
files.The Obs.from_*
methods take the DataFrame and metadata and convert it to an Obs instance in the classmethod
The ObsCollection.from_*
methods take the list of Obs and convert it to an ObsCollection instance.
I think making this more uniform will make it easier to write new methods. Thoughts?
Edit: typos
When I call the test test_knmi_collection_from_locations()
data from neerslagstation Heibloem should be downloaded using this URL:
This URL returns the message:
We're sorry, but something went wrong.
If you are the application owner check the logs for more information.
I don't know what is wrong. When I replace the station number with '550' for De Bilt everything works fine. When I check the station number on this website it seems correct. Also when I manually download the .txt file I get the requested data in the requested period. When I check on this website Heibloem has a different station number: '096'. When I call the URL with this station number I get an empty text file for station Heibloem.
I thought about the way metadata is stored in an Obs and ObsCollection object. A lot of metadata (bovenkant_filter, onderkant_filter, surface_level) is now stored as a float. In practice this data can change over time. I think we should have the option to store all values with a date. At the same time it is nice to just see one value for the metadata when you have an ObsCollection.
I was thinking we could create a metadata object that behaves like this:
o = Obs()
o.bovenkant_filter = NewMetadataObject({datetime(2010,1,1) : -9.26, datetime(2014,1,1) : -10.2})
print(o.bovenkant_filter) # you get the latest bovenkant_filter available
>>> -10.2
print(o.bovenkant_filter.history # you get the full history
>>>
{datetime.datetime(2010, 1, 1): -9.26,
datetime.datetime(2014, 1, 1): -10.2}
In the ObsCollection
dataframe we can just use the latest value and thus we still have a single value in the ObsCollection
. While we retain the full history in the NewMetadataObject
that is stored in the attribute of the Obs
object.
What are your thoughts on this?
The WSDL service seems to have been taken down... :(
For reference this was the webservice URL: http://www.dinoservices.nl/gwservices/gws-v11?wsdl
Time to figure out how we can download the timeseries data now...
There is already code to read a waterinfo csv file but no code yet to use an API to get waterinfo data. This package (https://github.com/openearth/ddlpy) uses the API so maybe we can use this package.
When two observation points from dinoloket are part of the same clusterlist the time series you get with the from_dino
method are identical. This can be annoying if you want to get all observations within an extent and end up with multiple identical time series with different metadata. Below is an example to illustrate what this means.
A few solutions can be:
startDate
and endDate
in the metadata. This way you end up with the same dataset as you would get from a manual Dinoloket download.example:
import observations as obs
o1 = obs.GroundwaterObs.from_dino(location='B38E0249')
o2 = obs.GroundwaterObs.from_dino(location='B38E0250')
print(o1.x)
>>> 120340
print(o2.x)
>>> 120350
print(o1.meta)
>>> {'topDepthMv': None,
'bottomDepthMv': None,
'diver': 'FALSE',
'startDate': '1960-01-28',
'endDate': '1965-05-14',
'headCount': 91,
'clusterId': 'B38E0248',
'clusterList': 'B38E0248;B38E0250',
'stCount': 1,
'saCount': 0,
'crs': 'RD',
'startDepthNap': None,
'endDepthNap': None,
'startDepthMv': None,
'endDepthMv': None,
'startDateLevels': '1960-01-28',
'endDateLevels': '1965-05-14',
'startDateSamples': None,
'endDateSamples': None}
print(o2.meta)
>>> {'topDepthMv': 1.53,
'bottomDepthMv': 2.03,
'diver': 'FALSE',
'startDate': '1968-03-28',
'endDate': '2000-10-14',
'headCount': 727,
'clusterId': 'B38E0248',
'clusterList': 'B38E0248;B38E0249',
'stCount': 1,
'saCount': 0,
'crs': 'RD',
'startDepthNap': -2.17,
'endDepthNap': -2.67,
'startDepthMv': 1.53,
'endDepthMv': 2.03,
'startDateLevels': '1968-03-28',
'endDateLevels': '2000-10-14',
'startDateSamples': None,
'endDateSamples': None}
print(o1)
>>> stand_m_tov_nap remarks
1952-01-14 -1.04 None
1952-01-28 -1.15 None
1952-02-15 -1.15 None
1952-02-28 -1.35 None
1952-03-14 -1.38 None
... ... ...
2000-08-14 -1.47 None
2000-08-28 -1.57 None
2000-09-14 -1.63 None
2000-09-28 -1.55 None
2000-10-14 -1.40 None
print(o2)
>>> stand_m_tov_nap remarks
1952-01-14 -1.04 None
1952-01-28 -1.15 None
1952-02-15 -1.15 None
1952-02-28 -1.35 None
1952-03-14 -1.38 None
... ... ...
2000-08-14 -1.47 None
2000-08-28 -1.57 None
2000-09-14 -1.63 None
2000-09-28 -1.55 None
2000-10-14 -1.40 None
The Dino WFS service (for metadata) seems to be down since at least yesterday... It might be temporary, but maybe they've replaced it with something new? Something to keep an eye on and fix if it doesn't come back online in the near future.
ConnectionError: HTTPSConnectionPool(host='broinspireservices.nl', port=443): Max retries exceeded with url: /wfs/osgegmw-a-v1.0?service=WFS&request=GetCapabilities&version=2.0.0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000012FEE011FD0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
When trying to make an interactive map with map_labels (with the locations: 'monitoring_well' as label), this gives a KeyError. This is because allmost all the metadata is popped (e.g. monitoring_well=meta.pop("monitoring_well")) when the metadata is returned in the 'from_bro' function in the observation.py, why is that done?
I see that you've added a file called pytest.ini. What is the use of this file?
I ask this because I cannot run pytest from command line anymore. I get the following error:
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...] pytest: error: unrecognized arguments: --cov-report --cov-report html:htmlcov -- cov observations --
I have recently created some plots of my observations. I think these might be usefull for other users as well, and can be included in HydroPandas. What do others think about that?
Section and observations
Left a section with characteristics of observation wells and bandwidth of observations; on right side all observations.
Annual mean and observations
Left annual mean per month; right side all observations. I highlighed 10 sept 2022 because of severe precipitation on that day.
A bunch of features are only available with the art_tools
module:
I think my current preference would be to keep a bunch of these methods private (MapGraph, OpenTopo backgrounds, KNMI filling missing measurements, AHN) and make the rest work without art_tools.
If we ever want to make this public, we need to think about how we implement these features that would not be available to the general public.
Thoughts?
For the next release it would be nice to run through all the docstrings and add/improve them where necessary.
Also we can start thinking of creating docs with sphinx...
In the newest version of pytest a warning is issued when a test does not return None. I changed most of the tests so they return None. I could not change all of them because some of our tests call other tests and expect a return value. Tests calling other tests is probably bad practice and should be changed.
Sometimes I want to create an Obs object from another Obs object. For example I want a GroundwaterObs object from a WaterlvlObs object.
# create WaterlvlObs
df = pd.DataFrame({'measurements':np.random.randint(0,10,5)}, index=pd.date_range('2020-1-1', '2020-1-5'))
o_wl = hpd.WaterlvlObs(df, name='obs',x=0, y=0, source='my fantasy', meta={'place':'Winterfell'})
# This is what I want to do, but now I will lose all metadata
o_gw = hpd.GroundwaterObs(o_wl)
# This is what I have to do now to keep all metadata
o_gw = hpd.GroundwaterObs(o_wl, name=o_wl.name,
x=o_wl.x, y=o_wl.y,
source=o_wl.source,
filename=o_wl.filename,
monitoring_well=o_wl.monitoring_well,
metadata_available=o_wl.metadata_available,
unit=o_wl.unit)
Since the read functionality is removed from pastas we should add the example for reading Menyanthes from pastas to hydropandas.
I am used to reading a certain file-format with observations and creating a Pandas DataFrame from it, where each row of the DataFrame consists of one location. The columns of the DataFrame consist of the metadata of the locations and the time series information is in one of the columns.
Is it possible to easily make an ObsCollection from this DataFrame?
When I do (where df is the DataFrame I mentioned):
oc = obs.ObsCollection(df)
I get no error (I expected to get an error). I cannot use any of the ObsCollection methods however. When I issue the command pts = oc.to_pastastore()
for example I get the following error:
AttributeError: 'ObsCollection' object has no attribute 'obs'
My guess is that I should make Obs-objects, add them to a list, and then call ObsCollection.form_list()
. Am I right??
There are quite a few dependencies that are barely used. An example would be geopandas
. This makes it harder to install hydropandas. We should treat these as optional dependencies.
The name of this repository is too general. The repository cannot be used for any kind of observation (e.g. satellite observations) and is now only used for Dutch hydrological observations. Therefore I think we should change the name to something more specific. I thought of something like:
After I updated the pystore package from 0.1.9 to 0.1.14 I get an IndexError: list index out of range
error when running the following code (from the test_001.py):
import pystore
pystore.set_path("./tests/data/2019-Pystore-test")
store = pystore.store("test_pystore")
coll = store.collection(store.collections[0])
apparently store.collections
is an empty list.
Hi!
Great stuff! Could it be that the headers in DINO changed?
This is what I've done:
I get a key error: KeyError: 'startdatum'
That is not surprising as I see START DATUM in the header of the attached file.
If you can confirm this is due to a change in DINO (and not my ignorance), I can mod the code myself and do a merge request for you to review.
Thanks in advance!
The files with metadata about knmi stations 'knmi_meteostation.json' and 'knmi_neerslagstation.json' are not up to date. Some stations that can be downloaded are not in the list and some stations are in the list but cannot be downloaded.
The functions I just committed are dealing with the errors this causes but the fill_missing_measurements
function only uses the stations from the json files and is therefore less accurate than it could be.
It would be nice of we can find new metadata to replace these ones.
KNMI meteostations have a daily and hourly export format.
E.g. on daily, basis pressure is available as:
PG = Daily mean sea level pressure (in 0.1 hPa) calculated from 24 hourly values PX = Maximum hourly sea level pressure (in 0.1 hPa) PXH = Hourly division in which PX was measured
E.g. on hourly basis, pressure is available as:
P = Air pressure (in 0.1 hPa) reduced to mean sea level, at the time of observation
The documentation of MeteoObs.from_knmi shows only the parameters that are in the 'daily format'.
I tried to import hourly pressure data, but the column has no data:
o = hpd.MeteoObs.from_knmi(310, meteo_var='P', interval='hourly',fill_missing_obs=False)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.