oceanmodeling / searvey Goto Github PK

View Code? Open in Web Editor NEW

22.0 8.0 10.0 2.6 MB

Sea state observational data retrieval

Home Page: https://searvey.readthedocs.io/en/stable/

License: GNU General Public License v3.0

Python 99.53% Makefile 0.46% Shell 0.01%

observations ocean-modelling ocean-sciences

searvey's People

Contributors

Stargazers

Watchers

Forkers

brey pmav99 zzmaucar rjialceky sorooshmani-noaa fariborzdaneshvar-noaa jiangchao3 abdu558 tomsail alis-noaa

searvey's Issues

unexpected data value raises an error for IOC

Hi All,
I have been trying to get the metadata for the IOC stations and I am currently experiencing an error when trying to set the activity_threshold. As of today (June 27), there seems to be at least one data record in the IOC stations that has not a "date" in the required field but a string called 'NA'.

The code snippet in stations.py is currently doing this :
ioc_gdf = ioc_gdf.assign(
delay=pd.concat(
(
ioc_gdf.delay[ioc_gdf.delay.str.endswith("'")].str[:-1].astype(int),
ioc_gdf.delay[ioc_gdf.delay.str.endswith("h")].str[:-1].astype(int) * 60,
ioc_gdf.delay[ioc_gdf.delay.str.endswith("d")].str[:-1].astype(int) * 24 * 60,
)
)
)

I made the following workaround:
    def as_int(x):
    try:
        return int(x)
    except:
        pass
    return -1
    
# Normalize IOC
# Convert delay to minutes
ioc_gdf = ioc_gdf.assign(
    delay=pd.concat(
        (
            ioc_gdf.delay[ioc_gdf.delay.str.endswith("'")].str[:-1].apply(as_int),
            ioc_gdf.delay[ioc_gdf.delay.str.endswith("h")].str[:-1].apply(as_int) * 60,
            ioc_gdf.delay[ioc_gdf.delay.str.endswith("d")].str[:-1].apply(as_int) * 24 * 60,
        )
    )        
)

This is working but maybe you have a better coding syntax. I any case, I wanted to bring this to your attention.

Thanks!

CO-OPS field parsing

this package has some info on how CO-OPS fields should be parsed:
https://github.com/GClunies/noaa_coops/blob/b1dce750dbbd710f146f67aa6d4b0a34d81854d7/noaa_coops/noaa_coops.py#L599-L609

Simplify the COOPS stations API

Currently there are 3 different functions for retrieving the COOPS station data:

coops_stations()
coops_stations_within_region()
coops_stations_within_bounds()

I think that these functions could be merged in a single function with something like the following signature:

def coops_stations(
    region: Optional[Union[Polygon, MultiPolygon]] = None,
    lon_min: Optional[float] = None,
    lon_max: Optional[float] = None,
    lat_min: Optional[float] = None,
    lat_max: Optional[float] = None,
    station_status: Optional[StationStatus] = None
) -> pd.DataFrame

The idea is that:

if neither region nor the BBox are specified, then all the stations are returned. This should be the default behavior.
if only region is specified then the stations outside of the region are filtered out
if only BBox is specified then the stations outside of the BBox are filtered out
if both region and BBox are specified then an exception is raised

@zacharyburnett what do you think?

windows and multiprocessing

Hey All,
I am trying to retrieve the USGS stations for the CERA workflow and I am just running the following two lines (in a python script, python version is 3.10):

from searvey import usgs, stations
usgs_stations = usgs.get_usgs_stations()

I get a pretty lengthy error log - any idea what that could be? Thanks!

usgs_test_error.log

Originally posted by @carolakaiser in #79 (comment)

USGS stations filter by region

Hey All,

I have a shapefile that I would like to use to filter the USGS stations by region.
The shapefile can be downloaded here: https://tinyurl.com/4h6evhfk

I convert the shapefile to a geopandas framework. This is read in successfully (I can see a matplotlib plot). However the retrieval for the USGS stations for this region fails. Could you please help?

My code is this (please note, this is a Python environment under Windows):
CERA_test.py.zip

coordinate precision for IOC stations

I have checked out the latest searvey repository (tag v0.2.1) - it works great, thank you!
For the IOC stations, the coordinates are given with a lower precision (2 or 3 decimal places) but sometimes the stations have better coordinates in the details page of the IOC website.
Would it be possible to get high precision coordinates for IOC? I understand that this would require additional requests, so maybe as a code option?

Implementing light wrapper around `Dataframe`s and `Dataset`s

Right now since all the functions return basic Dataframe and Dataset objects, it is not possible to chain call some methods for easier access. The following snippet demonstrates what I mean

    stations = coops.coops_stations_within_region(region=bbox)
    data = coops.coops_product_within_region(
        'water_level', region=bbox, start_date=start, end_date=end)

In this case stations dataframe has some information not included in data object, e.g. name. If we have a lightweight wrapper around these return values, then some additional functionality can be provided to do something like this:

    stations = coops.coops_stations_within_region(region=bbox)
    data = stations.get_product('water_level',  start_date=start, end_date=end)

As well as more sophisticated functionality such as normalizing, vdatum, etc. In this case the raw underlying data can be provided by a .raw or .data method or property of the light wrapper.

Of course I understand this goes against the idea of providing light API to get raw data for the user. So maybe this functionality could go to a downstream package that lightly wraps searvey.

architecture discussion

As far as the abstraction architecture goes, I propose that the data getters be abstracted into at least 2 layers:

the data source layer - this should encompass specific data sources, and methods that query their APIs. Data sources should output an (at least) semi-standardized data format, such as geopandas.GeoDataFrame or xarray.Dataset.
the data product layer - this should encompass abstractions of data products (e.g. water level, water velocity, storm track, etc.). The output of this layer should be much more standardized as this is what the end user actually cares about.

The key point here is that one data source may contribute to multiple data products, and one data product might be contributed to by multiple data sources. Separating into these distinct layers would make it much easier to maintain data sources and add new sources and / or products in the future.

Need to be looked at repos / orgs

This is just a place holder. Perhaps we need to move this to a discussion board.
See:

Adopt pandas Timestamps

searvey/searvey/ioc.py

Line 274 in 7e822c1

endtime: Union[str, datetime.date] = datetime.date.today(),

Setting datetime.date above overlooks the hours component. I suggest we use pandas Timestamps for more flexibility.

Measuring Water Levels with GPS (New Data Source)

Including new data source: See here: https://www.kristinelarson.net/portfolio/measuring-water-levels-with-gps/

Suggested to look at by John Galetzka - NOAA Federal

gather information about vertical datum from USGS stations

Hey All,
As discussed in our meeting, we have been asked to explore the possibility of adding USGS stations for the US Pacific Coast. This will very likely be a larger number of stations and your service to retrieve USGS stations is very helpful. Thank you for that!
Would it be possible to get any information about the vertical datum for each USGS station? The USGS website has this info in but it requires manual work to get this info. Any way to get this in an automated way would be very much appreciated!
Thank you.

CERA_worklow.ipynb is broken

@SorooshMani-NOAA I tried to execute the CERA_workflow.ipynb notebook and it seems that it is broken. You get a traceback in cell 7, but the root problem is in cell 6:

usgs_stations_w_param = usgs_stations[usgs_stations.parm_cd.isin(params_of_interest)]
is_active = (datetime.now() - usgs_stations_w_param.end_date) < timedelta(days=3)
usgs_stations_of_interest = usgs_stations_w_param[is_active]

More specifically the end_date of usgs_station_w_param is NaT for all rows, and as a result usgs_stations_of_interest is an empty dataframe. This causes the next cell which does the actual plotting to fail.

Can you have a look?

IOC: Stations which return data multiple times per minute are dropped.

IOC has a lot of stations. Each station sends data at a different moment. When you aggregate all the data in a single xr.Dataset, you end up with a huge number of timestamps, which means that the dataset has a huge number of NaNs which also means that we need a huge amount of RAM.

In order to reduce the RAM requirements, we decided to truncate the seconds from the timestamps. This helps a lot with the size and it should not change significantly the signal. Nevertheless, if there are stations that return data multiple times per minute, then we end up with duplicate timestamps. At the moment we are dropping these stations.

Instead of that we should:

Make the truncation optional
Drop one of the duplicate timestamps, not the whole station!

This is the relevant code:

searvey/searvey/ioc.py

Lines 233 to 241 in 08941a3

 df = df.assign( 

 ioc_code=ioc_code, 

 # Normalize timestamps to minutes: https://stackoverflow.com/a/14836359/592289 

 # WARNING: the astype() truncates seconds. This can potentially lead to duplicates! 

 time=pd.to_datetime(df.time).astype("datetime64[m]"), 

 ) 

 if len(df) != len(df.time.drop_duplicates()): 

 msg = f"{ioc_code}: The sampling ratio is less than 1 minute. Data have been lost" 

 raise ValueError(msg)

Improve readme

Readme should contain some basic examples of using the API

Adding functionality to download obs in format compatible with OCS's autoval code

@SorooshMani-NOAA

Spoke with Saeed today about linking searvey with OCS's autoval (https://github.com/noaa-ocs-modeling/autoval), using the Pacific SCHISM model as a test case. Autoval uses csdllib (https://github.com/noaa-ocs-modeling/csdllib) to download CO-OPS observed water level data.

The specific piece of code in csdllib that does this is
https://github.com/noaa-ocs-modeling/csdllib/blob/master/csdllib/data/coops.py -- the getData function (see below).

A sample URL that it downloads is https://opendap.co-ops.nos.noaa.gov/axis/webservices/waterlevelrawsixmin/plain/response.jsp?stationId=8410140&beginDate=20230302%2018:06&endDate=20230310%2012:00&datum=MSL&unit=0&timeZone=0&Submit=Submit

8410140 1 WL 2023-03-02 18:06 -1.868 0.017 1 0 0 0
8410140 1 WL 2023-03-02 18:12 -1.887 0.007 1 0 0 0
8410140 1 WL 2023-03-02 18:18 -1.901 0.010 0 0 0 0
8410140 1 WL 2023-03-02 18:24 -1.905 0.008 1 0 0 0
8410140 1 WL 2023-03-02 18:30 -1.910 0.010 1 0 0 0
8410140 1 WL 2023-03-02 18:36 -1.914 0.009 1 0 0 0
8410140 1 WL 2023-03-02 18:42 -1.923 0.009 0 0 0 0

Would you be able to add capability to searvey to download CO-OPS (and eventually IOC and USGS) observations into this format?

def getData (stationID,  dateRange, tmpDir=None,
             product='waterlevelrawsixmin', datum='MSL', units='meters', verbose=False):
    
    Allows for downloading the observations from NOAA's 
    Center for Operational Oceanographic Products and Services (COOPS)
    via OpenDAP server    http://opendap.co-ops.nos.noaa.gov .
    
    Args:
        stationID (str):              7 character-long CO-OPS station ID.
        dateRange (datetime, datetime): start and end dates of retrieval.

Run example notebooks on CI

We should ensure that the example notebooks are kept in sync with the code

Modify critech function

This is in fact the EMODNET function custom made for a subset of the available data.

Since critech has another API, soon to be available from Azure, we should generalise this one and later develop a new one for critech.

realtime streamflow and river data

Hi Zach,
Laura and I had a chance to go through the stormevents package. We wanted to send an email to you to describe our current data gaps in case stormevents would like to tackle some of these issues. Laura mentioned that it can be difficult to obtain some USGS and ACE data, especially streamflow data. See the attached spreadsheets for such data. Having this data accessible, and in the same format as co-ops data, would be a big help. In addition, it would be nice to have more unified and standardized data (e.g., similar metadata, similar datum, etc, similar units, similar variable names, etc). We saw that searvey is working on data standardization and were wondering if stormevents is working on data standardization too.
Just some thoughts! Sorry for the tardiness of this email - my internet was down this morning!
Best,
Josh
ACE_stationlist.csv
USGS_stationlist.csv

coops: Include datum information

As discussed yesterday in the meeting, it might be a good idea to include datum information in the COOPS metadata/data.

I only had a brief look, but I think that in order to retrieve the datum of each station you need to make an extra HTTP request for each station. E.g.
https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/1611400/datums.json

That being said, if I understand this correctly, the data we retrieve from COOPS are being requested using the STND datum (i.e. Station Datum) by default which I believe is 0 for all the stations. This might be a bit problematic when you request data from multiple stations because the values will not be comparable across stations.

Another issue, that we might have is what happens when you request historical data. Some stations might be referencing a different epoch, which means that we might need to retrieve the "supersededdatums" instead. E.g.
https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/2695540/supersededdatums.xml

Tests fail on OSX

The os.sched_getaffinity(0) used in multi.py is only available for linux (not osx nor windows).

https://docs.python.org/3/library/os.html#interface-to-the-scheduler

Rate limit object and multi processing/threading

In IOC code at:

searvey/searvey/ioc.py

Lines 327 to 344 in 0fc8bae

 for ioc_code in ioc_metadata.ioc_code: 

 func_kwargs.append( 

 dict( 

 period=period, 

 endtime=endtime, 

 ioc_code=ioc_code, 

 rate_limit=rate_limit, 

 truncate_seconds=truncate_seconds, 

 ), 

 ) 

 results = multithread( 

 func=get_ioc_station_data, 

 func_kwargs=func_kwargs, 

 n_workers=5, 

 print_exceptions=False, 

 disable_progress_bar=disable_progress_bar, 

 )

is the RateLimit object shared between all the threads or a new copy is created for each one? If the latter is true, then it might not work as intended.

IOC datums

The IOC stations have links to upstream providers that may provide datum information. E.g.

https://www.ioc-sealevelmonitoring.org/station.php?code=AN15

which reference:

and the links contain links to monthly sea data: https://psmsl.org/data/obtaining/rlr.monthly.data/2098.rlrdata

We should further investigate this

naming convention for attributes

Hi Guys, I have been able to successfully read both the COOPS and IOS GeoDataFrames and display the stations on a map (yellow is COOPS).
I noticed that some variable names are different in the data sets. E.g., the station name in the COOPS dataset is named "name" and in the IOC dataset "location". Will this be unified at some point (naming convention)?
I will start with some filtering of attributes soon and I assume this could be an interesting topic to discuss.
Best!

IOC technical questions

maximum number of concurrent requests (see #27)
Is there data cap?
Metadata: Complete set? Update? Consistency (e.g. Lat/Lon)
How are the data collected from primary providers
operational API? OpenDAP?
Is there any data post-processing? If yes what?
Time frame, e.g. time origin?
Datum (vertical)
Imminent changes that we should anticipate?

Add USGS storm based time series source

@zacharyburnettNOAA @SorooshMani-NOAA

Adding USGS timeseries from https://stn.wim.usgs.gov/FEV/#FlorenceSep2018

This is a code I got from @flackdl . See this repo developed by Danny as well: https://github.com/flackdl/cwwed .

import os
import re
import sys
import errno
import requests
from io import open

EVENT_ID_MATTHEW = 135  # default

# capture an event_id from the command line, defaulting to Matthew
EVENT_ID = sys.argv[1] if len(sys.argv) > 1 else EVENT_ID_MATTHEW

# file type "data"
# https://stn.wim.usgs.gov/STNServices/FileTypes.json
FILE_TYPE_DATA = 2

# deployment types
# https://stn.wim.usgs.gov/STNServices/DeploymentTypes.json
DEPLOYMENT_TYPE_WATER_LEVEL = 1
DEPLOYMENT_TYPE_WAVE_HEIGHT = 2
DEPLOYMENT_TYPE_BAROMETRIC = 3
DEPLOYMENT_TYPE_TEMPERATURE = 4
DEPLOYMENT_TYPE_WIND_SPEED = 5
DEPLOYMENT_TYPE_HUMIDITY = 6
DEPLOYMENT_TYPE_AIR_TEMPERATURE = 7
DEPLOYMENT_TYPE_WATER_TEMPERATURE = 8
DEPLOYMENT_TYPE_RAPID_DEPLOYMENT = 9

# create output directory
output_directory = 'output'
try:
    os.makedirs(output_directory)
except OSError as exception:
    if exception.errno != errno.EEXIST:
        raise

# fetch event data files
files_req = requests.get('https://stn.wim.usgs.gov/STNServices/Events/{}/Files.json'.format(EVENT_ID))
files_req.raise_for_status()
files_json = files_req.json()

# fetch event sensors
sensors_req = requests.get('https://stn.wim.usgs.gov/STNServices/Events/{}/Instruments.json'.format(EVENT_ID))
sensors_req.raise_for_status()
sensors_json = sensors_req.json()

# filter sensors down to barometric ones
barometric_sensors = [sensor for sensor in sensors_json if sensor.get('deployment_type_id') == DEPLOYMENT_TYPE_BAROMETRIC]

# print file urls for barometric sensors for this event
for file in files_json:
    if file['filetype_id'] == FILE_TYPE_DATA and file['instrument_id'] in [s['instrument_id'] for s in barometric_sensors]:

        file_url = 'https://stn.wim.usgs.gov/STNServices/Files/{}/item'.format(file['file_id'])

        # fetch the actual file
        file_req = requests.get(file_url, stream=True)

        # capture the filename from the headers so we can save it appropriately
        match = re.match('.*filename="(?P<filename>.*)"', file_req.headers['Content-Disposition'])
        if match:
            filename = match.group('filename')
        else:
            filename = '{}.unknown'.format(file['file_id'])
            print('COULD NOT FIND "filename" in header, saving as {}'.format(filename))

        print('{}\t\t({})'.format(filename, file_url))

        with open('{}/{}'.format(output_directory, filename), 'wb') as f:
            for chunk in file_req.iter_content(chunk_size=1024):
                f.write(chunk)

Develop a IOC get_data function

The IOC source has a dedicated REST web API for retrieving data see -> http://www.ioc-sealevelmonitoring.org/service.php

We need to test and include a corresponding function.

build station catalogues for major station providers (IOC, CO-OPS, Copernicus, EMODnet)

IOC
NOAA CO-OPS
- retrieved with StormEvents.coops.coops_stations()
Copernicus
European Marine Observation and Data Network (EMODnet)

searvey `0.3.0` not uploaded to PyPI due to version being `0.2.1` in metadata

@pmav99 the version of searvey in pyproject.toml is still 0.2.1

searvey/pyproject.toml

Line 3 in d654ad6

version = "0.2.1"

so the PyPI upload failed with

File already exists.

https://github.com/oceanmodeling/searvey/actions/runs/4269408156/jobs/7432567646#step:7:33

usgs tests are flaky

This test is currently failing:

searvey/tests/usgs_test.py

Lines 90 to 105 in 2a510e9

 @pytest.mark.parametrize( 

  "truncate_seconds,no_records", 

  [ 

  pytest.param(True, 11, id="truncate_seconds=True"), 

  pytest.param(False, 11, id="truncate_seconds=False"), 

  ], 

 ) 

 def test_get_usgs_station_data(truncate_seconds, no_records): 

 """Truncate_seconds=False returns more datapoints compared to `=True`""" 

 df = usgs.get_usgs_station_data( 

 usgs_code="15484000", 

 endtime=datetime.date(2022, 10, 1), 

 period=1, 

 truncate_seconds=truncate_seconds, 

 ) 

 assert len(df) == no_records

with:

=============================================================================================================== FAILURES ===============================================================================================================
__________________________________________________________________________________________ test_get_usgs_station_data[truncate_seconds=False] __________________________________________________________________________________________
tests/usgs_test.py:105: in test_get_usgs_station_data
    assert len(df) == no_records
E   assert 18 == 11

The dataframe we get is:

df = usgs.get_usgs_station_data(
    usgs_code="15484000",
    endtime=datetime.date(2022, 10, 1),
    period=1,
    truncate_seconds=False,
)

                                                                  value qualifier unit                                               name
site_no  datetime                  code  option                                                                                          
15484000 2022-09-30 08:00:00+00:00 00065                           7.32         A   ft                                  Gage height, feet
         2022-09-30 16:30:00+00:00 00065                           7.31         A   ft                                  Gage height, feet
         2022-10-01 00:00:00+00:00 00065                            7.3         A   ft                                  Gage height, feet
         2022-10-01 02:30:00+00:00 00065                           7.28         A   ft                                  Gage height, feet
         2022-10-01 04:15:00+00:00 00065                           7.29         A   ft                                  Gage height, feet
         2022-10-01 19:00:00+00:00 00065                           7.27         A   ft                                  Gage height, feet
         2022-10-02 03:15:00+00:00 00065                           7.26         A   ft                                  Gage height, feet
         2022-09-30 08:00:00+00:00 00065 ak dot&pf bridge datum  639.04         P   ft                                  Gage height, feet
         2022-09-30 16:30:00+00:00 00065 ak dot&pf bridge datum  639.03         P   ft                                  Gage height, feet
         2022-10-01 00:00:00+00:00 00065 ak dot&pf bridge datum  639.02         P   ft                                  Gage height, feet
         2022-10-01 02:30:00+00:00 00065 ak dot&pf bridge datum   639.0         P   ft                                  Gage height, feet
         2022-10-01 04:15:00+00:00 00065 ak dot&pf bridge datum  639.01         P   ft                                  Gage height, feet
         2022-10-01 19:00:00+00:00 00065 ak dot&pf bridge datum  638.99         P   ft                                  Gage height, feet
         2022-10-02 03:15:00+00:00 00065 ak dot&pf bridge datum  638.98         P   ft                                  Gage height, feet
         2022-09-30 08:25:00+00:00 72414 ak dot&pf bridge datum   632.3         P   ft  Streambed elevation at measurement point, abov...
         2022-09-30 08:55:00+00:00 72414 ak dot&pf bridge datum  632.21         P   ft  Streambed elevation at measurement point, abov...
         2022-09-30 12:55:00+00:00 72414 ak dot&pf bridge datum  632.12         P   ft  Streambed elevation at measurement point, abov...
         2022-09-30 14:55:00+00:00 72414 ak dot&pf bridge datum  632.38         P   ft  Streambed elevation at measurement point, abov...
[18 rows x 4 columns]

I guess that USGS started to return the additional 7 rows recently. Judging from the output, the extra rows that cause the issue are the first 7 ones, i.e. the ones without an option.

Is it normal for USGS to add data like this?

Anyhow, this specific test does not seem to be testing what it is supposed to test (i.e. truncate_seconds), because the timestamps don't contain any seconds anyhow. We should probably choose a different combination of usgs_station and/or time period. Furthermore, it will probably be more robust if we make the assertion like this:

df = usgs.get_usgs_station_data(..., truncate_seconds=False)
df_trunc = usgs.get_usgs_station_data(..., truncate_seconds=True)
assert len(df) > len(df_trunc)

seconds_trunc = df.index.to_frame().datetime.dt.second
assert (seconds_trunc == 0).all()
assert df_with

pinging @SorooshMani-NOAA

PS. There are more tests that are failing, probably due to USGS returning extra data but I didn't dig in deeper in those.

FAILED tests/usgs_test.py::test_get_usgs_station_data[truncate_seconds=True] - assert 18 == 11
FAILED tests/usgs_test.py::test_get_usgs_station_data[truncate_seconds=False] - assert 18 == 11
FAILED tests/usgs_test.py::test_get_usgs_data[truncate_seconds=True] - AssertionError: assert 226 == 222
FAILED tests/usgs_test.py::test_get_usgs_data[truncate_seconds=False] - AssertionError: assert 226 == 222
FAILED tests/usgs_test.py::test_get_usgs_station_data_by_string_enddate - assert 30 == 17

Source catalogs

I have used, in the past, this procedure to create the catalogs from providers' websites.

We need to review, possibly update and include something similar to searvey.

@yosoyjay @zacharyburnettNOAA @pmav99

Error when installing locally

Trying python -m pip install -e . gives:

  ERROR: Command errored out with exit status 1:
   command: /Users/brey/miniconda3/envs/searvey/bin/python /Users/brey/miniconda3/envs/searvey/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/rz/63vzrg3547x1_d0z04_vrb_00000gn/T/tmpsva1lntj
       cwd: /Users/brey/GitHub/searvey
  Complete output (14 lines):
  Traceback (most recent call last):
    File "/Users/brey/miniconda3/envs/searvey/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/Users/brey/miniconda3/envs/searvey/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/Users/brey/miniconda3/envs/searvey/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
      return hook(metadata_directory, config_settings)
    File "/private/var/folders/rz/63vzrg3547x1_d0z04_vrb_00000gn/T/pip-build-env-8dcboa1n/overlay/lib/python3.9/site-packages/poetry/core/masonry/api.py", line 43, in prepare_metadata_for_build_wheel
      poetry = Factory().create_poetry(Path(".").resolve(), with_dev=False)
    File "/private/var/folders/rz/63vzrg3547x1_d0z04_vrb_00000gn/T/pip-build-env-8dcboa1n/overlay/lib/python3.9/site-packages/poetry/core/factory.py", line 43, in create_poetry
      raise RuntimeError("The Poetry configuration is invalid:\n" + message)
  RuntimeError: The Poetry configuration is invalid:
    - Additional properties are not allowed ('group' was unexpected)

Example Notebooks

Can we have some examples in a Notebook(s) like the ones in my original repo?

Unfortunately I have deleted them so I do hope you have them somewhere.

Support other variables (tempreture, velocity) from CO-OPS

@zacharyburnett @pmav99 @brey

Thanks for the great tool. I just want to report that the download process for water temperature went through nicely:

However it generated error on netcdf file output:


In [1]: run data_region.py
 > get data ... water_temperature
  > plot figure for  water_temperature
 > save netcdf file
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /mnt/c/Users/Saeed.Moghimi/Documents/work/linux_working/00-working/15-stormevents/data_region.py:48, in <module>
     46     if True:
     47         print(' > save netcdf file')
---> 48         data.to_netcdf(var + '.nc')
     51 if True:
     52     import geopandas

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/core/dataset.py:1901, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1898     encoding = {}
   1899 from ..backends.api import to_netcdf
-> 1901 return to_netcdf(
   1902     self,
   1903     path,
   1904     mode,
   1905     format=format,
   1906     group=group,
   1907     engine=engine,
   1908     encoding=encoding,
   1909     unlimited_dims=unlimited_dims,
   1910     compute=compute,
   1911     invalid_netcdf=invalid_netcdf,
   1912 )

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/backends/api.py:1072, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1067 # TODO: figure out how to refactor this logic (here and in save_mfdataset)
   1068 # to avoid this mess of conditionals
   1069 try:
   1070     # TODO: allow this work (setting up the file for writing array data)
   1071     # to be parallelized with dask
-> 1072     dump_to_store(
   1073         dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1074     )
   1075     if autoclose:
   1076         store.close()

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/backends/api.py:1119, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1116 if encoder:
   1117     variables, attrs = encoder(variables, attrs)
-> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/backends/common.py:261, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    258 if writer is None:
    259     writer = ArrayWriter()
--> 261 variables, attributes = self.encode(variables, attributes)
    263 self.set_attributes(attributes)
    264 self.set_dimensions(variables, unlimited_dims=unlimited_dims)

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/backends/common.py:350, in WritableCFDataStore.encode(self, variables, attributes)
    347 def encode(self, variables, attributes):
    348     # All NetCDF files get CF encoded by default, without this attempting
    349     # to write times, for example, would fail.
--> 350     variables, attributes = cf_encoder(variables, attributes)
    351     variables = {k: self.encode_variable(v) for k, v in variables.items()}
    352     attributes = {k: self.encode_attribute(v) for k, v in attributes.items()}

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/conventions.py:859, in cf_encoder(variables, attributes)
    856 # add encoding for time bounds variables if present.
    857 _update_bounds_encoding(variables)
--> 859 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    861 # Remove attrs from bounds variables (issue #2921)
    862 for var in new_vars.values():

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/conventions.py:859, in <dictcomp>(.0)
    856 # add encoding for time bounds variables if present.
    857 _update_bounds_encoding(variables)
--> 859 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    861 # Remove attrs from bounds variables (issue #2921)
    862 for var in new_vars.values():

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/conventions.py:279, in encode_cf_variable(var, needs_copy, name)
    277 var = maybe_default_fill_value(var)
    278 var = maybe_encode_bools(var)
--> 279 var = ensure_dtype_not_object(var, name=name)
    281 for attr_name in CF_RELATED_DATA:
    282     pop_to(var.encoding, var.attrs, attr_name)

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/conventions.py:234, in ensure_dtype_not_object(var, name)
    231             inferred_dtype = np.dtype(float)
    232         fill_value = inferred_dtype.type(np.nan)
--> 234     data = _copy_with_dtype(data, dtype=inferred_dtype)
    235     data[missing] = fill_value
    236 else:

File ~/miniconda3/envs/adcircpy_env/lib/python3.10/site-packages/xarray/conventions.py:195, in _copy_with_dtype(data, dtype)
    189 """Create a copy of an array with the given dtype.
    190
    191 We use this instead of np.array() to ensure that custom object dtypes end
    192 up on the resulting array.
    193 """
    194 result = np.empty(data.shape, dtype)
--> 195 result[...] = data
    196 return result

TypeError: float() argument must be a string or a real number, not 'NAType'

In [2]:

Here is my code:

from shapely.geometry import box
from datetime import datetime,timedelta
from stormevents.coops import coops_product_within_region

import geopandas
from  matplotlib import pyplot as plt


#vars = ['water_level','water_temperature','wind','salinity']
#vars = ['wind','salinity']
vars = ['water_level','water_temperature']
vars = ['water_temperature']


#hsofs bnd
#bnd = -99.0,5,-52.8,46.3    #US mainland
bnd = -98.5 ,-84.5,24.,31.5  #GulfM
bnd = -74.5 , -73.55  , 40.35 ,   41.2  #san_newyork

now_    = datetime.now()
past_   = now_ - timedelta(0.25)

for var in vars:
    print (' > get data ...', var)
    data = coops_product_within_region(
        product = var,
        start_date = past_.isoformat(),
        end_date   = now_.isoformat(),
        region=box(*bnd)
    )

    print (' > plot figure for ' , var)
    plt.figure()
    countries = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

    figure, axis = plt.subplots(1, 1)
    figure.set_size_inches(12, 12 / 1.61803398875)
    countries.plot(color='lightgrey', ax=axis, zorder=-1)
    plt.scatter(x=data.x,y=data.y,c=data.v.mean(axis=1))
    plt.colorbar()
    plt.xlim(*bnd[:2])
    plt.ylim(*bnd[2:])
    axis.set_title(var)
    plt.savefig(var + '.png')
    plt.close('all')

    if True:
        print(' > save netcdf file')
        data.to_netcdf(var + '.nc')

Please look into this.

Make the example Notebooks binder friendly

Both the IOC and COOPS Notebooks work fine on Binder. The CERA one produces some errors and the USGS_data crashes the kernel most likely due to the memory constrain. Note that binder has a ceiling of 2GB of memory.

@SorooshMani-NOAA Can you have a look and push some new versions of the notebooks? There is no need of a release since the Notebooks are read from the repo.

Filter COOPS stations in maintenance mode

When COOPS stations are in maintenance the API might respond with corrupted data for those stations (e.g. duplicate timestamp). This causes an exception when in searvey code we are trying to create a Dataset from the Dataframe.

Note that this is true even for the historical data of the stations. We need to distinguish the stations in normal vs maintenance mode and automatically filter stations under maintenance (maybe throw warnings too?) in order to avoid Python exceptions.

Example of a response Dataframe that cause problem:

                         v      s        f  q                                                                                                                                                                      t
2018-08-30 00:00:00 -0.788  0.033  0,0,0,0  v
2018-08-30 00:00:00  0.186  0.042  0,0,0,0  v
2018-08-30 00:06:00 -0.812  0.039  0,0,0,0  v
2018-08-30 00:06:00  0.218  0.045  0,0,0,0  v
2018-08-30 00:12:00 -0.835  0.033  0,0,0,0  v
...                    ...    ...      ... ..
2018-09-17 23:48:00  0.661  0.073  0,0,0,0  v
2018-09-17 23:54:00  0.701  0.104  0,0,0,0  v                                                                                                                                                                     
2018-09-17 23:54:00 -0.442  0.054  0,0,0,0  v
2018-09-18 00:00:00  0.699  0.107  0,0,0,0  v
2018-09-18 00:00:00 -0.449  0.061  0,0,0,0  v

And the line of code is

searvey/searvey/coops.py

Line 365 in 0fc8bae

data = data.to_xarray()

Getting COOPS stations within region

In stormevents there's this assumption in one of the automated tests that if an empty geometry is passed to the coops_stations_within_region, the return data frame will be empty. Right now that test is broken because the logic in searvey is that:

searvey/searvey/coops.py

Lines 824 to 827 in d654ad6

 if region: 

 return stations[stations.within(region)] 

 else: 

 return stations

@pmav99, @carolakaiser in your opinions, does it make sense to keep searvey as it is and just update the stormevents test or should I modify how coops_stations_within_region works in searvey?

What was the use-case for adding this code @carolakaiser? Do you have a case where if you pass empty geometry then you want to get all of the stations? Or are you passing None instead of empty geometry to get all the stations? If you're usecase is covered by passing None, then we can change this to if region is None: then both of our use-cases are covered.

What do you think?

move Searvey into OceanModeling

https://github.com/pmav99/searvey -> https://github.com/oceanmodeling

Avoid function calls as default arguments in functions

Function calls should be avoided as default aguments.

import datetime
import time

def print_isotime(timestamp: datetime.datetime = datetime.datetime.now()) -> None
    print(timestamp.isoformat())

for i in range(3):
    print_isotime()
    time.sleep(1)

The above snippet will print:

2023-03-03T12:15:00.714514
2023-03-03T12:15:00.714514
2023-03-03T12:15:00.714514

I.e the timestamps do not change. This happens due to early bounding of function arguments]and is a well known gotcha

We use this anti-pattern in IOC and USGS. I introduced it and I guess that @SorooshMani-NOAA just followed my lead.

searvey/searvey/ioc.py

Line 252 in f7763c6

endtime: Union[str, datetime.date] = datetime.date.today(),

searvey/searvey/usgs.py

Line 280 in f7763c6

endtime: Union[str, datetime.date] = datetime.date.today(),

In practical terms this is a problem, because in a long running script, after midnight the date will not be updated.

There are several ways to avoid this. The most common one is to use None as the default value and make the function call in the body of the function. E.g.:

def print_time(timestamp: datetime.datetime | None = None) -> None
    if timestamp is None:
        timestamp = datetime.datetime.now()
    print(timestamp.isoformat())

Nevertheless, this is not ideal when the code is being statically typed, because you change the signature. Anyhow, we should remove it.

USGS improvements

Notes from today's meeting

Add a state column
If possible, get_usgs_stations() should contain a column with the data_type/variable.
get_usgs_stations(region) should retrieve all the stations and use region to filter them. Now it uses the "region" USGS API and this is limiting the number of stations we can retrieve.
we should add USGS data to searvey.stations.get_stations()
Add new parameters to the data fetched (#65 (comment))

Station activity status

searvey/searvey/stations.py

Line 75 in aa5a640

is_active=ioc_gdf.last_observation > activity_threshold_ts,

Currently we define a station as active in a very ad-hoc way. COOPS has an upstream definition while a time threshold is used for IOC (see above).

We need to come up with a consistent definition or avoid the tag altogether and let the user subset the data after retrieval.

dtype Category not writeable

searvey/searvey/ioc.py

Line 150 in 0b0e372

country=df.country.astype("category"),

The above referenced normalization creates a problem when trying to save to (geo)json.

Ex.

from searvey import ioc
import shapely
geo_box = shapely.geometry.box(-25.0, 56.0, -12.0, 74.0)
tg = ioc.get_ioc_stations(region=geo_box)
tg.dtypes
tg.to_file('stations.json')

Setting

tg['country']=tg.country.astype('str')

solves the problem.

Maybe this should be included in the final output on searvey side? I am not sure why category dtype is needed.

Relax requirement versions

Move "get data" functionality from `StormEvents` to `searvey`

@zacharyburnettNOAA @pmav99 @brey what is the plan to move "get data" functionality of StormEvents to searvey?

Should there be standardized input parameters and output dataframe for all data sources in searvey? In that case, for example, does it make sense to move USGS high water mark to searvey? As it has event_id as input and output dataframe that has non-station locations based on input county and state. Also how should dataframes be indexed?

retrieve IOC sensor types as metadata

Hi All,

Would it be possible to retrieve the sensor types for the IOC stations as an metadata attribute (with a list of sensors) for each station?

Regarding non-tidal stations

Eric Allen (NOAA NWS) asked whether searvey queries can filter on [non-]tidal station designation. The designation needs to be rigorously defined and added as metadata to water level stations from NOAA, USGS, etc. I have attached a PDF document from NOAA CO-OPS SOP re: non-tidal classification.

NOAA_Classification_of_a_Water_Level_Station_as_Non-tidal.pdf

What is the maximum number of requests that we can send to IOC and COOPS (rate limiting)?

In order to speed up data retrieval, it makes sense to parallelize the requests for the station data - e.g. using async or multithreading. Nevertheless I couldn't find any reference WRT rate limiting in the respective websites. Does anyone know what is the maximum number of concurrent requests that we can send to IOC and COOPS without violating their ToS?

https://api.tidesandcurrents.noaa.gov/api/prod/
https://www.ioc-sealevelmonitoring.org/disclaimer.php

Do we know anyone that we could ask?

Including support for Army Corp WL data

@gseroka @allenea @carolakaiser @SorooshMani-NOAA

All

Here I am adding some communication between @allenea @carolakaiser and others on how to access Army Corps WL data and its relative vertical datum.

It would be great if we continue this discussion here to have good documentation up until we find a brave volunteer to implement the support in Searvey.

Please tag other here as well,
Thanks
-Saeed

Update
I revised the quoted email formatting style to be easier read:

FYI... I followed up with the US Army Corp (for southern region location) and the only way to know those offsets is by looking at the [dated] website. It looks like most gages are going to be in NGVD29 and the numbers seem to line up with the VDatum offset to NAVD88. However, that likely doesn't apply for all gages from what I've been told. So it may take some reverse engineering to determine what the offset to get to NAVD88 is and from there you could use VDatum to figure out which datum correction gives you that offset value. Or we'll have to manually collect those offsets and store them somewhere that they can be referenced.

Eric

Thank you. That's what we'll do. I just wanted to make sure there was no better streamlined approach for determining that information.

Best,

Eric

If memory serves correct NGVD29 is predominant. You can navigate to the gage on the site and it should tell you what datum is used.

lb

Hi Lysanias,

This is very informative. Thank you for providing such a quick reply.

I am noticing one key piece of information missing. This piece about the "Adjustment for vertical datum NAVD88 (2009.55): -0.96 ft." is important because I don't see in any of these options the vertical datum which the gage collects data in isn't explicitly stated. Maybe all gages are NGVD29?

https://rivergages.mvr.usace.army.mil/WaterControl/stationinfo2.cfm?sid=01300

If that is the case, can you confirm or provide documentation that supports that assumption? I used VDatum and it shows the offset to be -0.961. So not exact but pretty close.

Is this information in the API/web service anywhere or is there documentation that provides this information?

V/r,

Eric

Eric,
RiverGages

The only parameter that may need to change is the site id unless you are pulling time series data. GetVariableInfo, GetSites, and GetSiteInfo only rely on the “site”parameter while GetValues, and GetFValues require the additional parameters to be set to pull the correct data. All parameters require a value for all methods even if they are not used by the method because of how ColdFusion components operate when invoked.

The first link is an example of how to query for site information. The other methods query for meta-data and observations:

https://rivergages.mvr.usace.army.mil/watercontrol/webservices/rest/webserviceWaterML.cfc?method=RGWML&meth=getSites&site=rcki2&location=rcki2&variable=HP&beginDate=2019-03-25T00:00&endDate=2019-03-26T23:59&authtoken=RiverGages&authToken=RiverGages

Below are all the "Meth"-ods for the RiverGages WaterML webservice. All the URL parameters are always required for any call; follow the above example as a template.

GetValues

GetVariableInfo

GetSites

GetSiteInfo

GetFValues (same as GetFalues except it only returns observed/measured values versus forecast values)

lb

Eric,

We have a web service that provides temporal and meta-data. I will send you the details.

lb

Hi,

I was wondering if you had an API or some sort of way for us to easily look up things like lat/lon, gage elevation, gage datum, offset to NAVD88 to other offsets that you have collected and provided via the website. Do we have to manually go through the website and record these offsets?

We have learned about this method for retrieving data, however, we aren't sure what datum this data is in or what offsets we need to make to correctly adjust the water level datum.

https://rivergages.mvr.usace.army.mil/watercontrol/webservices/rest/webserviceWaterML.cfc?method=RGWML&meth=getValues&location=01300&site=01300&variable=HG&beginDate=2020-04-05T00:00&endDate=2020-04-10T23:59&authToken=RiverGages

Can you please advise?

Eric

active status for COOPS stations

When filtering the COOPS stations for being active like this:

coops_active = stations_coops[stations_coops['is_active']] # comes back as True/False

there are some results that show a station as active but those stations do either not exist on the https://tidesandcurrents.noaa.gov website or deliver a "No data was found" result.

What API do you use to get the active status (is it the NOAA OpenDap server or anything like that?)

Examples:
https://tidesandcurrents.noaa.gov/waterlevels.html?id=8632837
https://tidesandcurrents.noaa.gov/waterlevels.html?id=8574728

Building docs on CI fails with an error for searchindex.js.tmp

While working on #28 I noticed some random CI failures while trying to build the docs. The error was something like this:

Exception occurred:
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/sphinx/builders/html/__init__.py", line 1125, in dump_search_index
    os.replace(searchindexfn + '.tmp', searchindexfn)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/searvey/searvey/docs/build/html/searchindex.js.tmp' -> '/home/runner/work/searvey/searvey/docs/build/html/searchindex.js'

It seems that there is a pending sphinx issue: sphinx-doc/sphinx#10111

Until the issue gets fixed and/or a workaround gets posted, I don't think that there is anything we can do about it. I am creating this issue just to raise the awareness about it.

FWIW, I don't think we need to block any PR if we get such an error.

	df = df.assign(
	ioc_code=ioc_code,
	# Normalize timestamps to minutes: https://stackoverflow.com/a/14836359/592289
	# WARNING: the astype() truncates seconds. This can potentially lead to duplicates!
	time=pd.to_datetime(df.time).astype("datetime64[m]"),
	)
	if len(df) != len(df.time.drop_duplicates()):
	msg = f"{ioc_code}: The sampling ratio is less than 1 minute. Data have been lost"
	raise ValueError(msg)

	for ioc_code in ioc_metadata.ioc_code:
	func_kwargs.append(
	dict(
	period=period,
	endtime=endtime,
	ioc_code=ioc_code,
	rate_limit=rate_limit,
	truncate_seconds=truncate_seconds,
	),
	)

	results = multithread(
	func=get_ioc_station_data,
	func_kwargs=func_kwargs,
	n_workers=5,
	print_exceptions=False,
	disable_progress_bar=disable_progress_bar,
	)

	@pytest.mark.parametrize(
	"truncate_seconds,no_records",
	[
	pytest.param(True, 11, id="truncate_seconds=True"),
	pytest.param(False, 11, id="truncate_seconds=False"),
	],
	)
	def test_get_usgs_station_data(truncate_seconds, no_records):
	"""Truncate_seconds=False returns more datapoints compared to `=True`"""
	df = usgs.get_usgs_station_data(
	usgs_code="15484000",
	endtime=datetime.date(2022, 10, 1),
	period=1,
	truncate_seconds=truncate_seconds,
	)
	assert len(df) == no_records

	if region:
	return stations[stations.within(region)]
	else:
	return stations

oceanmodeling / searvey Goto Github PK

searvey's People

Contributors

Stargazers

Watchers

Forkers

searvey's Issues

Recommend Projects

Recommend Topics

Recommend Org