meteoswiss / ampycloud Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 6.27 MB

Python package to determine the sky coverage fraction and base height of cloud layers using ceilometer data

Home Page: https://meteoswiss.github.io/ampycloud/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

ampycloud's People

Stargazers

Watchers

Forkers

uhamann

ampycloud's Issues

Should there be a limit to the number of significant cloud layers ?

To decide which cloud layers are "significant" (i.e. should be reported), ampycloud follows the ICAO rules in this document. The relevant part (in S.4.5.4.3) is as follows:

e) when several layers or masses of cloud of operational significance are observed, their amount and height of cloud
base should be reported in increasing order of the height of cloud base, and in accordance with the following
criteria:

the lowest layer or mass, regardless of amount to be reported as FEW, SCT, BKN or OVC as appropriate;

the next layer or mass, covering more than 2/8 to be reported as SCT, BKN or OVC as appropriate;

the next higher layer or mass, covering more than 4/8 to be reported as BKN or OVC as appropriate; and

cumulonimbus and/or towering cumulus clouds, whenever observed and not reported in 1) to 3);

This does set restrictions of the first three layers to be reported. But it does not explicitly exclude the possibility of additional cloud layers beyond that .... right ? 😕

Let's take an example: BKN010 BKN020 BKN050 BKN080

Should the last layer be reported ? I think so. If not, why not ?

Afterall, the same ICAO document is being very explicit, in the case of present weather:

[...] one or more, up to a maximum of three, of the present weather abbreviations [...]

If they were really only a maximum of three non-CB/TCU cloud layers that get reported in a METAR, the document would have said so ... right ?

Summary: for now, ampycloud does not limit the number of significant cloud layers. I stand ready to be convinced otherwise, in which case I'll update the code happily 😬

Add CITATION file

Describe the change
When the time comes, it'll be nice to have a proper CITATION.cff file in the repo.

https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files

Need to cleanup format_..._axes() once yaconfigobject supports deepcopy

Describe the bug
While fixing #24, this bug in yaconfigobjectwas revealed.

As a work around, in plots.diagnostic.format_slice_axes() and plots.diagnostic.format_group_axes(), the use of deepcopy was replaced by a manual initialization-and-update of a new dictionnary, e.g.

de_dt_scale_kwargs = {}
de_dt_scale_kwargs.update(dynamic.AMPYCLOUD_PRMS.SLICING_PRMS.dt_scale_kwargs)

If yacongiobject gets updated to support the use of deepcopy, these lines could be cleaned-up.

Document the ``99`` type of simulated cloud hits (and fix it?)

Describe the change
The mocker module is still rather crude, and so far cannot properly simulate "hits". It sets all the types to 99, which is confusing to users.

Before we go public, we should really fix this or document it properly. But ideally fix it.

Make MSA a scientific variable

Describe the change
Currently, msa is only a keywords applied to the functions issuing metars. However, cropping very high measurements prior to feeding them to the ampycloud algorithm does impact its outcome, by allowing for thinner slices.

To ease this, we need to create two new parameters: MSA and MSA_HIT_BUFFER.
The former is meant to be the official MSA value of the site, and the latter is a (positive) altitude buffer above which hits will not be considered by the ampycloud algorithm.

In p[articular, there should be a test setup to verify that MSA_HIT_BUFFER is postitive.

ValueError: Expected n_samples >= n_components ...

Describe the bug
Running the algorithm with the following input crashes:

demo_data = pd.DataFrame(np.array([['CT31_Oberglattt', 703877.16838, 2300, 1],
                                                               ['CT31_Oberglattt', 703877.16838, 4000, 2],
                                                               ['CT31_Oberglattt', 703877.16838, 4500, 3],
                                                               ['CT31_Oberglattt', 703892.16838, 1000, 4],
                                                               ['CT31_SomewherElse', 703877.16838, -1, 0],
                                                               ]),
                                                              columns=['ceilo', 'dt', 'alt', 'type'])

with:

ValueError: Expected n_samples >= n_components but got n_components = 3, n_samples = 2

To Reproduce

chunk = ampycloud.run(demo_data)

Expected behavior
Should work.

Additional context
Reported by @nestabur.

ref_dt should allow for type ``datetime.datetime``

Describe the bug
The examples says:

# Run the ampycloud algorithm on it
chunk = ampycloud.run(mock_data, geoloc='Mock data', ref_dt=datetime.now())

which works fine. However, datetime.datetime is not actually allowed by the type hints of the function.

Additional context
Initially reported by @nestabur

Move the mpl setup out of the main init.py

Describe the change
Loading ampycloud sets the default plotting style right away. This should be done only if a plot is actually required.

The relevant call should be moved to the __init__.py of the plots sub-package.

Edit: also, we should probably avoid setting any plotting style by default, and let the user decide to do so if they desire (without any guarantee as to the look of plots).

Need examples in the docstrings of the core routine

Describe the change
It would be useful to include properly formatted examples in each of the autometpy.core functions. That way, we will be able to pull them directly inside the doc and avoid duplicating information.

Using multiple config setups is not thread-safe

Describe the bug
Because of the current use of the dynamic module directly by the class methods of CeiloChunk, ampycloud is not thread-safe for users trying to process different data with differing sets of parameters at the same time.

To fix this, we need to:

Create an instance variable of the full sets of parameters for any new instance of CeiloChunk being created
Refer to these rather than the dynamic module in CeiloChunk methods

Side-note: the user should be able to specify only the specific parameters that are of interest, while getting the rest from the (current) defaults set in dynamic.

Add downsampling step

Describe the change
@loforest asked to have an additional, final step to the algorithm that ensures that the list of final cloud layer (bases) are separated by a specific amount (specifiable by the user as a function of altitude). Layers too close from one another should get merged in that step.

`step_scale` crashes with a single scale

Describe the bug
Specifying a single scale (and thus no steps) crashes the step_scale() routine.

Applying the MSA should set hits to NaN, rather than crop them

Describe the bug
At the moment, any hits above the MSA + MSA_HIT_BUFFER are cropped. This is incorrect. These should be converted as NaN, since they actually represent a non-detection.

Consider allowing to feed parameters via yaconfiobject ?

Describe the change
As the title says. Since this is where it'll go, some users may be interested to directly feed a yaconfigobject to ampycloud ?

Layering step crashes when all the altitudes are the same

Describe the bug
Following a report by @nestabur, it appears that the layering step does not handle well the case where all the altitude values are the same.

As of #74, the code will explicitly check for this, and skip the layering step when it occurs.

Deal with SYNOP properly

Describe the change
At the moment, the concept of synop is used throughout autometpy to mean all cloud layers identified.

Following a comment from @loforest, the concept of SYNOP is more complicated.

And we should really handle this properly.

Specify `OKTA_LIM0/8` in absolute hit counts rather than percentages

Describe the change
@loforest asked to have the OKTA_LIM0 and OKTA_LIM8 parameters in absolute counts rather than percentages.

Improve the mocker module

Describe the change
The mocker module is a bit clunky, and should be cleaned-up, particularly the "type A" routine that ought to be replaced by something more generic. Also, tests !

Should require a minimum number of points to do GMM

Describe the change
At present, Gaussian Mixture Modelling is being executed if cloud layers have at least 2 oktas and more than 3 points. But 3 points still leads to unstable results - this number should be increased to a reasonable value, based on simulations.

E.g.: 50 ?

Add more info about CeiloChunk in the docs

Describe the change
Right now, CeiloChunk is not exactly described in the docs. We should correct this and provide examples, in particular for users that need more than what the courtesy chunk.metar_msg() provides (e.g. okta value, altitude, exact sky coverage fraction).

Too many layers identified for very thin groups

Describe the bug
Very thin groups can lead to the over-detection of layers if the group thickness is close to the resolution of the hit measurements. In those cases, the hits appear strongly quantized to the eye of the Gaussian Mixture Modelling, which starts to see too many components.

Enable CI/CD tests for Python 3.9 and 3.10

Describe the change

Now that the repo is public, there are no limits on Github Actions minutes. We should thus enable automated checks for newer version of Pythons to catch problems early.

Suggestion:

run pylint on the latest Python version
run pytest on Python 3.8, 3.9, and 3.10 (for now).
run docs check on the latest Python version

Add a performance module

Describe the change
It would be interesting to have a utils.performance module that could run a benchmark case to assess the speed of the algorithm. Also to measure the overheads associated to plots.

Layering returns different results for the same dataset

Describe the bug
ncom_from_gmm() was found to return a different number of best_ncomp for certain datasets, if run multiple times.

This was related to the lack of fixed random seeds in the code, which leads to different results for "borderline" datasets: specifically, very thin groups that have a nearly quantized distribution of hits.

Expected behavior
The code should always return the same number of layer components.

Create high-level entry point to get a local copy of the parameter file

Describe the change
I suppose that a command-line high-level entry point of ampycloud.copy_prm_file() would be nice ...

Add ref to docstring of frac2okta()

Describe the change
It would be useful to add a link to the article of Boers et al. (2010) to the frac2okta() docstring in wmo.py, as a reference for the conversion scheme where the 1 and 7 okta bins are larger than the others by 50%.

`LAYER_BASE_FRAC` not fed to `self.metarize()`

Describe the bug
This bug was reported by @loforest in #83. The parameter LAYER_BASE_FRAC is never fed to the self.metarize() routine of the slicing, grouping, and layering steps.

Tie repo to Zenodo

Describe the change
When the time comes for v1.0.0, we should make sure to tie this repo to Zenodo before it actually happens.

Please add API documentation

At the moment no APIs are documented. These should at least be documented in the top-level README.

It is possible to look at some test cases in order to reverse engineer this, but this is not the preferred way, could change, and takes a lot of effort.

Should use yaconfigobject instead of the dynamic module

Describe the change
It looks like https://github.com/MeteoSwiss/yaconfigobject could be used to replace the dynamic module entirely.

Doing so would probably be a good idea, since currently the config values are present both in the reference YAML file and the dynamic module itself.

Users should be able to feed an 'msa' to plots.diagnostic

Describe the change
As the title says. The diagnostic plots should be able to receive a msa from the user.

Check that type 0 = nan for alts

Describe the bug
For the moment, we do not explicitely check that type 0 corresponds to NaN (no detection). We should check this, and issue a Wrning (or Exception ?) if it is not the case.

We should also clarify the docs to say that non-detection should be reported as NaNs, and that non-detection are different from no observations.

Document and add an interface using JSON as input and output in order to make it compatible with other systems

In order to interface properly with this lib, a clear and well documented API, should be specified.

I suggest using JSON and in- and output to this API. This makes it easier to integrate with other software, written in other languages.

I suggest doing the following:

At the moment the inputs seem to be pickle files (see tests). This should be converted to JSON.
The output is in a so-called "METAR format", which is a sorted list of text strings, that integrates badly. Change this to a sorted JSON list of numeric values.

At the moment no interfaces or APIs are documented. see #43.
Please document the API (#43) before fixing this PR.

Resampling large gaps should result in NaNs

Describe the change
When resampling profiles, large gaps should result in NaNs. We absolutely do not want to interpolate over large distances.

The maximum time gaps should be set to: 2s ?

@GonzagueRomanens, @g-martucci, @cfelix-ch: do you think this number is reasonable ?

Pandas v1.5.0 FutureWarning

Describe the change
The Github CI/CD checks are picking-up a new FutureWarning from pandas coming from v1.5.0.

I will fix those warnings using backward compatible code (hopefully).

To be safe, I could bump the minimum required version of pandas to be v1.5.0. @nestabur, in a general sense, how problematic/annoying is it on the autometpy side of things if ampycloud has new package version requirements ?

Use new Issue forms

Describe the change
The new issue forms would be a nice improvement to the current issue and feature request templates.

Fix plots/test_core.py once pytest 7.0 is pip-installable

Describe the change
The test_empty_plots() routine must be corrected once pytest 7.0 is properly out.

Should be any day now.

Duplicate points in X warning in GMM

Warning message when running ampycloud on 2022.01.02 at 00:20 in Geneva:
python3.8/site-packages/sklearn/mixture/_base.py:143: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (3). Possibly due to duplicate points in X. cluster.KMeans( python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
Screenshots

Desktop (please complete the following information):

OS: Ubuntu 20
ampycloud version 0.4.0.dev0
Python version 3.8.10

Submitting empty dataframe crashes

Describe the bug
At the moment, feeding ampycloud with an empty data frame crashes the code.

This should raise a proper error instead.

It should be noted here that an empty dataframe is indicative of no measurements (such that ampycloud evidently cannot compute a METAR code). This is NOT the same as a dataframe with NaN altitudes, which corresponds to valid ceilometer measurements of clear sky conditions.

Avoid testing all Python versions on all OS

Describe the change
Right now, the CI_pytest.yml is being executed for Python 3.8, 3.9,and 3.10 on Ubuntu, OSX, and Windows.

This is time consuming, and not ideal in an active dev phase. It would be best to test OSX and Windows deployment only for the latest Python version, and stick to Ubuntu to test different versions as well.

The syntax of CI_pytest.yml should be adjusted accordingly.

Surely this can be done without having to duplicate the action, right ?!

Secondary axis in diagnostic plots have wrong limits

Describe the bug
The secondary axis in the diagnostic plots, i.e. the axis showing the scaled dt and alt, have wrong limits for the minmax and const scaling.

This bug was found while fixing #66. Essentially, it is related to the fact that the scaling functions are "lossy", in the sense that one cannot "undo" the scaling just from the scaled data - knowledge of the original data is required.

Since the secondary axis were feeding a different set of data to the scaling functions, the resulting scaling was slightly different.

The problem was fixed by distinguishing between "user" keyword arguments from those actually required by the scaling functions.

Compute the fluffiness of cloud layers

Describe the change

Certain sets of ceilometer hits can display clear structure, possibly with ordered altitude fluctuations. Alternatively, ceilometer hits can be spread throughout the entire spatial extent of a given slice/group/layer.

To aid in differentiating between these two types of structures (and possibly pave the way for better algorithms), we bring forth the idea of "fluffiness", ~~ranging between 0 (perfectly structured) to 1 (ultra-fluffy). This number is computed as the ratio between:~~
~~1) the area of the complex hull encompassing all the ceilometer hits in the slice/group/layer, and~~
~~2) the full extent (in ft x s space) of the ceilometer hits span, i.e. (alt_max - alt_min) * (dt_max - dt_min).~~

Edit: the initial idea of the Convex Hull was abandoned in favor of LOWESS. See comments below for details.

Parameter file not included in wheel

Describe the bug
The file ampycloud_default_prms.yml is not included in the wheels, only in the tar.gz.

This makes using ampycloud via the pypi wheel impossible until it gets fixed.

Display the rescale alt axis for the grouping step

Describe the change
The use of slices fluffiness to derive the altitude rescaling value in the GROUPING step of ampycloud implies (possibly) a different scaling for distinct slice bundles.

Currently, these scaling values are not stored anywhere, such that from #86 onwards, the rescaled altitude axis can no longer be easily shown in the ampycloud diagnostic plots.

It would be great to correct this situation, but doing so is tough because:

the rescaling value (per slice bundle) is not stored anywhere (should it be ? Or should the plot re-compute it ?), and
we need to show "little" axis for each slice bundle (instead of one global one - no idea how to do that cleanly).

This is going to be a tough issue to fix.

Add a synop property to CeiloChunk

Describe the change
It would good to add a synop property to the CeiloChunk class, that unlike CeiloChunk.metar would return the full list of cloud layers.

`test_scientific_stability()` should always process all cases, even if one fails

Describe the change
At the moment, test-scientific_stability() checks all cases sequentially, and stops as soon as one of them fails.

This is not ideal, as fixing one may then break another, etc ...

A better approach would be to create some sort of test factory, such that each case is assessed individually independently of the others.

The const_scaling function should be done in relative space

Describe the bug
At the moment, the const_scaling using for the time is absolute, with respect to 0. This should be done relatively to the first point in the dataset instead, to ensure the correct behavior of ampycloud no matter what date of reference is used by the users.

New Action to test processing time

Describe the change
We need a new Github Action to test the ampycloud speed. As a start, I'd say the mean + 3std of the speed test should not tke longer than 1s to pass.

Add tests of key (real) situations to easily monitor the scientific stability of the code

Describe the change
Reference data from real measurements is stored in dedicated pickle files under test, and should be used to generate the metars from interesting/representative situations, to ensure that the outcome of ampycloud does not evolve unexpectedly.

Also, add a new pytest fixture to optionally generate the diagnostic plots and ease the assessment of changes.

plt.style.context() does not enable 'text.latex.preamble'

Describe the bug
ampycloud uses custom matplotlib rcParams parameters to tune the look of the plots. To avoid messing with the user's setup, these are set temporarily, for each ampycloud plot, using plt.style.context() via the plots.utils.set_mplstyle() decorator.

In particular, ampycloud offers users the ability to use their system-wide LaTeX installation to generate the prettiest plots, which requires to load a specific text.latex.preamble.

Sadly, plt.style.context() does not correctly set this specific rcParams element, a bug which has been reported before: matplotlib/matplotlib#17931

As a work around, text.latex.preamble is set outside of context (for now) using plt.style.use(), which implies that the user's setup will be modified. See plots.utils.set_mplstyle() for details.

This is a very specific issue that will have no impact on the vast majority of users (since it is very uncommon to use usetex: True). Fixing it would be nice, but since text.latex.preamble is not officially supported, it may not be that easy to do ...

meteoswiss / ampycloud Goto Github PK

ampycloud's People

Stargazers

Watchers

Forkers

ampycloud's Issues

Recommend Projects

Recommend Topics

Recommend Org