holoviz-topics / earthml Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 21.0 78.74 MB

Tools for working with machine learning in earth science

Home Page: https://earthml.holoviz.org

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

earthml's People

Contributors

Stargazers

Watchers

earthml's Issues

Bug at end of resampling docs in tutorial

At the end of this page of the docs, the resampling section has a couple of bugs.

I think it should read:

res_1000 = 1000
x_1000 = np.arange(xmin, xmax, res_1000)
y_1000 = np.arange(ymin, ymax, res_1000)

diff_res_1000 = (
    diff_regridded.groupby_bins("x", x_1000, labels=x_1000[:-1])
    .mean(dim="x")
    .groupby_bins("y", y_1000, labels=y_1000[:-1])
    .mean(dim="y")
    .rename(x_bins="x", y_bins="y")
)
diff_res_1000

need instructions added to load data/UCMerced_LandUse/Images/ from machine learning example

I remember a comment made during the presentation that the UC Mercred was to large to include in the instructional data. That said it would be nice to include a note that the data can be downloded from the site, or possibly add a commented out download using intake. Also, since the data is so large, maybe include a small set that will get the examples working even if they will produce bad results.

need instructions for loading

Carbon flux data not available anymore?

Hi.

Is it possible that the original nee_data_repo is offline? Or did it move?
https://github.com/greyNearing/nee_data_fusion/

Also, I get a kernel reboot at the cell

metadata.hvplot.points('lon', 'lat', geo=True, color='vegetation',
                       height=420, width=800, cmap='Category20') * gts.OSM

but that might be my env.

Great notebooks!

Best,
C

New suggestions for labelled ML examples

In a recent meeting we (@ebo @jsignell @jbednar) came up with some new ideas for public labelled data that can be applied to public satellite imagery (which mostly implies LANDSAT data).

Good criteria for a task are that 1) all the data can be made public 2) the labelled features are big enough to spot with LANDSAT 3) the features can be easily spotted by a human to evaluate the ML performance. The two most promising suggestions were:

Using the National Inventory of Dams Database to mark dams on US imagery. This data has latitude/longitude data so the labels are points. There is one excel file per state and there are > 90k dams total.
Labelling lakes using the Global Lakes and Wetlands Database which is polygon data. The GLWD-2 dataset has > 250,000 polygons though this is a global database so I don't know how many fall in the US if we want to focus on that.

Another nice thing about these two datasets is that there is a good chance they are correlated with each other!

Carbon flux: use latitude, longitude and day of year in predictions?

I'd like a better explanation of the motivation, and some domain knowledge to know what variables to exclude.

Analysis of the current content to move it eventually to Project Pythia

A task for the HoloViz group as part of the PangeoML project is to put together examples using the HoloViz tools supporting Geo-Machine Learning oriented workflows. And when need be, improve the existing tools to better support these workflows. Such content has already produced a couple of years ago part of the EarthML project.
Besides that task, in the Python ecosystem several projects are tackling a similar issue, i.e. producing and maintaining examples in this domain. This includes Project Pythia and the Pangeo Gallery. Members of this project have met in September 2022 to discuss about best practices and a way forward. One of the outcome of this discussion was that Project Pythia is the best equipped to accept and maintain examples (and that new developments are required to improve the workflow)

The tasks for the HoloViz groups are then:

Go through EarthML's content and see what's already at Pythia, to see how relevant EarthML's content is, since it's a few years old now.
If it's relevant:
- Update the notebooks to current best practices (showing off any HoloViz improvements since then)
- Work with Pythia to get them into a Cookbook or wherever else they make sense in Pythia.

cc @jbednar

03 Alignment and Preprocessing proj error

crs = gv.util.proj_to_cartopy(landsat_8_da.crs)
generated

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-d99b01f392df> in <module>
----> 1 crs = gv.util.proj_to_cartopy(landsat_8_da.crs)

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\geoviews\util.py in proj_to_cartopy(proj)
    453     proj = check_crs(proj)
    454 
--> 455     if proj.is_latlong():
    456         return ccrs.PlateCarree()
    457 

AttributeError: 'Proj' object has no attribute 'is_latlong'

Add a CRS to the tiff used by landsat_spectral_clustering

The landsat_spectral_clustering notebook updated in #12 uses a tif file missing crs information which results in lots of warnings. We should add one and update the S3 file with it to make the example cleaner.

Tools for reshaping multidimensional arrays for use with sklearn

In the EarthML project, we need to apply machine-learning tools like sklearn to multidimensional array data like xarrays and other data that doesn't fit naturally into sklearn's single-column input data format. Of course, arrays can be flattened before running the algorithm, then reshaped in the opposite way afterwards, as in Tom Augspurger's spectral clustering example.

However, doing so is awkward and error prone and likely to lose metadata like lat,lon coordinates, especially for more complicated multidimensional arrays with data that needs to be selected along some certain range of dimensions or sliced inside the array, and then restored to that range and slice afterwards for analysis and visualization. It's presumably especially painful and error prone if the dimensionality changed as a result of any of the sklearn operations (e.g. PCA).

Existing libraries to deal with these issues take one of two approaches:

making an xarray object aware of sklearn functionality: xarray_filters
making sklearn able to ingest xarray objects: phausamann/sklearn-xarray, nbren12/sklearn-xarray

These two approaches have very different implications, and it would be good if we can tease those out explicitly here before moving forward with a particular choice. They are also at very different stages of development, with phausamann/sklearn-xarray seeming further along than the rest, but I don't know how well any of them handle the end stages in this process (reshaping back to the original shape or some appropriately reduced version of it). @jlstevens, any thoughts?

getting fatal: Could not read from remote repository. error while updating

I tried to update the repository today (with: git pull) and got:

"Permission denied (publickey).
 fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists."

I thought that I might have corrupted the copy on my thumb drive, so I moved it over and tried to re-clone it. I got the same error.

Probably something simple, but I do not think it is on my end.

dask-ml examples break on windows

Just wanted to track this issue dask/dask-ml#230 which was causing failures as described in #81

Finding examples of openly accessible images

The original version of landsat_spectral_clustering.ipynb used redding.tif which was obtained originally from planet.com. As I wasn't sure whether this image could be made available, I updated the notebook to use a landsat example taken from a datashader example.

@ebo then informed me by e-mail that this image is not really suitable as it only has two bands. We would like some example images that can be made public that also have a decent number of bands. This is important as we will then be able to compute the various indices that we might then want to learn on.

One suggestion by @ebo was to use these images of a disappearing lake although I only see a link to download them in JPG format?

Lastly, a new notebook has been committed referencing 'Midwest_Mosaic.tif' which I don't think we have discussed yet. Is this something we could slice down and add to the repo as an example?

publishing example in peer reviewed journal

While driving with my wife for the holidays we got to talking about some apps her and her field crews recently published in Applications in Plant Sciences. It got me to thinking of an example I started playing with one night which scrapes local forecast information for her field sites, so that they can receive site specific forecasts targeting when they expect to be at the field sites. Since they are often gone for a month or two at a time, these have to be regenerated on the fly...

Anyway, I reached out to the editor regarding copyright and got the following reply:

"...Regarding the copyright/license, articles published in APPS are published under a Creative Commons license. With Creative Commons licenses, the author retains copyright and the public is allowed to reuse the content. The author grants the Botanical Society of America a license to publish the article. ..."

is CC an acceptable copyright for one of the notebooks? I figure that I could work this up as an example for either EarthML or EarthSIM, and it could also be published with APPS.

I thought I would ask before spending much time on it as I can easily get it to the point where it is useful for my wife and her field crew, but it is an interesting offshoot to things I already discussed with weather/climate data collection.

01_Data_Ingestion intake read error

training = intake.open_csv('../data/landsat*_training.csv') worked fine, but

training = intake.open_csv('../data/landsat{version:d}_training.csv')
training_df = training.read()
training_df.head()

produced a value error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-6ade6c87b33e> in <module>
      1 training = intake.open_csv('../data/landsat{version:d}_training.csv')
----> 2 training_df = training.read()
      3 training_df.head()

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\csv.py in read(self)
    140 
    141     def read(self):
--> 142         self._get_schema()
    143         return self._dataframe.compute()
    144 

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\csv.py in _get_schema(self)
    125 
    126         if self._dataframe is None:
--> 127             self._open_dataset(urlpath)
    128 
    129         dtypes = self._dataframe._meta.dtypes.to_dict()

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\csv.py in _open_dataset(self, urlpath)
    116 
    117         # add the new columns to the dataframe
--> 118         self._set_pattern_columns(path_column)
    119 
    120         if drop_path_column:

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\csv.py in _set_pattern_columns(self, path_column)
     73             col.cat.codes.map(dict(enumerate(values))).astype(
     74                 "category" if not _HAS_CDT else CategoricalDtype(set(values))
---> 75             ) for field, values in reverse_formats(self.pattern, paths).items()
     76         }
     77         self._dataframe = self._dataframe.assign(**column_by_field)

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\utils.py in reverse_formats(format_string, resolved_strings)
    126     args = {field_name: [] for field_name in field_names}
    127     for resolved_string in resolved_strings:
--> 128         for field, value in reverse_format(format_string, resolved_string).items():
    129             args[field].append(value)
    130 

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\utils.py in reverse_format(format_string, resolved_string)
    193 
    194     # get a list of the parts that matter
--> 195     bits = _get_parts_of_format_string(resolved_string, literal_texts, format_specs)
    196 
    197     for i, (field_name, format_spec) in enumerate(zip(field_names, format_specs)):

C:\ProgramData\Anaconda3\envs\earthml\lib\site-packages\intake\source\utils.py in _get_parts_of_format_string(resolved_string, literal_texts, format_specs)
     41             if literal_text not in _text:
     42                 raise ValueError(("Resolved string must match pattern. "
---> 43                                   "'{}' not found.".format(literal_text)))
     44             bit, _text = _text.split(literal_text, 1)
     45             if bit:

ValueError: Resolved string must match pattern. '../data/landsat' not found.

holoviz-topics / earthml Goto Github PK

earthml's People

Contributors

Stargazers

Watchers

Forkers

earthml's Issues

Recommend Projects

Recommend Topics

Recommend Org