Giter VIP home page Giter VIP logo

generic-catalog-reader's Introduction

Generic Catalog Reader (GCR)

Conda Version PyPIVversion

A ready-to-use, abstract base class for creating a common reader interface that accesses generic table-like catalogs.

This project was started in response to the need of the DESCQA framework. It is now used in the LSST DESC GCR Catalogs.

Installation

You can install GCR from conda-forge:

conda install gcr --channel conda-forge

Or from PyPI:

pip install GCR

Concept

The reader should specify: (1) how to translate (assemble) requested quantities from the native quantities; and (2) how to access native quantities from the underlying data format.

Concept

Usage

You can find API documentation here. However, looking at some real examples is probably more useful.

Basically, you will subclass GCR.BaseGenericCatalog and then set the member dict _quantity_modifiers inside _subclass_init, and implement the member methods _generate_native_quantity_list and _iter_native_dataset. Here's an minimal example.

import h5py
import GCR

class YourCatalogReader(GCR.BaseGenericCatalog):
    
    def _subclass_init(self, **kwargs):
        self._file = kwargs['filename']
        
        self._quantity_modifiers = {
            'galaxy_id' :    'galaxyID',
            'ra':            (lambda x: x/3600.0, 'ra'),
            'dec':           (lambda x: x/3600.0, 'dec'),
            'is_central':    (lambda x, y: x == y, 'haloId', 'parentHaloId'),
        }
        
    def _generate_native_quantity_list(self):
        """
        Must return an iterable of all native quantity names.
        """
        with h5py.File(self._file, 'r') as fh:
            return fh.keys()
        
    def _iter_native_dataset(self, native_filters=None):
        """
        Must be a generator.
        Must yield a callable, *native_quantity_getter*.
        This function must iterate over subsets of rows, not columns!
        Below are specifications of *native_quantity_getter*
        ---
        Must take a single argument of a native quantity name.
        Should assume the argument is valid.
        Must return a numpy 1d array.
        """
        assert not native_filters, '*native_filters* is not supported'
        with h5py.File(self._file, 'r') as fh:
            def native_quantity_getter(native_quantity):
                return fh[native_quantity].value
            yield native_quantity_getter

generic-catalog-reader's People

Contributors

wmwv avatar yymao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

generic-catalog-reader's Issues

New class to create composite catalog

Motivated by LSSTDESC/gcr-catalogs#65, we'd like to implement a new class that creates a composite catalog. This class will take several GCR catalogs as input and return an instance that behaves like a regular GCR catalog. The GCR quantities in the composite catalog will be mapped to the first input catalog, but overwritten by other input catalogs in sequence. The user can change the mapping by modifying the entries in quantity_modifiers.

Development for this will take place in the composite-catalog branch

(cc @EiffL @duncandc)

error message from empy native_filter could be more helpful

This is not a huge priority, but if you pass the GCR a native_filter that returns an empty set, you get an error message that does not make it obvious what went wrong. For example

>>> import GCRCatalogs
>>> from GCR import GCRQuery
>>> query = GCRQuery('healpix_pixel==-1')
>>> cat = GCRCatalogs.load_catalog('cosmoDC2_v1.0_image')
>>> cat.get_quantities('galaxy_id', native_filters=[query])['galaxy_id']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/GCR/base.py", line 77, in get_quantities
    return {q: (np.concatenate(data_all[q]) if len(data_all[q]) > 1 else data_all[q][0]) for q in quantities}
  File "/global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/GCR/base.py", line 77, in <dictcomp>
    return {q: (np.concatenate(data_all[q]) if len(data_all[q]) > 1 else data_all[q][0]) for q in quantities}
IndexError: list index out of range

It might be nice for the GCR to tell you that your native filter returned nothing, so that you don't worry that something more fundamental has gone wrong.

Again: this is not a high priority.

Provide method to query for quantity attributes

@yyymao I think it would be useful to give users a way to query for more information about a quantity that the catalog creators have seen fit to provide. For example, in protoDC2, each quantity has a units attribute and will have a description attribute. I am think of a method such as
gc.get_quantity_info(quantity_name, keyword='')
keyword could be '' (fetches all info), 'units' (fetches units) 'description' (fetches description).
If the catalog doesn't have anything, it returns 'Not available'. Catalog providers will implement this method in their reader to provide the information in whatever form they wish.
Then the list_quantities method could have an option return the list + all extra information. This would then be a nice online documentation for everything in the catalog.

Develop capability for composite catalog to contain add-ons where only select rows are modified

When joining an add-on catalog to a large catalog such as cosmoDC2, a use case that is currently not supported is to be able to replace specific rows in the large catalog. An example is using the AGN catalog as an add-on. The composite catalog needs to have only the rows having an AGN component to be modified from the original. I made a composite catalog "by hand" in
this notebook I first matched the galaxy-id's between the large catalog and the AGN catalog (because only some of the AGN galaxies appeared in cosmoDC2_v1.1.4_small). Then I matched the remaining AGN galaxies with those in the cosmoDC2 catalog and replaced the cosmoDC2 magnitudes with the magnitudes from the AGN catalog.( The reader for the AGN catalog is setup to deliver the sum of galaxy + AGN contributions.)

len(catalog) is very slow

Getting the length of a catalog can be very slow - it is implemented by loading in an entire column and measuring its length.

Could this be sped up? One option would be including the total catalog size in its configuration file, since catalogs don't typically change.

Another would be looking at file metadata rather than loading it all in.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.