Giter VIP home page Giter VIP logo

pandagg's Introduction

What is it?

pandagg is a Python package providing a simple interface to manipulate ElasticSearch queries and aggregations.

Disclaimer :this is a pre-release version

Features

  • flexible aggregation and search queries declaration
  • query validation based on provided mapping
  • parsing of aggregation results in handy formats: tree with interactive navigation, csv-like tabular breakdown, and others
  • mapping interactive navigation

Usage

Documentation

Full documentation and user-guide are available here on read-the-docs.

Quick sneak peek

Elasticsearch dict syntax

>>> from pandagg.query import Query

>>> expected_query = {'bool': {'must': [
    {'terms': {'genres': ['Action', 'Thriller']}},
    {'range': {'rank': {'gte': 7}}},
    {'nested': {
        'path': 'roles',
        'query': {'bool': {'must': [
            {'term': {'roles.gender': {'value': 'F'}}},
            {'term': {'roles.role': {'value': 'Reporter'}}}]}
         }
    }}
]}}
>>> q = Query(expected_query)
>>> q
<Query>
bool
└── must
    ├── nested
    │   ├── path="roles"
    │   └── query
    │       └── bool
    │           └── must
    │               ├── term, field=roles.gender, value="F"
    │               └── term, field=roles.role, value="Reporter"
    ├── range, field=rank, gte=7
    └── terms, field=genres, values=['Action', 'Thriller']

DSL syntax

from pandagg.query import Nested, Bool, Query, Range, Term, Terms
>>> q = Query(
    Bool(must=[
        TermsFilter('genres', terms=['Action', 'Thriller']),
        Range('rank', gte=7),
        Nested(
            path='roles', 
            query=Bool(must=[
                Term('roles.gender', value='F'),
                Term('roles.role', value='Reporter')
            ])
        )
    ])
)

# serialized query is computed by `query_dict` method
>>> q.query_dict() == expected_query
True

Chained syntax

>>> from pandagg.query import Query, Range, Term

>>> q = Query()\
    .query({'terms': {'genres': ['Action', 'Thriller']}})\
    .nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))\
    .query(Range('rank', gte=7))\
    .query(Term('roles.role', value='Reporter'), parent='nested_roles')

>>> q
<Query>
bool
└── must
    ├── nested
    │   ├── path="roles"
    │   └── query
    │       └── bool
    │           └── must
    │               ├── term, field=roles.gender, value="F"
    │               └── term, field=roles.role, value="Reporter"
    ├── range, field=rank, gte=7
    └── terms, field=genres, values=['Action', 'Thriller']
     

Notes:

  • both DSL and dict syntaxes are accepted in Query compound clauses methods (query, nested, must etc).
  • the last query uses the nested clause _name to detect where it should be inserted

Installation

pip install pandagg

Dependencies

Hard dependency: ligthtree: 0.0.2 or higher

Soft dependency: to parse aggregation results as tabular dataframe: pandas

Motivations

pandagg is inspired by the official high level python client elasticsearch-dsl, and is intended to make it more convenient to deal with deeply nested queries and aggregations.

The fundamental difference between those libraries is how they deal with the tree structure of aggregation queries and their responses.

Suppose we have this aggregation structure: (types of agg don't matter). Let's call all of A, B, C, D our aggregation nodes, and the whole structure our tree.

A           (Terms agg)
└── B       (Filters agg)
    ├── C   (Avg agg)
    └── D   (Sum agg)

Question is who has the charge of storing the tree structure (how nodes are connected)?

In elasticsearch-dsl library, each aggregation node is responsible of knowing which are its direct children.

In pandagg, all nodes are agnostic about which are their parents/children, and a tree object is in charge of storing this structure. It is thus possible to add/update/remove aggregation nodes or sub-trees in specific locations of the initial tree, thus allowing more flexible ways to build your queries.

Another difference is about how the response class. pandagg will favor "explicit" attributes and methods, rather than automatically generated attributes (except for classes whose purpose is exclusively interactive).

Disclaimers

pandagg is not as mature as the official client, and some interfaces might change.

It does not ensure retro-compatible with previous versions of elasticsearch (intended to work with >=7). It is part of the roadmap to tag pandagg versions according to the ElasticSearch versions they are related to (ie pandagg be v7.1.4 would work with ElasticSearch v7.X.X).

It doesn't provide yet all functionalities provided by the official client (for instance ORM like insert/updates, index operations etc..). Primary focus of pandagg was on read operations.

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

Roadmap

  • implement CI workflow: python2/3 tests, coverage
  • documentation; explain challenges induced by nested nodes syntaxes: for instance why are nested query clauses saved in children attribute before tree deserialization
  • on aggregation nodes, ensure all allowed fields are listed
  • expand functionalities: proper ORM similar to elasticsearch-dsl Document classes, index managing operations
  • package versions for different ElasticSearch versions

pandagg's People

Contributors

alk-lbinet avatar leonardbinet avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.