`query_builder`

The query_builder was developed to automate SQL operations based on input data and configuration provided by the user. The concept behind the query_builder is that a complex operation in a relational DB can be optimized by breaking it down into smaller operations. These operations can be native DB operations like join, bitwise operations or extended operations defined by the user. As a result of an operation, intermediate tables are created. These tables can be permanent in the database or trully temporary. The job of the query builder is to combine those intermediate tables in a predefined order to reduce the overall execution time. For instance, time consuming operations like join between large tables are performed at the end only when the size of the involved tables are reduced by previous operations.

The main concepts of the query_builder implementation are:

Workflow: the sequence of operations is described in a workflow like representation, the query_builder builds the queries for each operation, create intermediat tables and combine them in a predefined order to optimize the execution time. It also manages the parallelization of independent operations that can be run simultaneously.

Operation: a relational DB operation or a new operation defined by the user combining other operations. The result of an operation is an SQL query which is built based on input tables, intermediate tables created in previous operation of the workflow and on the configuration provided by the user for each operation.

New operations can be easily added and specified in the JSON file which represents the workflow. A new operation is defined overriding the get_statement method in the IOperation class. All operations are defined in Python using the SQLAlchemy core library.

Intermediate table: is a table created on the DB to store temporary data that are used to compute the final result. These tables can either be 'permanent' or 'temporary' depending on the configuration specified by the user. Permanent tables are useful for gather information on intermediate steps like fraction of rows removes, and other dianostics.

Example of a simple workflow: An workflow is defined using a JSON file. Each Object, represents an operation that must have at least 3 fields: "name" - Is the operation name and will be used to compose the intermediate table name. "op" - Is the type of the operation that will be executed. The operation must be defined in the operations file. "permanent_table" - If true, the table will be kept on the database. Otherwise, it will be removed as soon as possible.

Other fields can be added accordingly the operation needs. For example, the exposure_time_i operation has a field "value" that will be used by the operation "great_equal" only. To access this fields the method get_statement has an dictionary attribute called params.

When an operation depends on others operations, we use the reserved field "sub_op" and declare the list of operations that it depends on. In this sample, the exposure_time uses the intermediate tables produced by exposure_time_i and exposure_time_r. To access this data, the get_statement method has the attribute sub_operations in which the key is the operation name and the value associated is an object representation of an intermediate table.

Workflow definition: { "name": "exposure_time", "op": "join", "permanent_table": true, "sub_op": [ { "name": "exposure_time_i", "op": "great_equal", "permanent_table": false, "db": "y1a1_coadd_cosmos_d04_4096_exptime_i_10023575", "value": "0.55" }, { "name": "exposure_time_r", "op": "great_equal", "permanent_table": false, "db": "y1a1_coadd_cosmos_d04_4096_exptime_i_10023575", "value": "0.33" } ] }

The arguments received by the get_statement function for the example above is:

This way, all the operations can be written overriding the method get_statement.

Running the `query_builder` locally

Clone the project, create a virtualenv and install dependencies

  git clone https://github.com/lucasdpn/query_builder.git

  cd query_builder

  virtualenv env -p python3
  source env/bin/activate
  pip install -r requirements.txt

install postgresql
create databae

  CREATE DATABASE query_builder

create schema to save the results

  CREATE SCHEMA tst_oracle_output

download testing data and import in the database - sub_set_y1a1.sql has a set of data from y1a1_coadd_cosmos_d04 and others pypelines.

  download sub_set_y1a1.sql from http://devel2.linea.gov.br/~lucas.nunes/

import data in local database

  psql -U postgres query_builder < sub_set_y1a1.sql

execute tests (optional)

  PYTHONPATH=. python test/test_queries_y1ay_subset.py

lucasdpn / query_builder Goto Github PK

query_builder's Introduction

`query_builder`

Running the `query_builder` locally

query_builder's People

Contributors

Watchers

Forkers

query_builder's Issues

Image for README

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

lucasdpn / query_builder Goto Github PK

query_builder's Introduction

query_builder

Running the query_builder locally

query_builder's People

Contributors

Watchers

Forkers

query_builder's Issues

Recommend Projects

Recommend Topics

Recommend Org

`query_builder`

Running the `query_builder` locally