Giter VIP home page Giter VIP logo

floorist's Introduction

Floorist

Build Status

Floorist is a simple script to dump SQL queries from a PostgreSQL database into S3 buckets in parquet format.

Usage

The tool has been designed for use in Kubernetes or OpenShift, but you can also run it locally by using docker or podman. It can be configured through environment variables and a single floorplan (YAML) file.

docker build -t floorist .

docker run --rm floorist \
  -e POSTGRES_SERVICE_HOST=localhost \
  -e POSTGRESQL_USER=root \
  -e POSTGRESQL_PASSWORD=123456 \
  -e AWS_ACCESS_KEY_ID=your_key_id \
  -e AWS_SECRET_ACCESS_KEY=your_secret_access_key \
  -e AWS_REGION=us-east-1 \
  -e AWS_BUCKET=floorist-dumps \
  -e FLOORPLAN_FILE=floorplan.yml \
  -v floorplan.yml:floorplan.yml

Environment variables

  • POSTGRES_SERVICE_HOST
  • POSTGRESQL_USER
  • POSTGRESQL_PASSWORD
  • POSTGRESQL_DATABASE
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION
  • AWS_BUCKET
  • AWS_ENDPOINT - not mandatory, for using with minio
  • FLOORPLAN_FILE - should point to the floorplan (YAML) file

Floorplan file

The floorplan file simply defines a list of a prefix-query pair. The prefix should be a valid folder path that will be created under the bucket if it does not exist. For the queries it is recommended to assign simpler aliases for dynamically created (joins or aggregates) columns using AS. Optionally you can set a custom chunksize for the query (default is 1000) that will serve as the maximum number of records in a single parquet file.

- prefix: dumps/people
  query: >-
    SELECT name, email, birthyear FROM people;
- prefix: dumps/cities
  query: >-
    SELECT name AS city_name, zip, country FROM cities;
  chunksize: 100

The example above will create two dumps under the S3 bucket specified in the AWS_BUCKET environment variable into the <prefix>/year_created=<Y>/month_created=<M>/day_created=<D>/<UUID>.parquet files.

Clowder - How to add Floorist to your Clowder template

You only need to add a new job definition on your ClowdApp, and a ConfigMap with the Floorplan definition your app needs.

An example configuring the Floorist job follows, using a secret to host the S3 bucket config, the config map definition, and a floorplan Configmap example for reference.

#Clowdapp.yaml

- apiVersion: cloud.redhat.com/v1alpha1
  kind: ClowdApp
  metadata:
    name: "${APP_NAME}"
  spec:
    jobs:
    - name: floorist
      schedule: ${FLOORIST_SCHEDULE}
      suspend: ${FLOORIST_SUSPEND}
      concurrencyPolicy: Forbid
      podSpec:
        image: ${FLOORIST_IMAGE}:${FLOORIST_IMAGE_TAG}
        env:
        - name: AWS_BUCKET
          valueFrom:
            secretKeyRef:
              name: ${FLOORIST_BUCKET_SECRET_NAME}
              key: bucket
        - name: AWS_REGION
          valueFrom:
            secretKeyRef:
              name: ${FLOORIST_BUCKET_SECRET_NAME}
              key: aws_region
        - name: AWS_ENDPOINT
          valueFrom:
            secretKeyRef:
              name: ${FLOORIST_BUCKET_SECRET_NAME}
              key: endpoint
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: ${FLOORIST_BUCKET_SECRET_NAME}
              key: aws_access_key_id
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: ${FLOORIST_BUCKET_SECRET_NAME}
              key: aws_secret_access_key
        - name: FLOORPLAN_FILE
          value: "/tmp/floorplan/floorplan.yaml"
        - name: LOGLEVEL
          value: ${FLOORIST_LOGLEVEL}
        volumeMounts:
        - name: floorplan-volume
          mountPath: "/tmp/floorplan"
        volumes:
          - name: floorplan-volume
            configMap:
              name: floorplan
      resources:
          limits:
            cpu: "${CPU_LIMIT_FLOORIST}"
            memory: "${MEMORY_LIMIT_FLOORIST}"
          requests:
            cpu: "${CPU_REQUEST_FLOORIST}"
            memory: "${MEMORY_REQUEST_FLOORIST}"
- apiVersion: v1
  kind: ConfigMap
  metadata:
    name: floorplan
  data:
    floorplan.yaml: |
      - prefix: insights/yout-service-name/hosts-query
        query: >-
          SELECT
            "inventory"."hosts"."id",
              OR "inventory"."hosts"."id" IN (SELECT "test_results"."host_id" FROM "test_results"));
      - prefix: insights/your-service-name/policies-query
        query: >-
          SELECT DISTINCT
            "policies"."id",
            "profiles"."ref_id",
      - prefix: insights/your-service-name/policy_hosts-query
        query: >-
          SELECT "policy_hosts"."host_id", "policy_hosts"."policy_id" FROM "policy_hosts";

parameters:
- name: MEMORY_LIMIT_FLOORIST
  value: 200Mi
- name: MEMORY_REQUEST_FLOORIST
  value: 100Mi
- name: CPU_LIMIT_FLOORIST
  value: 100m
- name: CPU_REQUEST_FLOORIST
  value: 50m
- name: FLOORIST_SCHEDULE
  description: Cronjob schedule definition
  required: true
  value: "0 2 * * *"
- name: FLOORIST_SUSPEND
  description: Disable Floorist cronjob execution
  required: true
  value: "true"
- description: Floorist image name
  name: FLOORIST_IMAGE
  value: quay.io/cloudservices/floorist
- description: Floorist Image tag
  name: FLOORIST_IMAGE_TAG
  required: true
  value: latest
- description: Shared bucket name
  name: FLOORIST_BUCKET_NAME
  required: true
  value: floorist-bucket

Testing

For testing the tool, you will need PostgreSQL and minio, there's a Docker Compose file provided in the test folder with everything prepared and configured. The configuration for these two services has to be stored in the tests/env.yaml file, for the Docker Compose setup it's enough to copy the the tests/env.yaml.example to make it work. However, if you would like to bring your own PostgreSQL server or maybe use a real S3 bucket, you have to edit these values accordingly. The tests can be started via pytest.

There's two ways of running the tests, you can run them locally using pytest from your localhost or you can run everything from containers like we do on our CI process.

Running tests locally

# Install with Test dependencies
pip install -e .[test] -r requirements.txt

# Set the environment config file
cp tests/env.yaml.example tests/env.yaml

# Bring up all the required containers
docker-compose -f tests/docker-compose.yml up -d

# Run the tests locally
pytest tests

# Tear-down
docker-compose -f tests/docker-compose.yml down

Running tests from containers

Alternatively, you can also run the same process the CI system runs, locally, by running the pr_check.sh script with the LOCAL_BUILD=true environment variable set:

LOCAL_BUILD=true ./pr_check.sh

the pr_check.sh script will:

  • Build a new image for Floorist using the test dependencies (see build_deploy.sh for details)
  • Run the tests in a container using the aforementioned image (see run-tests.sh for details)

please NOTE - since the pr_check.sh script creates and deletes the containers it uses each time, it has to create a custom env.yaml file with the correct container names (i.e., to connect to the right database and the right container with the MinIO bucket), overriding the existing env file in the process (the local tests/env.yaml.example file has the default localhost value for when running pytestlocally, so make sure this file has the correct values between each run if you run it both from the pr_check.sh script and the local pytest command)

Contributing

Bug reports and pull requests are welcome, here are some ideas for improvement:

  • More fine-grained specification of the output path and filename in the floorplan (e.g. custom timestamps)
  • Additional parameters for the parquet exporting (column partitioning, compression, indexing, column types)
  • Support for other databases
  • Support for different object storage services
  • Better tests
  • Pylint support

License

The application is available as open source under the terms of the Apache License, version 2.0.

floorist's People

Contributors

romanblanco avatar skateman avatar victoremepunto avatar vkrizan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

floorist's Issues

There's no proper handling for not YAML formatted floorplan files

The parser tries to open it as a yaml, reading lines expecting having the right format, which is not safe, this raises exceptions when the format is not the expected:

 ERROR    root:floorist.py:55 string indices must be integers
 Traceback (most recent call last):
   File "/opt/app-root/lib64/python3.9/site-packages/floorist/floorist.py", line 34, in main
     logging.debug(f"Dumping #{dump_count}: {row['query']} to {row['prefix']}")
 TypeError: string indices must be integers

This is reproducible just by passing a file with a random string as a content ( empty files also causes #2 )

Update dependencies to make Floorist compatible with Python 3.10

The current package versions (specifically the pyarrow version pinned to 5.0.0) make Floorist incompatible with Python 3.10 (Again, PyArrow 5.0.0 package seems to require Python version <= 3.9).

Maybe reviewing the available packages there's an update to make it work with the latest current stable Python release (3.10.2)

There's no handling of empty floorplans

The code raises an exception if an empty file is handed as a floorplan:

with open(env['FLOORPLAN_FILE'], 'r') as stream:
         # This try block allows us to proceed if a single SQL query fails
 >       for row in yaml.safe_load(stream):
 E       TypeError: 'NoneType' object is not iterable

There should be a check / better handling of empty files to avoid this type of errors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.