Giter VIP home page Giter VIP logo

raystack / optimus Goto Github PK

View Code? Open in Web Editor NEW
737.0 18.0 153.0 13.53 MB

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

Home Page: https://raystack.github.io/optimus

License: Apache License 2.0

Dockerfile 0.02% Makefile 0.20% Go 98.18% Python 1.49% Shell 0.11%
airflow etl workflows automation golang bigquery data-warehouse analytics data-modelling analytics-engineering

optimus's People

Contributors

anuraagbarde avatar arinda-arif avatar dependabot[bot] avatar deryrahman avatar irainia avatar kushsharma avatar lollyxsrinand avatar mauliksoneji avatar mryashbhardwaj avatar novanxyz avatar ravisuhag avatar rootcss avatar sbchaos avatar scortier avatar siddhanta-rath avatar smarchint avatar sravankorumilli avatar sumitagrawal03071989 avatar tharun1718333 avatar vianhazman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optimus's Issues

Support custom date range generation via SQL query

Right now on Optimus, date range is generated by this window config in task section on job.yaml file

window:
    size: 24h
    offset: 24h
    truncate_to: d

On some of our use cases, we need to generate a custom date range based on certain conditions in form of SQL query. For example:

SELECT DISTINCT DATE(event_date) as data_date
FROM some_table
WHERE (event_date >= start_date and event_date < end_date)

UNION DISTINCT

SELECT DISTINCT DATE(created_date) as data_date
FROM some_table_2
WHERE ((created_date >= start_date AND created_date < end_date)  
OR (last_modified_date >= start_date AND last_modified_date < end_date))
ORDER BY 1

Date range that generated by above query then will be used as parameters to the job.

Provide a provision to configure resources for Jobs

Is your feature request related to a problem? Please describe.
Pods should have defaults configured but should have a provision for configuration as well, currently there is no provision for configuring cpu & memory for the jobs. Provide a mechanism to configure the resources for the jobs.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Optimus command to start a program with context envs injected by default

When a plugin docker container executes and requires configuration/asset files as input, it can either use GRPC/REST calls to fetch them from optimus or there is an existing command optimus admin build instance that prints them in the local filesystem.

What if we have a run command instead that will inject these configuration variables as env by default? Something like optimus admin run instance python3 main.py --some-arg where python3 main.py --some-arg will vary based on each plugin.

Delete call of job specification doesn't work

Sending a rest call to delete a job specification throws 404 where as grpc call works fine. Steps to reproduce

curl -X DELETE "http://localhost:9100/v1/project/my-project/namespace/kush/helloworld" -H  "accept: application/json"

Show recent replay list of a project

Currently, the subcommands for Replay that are available are run to start the replay and status to get the status of one replay using the ID. However, Optimus should also be able to list down the recent replays of a project, so that users can check the latest replay request including the id, which job & date is being replayed, when the replay happened, and also the status. Subcommand list will be added to accommodate this.

Example: optimus replay list --project [project_name]

Create backup resource run

  • Add the command
  • Generate the job spec from the destination (already available, need to test)
  • Resolve the dependencies of the job (just use the resolver)
  • Response of list of tables to be backed up
    • As part of the response, we can highlight which of the table that will be backed up and which are not (if user chose to ignore downstream backup)
  • Prompt confirmation to proceed the backup
  • Decide if the job can be backed up, which datastore and destination
  • Backup the resource
    • Bigquery
      • Table
      • View
      • External table (to be verified)

Support Replay and Backup for multiple namespaces project

Users should be able to do backup and replay for downstream jobs with a different namespace, as long as authorized to do so.

  • should able to accept allowed_downstream with possible value * (all namespaces) or empty (only requested namespace). applied to both replay and backup.
  • should able to accept ignore_downstream in Replay

User should be able to list all secrets.

A User should be provided an option to list all secrets within the project through api & cli, only digests to be shown to protect the secrets.

Acceptance Criteria

  • All secrets along with the digest to be shown to the user when requested.
  • Operation to fail with relevant details shown on invalid/insufficient params provided.
  • Documentation to be updated accordingly.

Option to ignore inferred dependencies

Optimus supports figuring out dependencies automatically by parsing task assets. This logic of finding dependency is implemented per task behavior. Users can also choose to not depend on the task's inferred dependencies and pass them explicitly in job.yaml specification. We need a way to ignore the task's automatically inferred dependencies explicitly.

We use the existing specification file to add a nonbreaking change as follows:

name: job1
dependencies:
- job: hello_world1
- job: hello_world3
- ignore_job: hello_world2
- ignore_job: hello_world4

In this case, if the task used in this job somehow infer hello_world2 as one of the upstream, we will choose to not treat it as an upstream dependency. Similarly, if infer logic did not find hello_world4 as one of the upstream, nothing happens and no error should be thrown.

Basic user authentication and authorisation

We can support basic authentication with minimal permission-based rule enforcement that can be read from a file in the Optimus server. The file could be stored locally, stored in a k8s config map, GCS, etc for the server to fetch.

[
  {
     "username": "foo",
     "password": "bar",
     "perms": ["*"]
  },
  {
     "username": "optimus",
     "password": "$2a$10$fKRHxrEuyDTP6tXIiDycr.nyC8Q7UMIfc31YMyXHDLgRDyhLK3VFS",
     "perms": ["deploy:t-data", "deploy:g-data"]
  },
  {
     "username": "prime",
     "password": "pass",
     "perms": ["deploy:*", "register:project", "register:secret"]
  }
]

Passwords can be cleartext or bcrypt encrypted hashes. Each permission is mapped as action:entity and * is used as a wildcard for all. To avoid authentication with internal clients(airflow docker images), we can break the optimus API into two parts, public and internal both exposed to different ports, and only public will be served to external users.

From the cli we can either

  • Use a .netrc file to store user credentials.
  • Users while running the command will provide these credentials in their .optimus.yaml file(auth method and username can only be configured in file, password will be either passed as flag or asked from user in stdin).

Optimus should not fail on startup because of bad plugins

When loading a plugin after the discovery phase, when the first GRPC client handshake happens between optimus core and plugin server, don't terminate the application on error instead, simply log a warning and continue as if an invalid binary is discovered as plugin.

Support for opentelemetry metrics

Currently, there are no stats/metrics/traces being pushed by Optimus service. It should support emitting basic stats like cpu/mem/gc usage, time to complete GRPC calls, etc.

Broken Link

Link to Contributing guide in line 101 of Readme.md is broken
The line:
Read our [contributing guide](docs/contribute/contribution.md) to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

Using integer type in job spec configs causes panic

Currently, job spec yaml configuration only supports string key-value pairs. Having string kv pairs are fine but passing int(or any other type) should handle the failure gracefully instead of panic.

panic: interface conversion: interface {} is int, not string
goroutine 1 [running]:
github.com/odpf/optimus/store/local.JobSpecAdapter.ToSpec
....

Refactor logger used in packages

Current implementation of the logger is very rough, that is a global variable is being used across different packages. It should be properly injected from the top wherever it is needed.

Refactor optimus plugins to have a base plugin interface

Currently, each plugin type has its own interface with set of functions in it, we should break it down to multiple interfaces like BaseInterface that will be implemented by all plugins, CLIInterface for those who want their plugin to be exposed in CLI questions/answers, etc.

User should be able to delete a secret.

User should be able to delete a secret which is no longer used in any job specs.

Acceptance Criteria

  • If the secret is referenced in any job spec, the delete should fail & list all the jobs where it is referenced.
  • If the secret is not referenced the deletion of the secret to be successful.
  • If the secret doesn't exist show a valid message that secret doesn't exist.

Two jobs with same destination will cause ambiguous dependency resolution

The current database model doesn't properly resolve if two jobs within a project choose a single destination and will cause ambiguity during dependency resolution. The destination should also support taking the type of destination and not just name to handle a variety of destinations like buckets/databases/tables/etc

Create backup resource dry run

  • Add backup command
  • Generate the job spec using datastore destination (already available, need to test)
  • Resolve the dependencies of the job (use the resolver)
  • Response of list of tables to be backed up

User should be able to create/update a secret through apis & cli.

A user through apis or through CLI should be able to create/update the secrets such that he should be able to reference the secrets & use in the Optimus Jobs.

User should be provided an option to create secret at project or namespace level.

Acceptance Criteria

  • GRPC end points to create/update a secret to accept base64 encoded values.
  • CLI to create/update secrets to accept both base64 & plain text.
  • Update documentation
  • Secrets to be encrypted securely.

Implement google sheets external data table type for bigquery datastore

Enable google sheets externaltables management for BigQuery datastore via Optimus.

User should be able to:

  • Create a google sheets external table by specifying sheets URI
  • Define schema for the google sheets external table
  • Use metadata management feature such as Labels

The implementation should be able to extend other BigQuery supported external data sources for future development.
About BigQuery external tables: https://cloud.google.com/bigquery/docs/external-tables

Allow plugins to skip assets compilation with Go template

Some plugins might have their own assets compilation method for example using Go templates or Jinja which used their own set of variables. For example an email sending plugin might have various kinds of variables:

Dear {{ .RECIPIENTS }},
Attached in this email is the monthly report for {{ .MONTH }}

Based on the discussion with @kushsharma , the proposed workaround is to add SkipCompile flag into CompileAssetsResponse which is configurable at plugin level.

Secrets can be used through macros in the job specs.

All User created secrets within the namespace or project level secrets can be referenced by users in the Jobspec & should be evaluated while fetching the assets & the spec.

Acceptance Criteria

  • when the secrets are referenced properly they should be evaluated & returned on registerInstance api.
  • When the secrets are not referenced properly the api call to fail with the relevant message.

Duplicate cross project dependencies should be handled gracefully

If a job inferred a cross-project dependency from its task and the same dependency is also mentioned in job.yaml specification, they are treated as duplicates. The reason is inferred dependencies when used in map for dependencies uses job name whereas cross-project dependency mentioned in specification uses project_name/job_name so duplicates are created inside the same dependency map.
For example:

...
dependencies:
- job: foo-project/bar-job
  type: inter

The map will have two jobs as bar-job and foo-project/bar-job.

Although users can choose to simply write the spec properly, the expected result is Optimus to handle it gracefully.

Support for RateLimits for replay such that the scheduled jobs are not impacted.

Is your feature request related to a problem? Please describe.
Replay feature can be handy as well it can be a problem where it disrupts the regular scheduled runs, by hitting the worker pool limits or the limits of the underlying Datastore.
Would be nice to have a feature to limit the number of jobs due to replay.

Describe the solution you'd like
A config which can be configured at project or namespace level to rate limit the number of active job runs due to replay, considering all the downstream runs. An option to force this validation, mainly for admin usage for replaying a higher priority job.

Describe alternatives you've considered
Currently, the design doesn't allow a mechanism to identify jobs which are triggered through replay or scheduled runs so the sophistication of configured the jobs differently such that limits can be applied at a underlying datastore or any downstream level as well.
Even assigning all the replay requests to a different pool at the scheduler level is not an option now. For all of that a custom scheduler with the state management managed by optimus is the solution but that will be a big change.

Additional context
With secret management, Optimus will have a flexibility to configure various service accounts for various jobs so one can configure the different service accounts accordingly such that the resources are properly used across

Support for External Sensor for Optimus Jobs

Currently Optimus Supports sensors for job dependencies which are within and the outside the project but they are managed by the same Optimus Server. It would be helpful if Optimus supports job sensors which are managed in a different Optimus, as with in an organisation there will be many deployments, checking for data availability may not always guarantee completeness & correctness of data which is guaranteed through Optimus dependencies.

Expectation :
The sensor provides checks for the status of the jobs b/w the input window.

Configuration :

dependencies : 
 job : 
 type : external
 project : 
 host : 
 start_time : // start time of the data that the job depends on
 end_time : // end time of the data that the job depends on.

The Optimus Server which accepts the requests based on its window, schedule configuration checks for all the jobs which outputs the data for the given window

This has the challenge of breaking the dependencies when job name changes.

Backup & Replay Improvements.

Table name of the backup result is preferred to have suffix of timestamp, instead of UUID.

To be considered:
=> limit of the table name
=> separator on the timestamp to make it still readable
=> the timestamp should be equals to backup time and will be equals to all downstream tables suffix.

Backup list response can be added extra useful information
=> High-level information (should be in user request point of view)
=> Ignore downstream choice should be added.
=> TTL (not the expiry time).

Add Backup details subcommand to show the list of all the tables backed up.
Screenshot 2021-11-25 at 12 28 21 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.