raystack / optimus Goto Github PK

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

Home Page: https://raystack.github.io/optimus

License: Apache License 2.0

Dockerfile 0.02% Makefile 0.20% Go 98.18% Python 1.49% Shell 0.11%

airflow etl workflows automation golang bigquery data-warehouse analytics data-modelling analytics-engineering

optimus's People

Contributors

Stargazers

Watchers

Forkers

kushsharma scortier vianhazman kar2410 karan-pawar-09 prituthenoob01 guilefoylegaurav dphilomath gauravdas014 arshita04 shivaom02 jyotishka747 10akash10 4marsha1 venkat-1708 farfromshallow101 tracyber shristisarma23 superfrank99 snehil07 gihpim vikas528 88anish yunsweta skywalkerps raktim-bhuyan bhaskar100-hub sumon05th baivabsaha679 adityacsenits021 ayyushshrma talukdarudit tinku245 gyandeep216 biley02 nazibul1 dharanip2512 yogeshkalluri mud-pot inolas259 legitmxn debojyoti1915001 niranjan0niranjan shamimahsabnoor24 revanthpalukuri faizalkarim280280 bharathchandra12 vermastra divk123 uday525 bishalk03 hrishikesh007788 rahul190190 pawan123987 ram-gopal nehapramanik mdsadik13 dishitahz rsv16 prem-331 20antara acyrus-git inrittik apu244 mansirawat709 josu9435 debanitrkl menma04 nandita658 saranga7 pankajswami78 suryaanand1979 trivedi089 vijaytomarx farukabd barnikapaul23 mayosenpai karan9102001 bitsjaymehta173 yash312312 rishav1709 rohit08012002 prakash2287 rhisav25 nirmal-sarma yk-4900 wi-es username-naz dominatorvj shubham001111 gittushr losercreates ishanarya0 kashifb25 angshuman032001 aman1727 anuraagbarde cybernetics developgo shubhi10

optimus's Issues

Support custom date range generation via SQL query

Right now on Optimus, date range is generated by this window config in task section on job.yaml file

window:
    size: 24h
    offset: 24h
    truncate_to: d

On some of our use cases, we need to generate a custom date range based on certain conditions in form of SQL query. For example:

SELECT DISTINCT DATE(event_date) as data_date
FROM some_table
WHERE (event_date >= start_date and event_date < end_date)

UNION DISTINCT

SELECT DISTINCT DATE(created_date) as data_date
FROM some_table_2
WHERE ((created_date >= start_date AND created_date < end_date)  
OR (last_modified_date >= start_date AND last_modified_date < end_date))
ORDER BY 1

Date range that generated by above query then will be used as parameters to the job.

Provide a provision to configure resources for Jobs

Is your feature request related to a problem? Please describe.
Pods should have defaults configured but should have a provision for configuration as well, currently there is no provision for configuring cpu & memory for the jobs. Provide a mechanism to configure the resources for the jobs.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Optimus command to start a program with context envs injected by default

When a plugin docker container executes and requires configuration/asset files as input, it can either use GRPC/REST calls to fetch them from optimus or there is an existing command optimus admin build instance that prints them in the local filesystem.

What if we have a run command instead that will inject these configuration variables as env by default? Something like optimus admin run instance python3 main.py --some-arg where python3 main.py --some-arg will vary based on each plugin.

Delete call of job specification doesn't work

Sending a rest call to delete a job specification throws 404 where as grpc call works fine. Steps to reproduce

curl -X DELETE "http://localhost:9100/v1/project/my-project/namespace/kush/helloworld" -H  "accept: application/json"

Optimus CLI should check for updates using Github releases

Having a notification about an update available for optimus cli can be helpful for users to keep the binary up to date. We can do this using github release api I think. This notification can be shown to users maybe once a week.

Add the Type of Plugins documentation

Clarified the difference in the types of plugin ,i.e between Task and Hook in the table format to get clear understanding

Kubernetes executor compatible with sequential scheduler

capable of executing 1 job at a time. Only supports OCI images

Show recent replay list of a project

Currently, the subcommands for Replay that are available are run to start the replay and status to get the status of one replay using the ID. However, Optimus should also be able to list down the recent replays of a project, so that users can check the latest replay request including the id, which job & date is being replayed, when the replay happened, and also the status. Subcommand list will be added to accommodate this.

Example: optimus replay list --project [project_name]

Convert handler/adapter from interface to functions

show execution logs of requested job

Via Optimus CLI user should be able to see logs of the requested Job

Remove Success bool field from proto response

Create backup resource run

Add the command
Generate the job spec from the destination (already available, need to test)
Resolve the dependencies of the job (just use the resolver)
Response of list of tables to be backed up
- As part of the response, we can highlight which of the table that will be backed up and which are not (if user chose to ignore downstream backup)
Prompt confirmation to proceed the backup
Decide if the job can be backed up, which datastore and destination
Backup the resource
- Bigquery
  - Table
  - View
  - External table (to be verified)

Support Replay and Backup for multiple namespaces project

Users should be able to do backup and replay for downstream jobs with a different namespace, as long as authorized to do so.

should able to accept allowed_downstream with possible value * (all namespaces) or empty (only requested namespace). applied to both replay and backup.
should able to accept ignore_downstream in Replay

User should be able to list all secrets.

A User should be provided an option to list all secrets within the project through api & cli, only digests to be shown to protect the secrets.

Acceptance Criteria

All secrets along with the digest to be shown to the user when requested.
Operation to fail with relevant details shown on invalid/insufficient params provided.
Documentation to be updated accordingly.

Option to ignore inferred dependencies

Optimus supports figuring out dependencies automatically by parsing task assets. This logic of finding dependency is implemented per task behavior. Users can also choose to not depend on the task's inferred dependencies and pass them explicitly in job.yaml specification. We need a way to ignore the task's automatically inferred dependencies explicitly.

We use the existing specification file to add a nonbreaking change as follows:

name: job1
dependencies:
- job: hello_world1
- job: hello_world3
- ignore_job: hello_world2
- ignore_job: hello_world4

In this case, if the task used in this job somehow infer hello_world2 as one of the upstream, we will choose to not treat it as an upstream dependency. Similarly, if infer logic did not find hello_world4 as one of the upstream, nothing happens and no error should be thrown.

SQL linter and formater as extension

What if wrap sqlfluff as an extension as well?

Basic user authentication and authorisation

We can support basic authentication with minimal permission-based rule enforcement that can be read from a file in the Optimus server. The file could be stored locally, stored in a k8s config map, GCS, etc for the server to fetch.

[
  {
     "username": "foo",
     "password": "bar",
     "perms": ["*"]
  },
  {
     "username": "optimus",
     "password": "$2a$10$fKRHxrEuyDTP6tXIiDycr.nyC8Q7UMIfc31YMyXHDLgRDyhLK3VFS",
     "perms": ["deploy:t-data", "deploy:g-data"]
  },
  {
     "username": "prime",
     "password": "pass",
     "perms": ["deploy:*", "register:project", "register:secret"]
  }
]

Passwords can be cleartext or bcrypt encrypted hashes. Each permission is mapped as action:entity and * is used as a wildcard for all. To avoid authentication with internal clients(airflow docker images), we can break the optimus API into two parts, public and internal both exposed to different ports, and only public will be served to external users.

From the cli we can either

Use a .netrc file to store user credentials.
Users while running the command will provide these credentials in their .optimus.yaml file(auth method and username can only be configured in file, password will be either passed as flag or asked from user in stdin).

Remove interface for Config provider and use the Config struct

Support for submitting PySpark Jobs

Optimus should not fail on startup because of bad plugins

When loading a plugin after the discovery phase, when the first GRPC client handshake happens between optimus core and plugin server, don't terminate the application on error instead, simply log a warning and continue as if an invalid binary is discovered as plugin.

Support for opentelemetry metrics

Currently, there are no stats/metrics/traces being pushed by Optimus service. It should support emitting basic stats like cpu/mem/gc usage, time to complete GRPC calls, etc.

Add integration tests for postgres in github workflows

Currently, there are tests files written for postgres store but they are being ignored in unit tests. This can be fixed using GitHub workflow services.

Broken Link

Link to Contributing guide in line 101 of Readme.md is broken
The line:
Read our [contributing guide](docs/contribute/contribution.md) to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

Support for basic sequential scheduler

The scheduler should be able to parse a requested job, convert it to an execution graph and executes it using a mocked executor.

External table datasource from Google Sheets should autodetect schema when not supplied

Bigquery external table API supports auto schema detection for external table source by a boolean property - ref.

If the target field names are the same as the source fields in the Google Spreadsheet, it should be better not to supply any schema rather than specifying the same column names one by one.

Bigquery Datastore should support description field on view columns

Currently, the BigQuery view is created using SQL query. Although we need a SQL query to create a schema, we should support updating an existing schema with descriptions in the upsert query similar to what is supported when creating a BigQuery table.

Using integer type in job spec configs causes panic

Currently, job spec yaml configuration only supports string key-value pairs. Having string kv pairs are fine but passing int(or any other type) should handle the failure gracefully instead of panic.

panic: interface conversion: interface {} is int, not string
goroutine 1 [running]:
github.com/odpf/optimus/store/local.JobSpecAdapter.ToSpec
....

Add the shell-completion feature documentation

Refactor logger used in packages

Current implementation of the logger is very rough, that is a global variable is being used across different packages. It should be properly injected from the top wherever it is needed.

Refactor optimus plugins to have a base plugin interface

Currently, each plugin type has its own interface with set of functions in it, we should break it down to multiple interfaces like BaseInterface that will be implemented by all plugins, CLIInterface for those who want their plugin to be exposed in CLI questions/answers, etc.

User should be able to delete a secret.

User should be able to delete a secret which is no longer used in any job specs.

Acceptance Criteria

If the secret is referenced in any job spec, the delete should fail & list all the jobs where it is referenced.
If the secret is not referenced the deletion of the secret to be successful.
If the secret doesn't exist show a valid message that secret doesn't exist.

Option to provide no schedule interval for creating on-demand jobs

Two jobs with same destination will cause ambiguous dependency resolution

The current database model doesn't properly resolve if two jobs within a project choose a single destination and will cause ambiguity during dependency resolution. The destination should also support taking the type of destination and not just name to handle a variety of destinations like buckets/databases/tables/etc

Discover plugins while respecting semantic versioning

If more than one version of the binary is available in plugin directories, use the latest one according to semantic versioning. Currently, this behaviour is undefined and randomly any binary will be selected.

CrossTenantDependancySensor takes up a worker slot for its entire runtime

CrossTenantDependancySensor currently configured to be on poke mode. causing worker slot to be allocated for its entire runtime. If too many CrossTenantDependancySensor running it exaust the available slots causing priority tasks to be queued.

Create backup resource dry run

Add backup command
Generate the job spec using datastore destination (already available, need to test)
Resolve the dependencies of the job (use the resolver)
Response of list of tables to be backed up

User should be able to create/update a secret through apis & cli.

A user through apis or through CLI should be able to create/update the secrets such that he should be able to reference the secrets & use in the Optimus Jobs.

User should be provided an option to create secret at project or namespace level.

Acceptance Criteria

GRPC end points to create/update a secret to accept base64 encoded values.
CLI to create/update secrets to accept both base64 & plain text.
Update documentation
Secrets to be encrypted securely.

Optimus cli should support bash/zsh completions

Create backup resource list

Implement google sheets external data table type for bigquery datastore

Enable google sheets externaltables management for BigQuery datastore via Optimus.

User should be able to:

Create a google sheets external table by specifying sheets URI
Define schema for the google sheets external table
Use metadata management feature such as Labels

The implementation should be able to extend other BigQuery supported external data sources for future development.
About BigQuery external tables: https://cloud.google.com/bigquery/docs/external-tables

Allow plugins to skip assets compilation with Go template

Some plugins might have their own assets compilation method for example using Go templates or Jinja which used their own set of variables. For example an email sending plugin might have various kinds of variables:

Dear {{ .RECIPIENTS }},
Attached in this email is the monthly report for {{ .MONTH }}

Based on the discussion with @kushsharma , the proposed workaround is to add SkipCompile flag into CompileAssetsResponse which is configurable at plugin level.

Print Basic Details of Image Version & Optimus Client & Server Versions

In order to support better debuggability, at the time of running of plugins.

Would be good to print image version.
Optimus client & optimus server versions

Kube secret automation as part of bootstrap

Secrets can be used through macros in the job specs.

All User created secrets within the namespace or project level secrets can be referenced by users in the Jobspec & should be evaluated while fetching the assets & the spec.

Acceptance Criteria

when the secrets are referenced properly they should be evaluated & returned on registerInstance api.
When the secrets are not referenced properly the api call to fail with the relevant message.

Duplicate cross project dependencies should be handled gracefully

If a job inferred a cross-project dependency from its task and the same dependency is also mentioned in job.yaml specification, they are treated as duplicates. The reason is inferred dependencies when used in map for dependencies uses job name whereas cross-project dependency mentioned in specification uses project_name/job_name so duplicates are created inside the same dependency map.
For example:

...
dependencies:
- job: foo-project/bar-job
  type: inter

The map will have two jobs as bar-job and foo-project/bar-job.

Although users can choose to simply write the spec properly, the expected result is Optimus to handle it gracefully.

Optimus to support deployment of multiple namespaces.

Currently optimus cli does support deployment of a single namespace, it would be better if the deploy command supports deploying of all namespaces within the project as well.

Support for RateLimits for replay such that the scheduled jobs are not impacted.

Is your feature request related to a problem? Please describe.
Replay feature can be handy as well it can be a problem where it disrupts the regular scheduled runs, by hitting the worker pool limits or the limits of the underlying Datastore.
Would be nice to have a feature to limit the number of jobs due to replay.

Describe the solution you'd like
A config which can be configured at project or namespace level to rate limit the number of active job runs due to replay, considering all the downstream runs. An option to force this validation, mainly for admin usage for replaying a higher priority job.

Describe alternatives you've considered
Currently, the design doesn't allow a mechanism to identify jobs which are triggered through replay or scheduled runs so the sophistication of configured the jobs differently such that limits can be applied at a underlying datastore or any downstream level as well.
Even assigning all the replay requests to a different pool at the scheduler level is not an option now. For all of that a custom scheduler with the state management managed by optimus is the solution but that will be a big change.

Additional context
With secret management, Optimus will have a flexibility to configure various service accounts for various jobs so one can configure the different service accounts accordingly such that the resources are properly used across

Support for External Sensor for Optimus Jobs

Currently Optimus Supports sensors for job dependencies which are within and the outside the project but they are managed by the same Optimus Server. It would be helpful if Optimus supports job sensors which are managed in a different Optimus, as with in an organisation there will be many deployments, checking for data availability may not always guarantee completeness & correctness of data which is guaranteed through Optimus dependencies.

Expectation :
The sensor provides checks for the status of the jobs b/w the input window.

Configuration :

dependencies : 
 job : 
 type : external
 project : 
 host : 
 start_time : // start time of the data that the job depends on
 end_time : // end time of the data that the job depends on.

The Optimus Server which accepts the requests based on its window, schedule configuration checks for all the jobs which outputs the data for the given window

This has the challenge of breaking the dependencies when job name changes.

Backup & Replay Improvements.

Table name of the backup result is preferred to have suffix of timestamp, instead of UUID.

To be considered:
=> limit of the table name
=> separator on the timestamp to make it still readable
=> the timestamp should be equals to backup time and will be equals to all downstream tables suffix.

Backup list response can be added extra useful information
=> High-level information (should be in user request point of view)
=> Ignore downstream choice should be added.
=> TTL (not the expiry time).

Add Backup details subcommand to show the list of all the tables backed up.

Update documentation about organising Optimus specifications

Documentation about how to organize specifications and inherit specifications using this.yaml if needed.
Documentation for creating SLA miss and failure notifications over slack.

raystack / optimus Goto Github PK

optimus's People

Contributors

Stargazers

Watchers

Forkers

optimus's Issues

Recommend Projects

Recommend Topics

Recommend Org