rsyi / whale Goto Github PK

🐳 The stupidly simple CLI workspace for your data warehouse.

Home Page: https://rsyi.gitbook.io/whale

License: GNU General Public License v3.0

Makefile 0.30% Python 85.60% Rust 14.10%

data-documentation data-catalog data-discovery

whale's Introduction

Whale is actively being built and maintained by hyperquery. For our full data workspace for teams, check out hyperquery.

The simplest way to find tables, write queries, and take notes

whale is a lightweight, CLI-first SQL workspace for your data warehouse.

Execute SQL in .sql files using wh run, or in sql blocks within .md files using the --!wh-run flag and wh run.
Automatically index all of the tables in your warehouse as plain markdown files -- so they're easily versionable, searchable, and editable either locally or through a remote git server.
Search for tables and documentation.
Define and schedule basic metric calculations (in beta).

😁 Join the discussion on slack.

For a demo of a git-backed workflow, check out dataframehq/whale-bigquery-public-data.

📔 Documentation

Read the docs for a full overview of whale's capabilities.

Installation

Mac OS

brew install dataframehq/tap/whale

All others

Make sure rust is installed on your local system. Then, clone this directory and run the following in the base directory of the repo:

make && make install

If you are running this multiple times, make sure ~/.whale/libexec does not exist, or your virtual environment may not rebuild. We don't explicitly add an alias for the whale binary, so you'll want to add the following alias to your .bash_profile or .zshrc file.

alias wh=~/.whale/bin/whale

Getting started

Setup

For individual use, run the following command to go through the onboarding process. It will (a) set up all necessary files in ~/.whale, (b) walk you through cron job scheduling to periodically scrape metadata, and (c) set up a warehouse:

wh init

The cron job will run as you schedule it (by default, every 6 hours). If you're feeling impatient, you can also manually run wh etl to pull down the latest data from your warehouse.

For team use, see the docs for instructions on how to set up and point your whale installation at a remote git server.

Seeding some sample data

If you just want to get a feel for how whale works, remove the ~/.whale directory and follow the instructions at dataframehq/whale-bigquery-public-data.

Go go go!

Run:

wh

to search over all metadata. Hitting enter will open the editable part of the docs in your default text editor, defined by the environmental variable $EDITOR (if no value is specified, whale will use the command open).

To execute .sql files, run:

wh run your_query.sql

To execute markdown files, you'll need to write the query in a ```sql block, then place a --!wh-run on its own line. Upon execution of the markdown file, any sql blocks with this comment will execute the query and replace the `--!wh-run` line with the result set. To run the markdown file, run:

wh run your_markdown_file.md

A common pattern is to set up a shortcut in your IDE to execute wh run % for a smooth editing + execution workflow. For an example of how to do this in vim, see the docs here. This is one of the most powerful features of whale, enabling you to take notes and write executable queries seamlessly side-by-side.

whale's People

Contributors

Stargazers

Watchers

whale's Issues

Add editorconfig

https://editorconfig.org/ because it's probably not that bad of an idea.

Revise the text for the optional `cluster` argument, as it's no-op; or remove it altogether.

Contrary to what the onboarding process claims, no restrictions are used w.r.t. catalog when the cluster argument is given. This simply allows a forced renaming of the cluster within the table name.

chore: Update `NotFound, message: "No such file or directory"` panics to include resource paths

There are several points where the code panics because a resource can't be found. The current panic does not include the filename/resource that Whale is attempting to access, it would make debugging easier if the resource path is included.

Add "alerts" to metrics.

We want to add alerting support within metrics yaml definitions, so given a user-provided condition, a slack message will be sent.

Detailed requirements

Metrics can be defined that assess the quality of data in a table, e.g. by using a definition like:

id-nulls:
  sql: |
    select count(id) from schema.table where id is not null

This would be substantially more useful if coupled with alert setup, which would tell you when a field is broken. Perhaps something like this would be reasonable:

id-nulls:
  sql: |
    select count(id) from schema.table where id is not null
  alerts:
    - condition: "> 0"
      message: "Nulls found in schema.table, column 'id'."
      slack: ["#channel", "@somebody"]
      email: "[email protected]"

This would require support for several things:

Scaffolding to send slack messages: Here is the official SDK, which provides a nice example on how to do this.
Triggering the send of slack messages: Within MetricRunner, probably, loop through the alerts section of the yaml run eval on the checks against the result, and, given a failure, send a message to the slack recipients specified with the message message.

Future considerations:

Email support would also be nice, though as a heavy slack user, I don't see a huge need for it, so will leave it out-of-scope until someone else argues. :)

Directory Organization

Today whale pull table information and organize in the same directory, so we have a mix of tables of different databases, I think that could be more organized if we have a directory for each database. It make senses for you?

Thank you!

Extract Tables Metadata from Glue Catalog

Hello Guys,

You have a great project here.

All my tables are cataloged with Glue Data Catalog, it will be great if whale could have a connector.

Thank you

Add snowflake support

A number of people have asked for snowflake support. Adding this issue here to track progress on this front.

Support query templating, not just running.

Currently, our Jinja engine (wh run) combines the templating and execution of the provided query into one step. For pipelining purposes, it could be useful to enable wh render that simply returns the query, without execution.

There are two parts to this issue (for python contributors, feel free to build the python end without building the rust part):

Python: (The python library lives in the pipelines sub-directory) We'll need to write a function (perhaps render_sql_file) that takes a file, opens it, then runs whale.utils.template_query on it, returning/logging the final result. The appropriate place for this is in whale.__init__, where the run and execute_sql_file functions already live. template_query does all the hard work already, so this function is just a wrapper, just as execute_sql_file is to the run function (see both for an example of what I mean).
Python: As-is, the API is just a set of scripts that are executed from a shell by rust, so you'll have to write a script that runs the whale.render_sql_file function. See pipelines/build_script.py for an example.
Rust: you'll have to modify main.rs to create a new subcommand for the CLI, render, and then have this reference a function in lib.rs that sources the whale virtual environment in ~/.whale/libexec. I'd directly just copy over the logic from the pull subcommand or run subcommands, which do the same thing.

Google Cloud Spanner support

This will involve:

Writing a new Extractor that runs a query that scrapes metadata from Spanner.
Writing logic into pipelines/utils/task_wrappers.py and pipelines/utils/extractor_wrappers.py to run the above Extractor when wh.pull is called.

We'll also later have to add this as an option to the wh init workflow in rust, but the library will function without that, so it's less priority for now.

whale Installation fails in Ubuntu 20.04

My enironment is

Ubuntu 20.04

I run git packet installation command

git clone https://github.com/dataframehq/whale.git

Then I go to whale folder and run following command

linttu@linttu-VirtualBox:~/Desktop/whale$ make && make install

The result and error is

cargo build --release --manifest-path cli/Cargo.toml
make: cargo: Command not found
make: *** [Makefile:10: rust] Error 127

Splice Machine integration

Hey Whale team! This is a ticket to track integration with Splice Machine DB.

This is the SqlAlchemy driver

For any other docs/links needed, we can track that here.

Thanks!

Snowflake connection not being properly parsed and configured

Write tests for MetaframeLoader.

Enable multi-value metrics query automation

Currently, we restrict the "metrics" feature to queries that can only return single values.
We should enable multiple values to be returned and viewed in the rendered markdown metrics payload.

Requirements:

Must be backwards-compatible with the original metrics calculation scheme.

chore: cargo-ci should only run when cli/ is modified

The Cargo-CI action currently runs when any file in the repo is modified, it would probably be better for it to run only when files under cli/ are modified.

Workflow here: https://github.com/dataframehq/whale/blob/master/.github/workflows/cargo-ci.yml
Update based on https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#onpushpull_requestpaths

Postgres views are not marked as views

Revise ./shell/mf to open both the .docs.md and the .md file upon entering.

Currently, opening a table entry selected in mf will open the .docs.md file, but this means that for tables with many columns or for small screens, you may not be able to see all the basic metadata. This is an edge case, but in this case the current only workaround is to run mf --a. We might want to consider changing the default behavior to just open both files, but make the .md metadata stub read-only. Ideally, these would show up in a split pane setup.

In vim, you can run vim -o to open two splits, which we really ought to do (with the focus on the .docs.md split). But we need to add some case statements to ensure that, for users not using vim, the resulting behavior is reasonable.

Plans for Apache Atlas support/integration

Hi,

This is related to #3 . Are there plans to support Apache Atlas (https://atlas.apache.org)? It's a metadata store that'll include other things like business catalogs and glossaries.
There's some integration with Amundsen, where the latter can store data on Atlas instead of Neo4j. In that case, supporting Amundsen API might be one way to support Atlas.

Postgres views are not indexed

Postgres views are not indexed after wh pull. SQL for postgres seems to miss out on views. Table pg_catalog.pg_statio_all_tables which is joined doesn‘t hold views in my case.

Improve rust test coverage.

The rust side of things is embarrassingly uncovered by tests. :)

This will be a standing issue, until we get to the point where our test coverage matches Python.

Add rust coverage.yml file generation to cargo-ci workflow

Currently our codecov badge only reflects python code coverage -- we should add in a coverage-generation and upload steps to the cargo-ci workflow.

I haven't set up multiple codecov uploads in a single repository before, but it seems like it'll work:
https://docs.codecov.io/docs/merging-reports
"Codecov does not override report data for multiple uploads. We always merge the data. Simply upload all three reports at once, or separately."

Isolated executability of metrics

Add the ability to execute MetricRunner against a single table_stub_path, and add the ability to execute this in rust through the CLI.

Detailed requirements

If MetricRunner is executed with table_stub_paths specified, it will only run the metrics specified in the table_stub_paths list. This means it's possible to use something like the following python snippet to execute only a single metric.

def pull_metrics(table_address):
    conf = ConfigFactory.from_dict({
        "markdown_metric.table_stub_paths": table_address,
        "loader.whale.database_name": table_address.split("/")[0]
    })
    task = WhaleTask(
        extractor=MetricRunner(),
        transformer=MarkdownTransformer(),
        loader=WhaleLoader(),
    )
    task.init(conf)
    task.run()

This ticket therefore entails:

Enabling a skim hotkey to execute the above python snippet through the rust CLI (this appears to be possible through the library we're using to handle the search, skim: https://github.com/lotabout/skim#keymap
Add [a functional version of] the above snippet to the python codebase (perhaps create a new function in utils.task_wrappers called pull_metrics() or something.)

Erroneous "crontab failed to register" message when crontab is registered for a manual installation

The is_cron_tab_registered function uses fgrep wh pull to check if a crontab has been registered already, but this is problematic as the binary in the ~/.whale/bin directory is called whale.

Two possible solutions:

Either change the makefile to copy to wh instead (but this will break people's aliases)
Change the logic to check for either wh pull or whale pull (better).

Add jinja templating support to `utils.task_wrappers.run` and `extractor.metric_runner.MetricRunner`

For more modular SQL-writes within the platform, we should enable jinja templating in metrics queries (in MetricRunner) & raw queries (in run), using templates stored in the ~/.whale/templates folder.

Detailed specifications

All Jinja templates should be stored in ~/.whale/templates.
A template ~/.whale/templates/my_template should be able to be added to a query by inserting {% extends my_template.sql %}, following the Jinja template engine pattern (similar to the pattern found in Django). This will require invocation of a jinja2.FileSystemLoader with the path ~/.whale/templates given as argument (an example is shown here).
Both utils.task_wrappers.run and extractor.metric_runner.MetricRunner.extract() should run the templating engine first, then execute the rendered query.

Add roadmap.

For visibility. I'll add this in later, but if you have a feature request, either post it here, in slack, or make an issue and I'll add it to the roadmap!

Possible features:

Data profiling
Additional bigquery metadata (e.g. tags)
Cloud Spanner support

Add docs for contributors.

Currently, there are no best practices for how to test the stack, best practices for contribution, etc. etc.

Add command to delete stale docs

Currently, there's no way to delete old documentation, except for by manually rming the associated files. We should enable mf clean to remove all stale files, though I'm not entirely sure of the best way to do this.

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J?

Amundsen's APIs (specifically metadataservice) might be a good idea to integrate as a backend since it abstracts the Atlas/Neo4J/other backends and will not depend on the schema of the backend data.

If the goal is to be able to run completely offline with a local copy of the backing data then I can understand that too, but if it's not a concern to depend on connectivity/access to Amundsen then it might be worthwhile.

Serve results as an API

It would be really useful if we could mount this on a machine (or multiple) and allow the search to be done through an API.
This could allow for multiple use cases like building a frontend for the tool.

Error when try get data from Glue Catalog

Hello,

I'm trying to get metadata from glue catalog and I got this error:

Traceback (most recent call last):
File "/usr/local/Cellar/whale/v1.1.0/bin/../libexec/build_script.py", line 20, in
pull()
File "/usr/local/Cellar/whale/v1.1.0/libexec/env/lib/python3.8/site-packages/whale/utils/task_wrappers.py", line 107, in pull
task.run()
File "/usr/local/Cellar/whale/v1.1.0/libexec/env/lib/python3.8/site-packages/whale/task/init.py", line 26, in run
record = self.transformer.transform(record)
File "/usr/local/Cellar/whale/v1.1.0/libexec/env/lib/python3.8/site-packages/whale/transformer/markdown_transformer.py", line 118, in transform
return formatter(record)
File "/usr/local/Cellar/whale/v1.1.0/libexec/env/lib/python3.8/site-packages/whale/transformer/markdown_transformer.py", line 29, in format_table_metadata
description = record.description + "\n"
TypeError: unsupported operand type(s) for +: 'DescriptionMetadata' and 'str'
thread 'main' panicked at 'Can't read file.: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/filesystem.rs:27:50
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Thank you

Glue parsing behaves poorly in certain edge cases.

Surfaced by @rubenssoto:

Traceback (most recent call last):
  File "/usr/local/Cellar/whale/v1.1.4/bin/../libexec/build_script.py", line 23, in <module>
    pull()
  File "/usr/local/Cellar/whale/v1.1.4/libexec/env/lib/python3.8/site-packages/whale/utils/task_wrappers.py", line 113, in pull
    task.run()
  File "/usr/local/Cellar/whale/v1.1.4/libexec/env/lib/python3.8/site-packages/whale/task/__init__.py", line 23, in run
    record = self.extractor.extract()
  File "/usr/local/Cellar/whale/v1.1.4/libexec/env/lib/python3.8/site-packages/whale/extractor/glue_extractor.py", line 28, in extract
    return next(self._extract_iter)
  File "/usr/local/Cellar/whale/v1.1.4/libexec/env/lib/python3.8/site-packages/whale/extractor/glue_extractor.py", line 52, in _get_extract_iter
    catalog, schema, table = self._parse_location(
  File "/usr/local/Cellar/whale/v1.1.4/libexec/env/lib/python3.8/site-packages/whale/extractor/glue_extractor.py", line 91, in _parse_location
    table = splits[-1]
IndexError: list index out of range

I suppose "Location" doesn't always follow the same format within Glue. As a quick patch, let's revert this by default back to an unparsed, _-based table name.

Priority for manual comments

Hello,

Whale get information about comments from in my case hive metastore, but access to hive metastore is limited, so if more people in the company want to collaborate with documentation, it is not possible.
So whale docs are in git, git allows a very safe method to collaboration, Pull Requests. When exists a manual comment in whale docs, whale shouldn't erase this comment when a new pull is made.

Whale should give priority for manual comments.

Fix clippy issue by using a character literal (rather than string literal) in `push_str`

See the error below (can also be found in https://github.com/dataframehq/whale/runs/1439452137)

error: calling `push_str()` using a single-character string literal
Error:   --> src/serialization.rs:26:9
   |
26 |         new_docs.push_str("\n");
   |         ^^^^^^^^^^^^^^^^^^^^^^^ help: consider using `push` with a character literal: `new_docs.push('\n')`

Add better messaging for when there are no tables in a warehouse.

Currently, when no tables are detected, whale throws an error

FileNotFoundError: [Errno 2] No such file or directory: '~/.whale/manifests/tmp_manifest.txt'

The function to change is copy_manifest in pipelines/whale/utils/__init__.py, which should actually check that the file exists before attempting to copy it. See the transfer_manifest function, right above it, for how this might look.

Do not create separate `## Metrics` sections when multiple metrics are input.

At the moment, creating multiple metrics in a single table details section causes some problems:

These should be combined into a single ## Metrics section.
These should be modified so future pulls are backwards-compatible.

Mock google auth connections in tests so secrets are not necessary

We discovered in #71 and #70 that pull requests from outside of the organization cannot use Github secrets. The python tests, therefore, will fail for anyone except for me and Joseph. Get around this by using [slightly] stricter mocking on the tests.

Add a "#tags" section, and pull down metadata from Bigquery.

Enable custom ETL jobs without having to re-build the library.

At present, we only support heavy-handed modifications to the source code (build_script.py) with a rebuild of the library as a means of running an ETL job that we do not support. I'd like to enable a slightly different pattern instead, where custom ETL code can be specified within the config file connections.yaml. You could, for example, imagine storing all scripts in a build_scripts directory, and add the ability to specify something like this:

- type: build_script
  name: vertica
  build_script_path: ~/.metaframe/build_scripts/vertica_build_script.py
  venv_path: ~/env/default

Add code coverage to github actions.

We're using github actions to run our CI/CD pipelines. I think all we need to do is add this: https://github.com/codecov/codecov-action

Write tests for PrestoLoopExtractor.

Add SSH tunneling capabilities.

Currently, it's a bit difficult to set up metaframe in some companies where you have to ssh into a devbox to run your queries. It's possible, however, to get around this by using paramiko to set up an SSH client to execute your query in your devbox, pipe the results to a csv, then scp the results back to localhost.

To enable this, we can add an ssh argument to our extractors to enable this pattern, using a gateway machine. I'm not entirely sure how this works for every sort of database, but I know for presto this looks something like client.execute('presto --execute {}'.format(query).

Latest PyPI release does not contain any code

The latest release (1.2.0) of whale-pipelines on PyPI does not contain any of the project's code. You can verify this by downloading the assets and listing their contents with tar ztf whale-pipelines-1.2.0.tar.gz and zipinfo whale_pipelines-1.2.0-py2.py3-none-any.whl.

I suspect that the cause of this problem is that the files were built by running python3 pipelines/setup.py ... from the root of the repository. setup.py should instead be run from the directory in which setup.py is located (or, even better, use build, which takes care of the change in directory for you).

Also, while I'm here, I noticed that, when your wheel is built correctly, it contains the tests/ directory at the top level. This is inadvisable; see here for why and how to fix it.

Write tests for MarkdownTransformer.

Create a TEMPLATE in MetaframeLoader that doesn't show partition.

To enable non-partitioned tables to save markdown space.

Convert Bigquery tags to standard format, compatible (at least) with AWS glue tags.

Given our recent discussion in slack, ideally we should adjust the format in which Bigquery saves tags to the TableMetadata model to have a consistent format between both Glue and Bigquery.

Let's do this after merging #93.

@manuelzander Would love your input here -- it seems like there are mostly just key value pairs in Bigquery tags, but there's also the "template" and "templateDisplayName". Are these necessary to surface? Perhaps it could just suffice to show the fields dict?

sh: 1: source: not found

I can't manage to successfully execute a connection to a postgres database. I followed all the instructions specified at the website https://docs.whale.cx/ without success, I ran make && make rust and make install_rust. At this point I have whale binary in ~/.whale/bin/whale location and I'm able to init my first data warehouse, however, when I try to execute wh pull, this error appears over and over again. I'm on a ubuntu 18.04 os. Thank you for your this project and for your time, hope you can help me. Seems to be a problem related to issue #113

Debian installation error

This issue manifests in Debian GNU/Linux 10 (buster)

when running make && make install, setuptools throws an error:

File "/tmp/pip-install-r2twi631/grpcio/src/python/grpcio/support.py", line 118, in diagnose_build_ext_error
        diagnostic(build_ext, error)
      File "/tmp/pip-install-r2twi631/grpcio/src/python/grpcio/support.py", line 98, in diagnose_attribute_error
        "We expect a missing `_needs_stub` attribute from older versions of "
    commands.CommandError: We expect a missing `_needs_stub` attribute from older versions of setuptools. Consider upgrading setuptools.

I can amend this on my side by amending the makefile to upgrade pip. @rsyi Is this a valuable pull request?

Generation of Html Documentation

Hello :)

There is a great software that I use in my company called Great Expectations, its a tool to check data quality. They have a feature called data docs, it is HTML documentation about data quality checks, I host all html in an s3 bucket and all company could access.

https://greatexpectations.io/

Whale could have a feature like this, simple html with all table documentation and with some simple fields to search data.

thank you

When a table address has `.md` in its name, sed breaks up the filename erroneously.

Metrics improvement: continue on a single metric failure, and throw better error messages

At the moment, when the metrics runner fails, whale does not supply useful information about which metric failed, making it quite difficult to debug.

Detailed specifications

When self.execute(metric_details["sql"]) is run (in extractor.metric_runner.py, here: https://github.com/dataframehq/whale/blob/master/pipelines/whale/extractor/metric_runner.py#L90), use a try except loop to enable the execution to be bypassed on failure. In the except portion, write a useful error message to the logger, indicating the metric name and the table address where the issue is located (printing table_stub_path would be sufficient).

rsyi / whale Goto Github PK

whale's Introduction

The simplest way to find tables, write queries, and take notes

📔 Documentation

Installation

Mac OS

All others

Getting started

Setup

Seeding some sample data

Go go go!

whale's People

Contributors

Stargazers

Watchers

Forkers

whale's Issues

Detailed requirements

Detailed requirements

Detailed specifications

Detailed specifications

Recommend Projects

Recommend Topics

Recommend Org