henriblancke / dbt-formatter Goto Github PK

View Code? Open in Web Editor NEW

45.0 3.0 9.0 843 KB

Formatting for dbt jinja-flavored sql

License: MIT License

TypeScript 100.00%

dbt sql formatting jinja

dbt-formatter's Introduction

DBT Formatter

Install

npm install -s dbt-formatter

Usage

import formatter from 'dbt-formatter';

const mySql = "SELECT * FROM {{ ref('myTableRef') }}";
const myOpts = { sql: 'default', indent: 2, upper: false };

formatter.format(mySql, myOpts);

This will result in:

SELECT
  *
FROM
  {{ ref('myTableRef') }}

Usage options

Fine tune dbt-formatter behavior with the following options:

Option	Default	Description
sql	`default`	The sql dialect you want to use, currently only `default` is available
indent	`2`	How many spaces you want an indentation to be
upper	`false`	Formats sql reserved words to be uppercase when set to `true`
newline	`false`	Appends a new line at the end of the formatted sql string
lowerWords	`false`	Lowercases all `words` as identified by the tokenizer
allowCamelcase	`true`	Allows column names to be camelcased

Development

NPM scripts

npm test: Run test suite
npm start: Run npm run build in watch mode
npm run build: Generate bundles and typings, create docs
npm run lint: Lints code
npm run package: Package dbt-formatter as a binary

Roadmap

Add more sql dialects:
- snowflake
- redshift
- bigquery
- postgres
- presto

dbt-formatter's People

Contributors

Stargazers

Watchers

Forkers

limess jackdreilly samkessaram betamatt whetstone-ventures hackstrap richiebful pricenow-ch

dbt-formatter's Issues

Support `.editorconfig`

Support .editorconfig configuration for SQL files rather than requiring custom config. This will make it easier to centralise formatting configuration.

prettier supports this out-of-the-box.

Formatter changes `null end` to `nullend` in case statement

The formatter can introduce an error when formatting the end of a case statement:
Before formatting:

when GTTV_28 between 2500 and 4000 then 'Trending'
when GTTV_28 < 2500 then 'Low'
else null end csm_status

After formatting:

WHEN GTTV_28 BETWEEN 2500 
AND 4000 THEN 'Trending'
WHEN GTTV_28 < 2500 THEN 'Low'
ELSE NULLEND csm_status

NULL END should remain as separate words.

This does not happen on all case statements; in this case it may have been related to use of between X and Y in one of the WHEN clauses

Justified "AS" column aliases?

We would love to start using this in our DBT projects. We are currently formatting SQL queries manually, and we are using a justified AS blocks for column aliases - which adds a lot in terms of readability. Has anyone looked into what this would require?

For example:

SELECT
    CAST(1 as int)                     AS col_a
  , CAST('text_value' as varchar(255)) AS col_b

...instead of:

SELECT
    CAST(1 as int) AS col_a
  , CAST('text_value' as varchar(255)) AS col_b

I'm happy to help contribute code if someone is able to point me in the right direction.

asin seen as reserved word

I typed:

SELECT
  updated_at,
  asin,
  data->'title' AS product_title
FROM
  product

And it formatted to:

SELECT
  updated_at,
  ASIN,
  DATA -> 'title' AS product_title
FROM
  product

Why is asin and data capitalized?

Top level {% set %} template should have a new line

Open to PRs?

Hello maintainers of this package 👋 thanks for your hard work on this! I have been using this for a little while now at work. I've noticed a few bugs in the formatter such that it produces invalid SQL. I'm wondering if you're open to PRs to address bugs?

Fails to parse square brackets in variables correctly

Hello! We're keen to use this to start formatting our dbt codebase; it looks like exactly what we want to have a standardised SQL output that takes into account jinja code blocks, but have come across a couple of issues. I'm happy to look at fixing this but want to validate the approach first.

Reproducing the issue

The issue is to do with the way the tokeniser treats [ and ] as string delimiters, which is fine for SQL Server SQL, but breaks when we start using arrays and dictionaries in our jinja code (which we currently do). The errors can be reproduced as follows:

  it('works with arrays in the jinja statement', () => {
    const formatted = formatter(`{{ some_macro(arg_1 = 'foo', arg_2 = ['[string with square brackets]'], arg_3 = 'bar') }}`);

    expect(formatted).toBe(`{{ some_macro(
  arg_1 = 'foo',
  arg_2 = ['[string with square brackets]'],
  arg_3 = 'bar'
) }}`);
  })

  it('does not treat square brackets in jinja statements as strings', () => {
    const formatted = formatter(`{{ variable['key'] }}`);
    expect(formatted).toBe(`{{ variable['key'] }}`);
  })

When I run these tests, I get the following output:

  ● jinja flavored queries › works with arrays in the jinja statement

    expect(received).toBe(expected) // Object.is equality

    - Expected  - 3
    + Received  + 2

      {{ some_macro(
        arg_1 = 'foo',
    -   arg_2 = ['[string with square brackets]'],
    -   arg_3 = 'bar'
    - ) }}
    +   arg_2 = ['[string with square brackets] '], arg_3 = ' bar ') }}
    +

  ● jinja flavored queries › does not treat square brackets in jinja statements as strings

    expect(received).toBe(expected) // Object.is equality

    - Expected  - 1
    + Received  + 1

    - {{ variable['key'] }}
    + {{ variable ['key'] }}
      ↵

Why these tests fail

The problem in the first test is that it tokenises ['[string with square brackets]'] into ['[string with square brackets] and '], arg_3 = '" }. The incorrect match on the trailing bracket ultimately results in tokenising the entirety of the remainder of the statment into a single token, and we end up with our entire file on one line.

The problem in the second test is that due to the treatment of [ ... ] as a string, it separates a dictionary access in the jinja code variable['key'] into two tokens variable and ['key'], which are then separated by whitespace because they're just two different strings.

Proposed solution

I think the root of this problem is due to the fact that the tokeniser doesn't distinguish between SQL and code in the jinja templates. Obviously we still need to treat [...] as a string in the SQL but treating it like a string in python doesn't make sense. Consequently I think what we need to do is to parse and format jinja separately to the SQL:

Firstly, tokenise the query as SQL and break out jinja variables or templates into whole tokens, instead of the current behaviour. So a string like SELECT * FROM {{ ref("table") }} would get tokenised into something like:
- {'type': 'reserved', 'value': 'SELECT'}
- {'type': 'string', 'value': '*'}
- {'type': 'reserved', 'value': 'FROM'}
- {'type': 'jinja', 'value': '{{ ref("table") }}'}
Have a separate parser for jinja variables and templates which get applied when we are going through the jinja token type.

Does the problem and the proposed solution make sense? Like I say, I'm happy to work on a fix for this but given it's a fairly substantial change I thought I'd make sure I'm not missing something first 😄

Case statement bug

When else is NULL it will append END without space to NULL

Example:

CASE
WHEN a > 1 THEN b
WHEN b > 1 THEN a
ELSE NULL
END AS col

Formatter expands => to = > causing snowflake issues

Snowflake complains when arrows are expanded with a space between = and >

JOIN `ON` statement no new line when expression contains a jinja variable

Desired:

LEFT JOIN {{ ref('mytable') }} AS tb
ON {{ var }} = tb.col

Actual:

LEFT JOIN {{ ref('mytable') }} AS tb
ON {{ var }}
= tb.col

Joins with dynamic table name insert extra newline before alias

Join with alias formats as:

INNER JOIN user_mins um
ON um.user_id = C.user_id

Join using jinja and an alias adds an extra newline before the alias:

INNER JOIN {{ ref("dim_user") }}
dim_user
ON dim_user.id = C.user_id

Add binary

A nice to have is a binary so this package can easily be ran on files using a runner such as pre-commit or https://github.com/okonet/lint-staged.

e.g.

npx dbt-formatter ./path-to-some/file.sql --spaces 2 --indent 2 --sql redshift

JSONB and type casting

I think JSONB and type casting is being mis-formated:

From:

(data ->> 'newPriceIsMAP')::boolean as new_price_is_map,

To:

  (
    data ->> 'newPriceIsMAP'
  ) :: boolean as new_price_is_map,

Seems like the first option is more readable.