Hello! We're keen to use this to start formatting our dbt codebase; it looks like exactly what we want to have a standardised SQL output that takes into account jinja code blocks, but have come across a couple of issues. I'm happy to look at fixing this but want to validate the approach first.
Reproducing the issue
The issue is to do with the way the tokeniser treats [
and ]
as string delimiters, which is fine for SQL Server SQL, but breaks when we start using arrays and dictionaries in our jinja code (which we currently do). The errors can be reproduced as follows:
it('works with arrays in the jinja statement', () => {
const formatted = formatter(`{{ some_macro(arg_1 = 'foo', arg_2 = ['[string with square brackets]'], arg_3 = 'bar') }}`);
expect(formatted).toBe(`{{ some_macro(
arg_1 = 'foo',
arg_2 = ['[string with square brackets]'],
arg_3 = 'bar'
) }}`);
})
it('does not treat square brackets in jinja statements as strings', () => {
const formatted = formatter(`{{ variable['key'] }}`);
expect(formatted).toBe(`{{ variable['key'] }}`);
})
When I run these tests, I get the following output:
โ jinja flavored queries โบ works with arrays in the jinja statement
expect(received).toBe(expected) // Object.is equality
- Expected - 3
+ Received + 2
{{ some_macro(
arg_1 = 'foo',
- arg_2 = ['[string with square brackets]'],
- arg_3 = 'bar'
- ) }}
+ arg_2 = ['[string with square brackets] '], arg_3 = ' bar ') }}
+
โ jinja flavored queries โบ does not treat square brackets in jinja statements as strings
expect(received).toBe(expected) // Object.is equality
- Expected - 1
+ Received + 1
- {{ variable['key'] }}
+ {{ variable ['key'] }}
โต
Why these tests fail
The problem in the first test is that it tokenises ['[string with square brackets]']
into ['[string with square brackets]
and '], arg_3 = '" }
. The incorrect match on the trailing bracket ultimately results in tokenising the entirety of the remainder of the statment into a single token, and we end up with our entire file on one line.
The problem in the second test is that due to the treatment of [ ... ]
as a string, it separates a dictionary access in the jinja code variable['key']
into two tokens variable
and ['key']
, which are then separated by whitespace because they're just two different strings.
Proposed solution
I think the root of this problem is due to the fact that the tokeniser doesn't distinguish between SQL and code in the jinja templates. Obviously we still need to treat [...]
as a string in the SQL but treating it like a string in python doesn't make sense. Consequently I think what we need to do is to parse and format jinja separately to the SQL:
- Firstly, tokenise the query as SQL and break out jinja variables or templates into whole tokens, instead of the current behaviour. So a string like
SELECT * FROM {{ ref("table") }}
would get tokenised into something like:
{'type': 'reserved', 'value': 'SELECT'}
{'type': 'string', 'value': '*'}
{'type': 'reserved', 'value': 'FROM'}
{'type': 'jinja', 'value': '{{ ref("table") }}'}
- Have a separate parser for jinja variables and templates which get applied when we are going through the
jinja
token type.
Does the problem and the proposed solution make sense? Like I say, I'm happy to work on a fix for this but given it's a fairly substantial change I thought I'd make sure I'm not missing something first ๐