macbre / sql-metadata Goto Github PK

View Code? Open in Web Editor NEW

773.0 16.0 119.0 931 KB

Uses tokenized query returned by python-sqlparse and generates query metadata

Home Page: https://pypi.python.org/pypi/sql-metadata

License: MIT License

Makefile 0.15% Python 99.85%

sql parser database sql-parser metadata sqlparse python-package python3-library hiveql hive

sql-metadata's Introduction

sql-metadata

Uses tokenized query returned by python-sqlparse and generates query metadata.

Extracts column names and tables used by the query. Automatically conduct column alias resolution, sub queries aliases resolution as well as tables aliases resolving.

Provides also a helper for normalization of SQL queries.

Supported queries syntax:

MySQL
PostgreSQL
Sqlite
MSSQL
Apache Hive

(note that listed backends can differ quite substantially but should work in regard of query types supported by sql-metadata)

You can test the capabilities of sql-metadata with an interactive demo: https://sql-app.infocruncher.com/

Usage

pip install sql-metadata

Extracting raw sql-metadata tokens

from sql_metadata import Parser

# extract raw sql-metadata tokens
Parser("SELECT * FROM foo").tokens
# ['SELECT', '*', 'FROM', 'foo']

Extracting columns from query

from sql_metadata import Parser

# get columns from query - for more examples see `tests/test_getting_columns.py`
Parser("SELECT test, id FROM foo, bar").columns
# ['test', 'id']

Parser("INSERT /* VoteHelper::addVote xxx */  INTO `page_vote` (article_id,user_id,`time`) VALUES ('442001','27574631','20180228130846')").columns
# ['article_id', 'user_id', 'time']

parser = Parser("SELECT a.* FROM product_a.users AS a JOIN product_b.users AS b ON a.ip_address = b.ip_address")

# note that aliases are auto-resolved
parser.columns
# ['product_a.*', 'product_a.users.ip_address', 'product_b.users.ip_address']

# note that you can also extract columns with their place in the query
# which will return dict with lists divided into select, where, order_by, group_by, join, insert and update
parser.columns_dict
# {'select': ['product_a.users.*'], 'join': ['product_a.users.ip_address', 'product_b.users.ip_address']}

Extracting columns aliases from query

from sql_metadata import Parser
parser = Parser("SELECT a, (b + c - u) as alias1, custome_func(d) alias2 from aa, bb order by alias1")

# note that columns list do not contain aliases of the columns
parser.columns
# ["a", "b", "c", "u", "d"]

# but you can still extract aliases names
parser.columns_aliases_names
# ["alias1", "alias2"]

# aliases are resolved to the columns which they refer to
parser.columns_aliases
# {"alias1": ["b", "c", "u"], "alias2": "d"}

# you can also extract aliases used by section of the query in which they are used
parser.columns_aliases_dict
# {"order_by": ["alias1"], "select": ["alias1", "alias2"]}

# the same applies to aliases used in queries section when you extract columns_dict
# here only the alias is used in order by but it's resolved to actual columns
assert parser.columns_dict == {'order_by': ['b', 'c', 'u'],
                               'select': ['a', 'b', 'c', 'u', 'd']}

Extracting tables from query

from sql_metadata import Parser

# get tables from query - for more examples see `tests/test_getting_tables.py`
Parser("SELECT a.* FROM product_a.users AS a JOIN product_b.users AS b ON a.ip_address = b.ip_address").tables
# ['product_a.users', 'product_b.users']

Parser("SELECT test, id FROM foo, bar").tables
# ['foo', 'bar']

# you can also extract aliases of the tables as a dictionary
parser = Parser("SELECT f.test FROM foo AS f")

# get table aliases
parser.tables_aliases
# {'f': 'foo'}

# note that aliases are auto-resolved for columns
parser.columns
# ["foo.test"]

Extracting values from insert query

from sql_metadata import Parser

parser = Parser(
    "INSERT /* VoteHelper::addVote xxx */  INTO `page_vote` (article_id,user_id,`time`) " 
    "VALUES ('442001','27574631','20180228130846')"
)
# extract values from query
parser.values
# ["442001", "27574631", "20180228130846"]

# extract a dictionary with column-value pairs
parser.values_dict
#{"article_id": "442001", "user_id": "27574631", "time": "20180228130846"}

# if column names are not set auto-add placeholders
parser = Parser(
    "INSERT IGNORE INTO `table` VALUES (9, 2.15, '123', '2017-01-01');"
)
parser.values
# [9, 2.15, "123", "2017-01-01"]

parser.values_dict
#{"column_1": 9, "column_2": 2.15, "column_3": "123", "column_4": "2017-01-01"}

Extracting limit and offset

from sql_metadata import Parser

Parser('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000').limit_and_offset
# (50, 1000)

Parser('SELECT foo_limit FROM bar_offset limit 2000,50').limit_and_offset
# (50, 2000)

Extracting with names

from sql_metadata import Parser

parser = Parser(
    """
WITH
    database1.tableFromWith AS (SELECT aa.* FROM table3 as aa 
                                left join table4 on aa.col1=table4.col2),
    test as (SELECT * from table3)
SELECT
  "xxxxx"
FROM
  database1.tableFromWith alias
LEFT JOIN database2.table2 ON ("tt"."ttt"."fff" = "xx"."xxx")
"""
)

# get names/ aliases of with statements
parser.with_names
# ["database1.tableFromWith", "test"]

# get definition of with queries
parser.with_queries
# {"database1.tableFromWith": "SELECT aa.* FROM table3 as aa left join table4 on aa.col1=table4.col2"
#  "test": "SELECT * from table3"}

# note that names of with statements do not appear in tables
parser.tables
# ["table3", "table4", "database2.table2"]

Extracting sub-queries

from sql_metadata import Parser

parser = Parser(
"""
SELECT COUNT(1) FROM
(SELECT std.task_id FROM some_task_detail std WHERE std.STATUS = 1) a
JOIN (SELECT st.task_id FROM some_task st WHERE task_type_id = 80) b
ON a.task_id = b.task_id;
"""
)

# get sub-queries dictionary
parser.subqueries
# {"a": "SELECT std.task_id FROM some_task_detail std WHERE std.STATUS = 1",
#  "b": "SELECT st.task_id FROM some_task st WHERE task_type_id = 80"}


# get names/ aliases of sub-queries / derived tables
parser.subqueries_names
# ["a", "b"]

# note that columns coming from sub-queries are resolved to real columns
parser.columns
#["some_task_detail.task_id", "some_task_detail.STATUS", "some_task.task_id", 
# "task_type_id"]

# same applies for columns_dict, note the join columns are resolved
parser.columns_dict
#{'join': ['some_task_detail.task_id', 'some_task.task_id'],
# 'select': ['some_task_detail.task_id', 'some_task.task_id'],
# 'where': ['some_task_detail.STATUS', 'task_type_id']}

See tests file for more examples of a bit more complex queries.

Queries normalization and comments extraction

from sql_metadata import Parser
parser = Parser('SELECT /* Test */ foo FROM bar WHERE id in (1, 2, 56)')

# generalize query
parser.generalize
# 'SELECT foo FROM bar WHERE id in (XYZ)'

# remove comments
parser.without_comments
# 'SELECT foo FROM bar WHERE id in (1, 2, 56)'

# extract comments
parser.comments
# ['/* Test */']

See test/test_normalization.py file for more examples of a bit more complex queries.

Migrating from `sql_metadata` 1.x

sql_metadata.compat module has been implemented to make the introduction of sql-metadata v2.0 smoother.

You can use it by simply changing the imports in your code from:

from sql_metadata import get_query_columns, get_query_tables

into:

from sql_metadata.compat import get_query_columns, get_query_tables

The following functions from the old API are available in the sql_metadata.compat module:

generalize_sql
get_query_columns (since #131 columns aliases ARE NOT returned by this function)
get_query_limit_and_offset
get_query_tables
get_query_tokens
preprocess_query

Authors and contributors

Created and maintained by @macbre with a great contributions from @collerek and the others.

aborecki (https://github.com/aborecki)
collerek (https://github.com/collerek)
dylanhogg (https://github.com/dylanhogg)
macbre (https://github.com/macbre)

Stargazers over time

sql-metadata's People

Contributors

Stargazers

Watchers

Forkers

vpasler windyzhousir zsysuper liugaohui lazzyfu fbeneventi yanglipeng likun1234 yanickjair dylanhogg johnyan2017 koclevent benzzdan daniel1205 taojian2009 romainr aborecki bandrisca harshach mozartdata collerek wchan757 annihilater anurag-shekhar nishuihanqiu sheri528 davehowell christopherpickering hanneshil cjlynch278 wyfunique mars-lan maoxingda huanghao1975 akshay288 metlo-labs governa xufengsun 9527error gordonwang niyue clot09 jedi18 peterpig arnab-turing uc89 sumartoyo lexarx dongsc1994 rednixon playfloor vakhmenin gauthamarcod bala-anjane sharon-accenture mileschuang jpatel1985 wwwlkk ssattari hex-inc georgeezgm 759803573 lisennku bhoy-troy tobymao minhduc304 mawuva ego goutham470 huiyuanlu superinsight miunice muyu1202 dynamiterdx passren pppavlov ww166 vanducng huzl1 bobmerevel ashish10alex liupengjugx landaio bert2me 383747787 alex1919 yyccww123 arpitjain799 chaceboss dataneek jfontestad mustaphac1 lavkeshg vimko hcymysql lightwind1 865547889 rwforest jian1273 shreyarora150

sql-metadata's Issues

Support for valid syntax for OUTER JOIN

After using this library and trying out different SQL queries i have noticed that there is not support for valid sql syntax for FULL|LEFT|RIGHT (OUTER) JOIN. The examples that i have found that didnt work are:

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders FULL JOIN Customers ON Orders.CustomerID=Customers.CustomerID;

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders FULL OUTER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders RIGHT OUTER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders LEFT OUTER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;

https://www.w3schools.com/sql/sql_join.asp

After using the get_query_tables method in each of these cases the result was:
['Orders']

CTE makes get_query_tables() case-sensitive for SELECT

For the following query:

WITH x AS (
    SELECT * FROM n
)
SELECT
    *
FROM x
JOIN y ON x.a = y.a

get_query_tables returns the expected

['n', 'x', 'y']

However, if I lowercase the SELECT in the main query:

WITH x AS (
    SELECT * FROM n
)
select    -- this is the only change
    *
FROM x
JOIN y ON x.a = y.a

then the select gets included in the results:

['n', 'select', 'x', 'y']

Update:

Strangely, this doesn't affect UPDATE. This works as expected:

>>> sql_metadata.get_query_tables( 
... """ 
... WITH x AS ( 
...     SELECT * FROM n 
... ) 
... update z 
... from x 
... set z.q = x.y 
... """ 
... )                                                                                                                                                                                                                  
['n', 'z', 'x']

Environment:

Python: 3.7.2
sqlparse: 0.2.4
sql-metadata: 1.5.0

Handle queries that touch several tables

Handle JOINS and multiple tables in FROM statement.

SQL SELECT /* Elecena\Models\Product::newFromIds */ prod_id AS id,code,products.name AS name,products.description AS description,producers.producer_id AS producer_id,producers.name AS producer,extra,parameters,url,price AS price_original,price *  currencies.value / (currencies.multiply * 100) AS price,last_price,price_history,products.multiply AS multiply,prices,min_amount,currency,currencies.name AS currency_name,shops.shop_id AS shop_id,shops.internal_code AS shop_code,shops.name AS shop_name,img,img_width,img_height,img_src,added,products.updated AS updated,last_seen FROM products LEFT JOIN shops ON shop_id = shop LEFT JOIN producers ON producer_id = producer LEFT JOIN currencies ON currencies.symbol = currency WHERE prod_id IN ("43672","1234827","752637","2568089","2966413","367950","5014492","3814059","1671966","5123458") ORDER BY prod_id LIMIT 10

Union's don't return the correct result

Hey,

Thanks for the great library, I noticed that when a query has a "UNION ALL" statement an incorrect result can be returned

sql_metadata.get_query_tables("SELECT a,b,c FROM d UNION ALL SELECT e,f FROM g")
['d', 'f', 'g']

missing tables returned by sql_metadata.get_query_tables (2)

This might be a duplicate of #112

my DB2 SQL query:

select ca.IDENTIFICATION_CODE identificationCode, 
eo.KBO_NUMBER kboNumber, 
eo.PARTY_NAME,
ca.total_guaranteed totale_borgtocht, 
coalesce(sum(ae1.remainder),0) Saldo, 
coalesce(sum(ae3.remainder),0) uitstel_van_betaling, 
coalesce(sum(ae4.remainder),0) reservering_aangifte, 
coalesce(sum(ae5.remainder),0) reservering_vergunning,
coalesce(sum(ae6.remainder),0) zekerheid_douanevervoer, 
coalesce(sum(ae7.remainder),0) zekerheid_accijnsbeweging,
coalesce(sum(ae8.remainder),0) FRCT 
from CUSTOMER_ACCOUNT ca 
inner join economic_operator eo on eo.id = ca.economic_operator_id 
join contact_details cd on cd.id = ca.contact_details_id 
left join ( ca1_remainder_total_guaranteed crtg 
inner join accounting_entity ae1 on ae1.id = crtg.accounting_entity_id)
on crtg.id = ca.ca1_id 
left join (ca3_credit_account cca inner join accounting_entity ae3 on ae3.id = 
cca.accounting_entity_id) on cca.id = ca.ca3_id 
left join (ca4_reservations_declaration crd inner join accounting_entity ae4 on 
ae4.id = crd.accounting_entity_id) on crd.id = ca.ca4_id 
left join (ca5_reservations_permits crp inner join accounting_entity ae5 on ae5.id 
= crp.accounting_entity_id) on crp.id = ca.ca5_id 
left join (CA6_GUARANTEE_CUSTOMS_TRANSPORT gct inner join accounting_entity ae6 on 
ae6.id = gct.accounting_entity_id) on gct.id = ca.ca6_id 
left join (CA7_GUARANTEE_EXCISE_PRODUCTS gep inner join accounting_entity ae7 on 
ae7.id = gep.accounting_entity_id) on gep.id = ca.ca7_id 
left join (ca8_frct cf inner join ca8_frct_per_discharge cfpd on cfpd.CA8_ID = 
cf.id inner join accounting_entity ae8 on ae8.id = cfpd.accounting_entity_id) on 
cf.id = ca.ca8_id 
group by eo.PARTY_NAME,eo.KBO_NUMBER, ca.IDENTIFICATION_CODE, ca.total_guaranteed 
order by eo.KBO_NUMBER, ca.IDENTIFICATION_CODE 
with ur

As preprocessing step I am replacing the new lines with spaces. So this is the output after the preprocessing step:

select ca.IDENTIFICATION_CODE identificationCode,  eo.KBO_NUMBER kboNumber,  eo.PARTY_NAME, ca.total_guaranteed totale_borgtocht,  coalesce(sum(ae1.remainder),0) Saldo,  coalesce(sum(ae3.remainder),0) uitstel_van_betaling,  coalesce(sum(ae4.remainder),0) reservering_aangifte,  coalesce(sum(ae5.remainder),0) reservering_vergunning, coalesce(sum(ae6.remainder),0) zekerheid_douanevervoer,  coalesce(sum(ae7.remainder),0) zekerheid_accijnsbeweging, coalesce(sum(ae8.remainder),0) FRCT  from CUSTOMER_ACCOUNT ca  inner join economic_operator eo on eo.id = ca.economic_operator_id  join contact_details cd on cd.id = ca.contact_details_id  left join ( ca1_remainder_total_guaranteed crtg  inner join accounting_entity ae1 on ae1.id = crtg.accounting_entity_id) on crtg.id = ca.ca1_id  left join (ca3_credit_account cca inner join accounting_entity ae3 on ae3.id =  cca.accounting_entity_id) on cca.id = ca.ca3_id  left join (ca4_reservations_declaration crd inner join accounting_entity ae4 on  ae4.id = crd.accounting_entity_id) on crd.id = ca.ca4_id  left join (ca5_reservations_permits crp inner join accounting_entity ae5 on ae5.id  = crp.accounting_entity_id) on crp.id = ca.ca5_id  left join (CA6_GUARANTEE_CUSTOMS_TRANSPORT gct inner join accounting_entity ae6 on  ae6.id = gct.accounting_entity_id) on gct.id = ca.ca6_id  left join (CA7_GUARANTEE_EXCISE_PRODUCTS gep inner join accounting_entity ae7 on  ae7.id = gep.accounting_entity_id) on gep.id = ca.ca7_id  left join (ca8_frct cf inner join ca8_frct_per_discharge cfpd on cfpd.CA8_ID =  cf.id inner join accounting_entity ae8 on ae8.id = cfpd.accounting_entity_id) on  cf.id = ca.ca8_id  group by eo.PARTY_NAME,eo.KBO_NUMBER, ca.IDENTIFICATION_CODE, ca.total_guaranteed  order by eo.KBO_NUMBER, ca.IDENTIFICATION_CODE  with ur

If I then apply method sql_metadata.get_query_tables to it then I am getting:

['CUSTOMER_ACCOUNT', 'economic_operator', 'contact_details', 'accounting_entity', 'ca8_frct_per_discharge']

So it is missing several tables :

ca1_remainder_total_guaranteed
ca3_credit_account
ca4_reservations_declaration
ca5_reservations_permits
CA7_GUARANTEE_EXCISE_PRODUCTS
ca8_frct

FYI these are the acutal python commands I have used:

with open('Requêtes Métier SQL/01. FRCT.sql','r') as file :
  data = file.read().replace('\n', ' ')
  sql = sql_metadata.get_query_tables(data)
sql

Improve SQL queries parsing

SELECT date_format(time_id,'%Y-%m-%d') AS date, pageviews AS cnt         FROM rollup_wiki_pageviews      WHERE period_id = '2'   AND wiki_id = '1676379'         AND time_id BETWEEN '2018-01-08'        AN
D '2018-01-01'

Tables with leading digits are not properly recognized

https://github.com/macbre/index-digest/runs/2567720826

ERROR    indexdigest.linters.linter_0006_not_used_columns_and_tables:linter_0006_not_used_columns_and_tables.py:91 Unable to extract tables and columns used from the query: SELECT test
ERROR    indexdigest.linters.linter_0006_not_used_columns_and_tables:linter_0006_not_used_columns_and_tables.py:91 Unable to extract tables and columns used from the query: SELECT t.val as value, count(*) FROM 0020_big_table as t WHERE id BETWEEN 10 AND 20 GROUP BY val

Parser("SELECT t.val as value, count(*) FROM 0020_big_table as t WHERE id BETWEEN 10 AND 20 GROUP BY val").tables  # []

However:

Parser("SELECT t.val as value, count(*) FROM `0020_big_table` as t WHERE id BETWEEN 10 AND 20 GROUP BY val").tables  # ['0020_big_table']

Identifiers may begin with a digit but unless quoted may not consist solely of digits.

https://dev.mysql.com/doc/refman/8.0/en/identifiers.html

sql_metadata.get_query_columns result error

demo

sql_metadata.get_query_columns("select max(col3)+avg(col)+1+sum(col2) from dual")
// result :  ['col3', 'avg', 'col', 'sum', 'col2'].   avg sum  shouldn't be a column

sql_metadata.get_query_columns("select avg(col)+sum(col2) from dual")
// result: ['avg', 'col', 'sum', 'col2']  avg sum  shouldn't be a column

sql_metadata.get_query_columns("select count(col)+max(col2)+ min(col3)+ count(distinct  col4) + custom_func(col5) from dual")
// result ['col', 'col2', 'col3', 'col4', 'custom_func', 'col5']   custom_func shouldn't be a column

it looks like function matching by keyword, I recommend matching by function pattern

query columns get wrong result

sql metadata is a pretty sql parser for me, but I found it get wrong result at this situation:

import sql_metadata

columns = sql_metadata.get_query_columns("SELECT count(c) as test, id FROM foo where cast(d as bigint) > e")

print(columns)

## ------------
['c', 'id', 'cast', 'd', 'e']

It is quite obvious ’cast' should not be appeared in this case

Wrong parsing results from get_query_tables for select with UNION

sql_metadata.get_query_tables("""
select col1, col2, col3 from tab union all select col1, col2, col3  from tab2
""")

returns
['tab', 'col2', 'col3', 'tab2']
expected:
['tab', 'tab2']

Regression 1.12 -> 2.x with CREATE TABLE AS (...) requests

I use sql-metadata (great module btw) to extract tables used by CREATE TABLE requests. I recently upgraded sql-metadata to 2.X, and broke my entire code :'(

Behavior with sql-metadata 1.12 (Good, at least for my use case)

>>> from sql_metadata import get_query_tables
>>> sql = CREATE TABLE records AS (SELECT t.id, t.name, e.name as energy FROM t JOIN e ON t.e_id = e.id)
>>> get_query_tables(sql)
['records', 't', 'e']

Behavior with sql-metadata 2.X (Less good...)

>>> from sql_metadata.compat import get_query_tables
>>> from sql_metadata import Parser
>>> sql = CREATE TABLE records AS (SELECT t.id, t.name, e.name as energy FROM t JOIN e ON t.e_id = e.id)
>>> get_query_tables(sql)
['records']
>>> Parser(sql).tables
['records']

Return information about which column is joined with which column

When parsing an sql query then the attribute parser.columns_dict returns also information about the join.
E.g. see https://github.com/macbre/sql-metadata/blob/master/test/test_getting_columns.py

{
        "select": ["fw1.wiki_id"],
        "join": ["fw1.wiki_id", "fw2.wiki_id", "fw2.time_id", "dw.wiki_id"],
        "where": [
            "fw1.time_id",
            "dw.url",
            "dw.title",
            "fw1.vertical_id",
            "fw1.wiki_id",
        ]
    }

The above query is correctly mentioning the 4 fields used in the join operations:

 ["fw1.wiki_id", "fw2.wiki_id", "fw2.time_id", "dw.wiki_id"]

but it would also be interesting to see also which field is joined with which field.
So for the above example it would be interesting to have an output array like:

 [ 
  {
     "operator" : "=",
     "left_hand_side" : ["fw1.wiki_id"], 
     "right_hand_side": ["fw2.wiki_id"]
    }, 
    {
            "operator" : "=",
            "left_hand_side" : ["fw1.wiki_id"], 
            "right_hand_side": ["dw.wiki_id"]
   }
]

New API [draft]

Implement a new API that will return an object with SQL query parsing results. This will ultimately replace separate functions for getting the list of tables, columns, aliases, etc.

API

import sql_metadata
query = sql_metadata.Parser(sql="SELECT foo AS value FROM bar")

Properties

query.query (the same as str(query))
query.query_normalized
query.columns
query.tables
query.where_columns
query.type (SELECT, UPDATE, ...)
query.tokens (exposes internal sql-metatdata tokenizer iterator)
query.comments (returns SQL inline comments, like SELECT /* process_data */ foo FROM bar)

Tokenizer

Internal sql-metadata tokenizer should now return dataclass wrapping the state of the current token (including states from previous ones, like the last keywords etc.).

Support Postgres quoted identifiers

Within Postgres, it's possible (and sometimes necessary) to put double quotes around table & column names. This causes undesired behavior within sql_metadata:

get_query_columns won't find any columns, even if they exist.
get_query_tables identifies the table names (though it doesn't recognize the quotes as being part of the name and will only return the unquoted part).
generalize_sql will generify both table and column names, which defeats the point of the call.

I'm able to work around the table issue by using the result of get_query_table to remove the quotes around the table names from the query, so that generalize_sql won't eat my table names - but this trick doesn't work for the columns, as get_query_columns doesn't recognize them.

Reasons why this should be implemented:

Legacy projects that can't just rewrite their database scheme
Django projects. <-- I'm part of this group. Even though Django generates valid Postgres table & column names (e. g. lowercase names without spaces), it will still quote them when constructing the queries because it's possible to manually specify names - and by using quotes, the chosen names won't cause errors. And the quotes are valid syntax for Postgres (see https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS for reference, the part about quoted identifiers), so I believe that this is a very valid user group, especially with how widely spread Django is - and since sql_metadata claims to support Postgres SQL, even more so!

Improve get_query_tokens iterator

Currently it returns a list of sqlparse.sql.Token objects. In all case sql-metadata code tries to keep the state of previous keywords as it iterates over tokens.

Instead of Token object return QueryTokenState dataclass with:

@dataclasses.dataclass
class SQLToken:
    value: str
    is_keyword: bool
    is_name: bool
    is_punctuation: bool
    is_wildcard: bool

    # and the state
    last_keyword: Optional[str]  # uppercased
    previous_token: Optional[Token]

get_query_tokens will be responsible for keeping the state in returned tokens.

Token will also have sub-classes that will indicate their "function" within the query:

table names
keywords
column names
functions

http://datacharmer.blogspot.com/2008/03/mysql-proxy-recipes-tokenizing-query.html

MySQL Proxy ships equipped with a tokenizer, a method that, given a query, returns its components as an array of tokens. Each token contains three elements:
name, which is a human readable name of the token (e.g. TK_SQL_SELECT)
id, which is the identifier of the token (e.g. 204)
text, which is the content of the token (e.g. "select").
For example, the query SELECT 1 FROM dual will be returned as the following tokens:

1:
   text          select
   token_name    TK_SQL_SELECT'
   token_id      204
2: 
   text          1
   token_name    TK_INTEGER
   token_id      11
3:
   text          from
   token_name    TK_SQL_FROM
   token_id      105
4: 
   text          dual
   token_name    TK_SQL_DUAL
   token_id      87

Support for lateral view explode clause

getting error when trying to get the tables for sql below,

sql = """
WITH uisd_filter_table as (
    select
        session_id,
        srch_id,
        srch_ci,
        srch_co,
        srch_los,
        srch_sort_type,
        impr_list
    from
        uisd
    where
        datem <= date_sub(date_add(current_date(), 92), 7 * 52)
        and lower(srch_sort_type) in ('expertpicks', 'recommended')
        and srch_ci <= date_sub(date_add(current_date(), 92), 7 * 52)
        and srch_co >= date_sub(date_add(current_date(), 1), 7 * 52)
)
select
    DISTINCT session_id,
    srch_id,
    srch_ci,
    srch_co,
    srch_los,
    srch_sort_type,
    l.impr_property_id as expe_property_id,
    l.impr_position_across_pages
from
    uisd_filter_table lateral view explode(impr_list) table as l
"""

How to reproduce the error:

p = Parser(sql)
print(p.tables)

Error:

Traceback (most recent call last):
  File "/Users/asalam/vscode/new/source_analyser/utils.py", line 152, in <module>
    print(p.tables)
  File "/Users/asalam/Library/Python/3.8/lib/python/site-packages/sql_metadata/parser.py", line 398, in tables
    with_names = self.with_names
  File "/Users/asalam/Library/Python/3.8/lib/python/site-packages/sql_metadata/parser.py", line 544, in with_names
    if token.next_token.normalized in WITH_ENDING_KEYWORDS:
AttributeError: 'NoneType' object has no attribute 'normalized'

complex query can't get all columns

query:
select targetingtype, sellerid, sguid, 'd01' as datetype, adgroupname, targeting, customersearchterm, 'product_search_term' as type, sum(impressions) as impr, sum(clicks) as clicks, sum(seventotalunits) as sold, sum(sevenadvertisedskuunits) as advertisedskuunits, sum(sevenotherskuunits) as otherskuunits, sum(sevendaytotalsales) as totalsales, round(sum(spend), 4) as spend, if(sum(impressions) > 0, round(sum(clicks)/sum(impressions), 4), 0) as ctr, if(sum(clicks) > 0, round(sum(seventotalunits)/sum(clicks), 4), 0) as cr, if(sum(clicks) > 0, round(sum(spend)/sum(clicks), 2), 0) as cpc from amazon_pl.search_term_report_impala where reportday >= to_date('2021-05-16 00:00:00.0') and reportday <= to_date('2021-05-16 00:00:00.0') and targetingtype in ('auto','manual') and sguid is not null and sguid != '' group by targetingtype,sellerid,sguid,adgroupname,targeting,customersearchterm order by impr desc

I get the defect columns list:

['targetingtype', 'sellerid', 'sguid', 'adgroupname', 'targeting', 'customersearchterm', 'reportday', 'impr']

impressions,clicks etc. columns is missing

Queries with in-line comments (--) are commenting out the entire rest of the query

it seems the parser cannot parse anything after -- in front of any item in the query

Parser('select --k
n from m')

returns empty table list.

It seems it comments everything after --

Columns with names that start with a keyword are not recognized as a column

I ran into an issue with a query with some keywords in it like below, when trying to get the columns from the parser nothing is returned:

>query1` = "SELECT schema_name, full_table_name, column_name, catalog_name, table_name, column_length, annotation FROM corporate.all_tables"
>Parser(query1).columns
[]

If I add a character in front of it everything works as expected:

>query2 = "SELECT k_schema_name, full_table_name, k_column_name, k_catalog_name, k_table_name, K_column_length, annotation FROM corporate.all_tables"
>Parser(query2).columns
['k_schema_name', 'full_table_name', 'k_column_name', 'k_catalog_name', 'k_table_name', 'K_column_length', 'annotation']

It seems as though any columns after a keyword is used in the lead of a column name all the rest of the columns are ignored regardless of their naming pattern:

>query3 = "SELECT k_schema_name, full_table_name, column_name,catalog_name, table_name, column_length, annotation FROM corporate.all_tables"
>Parser(query3).columns
['k_schema_name', 'full_table_name']

Move normalize_sql to sql-metadata

https://github.com/macbre/query-digest/blob/master/digest/helpers.py

parser seems to ignore UNION keyword

I have following query which I saved in a file test.sql

SELECT
ACCOUNTING_ENTITY.VERSION as "accountingEntityVersion",
ACCOUNTING_ENTITY.ACTIVE as "active",
ACCOUNTING_ENTITY.CATEGORY as "category",
ACCOUNTING_ENTITY.CREATION_DATE as "creationDate",
ACCOUNTING_ENTITY.DESCRIPTION as "description",
ACCOUNTING_ENTITY.ID as "accountingEntityId",
ACCOUNTING_ENTITY.MINIMAL_REMAINDER as "minimalRemainder",
ACCOUNTING_ENTITY.REMAINDER as "remainder",
ACCOUNTING_ENTITY.SYSTEM_TYPE_ID as "aeSystemTypeId",
ACCOUNTING_ENTITY.DATE_CREATION as "dateCreation",
ACCOUNTING_ENTITY.DATE_LAST_MODIFICATION as "dateLastModification",
ACCOUNTING_ENTITY.USER_CREATION as "userCreation",
ACCOUNTING_ENTITY.USER_LAST_MODIFICATION as "userLastModification" />
FROM ACCOUNTING_ENTITY
WHERE ACCOUNTING_ENTITY.ID IN (
SELECT DPD.ACCOUNTING_ENTITY_ID AS "ACCOUNTINGENTITYID" FROM DEBT D
INNER JOIN DUTY_PER_DEBT DPD ON DPD.DEBT_ID = D.ID
INNER JOIN DECLARATION_V2 DV2 ON DV2.ID = D.DECLARATION_ID
WHERE DV2.DECLARATION_REF = #MRNFORMOVEMENT#
UNION
SELECT ACCOUNTING_ENTITY_ID AS "ACCOUNTINGENTITYID" FROM BENELUX BX
INNER JOIN DECLARATION_V2 DV2 ON DV2.ID = BX.DECLARATION_ID
WHERE DV2.DECLARATION_REF = #MRNFORMOVEMENT#
UNION
SELECT CA4D.ACCOUNTING_ENTITY_ID AS "ACCOUNTINGENTITYID" FROM RESERVATION R
INNER JOIN CA4_RESERVATIONS_DECLARATION CA4D ON CA4D.ID = R.CA4_ID
INNER JOIN DECLARATION_V2 DV2 ON DV2.ID = R.DECLARATION_ID
WHERE DV2.DECLARATION_REF = #MRNFORMOVEMENT#

When I run the script below

from sql_metadata import Parser
file=open('test.sql','r')
data = file.read().replace('\n', ' ')
parser=Parser(data)
parser.tables
parser.columns_dict

I am getting:

>>> from sql_metadata import Parser
>>> file=open('test.sql','r')
>>> data = file.read().replace('\n', ' ')
>>> parser=Parser(data)
>>> parser.tables
['ACCOUNTING_ENTITY', 'DEBT', 'DUTY_PER_DEBT', 'DECLARATION_V2']
>>> parser.columns_dict
{'select': ['ACCOUNTING_ENTITY.VERSION', 'ACCOUNTING_ENTITY.ACTIVE', 'ACCOUNTING_ENTITY.CATEGORY', 'ACCOUNTING_ENTITY.CREATION_DATE', 'ACCOUNTING_ENTITY.DESCRIPTION', 'ACCOUNTING_ENTITY.ID', 'ACCOUNTING_ENTITY.MINIMAL_REMAINDER', 'ACCOUNTING_ENTITY.REMAINDER', 'ACCOUNTING_ENTITY.SYSTEM_TYPE_ID', 'ACCOUNTING_ENTITY.DATE_CREATION', 'ACCOUNTING_ENTITY.DATE_LAST_MODIFICATION', 'ACCOUNTING_ENTITY.USER_CREATION', 'ACCOUNTING_ENTITY.USER_LAST_MODIFICATION', 'DUTY_PER_DEBT.ACCOUNTING_ENTITY_ID'], 'where': ['ACCOUNTING_ENTITY.ID', 'DECLARATION_V2.DECLARATION_REF', '#MRNFORMOVEMENT'], 'join': ['DUTY_PER_DEBT.DEBT_ID', 'DEBT.ID', 'DECLARATION_V2.ID', 'DEBT.DECLARATION_ID']}
>>>

The problem is that

parser.tables is not returning the tables BENELUX and RESERVATION and CA4_RESERVATION_DECLARATION which is used in 2nd and 3rd UNION statement.
parser.columns_dict is only returning the joins : 'join': ['DUTY_PER_DEBT.DEBT_ID', 'DEBT.ID', 'DECLARATION_V2.ID', 'DEBT.DECLARATION_ID'] - so it is not seeing the joins in the 2nd and 3rd UNION statement (= DECLARATION_V2.ID with BENELUX.DECLARATION_ID and DECLARATION_V2.ID with RESERVATION.DECLARATION_ID and CA4_RESERVATIONS_DECLARATION.ID with RESERVATION.CA4_ID)

get_query_tables() CTE Case Sensitive for SELECT

For the following query:
import sql_metadata
sql_metadata.get_query_tables("create table abc.foo as SELECT pqr.foo1 , ab.foo2 FROM foo pqr, bar ab")

get_query_tables return the expected output

['abc.foo', 'foo', 'bar']

But when SELECT is in lowercase -
import sql_metadata
sql_metadata.get_query_tables("create table abc.foo as select pqr.foo1 , ab.foo2 FROM foo pqr, bar ab")

then the select columns get added in the
['pqr.foo1', 'ab.foo2', 'foo', 'bar']

Column alias after WITH fails

The following example fails:

from sql_metadata import Parser

sql = """
WITH foobar AS (
  SELECT
      col_a
  FROM
    table_a
),
SELECT
  col_1
  ,col_2 as my_col
FROM
  table_b
"""
p = Parser(sql)

print(p.tables)

with the exception:

raceback (most recent call last):
  File "parse.py", line 25, in <module>
    print(p.tables)
  File "/.../python3.8/site-packages/sql_metadata/parser.py", line 346, in tables
    with_names = self.with_names
  File "/.../python3.8/site-packages/sql_metadata/parser.py", line 495, in with_names
    if token.next_token.normalized in WITH_ENDING_KEYWORDS:
AttributeError: 'NoneType' object has no attribute 'normalized'

The problem seems to be with the combination of WITH and the alias AS my_col.

Using version 2.0.0 of the package.

`get_query_tables` doesn't respect table names with dashes in their name

To reproduce:

import sql_metadata

sql_metadata.get_query_tables('SELECT * FROM "schema-with-dash"."tablename"')
['dash.tablename']

I expect this to return schema-with-dash.tablename or for them to be quoted.

get_query_tables() breaks on `INSERT INTO ... SELECT`

Inserting data into a table via a SELECT query seems to break get_query_tables():

Minimal example:

For this query...

INSERT INTO foo
SELECT
  *
FROM bar

...running get_query_tables() returns this:

['foo', 'SELECT', 'bar']

The expected output is ['bar'] or maybe ['foo', 'bar'].

Environment:

Python: 3.7.0
OS: macOS High Sierra 10.13.6

Packages installed:

sql-metadata==1.1.2
sqlparse==0.2.4

Support for extracting where clause conditions

SELECT *
FROM COMPANY
WHERE ID =92

This query's ID value is not returned by your Parser class. Ive tried several other queries with numeric where clause values, None of them returns any value with your parser.

(Parser(query).values_dict)

CTE breaks INSERT INTO ... SELECT and DELETE FROM

Related to #18; it appears the fix doesn't work if the query includes a CTE. To reproduce, run get_query_tables() on this query:

WITH x AS (
    SELECT * FROM n
)
INSERT INTO z
SELECT
    *
FROM x

The result is

['n', 'INSERT', 'z', 'x']

You get a similar problem with DELETE FROM:

>>> sql_metadata.get_query_tables( 
... """ 
... WITH x AS ( 
...     SELECT * FROM n 
... ) 
... DELETE FROM z JOIN x ON z.a = x.a 
... """ 
... )                                                                                                                                                                                                                  
['n', 'DELETE', 'z', 'x']

Update: UPDATE doesn't seem to be affected.

Environment:

Python: 3.7.2
sqlparse: 0.2.4
sql-metadata: 1.5.0

Not able to import parser however get_query_tables() is working

from sql_metadata import Parser

Identify which part of the query columns are a part of

I'd like to be able to identify which part of the query columns belong to.

For example, in

SELECT foo FROM bar WHERE baz = 1

get_query_columns will return ["foo", "baz"]

I think it would be useful to know that "foo" is part of the projection, while "baz" is part of the WHERE clause.

Since get_query_columns keep track of the keyword the column belongs too (https://github.com/macbre/sql-metadata/blob/master/sql_metadata.py#L132), it should be possible to associate the column to it's query part.

Handle more complex SQL queries

Const values with an alias

SELECT 1 as c    FROM rollup_wiki_article_pageviews      WHERE time_id = '2018-01-07 00:00:00'   AND period_id = '2' LIMIT 1

Table aliases

SELECT r.wiki_id AS id, pageviews_7day AS pageviews      FROM report_wiki_recent_pageviews AS r                  INNER JOIN dimension_wikis AS d                 ON              r.wiki_id =  d.wiki_id          WHERE d.public = '1'    AND r.lang IN ( 'en', 'ru' )    AND r.hub_name = 'gaming'               ORDER BY pageviews DESC LIMIT 300

JOINS on self

SELECT  count(fw1.wiki_id) as wam_results_total  FROM `fact_wam_scores` `fw1` left join `fact_wam_scores` `fw2` ON ((fw1.wiki_id = fw2.wiki_id) AND (fw2.time_id = FROM_UNIXTIME(1466380800))) left join `dimension_wikis` `dw` ON ((fw1.wiki_id = dw.wiki_id))  WHERE (fw1.time_id = FROM_UNIXTIME(1466467200)) AND (dw.url like '%%' OR dw.title like '%%') AND fw1.vertical_id IN ('0','1','2','3','4','5','6','7')  AND (fw1.wiki_id NOT IN ('23312','70256','168929','463633','381622','524772','476782','9764','214934','170145','529622','52149','96420','390','468156','690804','197434','29197','88043','37317','466775','402313','169142','746246','119847','57268','1089624')) AND ((dw.url IS NOT NULL AND dw.title IS NOT NULL))

Incorrect column names when table aliases are used

sql_metadata.get_query_tables("SELECT a.* FROM users1 AS a JOIN users2 AS b ON a.ip_address = b.ip_address")

returns

users1, users2

While,

*sql_metadata.get_query_columns("SELECT a. FROM users1 AS a JOIN users2 AS b ON a.ip_address = b.ip_address")

returns

a.*, a.ip_address, b.ip_address**

Although, according to me, returning the below output makes more sense.

users1.*, users1.ip_address, users2.ip_adress

As, the aliases can be different for the query, while the table names and column names are still the same.

Handle more queries and add tests from query-digest

https://github.com/macbre/query-digest/blob/master/tests/test_query_metadata.py / macbre/query-digest#17

Best way to get an array of values?

If I wanted to get an array of the VALUES from an INSERT statement, is there a straightforward way of doing that?

I'm doing this for now:

def parse_sql_values(sql: str) -> List[str]:
    values = []
    found_values = False

    for token in sql_metadata.get_query_tokens(sql):
        if token.is_keyword and token.value.upper() == "VALUES":
            found_values = True
            continue
        if found_values and token.ttype in [String.Single, Number.Integer]:
            values.append(remove_quotes(token.value))

    return values

Extract the code from /index-digest

https://github.com/macbre/index-digest/blob/master/indexdigest/query.py

get_query_tables() is returning columns on WITH statements

With sql-metadata==1.12.0,

from sql_metadata import get_query_tables

q = "WITH t1 AS (SELECT * FROM t2), t3 (c1, c2) AS (SELECT c3, c4 FROM t4) SELECT * FROM t1, t3, t5;"

print(get_query_tables(q))

prints ['t2', 't3', 'c2', 'c4', 't4', 't1', 't5'] where c2 and c4 are columns.

This issue is not present with sql-metadata==1.11.0 and the code above prints ['t2', 't3', 't4', 't1', 't5'] as expected.

Add support for table schemas in get_query_tables()

The library currently doesn't support PostgreSQL-style table schemas. This is particularly useful when doing queries involving two tables with the same name but in different schemas, as the current behavior makes it impossible to tell the difference between the two.

Minimal example:

# Current behavior:
>>> sm.get_query_tables("SELECT * FROM s.t")
['t']

# Expected:
>>> sm.get_query_tables("SELECT * FROM s.t")
['s.t']

Something a bit less contrived:

SELECT
  a.*
FROM product_a.users AS a
JOIN product_b.users AS b
  ON a.ip_address = b.ip_address

get_query_tables() only returns users here, instead of the ideal product_a.users and product_b.users.

get_query_columns does not return nothing if parsing SELECT DISTINCT

Hi, with the following query, I do not get any results:

print(sql_metadata.get_query_columns("SELECT DISTINCT DATA.ASSAY_ID FROM foo"))
[]
print(sql_metadata.get_query_columns("SELECT DATA.ASSAY_ID FROM foo"))
['DATA.ASSAY_ID']

Is this normal?

Regards

Handle empty query argument of get_query_tokens

query = ''

    def get_query_tokens(query):
        """
        :type query str
        :rtype: list[sqlparse.sql.Token]
        """
        query = preprocess_query(query)
    
>       tokens = TokenList(sqlparse.parse(query)[0].tokens).flatten()
E       IndexError: tuple index out of range

get_query_table_aliases(query) is far from being final

While I was testing on Oracle SQL:

missed LEFT JOINs (it was easy to add to the code)
missed table names starting with schema name (SCHEMA.TABLE)
missed aliases without AS (this is mentioned in the code but not handled)
missed recognizing subselects (e.g. FROM (SELECT ... ) )

I touched the code to overcome these issues and found solution for them. Tested with a huge SQL but may need to do some more testing.

New code:

def get_query_table_aliases(query: str) -> Dict[str, str]:
    """
    Returns tables aliases mapping from a given query

    E.g. SELECT a.* FROM users1 AS a JOIN users2 AS b ON a.ip_address = b.ip_address
    will give you {'a': 'users1', 'b': 'users2'}
    """
    aliases = dict()
    last_keyword_token = None
    last_table_name = None

    for token in get_query_tokens(query):
        # print(token.ttype, token, last_table_name)

        # handle "FROM foo alias" syntax (i.e, "AS" keyword is missing)
        # if last_table_name and token.ttype is Name:
        #     aliases[token.value] = last_table_name
        #     last_table_name = False

        if last_table_name:
            if token.value=='.': 
                last_table_name = last_table_name + token.value #add the dot
            if token.value==',' or token.is_keyword and token.value.upper()!='AS': #there is no alias
                aliases[''] = last_table_name
                last_table_name = False
            if prev_token.value.upper()=='AS': #previous keyword was AS then we found the alias
                aliases[token.value] = last_table_name
                last_table_name = False
            if token.ttype is Name: 
                if prev_token.value=='.': 
                    last_table_name = last_table_name + token.value #add Name to last_table_name
                else: #found alias
                    aliases[token.value] = last_table_name
                    last_table_name = False


        if last_keyword_token:
           if last_keyword_token.value.upper() in ["FROM", "JOIN", "INNER JOIN","LEFT JOIN"] and token.value!='(':
                last_table_name = token.value

        last_keyword_token = token if token.is_keyword else False
        prev_token = token

    return aliases

Wrong parsing results get_query_tables for spaces in GROUP BY keyword

sql_metadata.get_query_tables("""
SELECT *
FROM  tab AS ab GROUP       BY group_by_col, group_by_col2
""")

Result:
['tab', 'group_by_col2']

Expected:

['tab']

Cause:
The issue is because there are multiple spaces in between GROUP and BY, which is not handled in the code.

missing tables returned by sql_metadata.get_query_tables

Hello,

I have 80 long queries, and for 27 of them, results are not correct for "sql_metadata.get_query_tables(ddl)"

Here is an example:

ddl_for_sql_metadata.sql

SELECT
  "attr1"
FROM
  (
   SELECT
     "attr2"
   FROM
     (database1.table1 "aliasTable"
   LEFT JOIN (
      SELECT
        "attr3"
      FROM
        ((
         SELECT
           "attr4"
         FROM
           database2.table2
         GROUP BY 1, 2, 3, 4
      )  "aliasTable2"
      LEFT JOIN (
         SELECT
           "attr5"
         FROM
           database3.table3
         GROUP BY 1, 2
      )  "X" ON ("table"."attr" = "table"."attr"))
   )  "Y" ON ("table"."attr" = "table"."attr"))
)  "Z"
WHERE (myCondition)
GROUP BY 1, 2, 3, 4, 5

Quick way to test:

import sql_metadata

def read_from_file(file_path):
    return open(file_path, "r").read()

ddl = read_from_file("ddl_for_sql_metadata.sql")
tables = sql_metadata.get_query_tables(ddl)
print(tables)

It returns only ['database2.table2', 'database3.table3']
and it should return database1.table1

The tables property is still having difficulties finding the tables in a CTAS

I created the following test:

def test_create_table_as_select_with_joins_with():
    qry = """
        CREATE table xyz as 
        with sub as (select it_id from internal_table)
        SELECT *
        from table_a
        join table_b on (table_a.name = table_b.name)
        left join table_c on (table_a.age = table_c.age)
        left join sub on (table.it_id = sub.it_id)
        order by table_a.name, table_a.age
        """
    parser = Parser(qry)
    assert set(parser.tables) == set(["xyz", "table_a", "table_b", "table_c", "internal_table"])

And the code is not able to pass it as is. It seems that it struggles with with created subqueries.

Add support for ALTER TABLE ... APPEND

Can we add support for Redshift's ALTER TABLE ... APPEND statement? I know it's a proprietary thing but it'd be nice.

Expected behavior:

>>> sql_metadata.get_query_tables("ALTER TABLE x APPEND FROM y")                                                                                                                                                       
['x', 'y']

Currently ALTER is included in the returned set of tables.

Use Poetry as dependency manager and PyPI publish tool

Handle Apache Hive queries

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

INSERT OVERWRITE TABLE foo_report
SELECT
  d.domain, r.wiki_id, r.beacon, r.pageviews
FROM
  (SELECT wiki_id, beacon, sum(pageviews) AS pageviews FROM rollup_wiki_beacon_pageviews
   WHERE period_id = '1' AND
     (
       year > '{{ beginYear28 }}' OR 
       (year = '{{ beginYear28 }}' AND month > '{{ beginMonth28 }}') OR 
       (year = '{{ beginYear28 }}' AND month = '{{ beginMonth28 }}' AND day > '{{ beginDay28 }}')
     ) AND (
       year < '{{ beginYear }}' OR 
       (year = '{{ beginYear }}' AND month < '{{ beginMonth }}') OR 
       (year = '{{ beginYear }}' AND month = '{{ beginMonth }}' AND day <= '{{ beginDay }}')
     )
   GROUP BY wiki_id, beacon) r
JOIN statsdb.dimension_wikis d ON r.wiki_id = d.wiki_id;

TSV output for `data-flow-graph`

statsdb.dimension_wikis	DAG	foo_report
rollup_wiki_beacon_pageviews	DAG	foo_report

Generalize SQL fails with column names

Hi,

Thanks for the very nice library! I have a table with a natural field and therefore I have to refer to it as "natural". See for instance the query

(
        SELECT way, way_area AS area, COALESCE(landuse, leisure, "natural", highway, amenity) AS type
        FROM planet_osm_polygon
        WHERE landuse IN ('cemetery', 'commercial', 'farmland', 'forest', 'grass', 'industrial', 'meadow', 'military', 'residential', 'religious')
          OR leisure IN ('common', 'garden', 'golf_course', 'park', 'pitch', 'sports_centre', 'stadium', 'track')
          OR "natural" IN ('grassland', 'heath', 'wood')
          OR highway IN ('pedestrian')
          OR amenity IN ('college', 'hospital', 'parking', 'school', 'university')
        ORDER BY way_area DESC
      ) AS data

When calling generalize_sql on it, it converts the "natural" to X, which is not expected I think,

>>> sql_metadata.generalize_sql(sql)
'( SELECT way, way_area AS area, COALESCE(landuse, leisure, X, highway, amenity) AS type FROM planet_osm_polygon WHERE landuse IN (XYZ) OR leisure IN (XYZ) OR X IN (XYZ) OR highway IN (XYZ) ORDER BY way_area DESC ) AS data'

Thanks!
Best,

Parser raises an exception for not supported query type when query starts with a comment

When working on index-digest's PR I have found the following issue.

An example call:

tables = get_query_tables("/* foo */ INSERT IGNORE INTO `0070_insert_ignore` VALUES (9, '123', '2017-01-01');")

Error raised:

/* foo */ INSERT IGNORE INTO `0070_insert_ignore` VALUES (9, '123', '2017-01-01');

        if self._query_type == "UNSUPPORTED":
>           raise ValueError("Not supported query type!")
E           ValueError: Not supported query type!

env/lib/python3.9/site-packages/sql_metadata/parser.py:94: ValueError

Additionally, we should also log the query itself that caused this exception to be raised.

Parser.tables returns wrong set of tables when parsing a CTAS

I think the following code inllustrate this behavior:

from sql_metadata import Parser

qry = """
    create table xyz as 
    select *
    from table_a
    join table_b on (table_a.name = table_b.name)
    left join table_c on (table_a.age = table_c.age)
    order by table_a.name, table_a.age
    """
p = Parser(qry)
p.tables
['xyz', 'table_a', 'table_b']

I would have expected the above to return ['xyz', 'table_a', 'table_b', 'table_c'] as it catches the table_c below when the CTAS statement is removed.

qry = """
    select *
    from table_a
    join table_b on (table_a.name = table_b.name)
    left join table_c on (table_a.age = table_c.age)
    order by table_a.name, table_a.age
    """
p = Parser(qry)
p.tables
['table_a', 'table_b', 'table_c']

Ignoring Multiple FROM tables when they prefixed by DATASET

ok_sql = """
SELECT A.FIELD1, B.FIELD1, (A.FIELD1 * B.FIELD1) AS QTY
FROM TABLE1 AS A, TABLE2 AS B
"""

fail_sql = """
SELECT A.FIELD1, B.FIELD1, (A.FIELD1 * B.FIELD1) AS QTY
FROM DATASET1.TABLE1 AS A, DATASET2.TABLE2 AS B
"""

Results

sql_metadata.get_query_tables(fail_sql)
['DATASET1.TABLE1']

sql_metadata.get_query_tables(ok_sql)
['TABLE1', 'TABLE2']

macbre / sql-metadata Goto Github PK

sql-metadata's Introduction

sql-metadata

Usage

Extracting raw sql-metadata tokens

Extracting columns from query

Extracting columns aliases from query

Extracting tables from query

Extracting values from insert query

Extracting limit and offset

Extracting with names

Extracting sub-queries

Queries normalization and comments extraction

Migrating from sql_metadata 1.x

Authors and contributors

Stargazers over time

sql-metadata's People

Contributors

Stargazers

Watchers

Forkers

sql-metadata's Issues

API

Properties

Tokenizer

Const values with an alias

Table aliases

JOINS on self

TSV output for data-flow-graph

Results

Recommend Projects

Recommend Topics

Recommend Org

Migrating from `sql_metadata` 1.x

TSV output for `data-flow-graph`