Giter VIP home page Giter VIP logo

bigquery-schema-generator's People

Contributors

abroglesc avatar bxparks avatar de-code avatar jonwarghed avatar jtschichold avatar kdeggelman avatar korotkevics avatar riccardomc avatar ziggerzz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bigquery-schema-generator's Issues

create symlink at /usr/local/bin/generate-schema on MacOS

To install from PyPI, we use the following pip3 command:
$ sudo -H pip3 install bigquery-schema-generator

On Ubuntu (verified on 17.10, 16.04), the 'generate-schema' shell script is installed at: /usr/local/bin/generate-schema

On MacOS (verified on 10.13.2, using Python 3.6.4), the 'generate-schema' script is installed at:
/Library/Frameworks/Python.framework/Versions/3.6/bin/generate-schema
This is not an obvious location for the user.

We need to create a symlink from /usr/local/bin/generate-schema -> (the above location) on MacOS.

Loading GCP stackdriver logs into BQ

Schema nested too deeply for field

protoPayload.request.spec.validation.openAPIV3Schema.properties.spec.properties.match.properties.kinds.items.properties.apiGroups.items, maximum allowed depth is 15.

"logName": "projects/xxxxxxxxxxxxx/logs/cloudaudit.googleapis.com%2Fdata_access",
"type": "k8s_cluster"

Need a way to possibly handle this

support timezone without a colon (:) character

Currently, the optional timezone indicator on a TIMESTAMP field is expected to contain a colon (:) character. For example:
2017-05-22 12:33:01-07:30

However, ISO8601 allows a timezone format without the colon character, like this:
2017-05-22 12:33:01-0730

I have not needed this feature yet, but this should be easy to add if someone needs it.

Fails to generate schema on nested json

this example json is valid but cannot be parsed to a schema:

from bigquery_schema_generator.generate_schema import SchemaGenerator
import json

test = {
	"a": "a",
	"b": "20220101",
	"c": "c",
	"values": [{
		"percentage": 3,
		"values": [{
			"a": "20220101",
			"b": 100,
		}]
	}]
}

generator = SchemaGenerator(input_format='json', infer_mode='NULLABLE')

schema_map, error_logs = generator.deduce_schema(input_data=json.dumps(test))

Issue with invalid csv column names

Having issue with that my csv column names are not valid bigquery names. Renaming and handling this is outside my control and schema is updated frequently. Ran into this library which makes it much easier but noticed I had to process everything twice to clean up the invalid names.

Bigquery does an automatic substitution in accordance with what is described in this issue #14

So added a pull request that allows one to run it with a sanitize names mode if wanted.

Quick note to say thanks!

This library saved me a bunch of time. Despite using BQ for a long time I hadn't heard of it until someone referred me on SO. Thanks a lot!

add configurable csv.field_size_limit in SchemaGenerator

File "/lib/python3.11/site-packages/bigquery_schema_generator/generate_schema.py", line 190, in deduce_schema
for json_object in reader:
File "/lib/python3.11/csv.py", line 111, in next
row = next(self.reader)
^^^^^^^^^^^^^^^^^
_csv.Error: field larger than field limit (131072)

version = '1.5.1'

apply quoted_values_are_strings to timetstamps?

Hello, I have a string column in BQ where I store timestamps.

Is there a way to prevent deduce_schema converting my field that contains string-timestamps to timestamps if I have quoted_values_are_strings= True?

Or maybe another solution would be if I'm passing the original schema to have the option to prevent changing the types of the existing colums? e.g. add a flag dont_modify_original_colums and whenever it's true, don't modify the colums of the existing schema (only add new ones)

schema inference involving nulls and arrays produces inconsistent results

The following call:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', None)])

Other calls of a similar nature produce inconsistent results:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', OrderedDict([('status', 'hard'), ('filled', True), ('info', OrderedDict([('mode', 'REPEATED'), ('name', '1'), ('type', 'STRING')]))]))])

And
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None} ])
Produces OrderedDict([('1', OrderedDict([('status', 'soft'), ('filled', False), ('info', OrderedDict([('mode', 'NULLABLE'), ('name', '1'), ('type', 'STRING')]))]))])

The specific issue I have involves a column 90% composed of nulls and 10% string arrays. It results in version 3 of the above instances, when I'd have hoped that it'd result in something with a mode of 'REPEATED'

infer_mode still returns "REQUIRED" for json file

I set infer_mode=False, and after scanning the json file, the resulting schema still had 'REQUIRED' fields instead of the default 'NULLABLE'.

Is that the expected behavior? Or am I misunderstanding what the argument does?

Schema generation from a list

Hi,

I think it would be useful (at least for me :)) to add a function to the library that generates a schema from some data, something like

from bigquery_schema_generator import generate_schema
data = [{"first_column": 1, "second_column": "value"}, {"first_column": 2, "second_column": "another value"}]
schema = generate_schema(data)  # returns a list

I came up with this function:

from subprocess import check_output
def generate_schema(data):
    data_string = ""
    for d in data:
        if d:
            data_string = data_string + json.dumps(d) + '\n'
    data_bytes = data_string.encode('utf-8')
    s = check_output(['generate-schema'], input=data_bytes)
    schema = json.loads(s)
    return schema

but I'm sure there's a more efficient way.

Can the schema generation tool suppress case insensitive duplicates that are not accepted by bigquery?

Hi

I have been trying to export asset metadata to GCS. The idea is to export the asset metadata generated into bigquery and then visualize in Data Studio.

However whenever I use the cloud asset API (either using curl or 'gcloud asset export' command), the generated raw json data file contains two duplicate fields, 'IPProtocol' and 'ipProtocol'.

Due to this when I try to export this data into bigquery (either by bq mk or bq load command) it gives me follwing error.

$ bq mk inventory_dataset.2019_09_20_11_00_00 schema.json
BigQuery error in mk operation: Field resource.data.allowed.ipProtocol already exists in schema

Is this a bug or I am doing anything wrong?

I am using a bigquery-schema-generator tool for generating schema.(https://pypi.org/project/bigquery-schema-generator/)

Please help.

Skip bad records instead of throwing an exception?

I have a newline delimited JSON file with a few bad (i.e. undecodable) lines. Currently this results in a JSONDecodeError halting execution.
Given that BigQuery can cope with bad records (--max_bad_records parameter) by skipping them, would it be useful to have a similar option in the schema generator? (This could be useful for e.g. CSV files with missing trailing columns as well.)
Concretely, the issue with my JSON file could be resolved by adding an (optional) try/except to

def json_reader(file):
"""A generator that converts an iterable of newline-delimited JSON objects
('file' could be a 'list' for testing purposes) into an iterable of Python
dict objects.
"""
for line in file:
yield json.loads(line)

Support for "UTC" suffix in TIMESTAMP data

Using bq extract to export table data from BigQuery exports default UTC timestamps in the format "YY-MM-DD HH:MI:SS UTC". This is the same format as displayed in BigQuery Web UI when previewing data.
When this data is passed through the schema generator, the regex on the TIMESTAMP_MATCHER fails and the data is interpreted as a STRING in the JSON schema.
Attempting to use bq update using the JSON schema on the same table the data was exported from then fails due to the change in data type from TIMESTAMP to STRING.
Should be quite simple to fix - need to add optional " UTC" check in regex as an alternative to "Z".

Unsupported array element type: __array__

Hello,

First of all, thank you very much for creating this. Looks like it's saved me heaps of time already.

I have, however, received an error, which seems to be related to the format of the json file. Specifically relating to the file having an array as one of the nested elements.

Example error message:
INFO:root:Problem on line 4: Unsupported array element type: __array__

This repeats for almost all rows of the file.

Row 4 of the file looks like this:
{"op":"mcm","clk":"1304450546","pt":1585613976590,"mc":[{"id":"1.170258437","rc":[{"batl":[[0,2.66,2.53],[1,1000,2.2]],"ltp":0.0,"tv":0.0,"id":110503}]}]}

Questions:

  1. Is this a limitation of the generator or limitation of BQ in general?
  2. If it is a limitation of the generator - do you have any ideas on how it can be fixed? I'm willing to contribute with a bit of guidance.
  3. If it is a limitation of BQ in general - what do you think could be the workaround here? To normalise per array element?

[FEATURE] When a nested field has mismatched type print the full path to that nested field

Summary

In a complex structure like the following:

{
  "source_machine": {
    "port": 80
  },
  "dest_machine": {
    "port": "http-port"
  }
}

If there was an error with another log where dest_machine.port was an integer this would error and simply state something like:
Ignoring field with mismatched type: old=(hard,port,NULLABLE,STRING); new=(hard,port,NULLABLE,INTEGER)

At this point you are left to figure out which structure this port column actually exists in. This is a more simple example but as the schema grows and is more complex, this problem is harder to manually resolve.

Ideally, we can track the path to this using a JSON path or dpath expression. Something like dest_machine.port. This will likely take adding an additional argument to the recursive function merge_schema_entry. Something like a base_path=None and continually build up that base_path string in each recursive iteration so that it can be used in the errors like "{}.{}".format(base_path, new_name) and "{}.{}".format(base_path, old_name)

Doesn't install properly from PyPI

Attempting to install from PyPI produces the following errors:

$ pip3 install bigquery-schema-generator
Collecting bigquery-schema-generator
  Downloading bigquery-schema-generator-0.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/_q/d57c1qhn5fb9ng6ycg2_3sxc0000gp/T/pip-build-i66hkje9/bigquery-schema-generator/setup.py", line 5, in <module>
        import pypandoc
    ModuleNotFoundError: No module named 'pypandoc'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/_q/d57c1qhn5fb9ng6ycg2_3sxc0000gp/T/pip-build-i66hkje9/bigquery-schema-generator/

After installing pypandoc and trying again, I encountered this error:

pip3 install bigquery-schema-generator
The directory '/Users/call/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/call/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting bigquery-schema-generator
  Downloading bigquery-schema-generator-0.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip-build-mr9vuupc/bigquery-schema-generator/setup.py", line 6, in <module>
        long_description = pypandoc.convert('README.md', 'rst')
      File "/usr/local/lib/python3.6/site-packages/pypandoc/__init__.py", line 66, in convert
        raise RuntimeError("Format missing, but need one (identified source as text as no "
    RuntimeError: Format missing, but need one (identified source as text as no file with that name was found).

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/tmp/pip-build-mr9vuupc/bigquery-schema-generator/

I was able to work around this by cloning this repo, cd'ing into the local repo and installing via pip3 install ., but it'd be great if this installed properly from PyPI.

Also, this is a much-needed utility. I'd previously been semi-solving this problem using wolverdude/genSON to infer a JSON schema, then converting that to BigQuery schema with some custom code, but this looks much more idiomatic. Looking forward to taking it for a spin. Thanks, and keep up the good work.

sanitize_name and field name starting by number

Hi folks.
First - your tool works great thanks for it.
Unfortunately, the data I work with them is a mess. What I am fighting with now is this JSON

{"objects":{"0":{"mime_type":"application/octet-stream","type":"artifact","hashes":{"MD5":"6..1","SHA-1":"4..0","SHA-256":"4..f"},"url":"https://URL/artifacts/4..f","x_cta_hash_identity":"6..a","x_cta_hash_context":"2..2","spec_version":"2.0"}}}

as you see, there is the map, which key is named as number 0... but BigQuery doesn't support numbers as the first letter

BigQuery error in load operation: Invalid field name "0". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long.

So I propose to change "0" to "_0" in this case when --santizie_name is applied.

Thanks

Make quoted integer detection optional

#15 introduced automatic conversion of quoted numeric version to INTEGER type.
For my use-case I would not want that behaviour. Some identifiers using numbers but should be treated as strings. e.g. they may not just consist of digits in other files. I am generating the JSON and would generate the JSON with the corresponding type (i.e. if I wanted something to be represented as an INTEGER then I wouldn't quote it).

Integers exceeding the bigquery integer limit are still converted to integer in the schema

To replicate:

test.json:

{"name": "111222333444555666777"}
{"name": "111222333444555666777"}

Expected:

 % python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "STRING"
  }
]

Actual:

 % python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "INTEGER"
  }
]

Allow sections in test DataReader to appear in any order

From a comment by @bxparks in #57 regarding the sections within the DataReader class.

Hmm, it's getting harder to keep track of which tags are allowed in which sections. Originally, the order of the tags were just: DATA, [ERRORS], SCHEMA, END. But now it's DATA, [EXISTING_SCHEMA], [ERRORS], SCHEMA, END. A better way would be to allow these sections to appears in any order. But that's a bit out of scope for this PR. If I get motivated, maybe I'll take a crack at it after merging in this PR... but realistically, it will probably not rise high enough on my priority list with so many other things going on. Too bad. At least this tidbit is recorded here.

support sending an existing schema to deduce_schema

support sending an existing schema to deduce schema so we can merge an existing BigQuery schema with new rows in file.
Something like:

def deduce_schema(self, file, schema_map =None):
if schema_map is None:
schema_map = OrderedDict()

mode conflation in nullable nested, repeated records

with data in a file:

{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}

If I manually upload the sample file to BIgquery I get schema:

{"name":"model","type":"RECORD","mode":"REPEATED","fields":[
    {"name":"data","type":"RECORD","mode":"REPEATED","fields":[
        {"name":"Inventory","type":"RECORD","mode":"REPEATED","fields":[
            {"name":"Observations","type":"STRING","mode":"NULLABLE"}
        ]}
    }
}
from bigquery_schema_generator.generate_schema import SchemaGenerator

generator = SchemaGenerator(
    infer_mode=True,
    input_format="json",
    quoted_values_are_strings=True,
    preserve_input_sort_order=True,
    keep_nulls=True,
    debugging_map=True,
    sanitize_names=True,
)
with open(file) as f:
    schema_map, errors = generator.deduce_schema(f)
if errors:
    for error in errors:
        print("Problem on line %s: %s", error['line_number'], error['msg'])
        
specs = generator.flatten_schema(schema_map) 
return [
    bigquery.SchemaField(
        name=spec["name"], field_type=spec["type"], mode=spec["mode"]
    )
    for spec in specs
]
            

But I get this error from the library:

Ignoring non-RECORD field with mismatched mode: 
old=(hard,model.data.Inventory.Observations,REPEATED,STRING); 
new=(soft,model.data.Inventory.Observations,NULLABLE,STRING)

My questions:

Docs suggest using generator.flatten_schema(schema_map)
but is there an alternative method to get a list of SchemaField in the original nested structure?
meaning: schemaFields without the flatten.

Are my batch sizes too big? What's the guidance?
I'm scanning 1000 records with 42 tags in the generated schema but that's after it eliminates all the nesting.
Even on 2MB files with ~280 records I get weird errors.

I get intermittent errors: Problem on line 278: Unsupported array element type: __null__
There are 14 nulls in line 278 and 13 in line 279, but none of them are in an array.

Ignoring non-RECORD field with mismatched mode: - error

Hi, when trying to recreate the example (using Ubuntu and venv)
I have the following problems

user@DESKTOP:/mnt/c/X/venv_dir$ cat > file.data.json
{ "a": [1, 2] }
{ "i": 3 }
Ctrl-D
user@DESKTOP:/mnt/c/X/venv_dir$ generate-schema < file.data.json > file.schema.json
Traceback (most recent call last):
File "/home/user/.local/bin/generate-schema", line 7, in
from bigquery_schema_generator.generate_schema import main
File "/home/user/.local/lib/python3.5/site-packages/bigquery_schema_generator/generate_schema.py", line 303
f'Ignoring non-RECORD field with mismatched mode: '
^
SyntaxError: invalid syntax

What might be wrong with this?

Type Inferrence in inconsistent lists of dictionaries

Hi, I would expect that module would pass the following test:

DATA
{ "r" : [{ "i": 4 },{ "i": "4px" }] }
SCHEMA
[
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "i",
        "type": "STRING"
      }
    ],
    "mode": "REPEATED",
    "name": "r",
    "type": "RECORD"
  }
]
END

Unfortunately, type of the "i" field returned is INTEGER. I have a problem with understanding if it is this a bug - it's seems to be technically doable and useful, but it also seems to be a case mentioned somewhere in README - "but bq load does not support it, so we follow its behavior".
Is is a bug to be fixed or not?

[BUG] - empty nested records with keep_nulls=False produce a record with no fields

Summary

When we remove nulls we are only removing inner nulls and not the case where a record has all fields removed during this process. This produces the following error when attempting to load into BigQuery with the generated schema:
Field outer_nested_record is type RECORD but has no schema.

Example input

test_data.json

{"test": "thing", "empty_record": {}, "outer_nested_record": {"inner_empty_record": {}}}

Command Ran

generate-schema --input_format json --quoted_values_are_strings < test_file.json

Current Output

[
  {
    "fields": [],
    "mode": "NULLABLE",
    "name": "outer_nested_record",
    "type": "RECORD"
  },
  {
    "mode": "NULLABLE",
    "name": "test",
    "type": "STRING"
  }
]

Expected Output

[
  {
    "mode": "NULLABLE",
    "name": "test",
    "type": "STRING"
  }
]

recursive name sanitization of records of type RECORD fails

Names in RECORD fields are not sanitized. I reproduced the issue consistently by introducing the following test data:

# Sanitize the names to comply with BigQuery, recursively.
DATA sanitize_names
{ "r" : { "a-name": [1, 2] } }
SCHEMA
[
  {
    "fields": [
      {
        "mode": "REPEATED",
        "name": "a_name",
        "type": "INTEGER"
      }
    ],
    "mode": "NULLABLE",
    "name": "r",
    "type": "RECORD"
  }
]
END

Which results in the following failure:

======================================================================
FAIL: test (__main__.TestFromDataFile)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./tests/test_generate_schema.py", line 423, in test
    self.verify_data_chunk(chunk_count, chunk)
  File "./tests/test_generate_schema.py", line 450, in verify_data_chunk
    self.assertEqual(expected, schema)
AssertionError: Lists differ: [Orde[62 chars]', 'a_name'), ('type', 'INTEGER')])]), ('mode'[46 chars]')])] != [Orde[62 chars]', 'a-name'), ('type', 'INTEGER')])]), ('mode'[46 chars]')])]

First differing element 0:
Order[61 chars]', 'a_name'), ('type', 'INTEGER')])]), ('mode'[45 chars]D')])
Order[61 chars]', 'a-name'), ('type', 'INTEGER')])]), ('mode'[45 chars]D')])

  [OrderedDict([('fields',
                 [OrderedDict([('mode', 'REPEATED'),
-                              ('name', 'a_name'),
?                                         ^

+                              ('name', 'a-name'),
?                                         ^

                               ('type', 'INTEGER')])]),
                ('mode', 'NULLABLE'),
                ('name', 'r'),
                ('type', 'RECORD')])]

----------------------------------------------------------------------
Ran 14 tests in 0.006s

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema

Current Behavior

Whether starting with the existing schema or not, if the script encounters a change, it logs the changed line. And giving errors like:

Error Log

INFO:root:Problem on line 47730: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,RECORD); new=(hard,dimensionValue,REPEATED,STRING) INFO:root:Problem on line 47732: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,STRING); new=(hard,dimensionValue,REPEATED,RECORD)

Expected Behavior

For example, our file includes like 100000 rows, but there are only 100 rows that do not match the existing schema. But if those nonmatching lines come consecutively, the script detects the first one as problematic, and one matching line that comes after consecutive nonmatching lines is marked as problematic, although it actually matches the existing schema.

Suggested solution

Add a new feature that checks existing files regarding a schema file and excludes rows that do not match the schema and writes them to another JSON/CSV file.

Cannot flatten schema due to schema map being a tuple[OrderedDict, list]

Following the readme tutorial, I reproduced the steps:

schema_map = generator.deduce_schema(
    input_data=table_data
)


schema = generator.flatten_schema(schema_map)

the input_data is a dictionary, which I specified in the generator configurations.

The exception raised is: Exception: Unexpected type '<class 'tuple'>' for schema_map

CSV Delimiter Option

I haven't seen an option for it, and the schema is not generating for a pipe-delimited csv. I was wondering if this is something that could be added.

I might be able to do the code edit myself and push it, but this would be the first project I would contributing to, so would want to take the time to look at all the code first.

When creating a load job programmatically, load_job.schema has to be a list of bigquery.SchemaField objects

i wrote a recursive function to walk a flattened schema_map and convert everything to bigquery.SchemaField:

but this could probably be done somewhere higher up instead of post generation, and would be more performant. This works well for me, but could be helpful for others.

def walk_schema(s):
    result = []
    for field in s:
        if field.get('fields', None):
            field['fields'] = walk_schema(field['fields'])
        
        if field.get('type', None):
            field['field_type'] = field.pop('type')
        
        field = bigquery.SchemaField(**field)
        result.append(field)
    return result

[FEATURE] Starting with an existing schema, epoch time cannot mismatched type TIMESTAMP --> INTEGER

Current Behavior

When starting with an existing bigquery schema that has a TIMESTAMP field in it we get an error when trying to load logs which contain an epoch time as this is detected as an INTEGER and we get the following error:

Error: [{'line_number': 1, 'msg': 'Ignoring field with mismatched type: old=(hard,event_time,NULLABLE,TIMESTAMP); new=(hard,event_time,NULLABLE,INTEGER)'}]

Expected Behavior

If this timestamp matched the correct number of digits for an epoch timestamp which is supported by bigquery we should be able to assume that this INTEGER is in fact a TIMESTAMP and allow it to be maintained as such.

Suggested solution

Add a new if block to the convert_type function which will allow btype = TIMESTAMP and atype = INTEGER to return TIMESTAMP if the integer matches the correct number of digits for an epoch time.

There is added complexity because we do not pass the actual data for the record into this function. We may need to start doing this.

Allow option to pass in iterable of dictionaries when used as a library

Right now if I wanted to generate the schema of a list of dictionaries I would need to first convert each dictionary into a JSON string just so that it could be loaded back into a dictionary and yielded in the json_reader. When using this as a library this would be a useful feature.

I am happy to create a PR for this if you would like but wanted to propose and make sure you are onboard with it before sending the PR @bxparks

Let me know if I should continue with a PR that includes some added tests for it.

error with white spaces & other wrong characters in column names

When the schema is created, column names with spaces are writen as they are.
Therefore, when uploading to bq generates the following error
<BigQuery error in load operation: Invalid field name "utm_medium-partners". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.>

Would be posible to substitute blank spaces and other wrong characters with '_' as the '--autodetect' option does?
For example:
'Column.example 1' is written as 'Column_example_1'

AttributeError: 'NoneType' object has no attribute 'lower'

I used bigquery-schema-generator for relatively small CSV file (320 mb). After reading 30,000 lines Attribute error was thrown stating:
AttributeError: 'NoneType' object has no attribute 'lower'

INFO:root:Processing line 1000
INFO:root:Processing line 2000
INFO:root:Processing line 3000
INFO:root:Processing line 4000
INFO:root:Processing line 5000
INFO:root:Processing line 6000
INFO:root:Processing line 7000
INFO:root:Processing line 8000
INFO:root:Processing line 9000
INFO:root:Processing line 10000
INFO:root:Processing line 11000
INFO:root:Processing line 12000
INFO:root:Processing line 13000
INFO:root:Processing line 14000
INFO:root:Processing line 15000
INFO:root:Processing line 16000
INFO:root:Processing line 17000
INFO:root:Processing line 18000
INFO:root:Processing line 19000
INFO:root:Processing line 20000
INFO:root:Processing line 21000
INFO:root:Processing line 22000
INFO:root:Processing line 23000
INFO:root:Processing line 24000
INFO:root:Processing line 25000
INFO:root:Processing line 26000
INFO:root:Processing line 27000
INFO:root:Processing line 28000
INFO:root:Processing line 29000
INFO:root:Processing line 30000
INFO:root:Processed 30334 lines
Traceback (most recent call last):
  File "/usr/local/bin/generate-schema", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/bigquery_schema_generator/generate_schema.py", line 1074, in main
    generator.run(schema_map=existing_schema_map)
  File "/usr/local/lib/python3.7/dist-packages/bigquery_schema_generator/generate_schema.py", line 707, in run
    input_file, schema_map=schema_map
  File "/usr/local/lib/python3.7/dist-packages/bigquery_schema_generator/generate_schema.py", line 201, in deduce_schema
    schema_map=schema_map,
  File "/usr/local/lib/python3.7/dist-packages/bigquery_schema_generator/generate_schema.py", line 237, in deduce_schema_for_record
    canonical_key = self.sanitize_name(key).lower()
AttributeError: 'NoneType' object has no attribute 'lower'

getting error after running the generate-schema

generate-schema file2.data.json file.schema.json
usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
[--quoted_values_are_strings] [--infer_mode]
[--debugging_interval DEBUGGING_INTERVAL]
[--debugging_map] [--sanitize_names]
generate-schema: error: unrecognized arguments: file2.data.json file.schema.json

I have tried linux and mac both and get the same error. the file2.data.json has only one json object. but basically it gives error on the args.

Dates which are not in ISO format...

Hi
Currently the code identifies a date/timestamp field only if it is in ISO format. Is it possible to add a feature to identify dates/timestamp in any format. I can work on this if required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.