capitalone / locopy Goto Github PK

View Code? Open in Web Editor NEW

101.0 10.0 46.0 9.63 MB

locopy: Loading/Unloading to Redshift and Snowflake using Python.

Home Page: https://capitalone.github.io/locopy/

License: Apache License 2.0

Makefile 0.23% Python 99.77%

python redshift psycopg2 copy unload sql pg8000 s3 aws etl

locopy's People

Stargazers

Watchers

locopy's Issues

Issue with insert_dataframe_to_table

Seems like a minor bug and easy fix. Particularly this line. Need to remove the head as it will only grab the first 5 rows.

Need to add an else clause when parsing column data type

https://github.com/capitalone/Data-Load-and-Copy-using-Python/blob/master/locopy/utility.py#L296

There's a lot of elifs, maybe we can add an else clause and set it to something generic like VARCHAR. For example if we have a pandas categorical data type right now, it would just skip that column...

Look into some abstraction and refactoring of classes

Based on a previous PR (#17 ) and comment:

class Cmd:
    #Redshift stuff

class S3:
    #S3-only stuff

class Copy(Cmd, S3):
    #Stuff that needs both (like COPY/UNLOAD/etc.)

CVE-2020-11022 (Medium) detected in jquery-3.2.1.js

CVE-2020-11022 - Medium Severity Vulnerability

Vulnerable Library - jquery-3.2.1.js

JavaScript library for DOM operations

Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.js

Path to vulnerable library: /Data-Load-and-Copy-using-Python/_static/jquery-3.2.1.js

Dependency Hierarchy:

❌ jquery-3.2.1.js (Vulnerable Library)

Found in HEAD commit: fc064b132c13e4214bde3ffae659bafa1d52ae52

Vulnerability Details

In jQuery versions greater than or equal to 1.2 and before 3.5.0, passing HTML from untrusted sources - even after sanitizing it - to one of jQuery's DOM manipulation methods (i.e. .html(), .append(), and others) may execute untrusted code. This problem is patched in jQuery 3.5.0.

Publish Date: 2020-04-29

URL: CVE-2020-11022

CVSS 3 Score Details (6.1)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: Required
- Scope: Changed
Impact Metrics:
- Confidentiality Impact: Low
- Integrity Impact: Low
- Availability Impact: None

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/

Release Date: 2020-04-29

Fix Resolution: jQuery - 3.5.0

Step up your Open Source Security Game with WhiteSource here

Switch back to standard Loggging

Seems like the advantages for loguru are not really being met.
Proposal to switch back to just standard logging to ensure easier compatibility with dependent workflows.

Sphinx is resolving default values in functions

Seems like we need a work around for this "bug". Relates to sphinx-doc/sphinx#759
Basically default values in functions get fully resolved and we have things like local_path=os.getcwd()

Adding a "USE SCHEMA" call for Snowflake

In the Snowflake class, I'd like to add a feature for automatically running the command

USE SCHEMA {{ schema_name }}

upon connecting to Snowflake. The package already has a similar feature with database and warehouse, so I think it would be a relatively simple feature addition.

Look into using pytest-mock

Might lead to some cleaner mocking code
https://github.com/pytest-dev/pytest-mock/

Move package to GitHub

Support for non-Delimited file types in Snowflake

https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html#format-type-options-formattypeoptions

Azure Snowflake

Snowflake can run off azure now (and gcp sometime this year I think), so we should at least add in the docs that we don't support azure/gcp, and maybe look at supporting them (though that could be tricksy with testing)

Insert dataframe to snowflake

This function will help insert python dataframe to snowflake tables

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.3.6. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary locopy -w /tmp/ext locopy==0.3.6
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting locopy==0.3.6
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/4f1/46b583dff9457/locopy-0.3.6.tar.gz (20 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /tmp/venv/bin/python3 /tmp/tmp8ev7dvhg get_requires_for_build_wheel /tmp/tmp_l094jkx
       cwd: /tmp/pip-wheel-w_6j1bvw/locopy
  Complete output (18 lines):
  Traceback (most recent call last):
    File "/tmp/tmp8ev7dvhg", line 280, in <module>
      main()
    File "/tmp/tmp8ev7dvhg", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/tmp/tmp8ev7dvhg", line 114, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 147, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 128, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 249, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 143, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 26, in <module>
      with open(os.path.join(CURR_DIR, "requirements.txt"), encoding="utf-8") as file_open:
  FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-wheel-w_6j1bvw/locopy/requirements.txt'
  ----------------------------------------
ERROR: Command errored out with exit status 1: /tmp/venv/bin/python3 /tmp/tmp8ev7dvhg get_requires_for_build_wheel /tmp/tmp_l094jkx Check the logs for full command output.

CVE-2019-11358 (Medium) detected in jquery-3.2.1.js

CVE-2019-11358 - Medium Severity Vulnerability

Vulnerable Library - jquery-3.2.1.js

JavaScript library for DOM operations

Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.js

Path to vulnerable library: /Data-Load-and-Copy-using-Python/_static/jquery-3.2.1.js

Dependency Hierarchy:

❌ jquery-3.2.1.js (Vulnerable Library)

Found in HEAD commit: fc064b132c13e4214bde3ffae659bafa1d52ae52

Vulnerability Details

jQuery before 3.4.0, as used in Drupal, Backdrop CMS, and other products, mishandles jQuery.extend(true, {}, ...) because of Object.prototype pollution. If an unsanitized source object contained an enumerable proto property, it could extend the native Object.prototype.

Publish Date: 2019-04-20

URL: CVE-2019-11358

CVSS 3 Score Details (6.1)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: Required
- Scope: Changed
Impact Metrics:
- Confidentiality Impact: Low
- Integrity Impact: Low
- Availability Impact: None

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-11358

Release Date: 2019-04-20

Fix Resolution: 3.4.0

Step up your Open Source Security Game with WhiteSource here

pulling data from RS w/o adding S3 credentials raise an err

A question was raised in an internal slack channel about getting S3CredentialsError when trying to sample data from RS (not doing any ETL job so doesn't really require S3 credentials).

The msg I expect to get here would be: S3 credentials we not found. S3 functionality is disabled
Not sure if we both set the inputs wrong but want to do some investigation on this.

Clean up examples folder

Move examples into docs and / or simplify the usage.

Hypothesis for testing!

Switching to hypothesis for testing data

Find better solution to classify 'object' type column

Current way is to loop through every row in the column to determine if the 'Object' type column could potentially be timestamp eg. 2019-01-01 or float Decimal(2.0). This will be computing consuming if the dataframe is large.
One idea is to use sampling but can lead to false positives.

Generate gh-pages and docs

The branch and docs are push now.
Just need to setup the repo to look at the gh-pages branch.

use snowflake-connector-python's function to export to pandas

Snowflake has a new method to efficiently export to pandas
https://docs.snowflake.net/manuals/user-guide/python-connector-pandas.html#

A test with 8463105 rows of data shows a ~20x speed up on Snowflake.

Add Travis CI setup

PUT Command Skips File Upload If File Exists

Based on the following release:
https://community.snowflake.com/s/article/4-2-Release-Notes-January-27-2020

Just to align better we should allow for the OVERWITE flag to be set in upload_to_internal

Use black for code style

Would be nice to setup a pre-commit hook for using black

ensure consistent code styling
make for better readability

Codeowners file

Please be sure to add trusted reviewers to your codeowners file

Use S3 functions without providing RS creds

Right now if you want to just use the S3 functionality you need to provide some redshift credentials.
This isn't ideal behaviour and we should maybe look into refactoring the code a bit to decouple some of this functionality so that we can interact with S3 independently.

Research and refactor some of the Cmd and S3 class.
Potential use of Mixins here.

Move to github actions from travis

I think moving to Github actions might make sense. I've been noticing that Travis' support for OSP has lead to long queuing times. GitHub actions sort of replaces Travis for our purposes.

Move to a develop / master branch workflow

I'd like to move to a develop / master workflow and update the docs a bit to reflect the proper release instructions. I think it is important to set this up to ensure more structure and better practices for the future. right now the package is relatively small so not a huge deal in its current state.

Potential for adding snowflake as a option?

@theianrobertson thoughts on this.
Would be nice to maybe expand the usage.

Maybe even just regular Postgres too?

Security vulnerability for pyyaml

https://nvd.nist.gov/vuln/detail/CVE-2017-18342

Snowflake's write_pandas() function

Just want to have the discussion here vs in the PR.

Proposal: Add a flag use_write_pandas to insert_dataframe_to_table which will defer to the write_pandas method rather than running the INSERT INTO statements. This would give ppl both options.

Basically:

if use_write_pandas:
   # run self.cur.write_pandas(......)
else:
   insert_query = """INSERT INTO {table_name} {columns} VALUES {values}""".format(
            table_name=table_name, columns=column_sql, values=string_join
        )

We can keep the table creation / metadata part in this scenario.

Docs: https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas

Need to write a bit on SQL injection in the docs

I think given the type of processing locopy is doing we need a bit of a explanation and warning on SQL injections.
There doesn't seem to be a great solution when dealing with table names and COPY/UNLOAD statements as these can't be binded via parameterization you would say for a where clause.

Support for SQLite3

I tried to connect to a local SQLite3 database using Locopy and it threw the following error:

ValueError: parameters are of unsupported type

After doing some digging, it looks like the error is being thrown by the default argument in locopy.database.Database.execute() for params. Once I started using params=(), it started working.

e.g.

with locopy.Database(dbapi=sqlite3, database=':memory:') as cmd:
  cmd.execute('''CREATE TABLE stocks (date text, qty real)''', params=())

I tested params=() with locopy.snowflake.Snowflake as well and it worked.

Load NULL in database as numpy.nan in pandas

NULL values in Snowflake are loaded into python as None. When using Database.to_dataframe, these values remain as None rather than being converted to numpy.nan. As a result, any column containing None is forced to an "object" data type.

This makes it difficult to validate our data, since the data type has already changed. It also necessitates an extra step for type conversion.

I haven't tested this fully, but this issue could be fixed by changing this line
fetched = [tuple(column for column in row) for row in fetched]

to something like:

fetched = [tuple(column if column is not None else np.nan for column in row) 
    for row in fetched]

np.nan seems to be the de facto null value to use with pandas, as it doesn't mess up one's dtypes. If Database.to_dataframe is meant to be a convenient way of porting data into pandas, I think it makes sense for the method to be aware of the issue with None and to handle nulls more gracefully.

Please update license headers to include SPDX statements

Please refer to the documentation requiremebts to update the SPDX header

Any one know how to update multiple rows to a snowflake table from python (preferably pandas). Has data around 1 lakh records that needs to be updated to snowflake tables

Change boto3 version from pinning to minimum

I think it would make sense to change the pinning of the boto3 version to a minimum version.

Max retries and connection timeout arguments

Sometimes I have connection issues when uploading data via load_and_copy, for example:

botocore.exceptions.ConnectionClosedError:
Connection was closed before we received a valid response from endpoint URL: "https://MY-S3-FILE".

I found a proposition to use bigger connection_timeout in this thread. Also I'm interested in using retries option (it's zero by default I guess). Both are attributes of botocore.config.Config.

Can I somehow pass Config to Redshift object?

Look at upgrading dependencies for the next major release (post snowflake)

we should look at all dependencies and test to see if we can shift to more recent version
ideally this can happen once #20 is completed

Support for non-Delimited file types in Redshift

https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html

Switching to loguru for cleaner logging in the package

I think switching to loguru might clean up the logging portion and make things a bit simplier. Been testing it internally and so far very happy with it.

https://github.com/Delgan/loguru

Look into using s3fs and get rid of our own s3 connection code.

This could lead to a cleaner code base and remove some of the complexity and less maintenance.
Needs some investigating to scope out the work load etc.

New Release: 0.3.2

Updates:

General cleanup of the code base/linting
Bumping up pytest and pytest-cov versions
Pandas to Snowflake Support (#57)
Adding support for SQLite (#56) (Note: Tweak to allow for compatibility for those whom are interested in usage. Not a primary use case)

Add locopy to PyPi

Boolean values transform to t/f

Version: 0.3.7

After downloading data from Redshift-S3 values in boolean column become t or f instead of True or False. Therefore I can't read dataframe with boolean dtype columns (nullable) properly

Load Pandas DF into Redshift table

This might be some nice functionality which we can build out.
I can see a bunch of people wrangling with Pandas and then wanting to get this into Redshift.

Add Postgres support

Figure we should add Postgres support to locopy since Redshift is just a derivative of it.
Might need to refactor the redshift class a bit but shouldn't be too much work

Data loss when splitting csv files with load_and_copy

Python 3.7.3
locopy==0.3.6

When locopy.Redshift().load_and_copy() is called with both splits=N and copy_options=["IGNOREHEADER AS 1"] arguments,
N-1 data rows are lost,
because load_and_copy function doesn't recreate the csv file's header in the chunks, so the first chunk still has the header, and is copied correctly, while the rest of the chunks' first rows are ignored because of copy_options=["IGNOREHEADER AS 1"].

There's a workaround: remove the csv file's header and call load_and_copy() without copy_options=["IGNOREHEADER AS 1"].

CVE-2020-14343 (High) detected in PyYAML-5.3.1.tar.gz

CVE-2020-14343 - High Severity Vulnerability

Vulnerable Library - PyYAML-5.3.1.tar.gz

YAML parser and emitter for Python

Library home page: https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz

Path to dependency file: Data-Load-and-Copy-using-Python

Path to vulnerable library: Data-Load-and-Copy-using-Python,Data-Load-and-Copy-using-Python/requirements.txt

Dependency Hierarchy:

❌ PyYAML-5.3.1.tar.gz (Vulnerable Library)

Found in HEAD commit: 0b930613055b1f748f7ca0422981cd4c9d47bb5b

Vulnerability Details

A vulnerability was discovered in the PyYAML library in all versions, where it is susceptible to arbitrary code execution when it processes untrusted YAML files through the full_load method or with the FullLoader loader. .load() defaults to using FullLoader and FullLoader is still vulnerable to RCE when run on untrusted input. Applications that use the library to process untrusted input may be vulnerable to this flaw. An attacker could use this flaw to execute arbitrary code on the system by abusing the python/object/new constructor.
The fix for CVE-2020-1747 was not enough to fix this issue.

Publish Date: 2020-07-21

URL: CVE-2020-14343

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Step up your Open Source Security Game with WhiteSource here

Soften requirement for pyyaml

There's a strict dependency on a specific version of pyyaml, but some times I'm trying to use this alongside other packages that may have a range. For example if I install pre-commit to my environment first, it installs the latest pyyaml:

https://github.com/pre-commit/pre-commit/blob/master/setup.cfg#L32

What are your thoughts on having a minimum version for pyyaml but leaving the max version out? Since this doesn't rely on a huge amount of the yaml package?

`asn1crypto` new release breaks the SF connection

The err msg is:

.conda/envs/py37/lib/python3.7/site-packages/asn1crypto/keys.py", line 1065, in unwrap
    'asn1crypto.keys.PublicKeyInfo().unwrap() has been removed, '
asn1crypto._errors.APIException: asn1crypto.keys.PublicKeyInfo().unwrap() has been removed, please use oscrypto.asymmetric.PublicKey
().unwrap() instead

which is probably triggered by the new release by asn1crypto. need to add asn1crypto==0.24.0 as dependency.

capitalone / locopy Goto Github PK

locopy's People

Stargazers

Watchers

Forkers

locopy's Issues

CVE-2020-11022 - Medium Severity Vulnerability

CVE-2019-11358 - Medium Severity Vulnerability

CVE-2020-14343 - High Severity Vulnerability

Recommend Projects

Recommend Topics

Recommend Org