capitalone / locopy Goto Github PK
View Code? Open in Web Editor NEWlocopy: Loading/Unloading to Redshift and Snowflake using Python.
Home Page: https://capitalone.github.io/locopy/
License: Apache License 2.0
locopy: Loading/Unloading to Redshift and Snowflake using Python.
Home Page: https://capitalone.github.io/locopy/
License: Apache License 2.0
Seems like a minor bug and easy fix. Particularly this line. Need to remove the head
as it will only grab the first 5 rows.
https://github.com/capitalone/Data-Load-and-Copy-using-Python/blob/master/locopy/utility.py#L296
There's a lot of elifs, maybe we can add an else
clause and set it to something generic like VARCHAR. For example if we have a pandas categorical data type right now, it would just skip that column...
Based on a previous PR (#17 ) and comment:
class Cmd:
#Redshift stuff
class S3:
#S3-only stuff
class Copy(Cmd, S3):
#Stuff that needs both (like COPY/UNLOAD/etc.)
JavaScript library for DOM operations
Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.js
Path to vulnerable library: /Data-Load-and-Copy-using-Python/_static/jquery-3.2.1.js
Dependency Hierarchy:
Found in HEAD commit: fc064b132c13e4214bde3ffae659bafa1d52ae52
In jQuery versions greater than or equal to 1.2 and before 3.5.0, passing HTML from untrusted sources - even after sanitizing it - to one of jQuery's DOM manipulation methods (i.e. .html(), .append(), and others) may execute untrusted code. This problem is patched in jQuery 3.5.0.
Publish Date: 2020-04-29
URL: CVE-2020-11022
Base Score Metrics:
Type: Upgrade version
Origin: https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/
Release Date: 2020-04-29
Fix Resolution: jQuery - 3.5.0
Step up your Open Source Security Game with WhiteSource here
Seems like the advantages for loguru are not really being met.
Proposal to switch back to just standard logging to ensure easier compatibility with dependent workflows.
Seems like we need a work around for this "bug". Relates to sphinx-doc/sphinx#759
Basically default values in functions get fully resolved and we have things like local_path=os.getcwd()
In the Snowflake
class, I'd like to add a feature for automatically running the command
USE SCHEMA {{ schema_name }}
upon connecting to Snowflake. The package already has a similar feature with database
and warehouse
, so I think it would be a relatively simple feature addition.
Might lead to some cleaner mocking code
https://github.com/pytest-dev/pytest-mock/
Snowflake can run off azure now (and gcp sometime this year I think), so we should at least add in the docs that we don't support azure/gcp, and maybe look at supporting them (though that could be tricksy with testing)
This function will help insert python dataframe to snowflake tables
It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.3.6. You're in good company, about 5% of other
projects updated in the last year are also missing files.
+ /tmp/venv/bin/pip3 wheel --no-binary locopy -w /tmp/ext locopy==0.3.6
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting locopy==0.3.6
Downloading http://10.10.0.139:9191/root/pypi/%2Bf/4f1/46b583dff9457/locopy-0.3.6.tar.gz (20 kB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /tmp/venv/bin/python3 /tmp/tmp8ev7dvhg get_requires_for_build_wheel /tmp/tmp_l094jkx
cwd: /tmp/pip-wheel-w_6j1bvw/locopy
Complete output (18 lines):
Traceback (most recent call last):
File "/tmp/tmp8ev7dvhg", line 280, in <module>
main()
File "/tmp/tmp8ev7dvhg", line 263, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/tmp/tmp8ev7dvhg", line 114, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 147, in get_requires_for_build_wheel
return self._get_build_requires(
File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 128, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 249, in run_setup
super(_BuildMetaLegacyBackend,
File "/tmp/pip-build-env-3ri7vmu3/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 143, in run_setup
exec(compile(code, __file__, 'exec'), locals())
File "setup.py", line 26, in <module>
with open(os.path.join(CURR_DIR, "requirements.txt"), encoding="utf-8") as file_open:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-wheel-w_6j1bvw/locopy/requirements.txt'
----------------------------------------
ERROR: Command errored out with exit status 1: /tmp/venv/bin/python3 /tmp/tmp8ev7dvhg get_requires_for_build_wheel /tmp/tmp_l094jkx Check the logs for full command output.
JavaScript library for DOM operations
Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.js
Path to vulnerable library: /Data-Load-and-Copy-using-Python/_static/jquery-3.2.1.js
Dependency Hierarchy:
Found in HEAD commit: fc064b132c13e4214bde3ffae659bafa1d52ae52
jQuery before 3.4.0, as used in Drupal, Backdrop CMS, and other products, mishandles jQuery.extend(true, {}, ...) because of Object.prototype pollution. If an unsanitized source object contained an enumerable proto property, it could extend the native Object.prototype.
Publish Date: 2019-04-20
URL: CVE-2019-11358
Base Score Metrics:
Type: Upgrade version
Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-11358
Release Date: 2019-04-20
Fix Resolution: 3.4.0
Step up your Open Source Security Game with WhiteSource here
A question was raised in an internal slack channel about getting S3CredentialsError
when trying to sample data from RS (not doing any ETL job so doesn't really require S3 credentials).
The msg I expect to get here would be: S3 credentials we not found. S3 functionality is disabled
Not sure if we both set the inputs wrong but want to do some investigation on this.
Move examples
into docs and / or simplify the usage.
Switching to hypothesis for testing data
Current way is to loop through every row in the column to determine if the 'Object' type column could potentially be timestamp eg. 2019-01-01
or float Decimal(2.0)
. This will be computing consuming if the dataframe is large.
One idea is to use sampling but can lead to false positives.
gh-pages
branch.Snowflake has a new method to efficiently export to pandas
https://docs.snowflake.net/manuals/user-guide/python-connector-pandas.html#
A test with 8463105 rows of data shows a ~20x speed up on Snowflake.
Based on the following release:
https://community.snowflake.com/s/article/4-2-Release-Notes-January-27-2020
Just to align better we should allow for the OVERWITE
flag to be set in upload_to_internal
Would be nice to setup a pre-commit hook for using black
Please be sure to add trusted reviewers to your codeowners file
Right now if you want to just use the S3 functionality you need to provide some redshift credentials.
This isn't ideal behaviour and we should maybe look into refactoring the code a bit to decouple some of this functionality so that we can interact with S3 independently.
Cmd
and S3
class.I think moving to Github actions might make sense. I've been noticing that Travis' support for OSP has lead to long queuing times. GitHub actions sort of replaces Travis for our purposes.
I'd like to move to a develop
/ master
workflow and update the docs a bit to reflect the proper release instructions. I think it is important to set this up to ensure more structure and better practices for the future. right now the package is relatively small so not a huge deal in its current state.
@theianrobertson thoughts on this.
Would be nice to maybe expand the usage.
Maybe even just regular Postgres too?
Just want to have the discussion here vs in the PR.
Proposal: Add a flag use_write_pandas
to insert_dataframe_to_table
which will defer to the write_pandas
method rather than running the INSERT INTO
statements. This would give ppl both options.
Basically:
if use_write_pandas:
# run self.cur.write_pandas(......)
else:
insert_query = """INSERT INTO {table_name} {columns} VALUES {values}""".format(
table_name=table_name, columns=column_sql, values=string_join
)
We can keep the table creation / metadata part in this scenario.
locopy
is doing we need a bit of a explanation and warning on SQL injections.COPY
/UNLOAD
statements as these can't be binded via parameterization you would say for a where clause.I tried to connect to a local SQLite3 database using Locopy and it threw the following error:
ValueError: parameters are of unsupported type
After doing some digging, it looks like the error is being thrown by the default argument in locopy.database.Database.execute()
for params
. Once I started using params=()
, it started working.
e.g.
with locopy.Database(dbapi=sqlite3, database=':memory:') as cmd:
cmd.execute('''CREATE TABLE stocks (date text, qty real)''', params=())
I tested params=()
with locopy.snowflake.Snowflake
as well and it worked.
NULL values in Snowflake are loaded into python as None
. When using Database.to_dataframe
, these values remain as None
rather than being converted to numpy.nan
. As a result, any column containing None
is forced to an "object" data type.
This makes it difficult to validate our data, since the data type has already changed. It also necessitates an extra step for type conversion.
I haven't tested this fully, but this issue could be fixed by changing this line
fetched = [tuple(column for column in row) for row in fetched]
to something like:
fetched = [tuple(column if column is not None else np.nan for column in row)
for row in fetched]
np.nan
seems to be the de facto null value to use with pandas, as it doesn't mess up one's dtypes. If Database.to_dataframe
is meant to be a convenient way of porting data into pandas, I think it makes sense for the method to be aware of the issue with None
and to handle nulls more gracefully.
Please refer to the documentation requiremebts to update the SPDX header
I think it would make sense to change the pinning of the boto3 version to a minimum version.
Sometimes I have connection issues when uploading data via load_and_copy
, for example:
botocore.exceptions.ConnectionClosedError:
Connection was closed before we received a valid response from endpoint URL: "https://MY-S3-FILE".
I found a proposition to use bigger connection_timeout
in this thread. Also I'm interested in using retries
option (it's zero by default I guess). Both are attributes of botocore.config.Config
.
Can I somehow pass Config
to Redshift
object?
I think switching to loguru might clean up the logging portion and make things a bit simplier. Been testing it internally and so far very happy with it.
This could lead to a cleaner code base and remove some of the complexity and less maintenance.
Needs some investigating to scope out the work load etc.
Version: 0.3.7
After downloading data from Redshift-S3 values in boolean column become t
or f
instead of True
or False
. Therefore I can't read dataframe with boolean
dtype columns (nullable) properly
This might be some nice functionality which we can build out.
I can see a bunch of people wrangling with Pandas and then wanting to get this into Redshift.
locopy
since Redshift is just a derivative of it.Python 3.7.3
locopy==0.3.6
When locopy.Redshift().load_and_copy() is called with both splits=N and copy_options=["IGNOREHEADER AS 1"] arguments,
N-1 data rows are lost,
because load_and_copy function doesn't recreate the csv file's header in the chunks, so the first chunk still has the header, and is copied correctly, while the rest of the chunks' first rows are ignored because of copy_options=["IGNOREHEADER AS 1"].
There's a workaround: remove the csv file's header and call load_and_copy() without copy_options=["IGNOREHEADER AS 1"].
YAML parser and emitter for Python
Library home page: https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz
Path to dependency file: Data-Load-and-Copy-using-Python
Path to vulnerable library: Data-Load-and-Copy-using-Python,Data-Load-and-Copy-using-Python/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 0b930613055b1f748f7ca0422981cd4c9d47bb5b
A vulnerability was discovered in the PyYAML library in all versions, where it is susceptible to arbitrary code execution when it processes untrusted YAML files through the full_load method or with the FullLoader loader. .load() defaults to using FullLoader and FullLoader is still vulnerable to RCE when run on untrusted input. Applications that use the library to process untrusted input may be vulnerable to this flaw. An attacker could use this flaw to execute arbitrary code on the system by abusing the python/object/new constructor.
The fix for CVE-2020-1747 was not enough to fix this issue.
Publish Date: 2020-07-21
URL: CVE-2020-14343
Base Score Metrics:
Step up your Open Source Security Game with WhiteSource here
There's a strict dependency on a specific version of pyyaml, but some times I'm trying to use this alongside other packages that may have a range. For example if I install pre-commit to my environment first, it installs the latest pyyaml:
https://github.com/pre-commit/pre-commit/blob/master/setup.cfg#L32
What are your thoughts on having a minimum version for pyyaml but leaving the max version out? Since this doesn't rely on a huge amount of the yaml package?
The err msg is:
.conda/envs/py37/lib/python3.7/site-packages/asn1crypto/keys.py", line 1065, in unwrap
'asn1crypto.keys.PublicKeyInfo().unwrap() has been removed, '
asn1crypto._errors.APIException: asn1crypto.keys.PublicKeyInfo().unwrap() has been removed, please use oscrypto.asymmetric.PublicKey
().unwrap() instead
which is probably triggered by the new release by asn1crypto. need to add asn1crypto==0.24.0
as dependency.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.