Comments (10)
Can you please post the code you are using?
from locopy.
import os
import glob
import shutil
import locopy
import pandas as pd
from typing import Optional, Sequence
def db_to_csv(
query: str,
data_dir: Optional[str] = None,
separator: str = ",",
bucket: str = "analytics",
bucket_dir: str = "dwh",
) -> Sequence[str]:
creds = get_creds()
if data_dir and not os.path.exists(data_dir):
os.makedirs(data_dir)
with locopy.Redshift(
dbapi=psycopg2,
host=creds.get("host"),
port=creds.get("port"),
dbname=creds.get("dbname"),
user=creds.get("username"),
password=creds.get("password"),
) as redshift_locopy:
redshift_locopy.unload_and_copy(
query=query,
s3_bucket=bucket,
s3_folder=bucket_dir,
export_path=False,
raw_unload_path=data_dir,
delimiter=separator,
delete_s3_after=True,
parallel_off=False,
unload_options=["csv header gzip parallel on allowoverwrite"],
)
filenames = glob.glob(os.path.join(data_dir or os.getcwd(), "*part_00.gz"))
return filenames
query = "SELECT * FROM stats"
filenames = db_to_csv(query, data_dir="temp", bucket="foo", bucket_dir="bar")
chunks = [
pd.read_csv(
filename,
na_values=[""],
keep_default_na=False,
parse_dates=["current_dt"],
dtype={"some_boolean_column_name": "boolean"},
)
for filename in filenames
]
data = pd.concat(chunks)
from locopy.
So Redshift is unloading a boolean column as t
\ f
?
Just want to make sure I'm understanding this properly.
I'm not sure we can do much about that if it is unloading it like so.
You might need to do some post processing on the data to get into Pandas the way you like.
from locopy.
Yes, you're correct.
As for pandas, while using read_sql
I haven't got any problems with boolean (it directly treats it as True
or False
). I'm just guessing that explicit post processing can by omitted if that transformation is performed on locopy's side
from locopy.
Let me think about this a bit more in terms of locopy
supporting this natively.
Some quick solutions:
- use
CASE
statements and output the stringsTrue
/False
. This should allow pandas to read it I think. - Some form of post processing before data goes into Pandas or in Pandas.
@jborchma @dasjyao @theianrobertson any thoughts?
from locopy.
If we are talking about Redshift -> Pandas, I believe the utility function to_dataframe()
does not suffer from this issue.
Sample code to read redshift table to pandas dataframe directly
with locopy.Redshift(dbapi=pg8000, **redshift_cred) as cmd:
cmd.execute(query)
data = cmd.to_dataframe()
from locopy.
Sample code to read redshift table to pandas dataframe directly
Would be nice if we had Redshift -> S3 -> Pandas option
from locopy.
Would be nice if we had Redshift -> S3 -> Pandas option
I guess I would like to understand more on the use case for this option. @lisovskey can you please provide some specific examples where you would find this option to be useful? Current version of locopy does provide functionalities to transfer data from Redshift -> S3 and Redshift -> Pandas separately.
from locopy.
I guess I would like to understand more on the use case for this option. @lisovskey can you please provide some specific examples where you would find this option to be useful? Current version of locopy does provide functionalities to transfer data from Redshift -> S3 and Redshift -> Pandas separately.
I use Redshift -> S3 -> CSV option to reduce download time of huge data amount. And I have to do additional CSV -> Pandas step. It's not the point of the issue but maybe skipping of CSV export with direct download from S3 can fix the problem with True/False.
from locopy.
@lisovskey Sorry for the delay around this. This is stale issue, so closing this for now. If you want to discuss further please feel free to reopen it.
from locopy.
Related Issues (20)
- CVE-2020-14343 (High) detected in PyYAML-5.3.1.tar.gz
- Need to add an else clause when parsing column data type HOT 8
- Move to github actions from travis
- Max retries and connection timeout arguments HOT 3
- Drop Python 3.5 support
- Add support for edgetest (config)
- CVE-2021-41496 (High) detected in numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- CVE-2021-34141 (Medium) detected in numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Move to main and dev for branch names
- fix build package action
- numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl: 1 vulnerabilities (highest severity is: 5.3)
- Don't log error if no files unloaded HOT 1
- update to use pyproject.toml
- find_column_type incorrectly identifying certain date related columns HOT 8
- edgetest action is failing
- Unload don't create files while concurrency scaling enabled HOT 8
- CVE-2020-11022 (Medium) detected in jquery-3.2.1.js HOT 1
- Any one know how to update multiple rows to a snowflake table from python (preferably pandas). Has data around 1 lakh records that needs to be updated to snowflake tables HOT 4
- Codeowners file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from locopy.