Giter VIP home page Giter VIP logo

Comments (10)

jborchma avatar jborchma commented on June 3, 2024

Can you please post the code you are using?

from locopy.

lisovskey avatar lisovskey commented on June 3, 2024
import os
import glob
import shutil
import locopy
import pandas as pd
from typing import Optional, Sequence


def db_to_csv(
    query: str,
    data_dir: Optional[str] = None,
    separator: str = ",",
    bucket: str = "analytics",
    bucket_dir: str = "dwh",
) -> Sequence[str]:
    creds = get_creds()
    if data_dir and not os.path.exists(data_dir):
        os.makedirs(data_dir)
    with locopy.Redshift(
            dbapi=psycopg2,
            host=creds.get("host"),
            port=creds.get("port"),
            dbname=creds.get("dbname"),
            user=creds.get("username"),
            password=creds.get("password"),
    ) as redshift_locopy:
        redshift_locopy.unload_and_copy(
            query=query,
            s3_bucket=bucket,
            s3_folder=bucket_dir,
            export_path=False,
            raw_unload_path=data_dir,
            delimiter=separator,
            delete_s3_after=True,
            parallel_off=False,
            unload_options=["csv header gzip parallel on allowoverwrite"],
        )
    filenames = glob.glob(os.path.join(data_dir or os.getcwd(), "*part_00.gz"))
    return filenames


query = "SELECT * FROM stats"
filenames = db_to_csv(query, data_dir="temp", bucket="foo", bucket_dir="bar")
chunks = [
    pd.read_csv(
        filename,
        na_values=[""],
        keep_default_na=False,
        parse_dates=["current_dt"],
        dtype={"some_boolean_column_name": "boolean"},
    )
    for filename in filenames
]
data = pd.concat(chunks)

from locopy.

fdosani avatar fdosani commented on June 3, 2024

So Redshift is unloading a boolean column as t \ f?
Just want to make sure I'm understanding this properly.

I'm not sure we can do much about that if it is unloading it like so.
You might need to do some post processing on the data to get into Pandas the way you like.

from locopy.

lisovskey avatar lisovskey commented on June 3, 2024

Yes, you're correct.
As for pandas, while using read_sql I haven't got any problems with boolean (it directly treats it as True or False). I'm just guessing that explicit post processing can by omitted if that transformation is performed on locopy's side

from locopy.

fdosani avatar fdosani commented on June 3, 2024

Let me think about this a bit more in terms of locopy supporting this natively.

Some quick solutions:

  • use CASE statements and output the strings True / False. This should allow pandas to read it I think.
  • Some form of post processing before data goes into Pandas or in Pandas.

@jborchma @dasjyao @theianrobertson any thoughts?

from locopy.

dasjyao avatar dasjyao commented on June 3, 2024

If we are talking about Redshift -> Pandas, I believe the utility function to_dataframe() does not suffer from this issue.

Sample code to read redshift table to pandas dataframe directly

with locopy.Redshift(dbapi=pg8000, **redshift_cred) as cmd:
    cmd.execute(query)
    data = cmd.to_dataframe()

from locopy.

lisovskey avatar lisovskey commented on June 3, 2024

Sample code to read redshift table to pandas dataframe directly

Would be nice if we had Redshift -> S3 -> Pandas option

from locopy.

dasjyao avatar dasjyao commented on June 3, 2024

Would be nice if we had Redshift -> S3 -> Pandas option

I guess I would like to understand more on the use case for this option. @lisovskey can you please provide some specific examples where you would find this option to be useful? Current version of locopy does provide functionalities to transfer data from Redshift -> S3 and Redshift -> Pandas separately.

from locopy.

lisovskey avatar lisovskey commented on June 3, 2024

I guess I would like to understand more on the use case for this option. @lisovskey can you please provide some specific examples where you would find this option to be useful? Current version of locopy does provide functionalities to transfer data from Redshift -> S3 and Redshift -> Pandas separately.

I use Redshift -> S3 -> CSV option to reduce download time of huge data amount. And I have to do additional CSV -> Pandas step. It's not the point of the issue but maybe skipping of CSV export with direct download from S3 can fix the problem with True/False.

from locopy.

fdosani avatar fdosani commented on June 3, 2024

@lisovskey Sorry for the delay around this. This is stale issue, so closing this for now. If you want to discuss further please feel free to reopen it.

from locopy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.