Giter VIP home page Giter VIP logo

Comments (8)

Tishj avatar Tishj commented on September 26, 2024 1

I still think this is a pandas issue, I can't imagine it's intended that numpy arrays are written differently than native lists

from duckdb.

Tishj avatar Tishj commented on September 26, 2024 1

Performance
Arrays can be preallocated and the same methods we use to populate a column can be used to populate the LIST/ARRAY conversion

#10826

from duckdb.

Tishj avatar Tishj commented on September 26, 2024

I think that's a Pandas issue, no?

from duckdb.

Dtenwolde avatar Dtenwolde commented on September 26, 2024

I don't understand how that'd be a Pandas issue. Printing the result of duckdb.sql("select [1,2,3]").df() gives the expected result. Doing a similar thing in Pandas:

import pandas as pd

data = [
    ["Alice", 25, ["Marketing", "Social Media"]],
    ["Bob", 30, ["Sales", "Business Development"]],
    ["Charlie", 28, ["Engineering", "Software Development"]]
]

df = pd.DataFrame(data, columns=["Name", "Age", "Department"])
df.to_csv("pandas.csv")

Results in the following file with correct output:

,Name,Age,Department
0,Alice,25,"['Marketing', 'Social Media']"
1,Bob,30,"['Sales', 'Business Development']"
2,Charlie,28,"['Engineering', 'Software Development']"

from duckdb.

szarnyasg avatar szarnyasg commented on September 26, 2024

@Dtenwolde thanks for raising the issue. @Tishj here's a reproducer which puts the pandas/DuckDB outputs in contrast.

import duckdb
import pandas as pd

df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print(df1)
df1.to_csv("df1.csv", index=False)

df2 = duckdb.sql("select [1,2,3] AS c1").df()
print(df2)
df2.to_csv("df2.csv", index=False)

outputs:

>>> print(df1)
          c1
0  [1, 2, 3]
>>> print(df2)
          c1
0  [1, 2, 3]

df1.csv:

c1
"[1, 2, 3]"

df2.csv:

c1
[1 2 3]

from duckdb.

Tishj avatar Tishj commented on September 26, 2024

Ah interesting, I just meant you are using pandas's to_csv method of the DataFrame class, that's not our CSV writer that's being used here.

Perhaps the way we construct the lists that make up the produced DataFrame are different, which is causing their csv writer to act up?


I had a quick look:

# DataFrame 1
df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print("DataFrame 1:")
print(df1)

print(df1['c1'][0].__class__)

# Write DataFrame 1 to StringIO
stringio_df1 = StringIO()
df1.to_csv(stringio_df1, index=False)

# DataFrame 2
df2 = duckdb.sql("select [1,2,3] AS c1").df()
print("\nDataFrame 2:")
print(df2)

print(df2['c1'][0].__class__)
DataFrame 1:
          c1
0  [1, 2, 3]
<class 'list'>

DataFrame 2:
          c1
0  [1, 2, 3]
<class 'numpy.ndarray'>

We create a numpy.ndarray, which gets rendered differently by pandas

from duckdb.

Dtenwolde avatar Dtenwolde commented on September 26, 2024

I found this pandas-dev/pandas#48478 which seems related, and if I understand correctly it is their intentional way of writing arrays.
Conversely (and out of curiosity), what is the reason DuckDB converts lists to numpy.ndarray instead of list?

from duckdb.

Tishj avatar Tishj commented on September 26, 2024

Perhaps we can try creating a list out of the numpy array and see if that comes with any severe performance impact
If it doesn't then I'm fine restoring the old behavior that way

from duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.