What happens? In a Python environment, when writing a column with

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I found this <a class="issue-link js-issue-link" data-error-text="Failed to load title

No comma separator writing list to csv in python about duckdb HOT 8 CLOSED

Dtenwolde commented on September 26, 2024

No comma separator writing list to csv in python

from duckdb.

Comments (8)

Tishj commented on September 26, 2024 1

I still think this is a pandas issue, I can't imagine it's intended that numpy arrays are written differently than native lists

from duckdb.

Tishj commented on September 26, 2024 1

Performance
Arrays can be preallocated and the same methods we use to populate a column can be used to populate the LIST/ARRAY conversion

#10826

from duckdb.

Tishj commented on September 26, 2024

I think that's a Pandas issue, no?

from duckdb.

Dtenwolde commented on September 26, 2024

I don't understand how that'd be a Pandas issue. Printing the result of duckdb.sql("select [1,2,3]").df() gives the expected result. Doing a similar thing in Pandas:

import pandas as pd

data = [
    ["Alice", 25, ["Marketing", "Social Media"]],
    ["Bob", 30, ["Sales", "Business Development"]],
    ["Charlie", 28, ["Engineering", "Software Development"]]
]

df = pd.DataFrame(data, columns=["Name", "Age", "Department"])
df.to_csv("pandas.csv")

Results in the following file with correct output:

,Name,Age,Department
0,Alice,25,"['Marketing', 'Social Media']"
1,Bob,30,"['Sales', 'Business Development']"
2,Charlie,28,"['Engineering', 'Software Development']"

from duckdb.

szarnyasg commented on September 26, 2024

@Dtenwolde thanks for raising the issue. @Tishj here's a reproducer which puts the pandas/DuckDB outputs in contrast.

import duckdb
import pandas as pd

df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print(df1)
df1.to_csv("df1.csv", index=False)

df2 = duckdb.sql("select [1,2,3] AS c1").df()
print(df2)
df2.to_csv("df2.csv", index=False)

outputs:

>>> print(df1)
          c1
0  [1, 2, 3]

>>> print(df2)
          c1
0  [1, 2, 3]

df1.csv:

c1
"[1, 2, 3]"

df2.csv:

c1
[1 2 3]

from duckdb.

Tishj commented on September 26, 2024

Ah interesting, I just meant you are using pandas's to_csv method of the DataFrame class, that's not our CSV writer that's being used here.

Perhaps the way we construct the lists that make up the produced DataFrame are different, which is causing their csv writer to act up?

I had a quick look:

# DataFrame 1
df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print("DataFrame 1:")
print(df1)

print(df1['c1'][0].__class__)

# Write DataFrame 1 to StringIO
stringio_df1 = StringIO()
df1.to_csv(stringio_df1, index=False)

# DataFrame 2
df2 = duckdb.sql("select [1,2,3] AS c1").df()
print("\nDataFrame 2:")
print(df2)

print(df2['c1'][0].__class__)

DataFrame 1:
          c1
0  [1, 2, 3]
<class 'list'>

DataFrame 2:
          c1
0  [1, 2, 3]
<class 'numpy.ndarray'>

We create a numpy.ndarray, which gets rendered differently by pandas

from duckdb.

Dtenwolde commented on September 26, 2024

I found this pandas-dev/pandas#48478 which seems related, and if I understand correctly it is their intentional way of writing arrays.
Conversely (and out of curiosity), what is the reason DuckDB converts lists to numpy.ndarray instead of list?

from duckdb.

Tishj commented on September 26, 2024

Perhaps we can try creating a list out of the numpy array and see if that comes with any severe performance impact
If it doesn't then I'm fine restoring the old behavior that way

from duckdb.

No comma separator writing list to csv in python about duckdb HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent