Comments (8)
I still think this is a pandas issue, I can't imagine it's intended that numpy arrays are written differently than native lists
from duckdb.
Performance
Arrays can be preallocated and the same methods we use to populate a column can be used to populate the LIST/ARRAY conversion
from duckdb.
I think that's a Pandas issue, no?
from duckdb.
I don't understand how that'd be a Pandas issue. Printing the result of duckdb.sql("select [1,2,3]").df()
gives the expected result. Doing a similar thing in Pandas:
import pandas as pd
data = [
["Alice", 25, ["Marketing", "Social Media"]],
["Bob", 30, ["Sales", "Business Development"]],
["Charlie", 28, ["Engineering", "Software Development"]]
]
df = pd.DataFrame(data, columns=["Name", "Age", "Department"])
df.to_csv("pandas.csv")
Results in the following file with correct output:
,Name,Age,Department
0,Alice,25,"['Marketing', 'Social Media']"
1,Bob,30,"['Sales', 'Business Development']"
2,Charlie,28,"['Engineering', 'Software Development']"
from duckdb.
@Dtenwolde thanks for raising the issue. @Tishj here's a reproducer which puts the pandas/DuckDB outputs in contrast.
import duckdb
import pandas as pd
df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print(df1)
df1.to_csv("df1.csv", index=False)
df2 = duckdb.sql("select [1,2,3] AS c1").df()
print(df2)
df2.to_csv("df2.csv", index=False)
outputs:
>>> print(df1)
c1
0 [1, 2, 3]
>>> print(df2)
c1
0 [1, 2, 3]
df1.csv:
c1
"[1, 2, 3]"
df2.csv:
c1
[1 2 3]
from duckdb.
Ah interesting, I just meant you are using pandas's to_csv
method of the DataFrame class, that's not our CSV writer that's being used here.
Perhaps the way we construct the lists that make up the produced DataFrame are different, which is causing their csv writer to act up?
I had a quick look:
# DataFrame 1
df1 = pd.DataFrame([[[1, 2, 3]]], columns=["c1"])
print("DataFrame 1:")
print(df1)
print(df1['c1'][0].__class__)
# Write DataFrame 1 to StringIO
stringio_df1 = StringIO()
df1.to_csv(stringio_df1, index=False)
# DataFrame 2
df2 = duckdb.sql("select [1,2,3] AS c1").df()
print("\nDataFrame 2:")
print(df2)
print(df2['c1'][0].__class__)
DataFrame 1:
c1
0 [1, 2, 3]
<class 'list'>
DataFrame 2:
c1
0 [1, 2, 3]
<class 'numpy.ndarray'>
We create a numpy.ndarray
, which gets rendered differently by pandas
from duckdb.
I found this pandas-dev/pandas#48478 which seems related, and if I understand correctly it is their intentional way of writing arrays.
Conversely (and out of curiosity), what is the reason DuckDB converts lists to numpy.ndarray
instead of list
?
from duckdb.
Perhaps we can try creating a list
out of the numpy array and see if that comes with any severe performance impact
If it doesn't then I'm fine restoring the old behavior that way
from duckdb.
Related Issues (20)
- DROP TABLE IF EXISTS fails with an existing VIEW HOT 2
- Newly imported tables not showing HOT 3
- Date Difference Discrepancy in DuckDB HOT 2
- Change SQL autocompletion function name into a pragma rather than sql_auto_complete
- Using random() in an uncorrelated subquery replicates the result HOT 4
- `USE` does not affect for the table referenced after the keyword ON for the `CREATE UNIQUE INDEX ... ON ...` HOT 1
- For even number of inputs, MEDIAN() does wrong floor/mean behavior for various dtypes HOT 16
- query hangs forever using views, completes immediately with equivalent tables HOT 4
- Conversion Error of a DATE column when reading a file to insert in table HOT 6
- Wrong number of results when calling a MACRO that calls a scalar function HOT 7
- product of unnested array versus not unnested array give very different answers HOT 5
- File protocol not supported HOT 8
- Bad list performance HOT 3
- json_deserialize_sql(): Error: Attempted to dereference unique_ptr that is NULL
- DuckDBPyConnection cannot be used to fetch rows of original query after HOT 3
- No performance difference in duckdb when running with SSE and AVX 2 instruction set HOT 1
- Slow join performance when adding a condition that applies to only the left table
- DESCRIBE and query_table can't work together
- read_csv() can't skip rows where cells have a new line char HOT 1
- Using both `hive_partitioning` and `hive_types` in `read_json_objects` intermittently segfaults HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.