Comments (4)
Ok I've got a reproducer:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64, d UInt64', 1, 64, 2)
LIMIT 100000
Query id: b40c068e-10f5-4f77-b598-0cbfffb37943
Elapsed: 0.031 sec.
Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()
If we have two fields with long strings it works:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a String, d String', 1, 1024, 2)
LIMIT 100000
Query id: 5244e390-3f49-4ce6-9081-c25af8644146
Ok.
0 rows in set. Elapsed: 3.304 sec. Processed 100.19 thousand rows, 104.38 MB (30.32 thousand rows/s., 31.59 MB/s.)
Peak memory usage: 258.04 MiB.
If we have a single UInt64 column with enough rows it fails:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64', 1, 1024, 2)
LIMIT 100000
Query id: 581138a6-f3e2-4f38-b114-8e2e45dfda34
Elapsed: 0.031 sec.
Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()
But if we use less rows it works:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64', 1, 1024, 2)
LIMIT 10000
Query id: 5059d82c-f15a-4a19-8123-f7c9cd5388af
Ok.
0 rows in set. Elapsed: 0.155 sec. Processed 10.00 thousand rows, 80.00 KB (64.49 thousand rows/s., 515.95 KB/s.)
Peak memory usage: 16.86 KiB.
from clickhouse.
It doesn't seem exclusive to Uint64 either, it seems that at some column block size the compression breaks. For example, here's an example with Int8. This works:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 550000
Query id: df0dee41-9aa5-467f-8db1-58ff16b25ee1
Ok.
But if we insert 560k rows it fails:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 560000
Query id: 22150098-bd2a-4fac-a5bf-f4cb04b1c977
Elapsed: 0.025 sec.
Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()
So it seems there's some threshold at which it breaks. If we enable the custom encoder it works even inserting 1M rows:
INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip', output_format_parquet_use_custom_encoder = 1
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 1000000
Query id: f1d40025-ed20-4508-87bc-11f2a18903d6
Ok.
from clickhouse.
The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from apache/arrow#2756 where the problem seems to be caused by an apache/arrow#2756 (comment). Maybe it's a matter of upgrading?
@tb-fjurado since the issue you're facing seems to be exclusively related to Arrow. Have you tried enabling output_format_parquet_use_custom_encoder
? By comments and changes here looks like is stable enough already.
from clickhouse.
The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from apache/arrow#2756 where the problem seems to be caused by an apache/arrow#2756 (comment). Maybe it's a matter of upgrading?
@tb-fjurado since the issue you're facing seems to be exclusively related to Arrow. Have you tried enabling
output_format_parquet_use_custom_encoder
? By comments and changes here looks like is stable enough already.
Yes, I tried enabling it and it worked as expected. I see that #63210 has been already merged so I understand that the custom encoder is the way forward and we don't want to invest more time in arrow right?
Also, does anybody know from which CH version could we consider the custom encoder stable, even if it was not set as default? I saw this comment here from Aug 2023 that said it should be good enough to enable and I haven't found many more (if any) changes on the encoder in the PRs ever since. Just to know if we can enable it right away in our current CH deployments or we need to upgrade.
Thanks!
from clickhouse.
Related Issues (20)
- Log down the checked privileges to the system.query_log HOT 1
- Freebsd + Mysql compiling error
- Logs are sent over network not instantly. HOT 1
- ClickHouse Zookeeper API automatically optimize `multi` to `mutiRead` HOT 4
- Create BigRedButton script to cancel all the worflows with a given name HOT 2
- CH to Teradata via ODBC
- Issues with persistency in storage `File`/`S3`/… with setting `engine_file_allow_create_multiple_files`
- ThreadSanitizer: data race: DB::TimeoutSetter::~TimeoutSetter() HOT 2
- Segmentation fault DB::StorageDistributed::getOptimizedQueryProcessingStageAnalyzer
- Add settings `local_filesystem_write_method` HOT 1
- Orc logger is incorrect.
- Support new prefetches in `MergeTreeReadPoolParallelReplicas*`
- AST Fuzzer triggered stack overflow for 02896_cyclic_aliases_crash HOT 1
- ClickHouse MySQL protocol does not work with Golang MySQL driver HOT 5
- `Sparse` serialization should be supported for `Nullable` HOT 1
- TSan/MSan builds failures HOT 8
- Got 'TOO LARGE SIZE' when select from odbc table. Is there new settings for this? HOT 4
- the hdfs with kerberos authentication,I has done follow the doc https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs, get Exception: Unable to connect to hdfs: simple authentication is not enabled. Available:[TOKEN,KERBEROS](NETWORK_ERROR)
- recursive CTE primary key optimization for recursive step HOT 1
- To select a json object from mongodb in clickhouse. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clickhouse.