Comments (16)
@guoshuai2016 , @lukatera , @MetalRex101 , the memory leak is not fixed yet, but we have fixed #239 in 0.9.1 release, so monitoring is not lost when metrics exporter pod is restarted. The memory leak itself is the next priority.
from clickhouse-operator.
@alex-zaitsev After pprof the heap and goroutine, we found the memory leak is caused by goroutine leak, specifically by too many opened sql.DB, which is created for each query. Every opened sql.DB will have two goroutines, like:
database/sql.(*DB).connectionResetter(0xc0000b66c0, 0x135d160, 0xc0000b2f80)
/usr/local/go/src/database/sql/sql.go:1013 +0xfb
created by database/sql.OpenDB
/usr/local/go/src/database/sql/sql.go:671 +0x194
goroutine 47 [select, 3528 minutes]:
database/sql.(*DB).connectionOpener(0xc0000b6600, 0x135d160, 0xc0000b2600)
/usr/local/go/src/database/sql/sql.go:1000 +0xe8
created by database/sql.OpenDB
/usr/local/go/src/database/sql/sql.go:670 +0x15e
Thus the quick fix solution is to close the sql.DB after query, however this is not recommended:
// and maintains its own pool of idle connections. Thus, the Open
// function should be called just once. It is rarely necessary to
// close a DB.
func Open(driverName, dataSourceName string) (*DB, error) {
So I keep the sql.DB in a global map, which works as singleton. [#270 ]
Apart from the leak, we also find the clickHouseQueryScanRows
may return partial data, after Investigation, we found it caused by sql.Rows is closed by deadline context after query but before scan. So we refine to close the deadline context after scan. [#270 ]
Those changes has been tested on our Stage and Production environment, and the memory usage of metrics exporter becomes much more stable now.
Please help review the merge request.
from clickhouse-operator.
Fixed in https://github.com/Altinity/clickhouse-operator/releases/tag/0.9.2
from clickhouse-operator.
@MetalRex101 , could you advise on your Kubernetes version and ClickHouse cluster setup. We can not reproduce memory leak in our environment. Could you check if you see it growing on 'clickhouse-operator' or 'metrics-exporter' container? It can be checked with a command like:
# kubectl top pod <your_operator_pod_name> --containers=true -n <your_namespace>
from clickhouse-operator.
Kubenetes version:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.9-gke.15", GitCommit:"b48a8d693e191192e27c2f807daa51b54d0b0a61", GitTreeState:"clean", BuildDate:"2019-08-12T17:49:30Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}
Operator pod top:
POD NAME CPU(cores) MEMORY(bytes)
clickhouse-operator-6b7f548688-m5857 clickhouse-operator 8m 36Mi
Clickhouse cluster configuration:
Name: chi-clickhouse-db-common-configd
Namespace: clickhouse
Labels: clickhouse.altinity.com/app=chop
clickhouse.altinity.com/chi=clickhouse-db
clickhouse.altinity.com/chop=0.6.0
Annotations: <none>
Data
====
remote_servers.xml:
----
<yandex>
<remote_servers>
<!-- User-specified clusters -->
<default>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-0-0</host>
<port>9000</port>
</replica>
<replica>
<host>chi-clickhouse-db-default-0-1</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-1-0</host>
<port>9000</port>
</replica>
<replica>
<host>chi-clickhouse-db-default-1-1</host>
<port>9000</port>
</replica>
</shard>
</default>
<!-- Autogenerated clusters -->
<all-replicated>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-0-0</host>
<port>9000</port>
</replica>
<replica>
<host>chi-clickhouse-db-default-0-1</host>
<port>9000</port>
</replica>
<replica>
<host>chi-clickhouse-db-default-1-0</host>
<port>9000</port>
</replica>
<replica>
<host>chi-clickhouse-db-default-1-1</host>
<port>9000</port>
</replica>
</shard>
</all-replicated>
<all-sharded>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-0-0</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-0-1</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-1-0</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>chi-clickhouse-db-default-1-1</host>
<port>9000</port>
</replica>
</shard>
</all-sharded>
</remote_servers>
</yandex>
zookeeper.xml:
----
<yandex>
<zookeeper>
<node>
<host>zookeeper-0.zookeeper-headless.clickhouse</host>
<port>2181</port>
</node>
<node>
<host>zookeeper-1.zookeeper-headless.clickhouse</host>
<port>2181</port>
</node>
<node>
<host>zookeeper-2.zookeeper-headless.clickhouse</host>
<port>2181</port>
</node>
</zookeeper>
<distributed_ddl>
<path>/clickhouse/clickhouse-db/task_queue/ddl</path>
</distributed_ddl>
</yandex>
01-clickhouse-operator-listen.xml:
----
<yandex>
<!-- Listen wildcard address to allow accepting connections from other containers and host network. -->
<listen_host>::</listen_host>
<listen_host>0.0.0.0</listen_host>
<listen_try>1</listen_try>
</yandex>
02-clickhouse-operator-logger.xml:
----
<yandex>
<logger>
<console>1</console>
</logger>
</yandex>
from clickhouse-operator.
Thanks, @MetalRex101 , do you have metrics exporter pod running? If you upgraded from 0.5.0, you could miss it. You can check your operator logs as well -- if metrics exporter is missing it complains a lot. This is a bug that is already fixed.
Also, please list your CHI spec, if possible.
We are going to release 0.7.0 later this week, so we mainly test this version. Probably the memory issue is already fixed.
from clickhouse-operator.
@alex-zaitsev, what is CHI spec? How could i get that?
If New release version is coming soon, we could wait for it and try for memory leaks. If no leaks will appear, we could close this issue. Is that ok for you?
from clickhouse-operator.
@MetalRex101 , CHI spec is your ClickHouseInstallation resource specification, that is 'clickhouse-db' in your example
from clickhouse-operator.
We have similar issue but with version 0.8.0. Our operator pod is constantly getting evicted because of too high resource usage. Now that metrics have been separated from operator, we can see that it is metrics pod which is using a lot of memory. It's using over 2GiB before getting evicted which seems way too much just for metrics.
from clickhouse-operator.
@lukatera , what is the size of your cluster and do you actually use monitoring? (i.e. Prometheus integration).
from clickhouse-operator.
Size of the cluster is 8 shards * 2 times replicated. We use Prometheus to collect metrics from clickhouse-metrics.
from clickhouse-operator.
Here's the graph of container_memory_usage_bytes{container_name="metrics-exporter"}
over past day.
from clickhouse-operator.
@lukatera , thanks. We are making some fixes to metrics exporter now, will look at possible memory leak issue and provide a fix.
from clickhouse-operator.
any update here? we also encounter the same issue.
from clickhouse-operator.
fix merged into 0.9.2, thanks @guoshuai2016
from clickhouse-operator.
Please, take a look
from clickhouse-operator.
Related Issues (20)
- Which Account and Password Should Be Used to Connect to the ClickHouse Cluster from Outside? HOT 1
- Add metric for "onStatefulSetUpdateFailureAction: rollback" / version mismatch HOT 2
- The startup sequence during installation. HOT 1
- operator error HOT 2
- Operator hangs during pod recreation in ClickHouse cluster with multiple shards and replicas HOT 4
- Select query fails on Distributed table with Authentication failed HOT 2
- [Question] Is using PreviousTailAffinity in a single node cluster valid syntax? HOT 1
- [Question] How can I achieve ClickHouse synchronization between two Kubernetes clusters? HOT 8
- [Question] Best way to replace default disk HOT 1
- [Question] Can we specify a static port for NodePort service? HOT 4
- Expanding PVC Volume Template Results in Data Loss HOT 10
- Leader election for the operator HOT 3
- Deleting a CHI resource may leave debris from replicated tables in [Zoo]Keeper that requires later cleanup HOT 3
- ClickHouse Operator leaves orphan S3 files when scaling down replicas that use S3-backed MergeTree HOT 2
- how to solve clickhouse cluster aborted status? HOT 6
- when I have two replicas, I want to delete the first
- when I have two replicas, I want to delete the first replica, keep the second replica alive, how can i do? HOT 7
- ClickHouse pod update stuck when adjusting version or resource HOT 1
- [Question] How do I update the clickhouse-$INSTALLATION_NAME service to LoadBalancer from ClusterIP? HOT 5
- PVC request resize use size without SI suffix HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clickhouse-operator.