<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Kubenetes version: <div class="snippet-clipboard-content notranslate position-rela

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Operator Pod Memory leak about clickhouse-operator HOT 16 CLOSED

altinity commented on June 15, 2024

Operator Pod Memory leak

from clickhouse-operator.

Comments (16)

alex-zaitsev commented on June 15, 2024 1

@guoshuai2016 , @lukatera , @MetalRex101 , the memory leak is not fixed yet, but we have fixed #239 in 0.9.1 release, so monitoring is not lost when metrics exporter pod is restarted. The memory leak itself is the next priority.

from clickhouse-operator.

guoshuai2016 commented on June 15, 2024 1

@alex-zaitsev After pprof the heap and goroutine, we found the memory leak is caused by goroutine leak, specifically by too many opened sql.DB, which is created for each query. Every opened sql.DB will have two goroutines, like:

database/sql.(*DB).connectionResetter(0xc0000b66c0, 0x135d160, 0xc0000b2f80)
        /usr/local/go/src/database/sql/sql.go:1013 +0xfb
created by database/sql.OpenDB
        /usr/local/go/src/database/sql/sql.go:671 +0x194

goroutine 47 [select, 3528 minutes]:
database/sql.(*DB).connectionOpener(0xc0000b6600, 0x135d160, 0xc0000b2600)
        /usr/local/go/src/database/sql/sql.go:1000 +0xe8
created by database/sql.OpenDB
        /usr/local/go/src/database/sql/sql.go:670 +0x15e

Thus the quick fix solution is to close the sql.DB after query, however this is not recommended:

// and maintains its own pool of idle connections. Thus, the Open
// function should be called just once. It is rarely necessary to
// close a DB.
func Open(driverName, dataSourceName string) (*DB, error) {

So I keep the sql.DB in a global map, which works as singleton. [#270 ]

Apart from the leak, we also find the clickHouseQueryScanRows may return partial data, after Investigation, we found it caused by sql.Rows is closed by deadline context after query but before scan. So we refine to close the deadline context after scan. [#270 ]

Those changes has been tested on our Stage and Production environment, and the memory usage of metrics exporter becomes much more stable now.
Please help review the merge request.

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024 1

Fixed in https://github.com/Altinity/clickhouse-operator/releases/tag/0.9.2

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024

@MetalRex101 , could you advise on your Kubernetes version and ClickHouse cluster setup. We can not reproduce memory leak in our environment. Could you check if you see it growing on 'clickhouse-operator' or 'metrics-exporter' container? It can be checked with a command like:

# kubectl top pod <your_operator_pod_name> --containers=true -n <your_namespace>

from clickhouse-operator.

MetalRex101 commented on June 15, 2024

Kubenetes version:

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.9-gke.15", GitCommit:"b48a8d693e191192e27c2f807daa51b54d0b0a61", GitTreeState:"clean", BuildDate:"2019-08-12T17:49:30Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}

Operator pod top:

POD                                    NAME                  CPU(cores)   MEMORY(bytes)
clickhouse-operator-6b7f548688-m5857   clickhouse-operator   8m           36Mi

Clickhouse cluster configuration:

Name:         chi-clickhouse-db-common-configd
Namespace:    clickhouse
Labels:       clickhouse.altinity.com/app=chop
              clickhouse.altinity.com/chi=clickhouse-db
              clickhouse.altinity.com/chop=0.6.0
Annotations:  <none>

Data
====
remote_servers.xml:
----
<yandex>
    <remote_servers>
        <!-- User-specified clusters -->
        <default>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-0-0</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>chi-clickhouse-db-default-0-1</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-1-0</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>chi-clickhouse-db-default-1-1</host>
                    <port>9000</port>
                </replica>
            </shard>
        </default>
        <!-- Autogenerated clusters -->
        <all-replicated>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-0-0</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>chi-clickhouse-db-default-0-1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>chi-clickhouse-db-default-1-0</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>chi-clickhouse-db-default-1-1</host>
                    <port>9000</port>
                </replica>
            </shard>
        </all-replicated>
        <all-sharded>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-0-0</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-0-1</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-1-0</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>chi-clickhouse-db-default-1-1</host>
                    <port>9000</port>
                </replica>
            </shard>
        </all-sharded>
    </remote_servers>
</yandex>

zookeeper.xml:
----
<yandex>
    <zookeeper>
        <node>
            <host>zookeeper-0.zookeeper-headless.clickhouse</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper-1.zookeeper-headless.clickhouse</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper-2.zookeeper-headless.clickhouse</host>
            <port>2181</port>
        </node>
    </zookeeper>
    <distributed_ddl>
        <path>/clickhouse/clickhouse-db/task_queue/ddl</path>
    </distributed_ddl>
</yandex>

01-clickhouse-operator-listen.xml:
----
<yandex>
    <!-- Listen wildcard address to allow accepting connections from other containers and host network. -->
    <listen_host>::</listen_host>
    <listen_host>0.0.0.0</listen_host>
    <listen_try>1</listen_try>
</yandex>

02-clickhouse-operator-logger.xml:
----
<yandex>
    <logger>
        <console>1</console>
    </logger>
</yandex>

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024

Thanks, @MetalRex101 , do you have metrics exporter pod running? If you upgraded from 0.5.0, you could miss it. You can check your operator logs as well -- if metrics exporter is missing it complains a lot. This is a bug that is already fixed.

Also, please list your CHI spec, if possible.

We are going to release 0.7.0 later this week, so we mainly test this version. Probably the memory issue is already fixed.

from clickhouse-operator.

MetalRex101 commented on June 15, 2024

@alex-zaitsev, what is CHI spec? How could i get that?
If New release version is coming soon, we could wait for it and try for memory leaks. If no leaks will appear, we could close this issue. Is that ok for you?

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024

@MetalRex101 , CHI spec is your ClickHouseInstallation resource specification, that is 'clickhouse-db' in your example

from clickhouse-operator.

commented on June 15, 2024

We have similar issue but with version 0.8.0. Our operator pod is constantly getting evicted because of too high resource usage. Now that metrics have been separated from operator, we can see that it is metrics pod which is using a lot of memory. It's using over 2GiB before getting evicted which seems way too much just for metrics.

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024

@lukatera , what is the size of your cluster and do you actually use monitoring? (i.e. Prometheus integration).

from clickhouse-operator.

commented on June 15, 2024

Size of the cluster is 8 shards * 2 times replicated. We use Prometheus to collect metrics from clickhouse-metrics.

from clickhouse-operator.

commented on June 15, 2024

Here's the graph of container_memory_usage_bytes{container_name="metrics-exporter"} over past day.

from clickhouse-operator.

alex-zaitsev commented on June 15, 2024

@lukatera , thanks. We are making some fixes to metrics exporter now, will look at possible memory leak issue and provide a fix.

from clickhouse-operator.

guoshuai2016 commented on June 15, 2024

any update here? we also encounter the same issue.

from clickhouse-operator.

sunsingerus commented on June 15, 2024

fix merged into 0.9.2, thanks @guoshuai2016

from clickhouse-operator.

sunsingerus commented on June 15, 2024

Please, take a look

from clickhouse-operator.

Operator Pod Memory leak about clickhouse-operator HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent