Describe the bug Good morning. We are running Grafana Tempo 2.5.0

Hey all, can you review these steps: <a href="https://grafana.com/do

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

It stopped again: <div class="snippet-clipboard-content notranslate position-relat

Comments (11)

kirankumar-grootan commented on August 15, 2024

same here. but im using official grafana tempo-distributed helm chart

cache:
      caches:
      - memcached:
          consistent_hash: true
          host: 'test-tempo-memcached'
          service: memcached-client
          timeout: 500ms
        roles:
        - parquet-footer
        - bloom
        - frontend-search
    compactor:
      compaction:
        block_retention: 72h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      log_received_spans:
        enabled: true
        filter_by_status_error: true
        include_all_attributes: true
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
              max_recv_msg_size_mib: 4194304
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+test-tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          enable_client_server_prefix: true
          enable_messaging_system_latency_histogram: true
          enable_virtual_node_label: true
          histogram_buckets: []
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets: []
      registry:
        collection_interval: 15s
        external_labels:
          collection_interval: 15s
          external_labels:
            source: tempo
          source: tempo
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/generator/wal
        remote_write:
        - headers:
            X-Scope-OrgID: test
          send_exemplars: true
          url: http://observability-mimir-distributor:8080/api/v1/push
        remote_write_add_org_id_header: false
        remote_write_flush_deadline: 1m
      traces_storage:
        path: /var/tempo/generator/traces
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /runtime-config/overrides.yaml
    querier:
      frontend_worker:
        frontend_address: test-tempo-query-frontend-discovery:9095
        grpc_client_config:
          max_send_msg_size: 4194304
      max_concurrent_queries: 20
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_outstanding_per_tenant: 2000
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 4194304
      grpc_server_max_send_msg_size: 4194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        backend: s3
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        pool:
          max_workers: 400
          queue_depth: 20000
        s3:
          access_key: 123
          bucket: tempo
          endpoint: observability-minio:9000
          forcepathstyle: true
          insecure: false
          secret_key: 123
          tls_insecure_skip_verify: true
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false

from tempo.

kirankumar-grootan commented on August 15, 2024

I'm not sure if this is the limitaion in distributed mode. When I tried it with Docker Compose, it worked because the metrics generator and ingester read and write data to the same local path. However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

someone from Grafana team can enlighten us on this

from tempo.

nc-gcz commented on August 15, 2024

In our set-up, ingester indeed has a persistent volume claim:

spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: data-grafana-tempo-ingester-0

But doesn't seem to be mounted into metrics-generator.

Another thing that we've spotted when running the tests was the following behaviour:

level=info ts=2024-07-10T14:50:55.619034542Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=44556
level=info ts=2024-07-10T14:50:55.62145804Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=45901
level=info ts=2024-07-10T14:51:10.515918323Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-10T14:51:25.5159173Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

At some point during the tests (that in our case generate traffic across 30+ microservices) it basically stopped generating metrics. After I detected that in the morning next day, and restarted the metrics-generator, it stopped after printing the following log lines:

level=info ts=2024-07-11T12:42:06.162281707Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2024-07-11T12:42:06.162851639Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-11T12:42:06.173897462Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=11.048845ms
level=info ts=2024-07-11T12:42:06.173946198Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=test-grafana-tempo-gossip-ring
level=info ts=2024-07-11T12:42:06.17426984Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-6cc769bdd7-zh89p ring=metrics-generator
level=info ts=2024-07-11T12:42:06.174649825Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-11T12:42:06.185328903Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=11.382705ms

Nothing is being printed about even attempting to connect to Prometheus instance, which in our case, is completetly idle, not restarting, working as expected, always on the same URL.

from tempo.

joe-elliott commented on August 15, 2024

Hey all, can you review these steps:

https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

This will help us identify where things are being dropped.

However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.

Generators and ingesters do not need shared access to block storage. They are fine to run independently.

from tempo.

nc-gcz commented on August 15, 2024

Hello @joe-elliott, thank you for reaching out.

As a part of the troubleshooting, our Operations Team has reverted from Grafana Tempo 2.5.0 to 2.1.1 - we had issues with the same behaviour on 2.5.0 that we thought would be fixed by downgrading to 2.1.1 (older version we've run), but it doesn't seem to help - we see the same behaviour on both versions.

I will follow the matching troubleshooting guide from here: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/

Testing of metrics from similar issue: #2514

sum(increase(traces_service_graph_request_total[1m])) - empty result from last 12 hours, no graph visible
sum(increase(tempo_distributor_ingester_appends_total[1m])) - value of around 300-350 in last 12 hours

sum(increase(tempo_distributor_metrics_generator_pushes_total[1m])) - empty result from last 12 hours, no graph visible
sum(increase(tempo_distributor_metrics_generator_pushes_failures_total[1m])) - empty result from last 12 hours, no graph visible
rate(tempo_metrics_generator_registry_collections_failed_total[1m]) - empty result from last 12 hours, no graph visible

Metrics like: prometheus_remote_storage_samples_total, prometheus_remote_storage_bytes_total or e.g. prometheus_remote_storage_samples_failed_total are reporting empty results in Prometheus for last 24 hours.

Troubleshooting guide

sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])) - empty result from last 12 hours, no graph visible
sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])) - as above,
sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason) - as above,
sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])) - as above,
All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])) - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])) - as above,
sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])) - as above.

It seems that metrics-generator is completely not sending the metrics at all, but it's reporting itself as healthy.

It used to work when we had a reinstall of a Helm chart and after some time and tests, it stopped working, reporting active_series=0 for a prolonged time only to be restarted again and not picking up and sending any metrics to Prometheus. So it's not like it never works - it works for some time and then it stops :(

I've attached the example body of the metrics endpoint for metrics-generator that is missing all metrics related to service graph, span metrics and internal metrics as specified above (that are empty):

metrics-generator-example-metrics-endpoint-response-body.txt

Prometheus reports it as scraping correctly:

from tempo.

nc-gcz commented on August 15, 2024

Also, we've restarted again the Grafana Tempo - clean deployed the Helm chart version 2.1.1.

What we can see now is that the active_series= log is present and what's most important, the following log line is present:

level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant

When we have a situation where metrics-generator is idle, the log from the component ends before this line, so the last line of log is:

level=info ts=2024-07-12T10:05:50.298578151Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-589965b6b5-qdrvt ring=metrics-generator
level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms

Where, for a working metrics-generator it's like this:

level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms
level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant
ts=2024-07-12T10:05:50.481543325Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-0
ts=2024-07-12T10:05:50.481582922Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.481763202Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-1
ts=2024-07-12T10:05:50.481772492Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.483207769Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-0
ts=2024-07-12T10:05:50.483562004Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-1
level=info ts=2024-07-12T10:05:50.48629064Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/bitnami/grafana-tempo/data/wal/single-tenant/wal
level=info ts=2024-07-12T10:05:50.486461227Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2024-07-12T10:06:05.514274932Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6074
ts=2024-07-12T10:06:05.551793586Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.068244585s
ts=2024-07-12T10:06:05.552795859Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.069655298s

On a idle metrics-generator, /bitnami/grafana-tempo/data/wal path is empty. But when it correctly starts, it has data:

/bitnami/grafana-tempo/data is an emptyDir in Kubernetes deployment, as installed from Helm chart from Bitnami.

from tempo.

nc-gcz commented on August 15, 2024

It stopped again:

level=info ts=2024-07-12T10:43:33.390479119Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:43:48.393775049Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:03.390660555Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:18.379634162Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-07-12T10:44:18.379708084Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-12T10:44:33.379406636Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0

As we used to have some metrics for a short period of time, I'd like to get the metrics from troubleshooting guide again:

Troubleshooting guide

sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m])):

Empty

sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m])):

sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason):

sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m])):

Empty

All of: prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total:

sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m])):

sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m])):

sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m])):

from tempo.

nc-gcz commented on August 15, 2024

By checking out code locally, 2.1.1, and looking at modules/generator/storage/instance.go, we can conclude that at some point:

// New creates a metrics WAL that remote writes its data.
func New(cfg *Config, tenant string, reg prometheus.Registerer, logger log.Logger) (Storage, error) {
	logger = log.With(logger, "tenant", tenant)
	reg = prometheus.WrapRegistererWith(prometheus.Labels{"tenant": tenant}, reg)

	walDir := filepath.Join(cfg.Path, tenant)

	level.Info(logger).Log("msg", "creating WAL", "dir", walDir) // This is never called

The creating WAL is never called - we cannot see it in logs.

It has a chance to exit before New method can be called in two instances in modules/generator/generator.go:

func (g *Generator) getOrCreateInstance(instanceID string) (*instance, error) {

here:

	inst, ok := g.getInstanceByID(instanceID)
	if ok {
		return inst, nil
	} // <<< HERE

	g.instancesMtx.Lock()
	defer g.instancesMtx.Unlock()

	inst, ok = g.instances[instanceID] // <<< OR HERE
	if ok {
		return inst, nil
	}

        // This is never reached apparently?
	inst, err := g.createInstance(instanceID)
	if err != nil {
		return nil, err
	}
	g.instances[instanceID] = inst
	return inst, nil
}

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

from tempo.

joe-elliott commented on August 15, 2024

I would definitely not recommend upgrading from 2.1.x to 2.5.x. We aggressively rotate through block formats and it's quite possible blocks created by 2.1 are not supported by 2.5. also, if you're doing fresh installs. can you just install 2.5?

But I'm not sure if that is causing your issue or not. You're saying that on a fresh install of 2.1.1 it worked for awhile and then quit? Those graphs suggest the distributors just stopped sending data to the generators. Can you check:

if the generators are accessible over the network from the distributors
ring health. check /metrics-generator/ring on the distributors

Why it's not being called - I am not able to tell :( Any help needed here. Thank you.

I believe it's not being called b/c if you don't send data to the generator it will never create an instance. it creates a per tenant instance on demand. the missing log line suggests it's just not receiving any data which is consistent with your other graphs.

from tempo.

`metrics-generator` idle and no traces are generated using Tempo 2.5.0 about tempo HOT 11 OPEN

Comments (11)

Testing of metrics from similar issue: #2514

Troubleshooting guide

Troubleshooting guide

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent