Giter VIP home page Giter VIP logo

cloudfoundry-community / stackdriver-tools Goto Github PK

View Code? Open in Web Editor NEW
21.0 8.0 13.0 35.73 MB

Stackdriver Nozzle for Cloud Foundry Loggregator, Host Monitoring Agents BOSH Release

License: Apache License 2.0

Shell 11.85% Ruby 1.38% Smarty 0.50% HCL 0.16% Go 79.93% Makefile 1.44% HTML 4.35% Dockerfile 0.39%
stackdriver stackdriver-agent stackdriver-monitoring fluentd bosh-release stackdriver-nozzle cloud-foundry loggregator

stackdriver-tools's Introduction

Status

This project is no longer actively maintained and the repository has been archived.

stackdriver-tools release for BOSH

This release provides Cloud Foundry and BOSH integration with Google Cloud Platform's Stackdriver Logging and Monitoring.

Functionality is provided by 3 jobs in this release:

Project Status

The following is generally available:

  • Stackdriver Host Monitoring Agent (stackdriver-agent)
  • Stackdriver Host Logging Agent (google-fluentd)
  • Stackdriver Nozzle (stackdriver-nozzle)
    • Stackdriver Logging for Cloud Foundry Log Events (LogMessage, Error, HttpStartStop)
    • Stackdriver Monitoring for Cloud Foundry Metric Events (ContainerMetric, ValueMetric, CounterEvent)

The following is in beta:

  • Stackdriver Nozzle
    • Stackdriver Logging for Cloud Foundry Metric Events (ContainerMetric, ValueMetric, CounterEvent)

The project was developed in partnership with Google and Pivotal and is actively maintained by Google.

Getting started

Enable Stackdriver APIs

Ensure the Stackdriver Logging and Stackdriver Monitoring APIs are enabled.

Quotas

Depending on the size of the cloud foundry deployment and which events the nozzle is forwarding, it can be quite easy to reach the default Stackdriver quotas:

Google quotas can be viewed and managed on the API Quotas Page. An operator can increase the default quota up to a limit; exceeding that, use the contact links to request even higher quotas.

Create and configure service accounts

All of the jobs in this release authenticate to Stackdriver Logging and Monitoring via Service Accounts. Follow the GCP documentation to create a service account via gcloud with the following roles:

  • roles/logging.logWriter
  • roles/logging.configWriter
  • roles/monitoring.metricWriter

You can either authenticate the job(s) by specifying the service account in the cloud_properties for the resource pool running the job(s) or by configuring credentials.application_default_credentials in the job spec.

You may also read the access control documentation for more general information about how authentication and authorization work for Stackdriver.

General usage

To use any of the jobs in this BOSH release, first upload it to your BOSH director:

bosh2 upload-release https://storage.googleapis.com/bosh-gcp/beta/stackdriver-tools/latest.tgz

The stackdriver-tools.yml sample BOSH 2.0 manifest illustrates how to use all 3 jobs in this release (nozzle, host logging, and host monitoring). You can deploy the sample with the following commands:

bosh2 upload-stemcell https://bosh.io/d/stemcells/bosh-google-kvm-ubuntu-trusty-go_agent

bosh2 update-cloud-config -n manifests/cloud-config-gcp.yml \
          -v zone=... \
          -v network=... \
          -v subnetwork=... \
          -v "tags=['stackdriver-nozzle']" \
          -v internal_cidr=... \
          -v internal_gw=... \
          -v "reserved=[10....-10....]"

bosh2 deploy manifests/stackdriver-tools.yml \
            -d stackdriver-nozzle \
            --var=firehose_endpoint=https://.. \
            --var=firehose_username=stackdriver_nozzle \
            --var=firehose_password=... \
            --var=skip_ssl=false \
            --var=gcp_project_id=... \
            --var-file=gcp_service_account_json=path/to/service_account.json \

This will create a self-contained deployment that sends Cloud Foundry firehose data, host logs, and host metrics to Stackdriver.

Deploying each job individually is described in detail below.

Deploying the nozzle

Create a new deployment manifest for the nozzle. See the example manifest for a full deployment and the jobs.stackdriver-nozzle section for the nozzle.

To reduce message loss, operators should run a minimum of two instances. With two instances, updating stemcells and other destructive BOSH operations will still leave an instance draining logs.

The loggregator system will round-robin messages across multiple instances. If the nozzle can't handle the load, consider scaling to more than two nozzle instances.

The spec describes all the properties an operator should modify.

Stackdriver Error Reporting

Stackdriver can automatically detect and report errors from stack traces in logs. However, this does not automatically work with Loggregator because it sends each line from app output as a separate log message to the nozzle. To enable this feature of Stackdriver, apps will need to manually encode stacktraces on a single line so that the stackdriver-nozzle can send them as single messages to Stackdriver.

This is accomplished by replacing newlines in stacktraces with a unique character, which is set using the firehose.newline_token template variable in the nozzle so that the nozzle can reconstruct the stacktrace on multiple lines.

For example, if firehose.newline_token is set to , a Go app would need to implement something like the following:

const newlineToken = "∴"

func main() {
    ...
    defer handlePanic()
    ...
}

func handlePanic() {
    	e := recover()
    	if e == nil {
    		return
    	}
    
    	stack := make([]byte, 1<<16)
    	stackSize := runtime.Stack(stack, true)
    	out := string(stack[:stackSize])
    
    	fmt.Fprintf(os.Stderr, "panic: %v", e)
    	fmt.Fprintf(os.Stderr, strings.Replace(out, "\n", newlineToken, -1))
    	os.Exit(1)
}

This outputs the stacktrace separately from the panic so that the panic remains in the logs and the stacktrace is logged by itself. This allows Stackdriver to detect the stacktrace as an error.

For an example in Java, see this section of the Loggregator documentation.

Deploying host logging

The google-fluentd template uses Fluentd to send both syslog and template logs (assuming that template jobs are writing logs into /var/vcap/sys/log/*/*.log) to Stackdriver Logging.

To forward host logs from BOSH VMs to Stackdriver, co-locate the google-fluentd template with an existing job whose host logs should be forwarded.

Include the stackdriver-tools release in your existing deployment manifest:

releases:
  ...
  - name: stackdriver-tools
    version: latest
  ...

Add the google-fluentd template to your job:

jobs:
  ...
  - name: nats
    templates:
      - name: nats
        release: cf
      - name: metron_agent
        release: cf
      - name: google-fluentd
        release: stackdriver-tools
  ...

Deploying host monitoring

The stackdriver-agent template uses the Stackdriver Monitoring Agent to collect VM metrics to send to Stackdriver Monitoring.

To forward host metrics forwarding from BOSH VMs to Stackdriver, co-locate the stackdriver-agent template with an existing job whose host metrics should be forwarded.

Include the stackdriver-tools release in your existing deployment manifest:

releases:
  ...
  - name: stackdriver-tools
    version: latest
  ...

Add the stackdriver-agent template to your job:

jobs:
  ...
  - name: nats
    templates:
      - name: nats
        release: cf
      - name: metron_agent
        release: cf
      - name: stackdriver-agent
        release: stackdriver-tools
  ...

Deploying as a BOSH addon

Specify the jobs as addons in your runtime config to deploy Stackdriver Monitoring and Logging agents on all instances in your deployment. Do not specify the jobs as part of your deployment manifest if you are using the runtime config.

# runtime.yml
---
releases:
  - name: stackdriver-tools
    version: latest

addons:
- name: stackdriver-tools
  jobs:
  - name: google-fluentd
    release: stackdriver-tools
  - name: stackdriver-agent
    release: stackdriver-tools

To update the runtime config:

bosh2 update-runtime-config -d <your deployment> runtime.yml

Then redeploy your manifest:

bosh2 deploy -d <your deployment> path/to/manifest.yml

Development

Updating google-fluentd

google-fluentd is versioned by the Gemfile in src/google-fluentd. To update fluentd:

  1. Update the version specifier in the Gemfile (if necessary)
  2. Update Gemfile.lock: bundle update
  3. Create a vendor cache from the Gemfile.lock: bundle package
  4. Tar and compress the vendor folder: tar zvc vendor > google-fluentd-vendor-<VERSION>-plugin-<VERSION>.tgz
  5. Update the vendor version in the google-fluentd package packaging and spec
  6. Add vendored cache to the BOSH blobstore: bosh2 add-blob google-fluentd-vendor-<VERSION>-plugin-<VERSION>.tgz google-fluentd-vendor/google-fluentd-vendor-VERSION-NUMBER.tgz
  7. Create a dev release and deploy it to verify that all of the above worked
  8. Update the BOSH blobstore: bosh upload-blobs
  9. Commit your changes

bosh-lite

Both the nozzle and the fluentd jobs can run on bosh-lite. To generate a working manifest, start from the bosh-lite-example-manifest. Note the application_default_credentials property, which should be filled in with the contents of a Google service account key.

Contributing

For details on how to contribute to this project - including filing bug reports and contributing code changes - please see CONTRIBUTING.md.

Copyright

Copyright (c) 2016 Ferran Rodenas. See LICENSE for details.

stackdriver-tools's People

Contributors

chentom88 avatar cholick avatar erjohnso avatar evandbrown avatar fluffle avatar frodenas avatar garimasharma avatar johnsonj avatar kejadlen avatar knyar avatar mattysweeps avatar nrxus avatar pivotal-jwynne avatar pivotalsquid avatar sarahwalther avatar stonish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stackdriver-tools's Issues

Deploy the Spinner with the Tile

Push the Spinner cf app as part of the tile. Re-use the service account used by the nozzle. This may require adding a GOOGLE_CREDENTIALS_JSON to the manifest and plumbing it through to the stackdriver client (see root_service_account_json, client creation as an example)

Surface the configuration options (with friendly name/description/sane default):

  • SPINNER_COUNT
  • SPINNER_WAIT

Open questions:

  • Do we need to enable/disable it or just assume users want it?
  • Should we need to deploy multiple instances?

/cc @hustons @sahilm @GarimaSharma (+tom)

Nozzle sends more than 200 time series in a single request

With nozzle built from develop branch and the default metrics_buffer_size set to 200 I am seeing the following errors from Stackdriver:

code = InvalidArgument desc = Field timeSeries had an invalid value: A maximum of 200 TimeSeries can be written in a single request.

I believe this is caused by the fact that metrics_buffer_size limits the number of MetricEvents passed to the metric adapter, however each MetricEvent can have multiple metrics (which share the same labels). As the result, 200 MetricEvents can result in more than 200 TimeSeries in a CreateTimeSeriesRequest.

There should probably be a stricter check that ensures that CreateTimeSeriesRequest always conforms to API restrictions and does not have more than 200 time series.

Provide Bosh 2.0 manifest.

It would be convenient to have a Bosh 2.0 manifest available for deployments.

This would remove the need for the Bosh 1 CLI that supports ERB as well as allow users to take advantage of vars files and store.

/cc @apoydence

rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 2

Hello,

I am using the nozzle in combination with google cloud stackdriver. In our logs we keep seeing the following errors:

rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 2"

our configuration looks like this:

    export FIREHOSE_ENDPOINT=https://api.our.domain.io
    export FIREHOSE_USERNAME=firehose
    export FIREHOSE_PASSWORD=password
    export FIREHOSE_EVENTS=LogMessage,Error,HttpStartStop,CounterEvent,ValueMetric,ContainerMetric
    export FIREHOSE_SKIP_SSL=false
    export FIREHOSE_SUBSCRIPTION_ID=stackdriver-nozzle
    export FIREHOSE_NEWLINE_TOKEN=

    export DEBUG_NOZZLE=true
    export RESOLVE_APP_METADATA=true

could you help with this problem?

Many thanks,
Claudio

Deduplicate process-level metrics

Stackdriver has a limit of 500 custom metrics per project, and the latest build from develop branch already attempts to create more. As the result, SD API requests fail with the following error message:

rpc error: code = ResourceExhausted desc = Your metric descriptor quota has been exhausted

Note, #139 increased the number of metrics by prepending origin to the metric name. While it's the right thing to do in general, there are several metrics that seem to be created for multiple processes and seem to mean the same thing for all of them:

  • memoryStats.lastGCPauseTimeNS
  • memoryStats.numBytesAllocated
  • memoryStats.numBytesAllocatedHeap
  • memoryStats.numBytesAllocatedStack
  • memoryStats.numFrees
  • memoryStats.numMallocs
  • numCPUS
  • numGoRoutines

In our test PCF instance the 8 metrics listed above repeat 26 times each, so deduplicating them (by not prepending origin to metric name) will decrease the total number of metrics by 182. This seems like a quick easy win, but I suspect in the future we might also want to add metric blacklist/whitelist to give users better control of the number of metrics created by the nozzle.

@johnsonj, what do you think?

Emit Spinner results to Stackdriver Monitoring

The spinner emits a log entry that describes the outcome of it's log loss test.
The user should be able to emit a custom metric with this result to Stackdriver Monitoring.

possible metrics:

  • stackdriver-spinner/logs.sent - cumulative total of log messages sent to loggregator
  • stackdriver-spinner/logs.recieved - cumulative total of logs received by the probe
  • stackdriver-spinner/logs.lost - cumulative total of logs never received

labels:

  • director - corresponds to the same director value for Stackdriver Nozzle. Used when multiple PCF instances are logging to a single Stackdriver project. Make an ENV variable for the app?
  • index - index of the Cloud Foundry app (in case the user is running multiple copies)

/cc cloud-ops for implementation/collab @hustons @sahilm @GarimaSharma (+tom)
/cc cre for metrics guidance/awareness @fluffle @knyar

Downloading blobs requires credentials

It seems that 2baa699 broke the tile building script (scripts/build-custom-tile-docker.sh started in Docker by scripts/custom-tile) because downloading blobs requires credentials now:

Step 5/5 : RUN scripts/build-custom-tile-docker.sh
 ---> Running in 4a23e3392c29

Blob download 'golang/go1.9.linux-amd64.tar.gz' (id: 9389191f-2e77-4df2-6ccd-d4a1639ed201) failed
Blob download 'google-fluentd-vendor/google-fluentd-vendor-0.12-plugin-0.5.3.tgz' (id: da306bf9-a23c-46cc-69e7-7e86a54e7bbb) failed
Blob download 'google-fluentd-vendor/google-fluentd-vendor-0.14.tgz' (id: ad88f6c7-fd8d-41ae-6109-65e3f604e76f) failed
Blob download 'libtool/libtool-2.4.2.tar.gz' (id: a72448bd-9ae1-482a-54f3-a73f6820c2c4) failed
Blob download 'libyajl/yajl-2.1.0.tar.gz' (id: 0a77c971-4860-4059-5bb7-8b3e0049f2ff) failed
Downloading blobs:
  - Getting blob '9389191f-2e77-4df2-6ccd-d4a1639ed201' for path 'golang/go1.9.linux-amd64.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'da306bf9-a23c-46cc-69e7-7e86a54e7bbb' for path 'google-fluentd-vendor/google-fluentd-vendor-0.12-plugin-0.5.3.tgz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'ad88f6c7-fd8d-41ae-6109-65e3f604e76f' for path 'google-fluentd-vendor/google-fluentd-vendor-0.14.tgz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'a72448bd-9ae1-482a-54f3-a73f6820c2c4' for path 'libtool/libtool-2.4.2.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob '0a77c971-4860-4059-5bb7-8b3e0049f2ff' for path 'libyajl/yajl-2.1.0.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Exit code 1

@johnsonj, is this something that can be easily fixed by changing permissions on the GCS bucket?

Shipping multiline stack traces from google-fluentd job for use with Stackdriver Error Reporting

I'd like to make use of the Stackdriver Error reporting functionality - https://cloud.google.com/error-reporting/docs/viewing - to track errors that CF platform components are logging.

As a concrete example; the CF UAA component logs verbose multiline Java stacktrace errors to /var/vcap/sys/log/uaa/uaa.log

For example:

[2016-12-20 18:50:08.070] uaa - 10605 [localhost-startStop-1] .... FATAL --- RecognizeFailureDispatcherServlet: Unable to start UAA application.
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.cloudfoundry.identity.uaa.security.web.SecurityFilterChainPostProcessor#0' defined in ServletContext resource [/WEB-INF/spring-servlet.xml]: Cannot resolve reference to bean 'identityZoneResolvingFilter' while setting bean property 'additionalFilters' with key [TypedStringValue: value [#{T(org.cloudfoundry.identity.uaa.security.web.SecurityFilterChainPostProcessor.FilterPosition).position(2)}], target type [null]]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'identityZoneResolvingFilter' defined in ServletContext resource [/WEB-INF/spring-servlet.xml]: Cannot resolve reference to bean 'identityZoneProvisioning' while setting bean property 'identityZoneProvisioning'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'identityZoneProvisioning' defined in ServletContext resource [/WEB-INF/spring/multitenant-endpoints.xml]: Cannot resolve reference to bean 'jdbcTemplate' while setting constructor argument; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'flyway' defined in class path resource [spring/data-source.xml]: Invocation of init method failed; nested exception is org.flywaydb.core.api.FlywayException: Unable to obtain Jdbc connection from DataSource
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:359)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:108)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveManagedMap(BeanDefinitionValueResolver.java:407)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:165)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1481)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1226)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:543)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:482)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:306)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
        ...snip...
        at org.springframework.web.servlet.FrameworkServlet.initWebApplicationContext(FrameworkServlet.java:553)
        at org.springframework.web.servlet.FrameworkServlet.initServletBean(FrameworkServlet.java:494)
        at org.springframework.web.servlet.HttpServletBean.init(HttpServletBean.java:136)
        at javax.servlet.GenericServlet.init(GenericServlet.java:158)
        at org.cloudfoundry.identity.uaa.web.RecognizeFailureDispatcherServlet.init(RecognizeFailureDispatcherServlet.java:56)
        at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1227)
        at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1140)
        at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1027)
        at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5038)
...snip...

My understanding of StackDriver Error reporting is that if we can capture all the stacktrace information in a single log message; then we "automagically" get a host of Stackdriver Error reporting goodness.

My concrete question is around how to ship multiline Java stacktrace log messages; but its part of a more generic question around where logic for understanding the log format should live.

One place I can see the logic going is in the jobs/google-fluentd/templates/vcap.conf config. This could potentially be extended with some more detailed in_tail config format for specific log files - in the case of the of the uaa.log file some custom multiline config that recognises Java stack traces.

However, before submitting a PR for the above; I'd like to find out if there is a better place to put such logic.

Thanks!

Instance index should be a label not a metric

Currently the container metric's "instance index" is being sent as a separate metric type. It should really be a label on the other container metrics. (E.g. This is the CPU utilization for this instance of this app)

tile generates invalid ops manager manifest in develop

workaround:

diff --git a/tile.yml.erb b/tile.yml.erb
index d82806d..5313730 100644
--- a/tile.yml.erb
+++ b/tile.yml.erb
@@ -58,9 +58,7 @@ forms:
     type: string
     default: HttpStartStop,LogMessage,Error
     label: Whitelist for Stackdriver Logging
-    description: |
-      Comma separated list without spaces consisting of any or all of HttpStartStop,LogMessage,Error. 
-      The following events are in beta can also be used: ValueMetric,CounterEvent,ContainerMetric
+    description: Comma separated list without spaces consisting of any or all of HttpStartStop,LogMessage,Error. The following events are in beta can also be used, ValueMetric,CounterEvent,ContainerMetric
   - name: firehose_events_to_stackdriver_monitoring
     type: string
     default: CounterEvent,ValueMetric,ContainerMetric
@@ -106,6 +104,7 @@ forms:
     default: 1000
     label: Logging Batch Count
     description: Batch size for log messages being sent to Stackdriver
+    type: integer
   - name: metric_path_prefix
     type: string
     default: firehose
@@ -120,4 +119,4 @@ forms:
     type: boolean
     default: false
     label: Nozzle Debugging
-    description: Enable Nozzle Debugging Features. With this enabled each Stackdriver Nozzle instance will host a web server on 0.0.0.0:6060 that exposes debug information such as a heap dump and running threads.
\ No newline at end of file
+    description: Enable Nozzle Debugging Features. With this enabled each Stackdriver Nozzle instance will host a web server on port 6060 that exposes debug information such as a heap dump and running threads.
\ No newline at end of file

Add monitoring roles to documentation

The current suggested roles do not allow the stackdriver-agent to setup/write metrics. Today it needs an 'Editor' role to write metrics. Additionally it seems like the agent does configuration that requires the 'Owner' role on first start.

Provide a snippet to create a service account and add appropriate roles. Similar to the example app.

Clarify Project Status

The project status is outdated and does not reflect the current state. The overall nozzle is stable and the addons (host agents) are also stable.

  • Remove general beta warning in the README
  • Add beta verbiage around sending Metrics to Stackdriver Logging

Possible memory leak due to quota exhaustion

Hi,

We are using Stackdriver nozzle tile 1.0.3 with PCF 1.11.6

For our stackdriver nozzle vms, in few minutes the memory utilization increases to 98% and the service crashes/restarts.

What we anticipate as problem:

  1. We are exceeding metric descriptor quota for the service account used. However it is not clear which quota precisely need to be increased.
  2. Memory should not spike if quota is reached rather the content in memory should be flushed on this error. This is causing intermittent failure of our deployments.

What we expect out of this issue:

  1. The installation docs of stackdriver-nozzle tile for pcf should mention about which quota to be increased and to what(if possible).
  2. If the quota is reached, the memory leak should not happen.

Logs:

  1. During the time when memory utilization is increasing:
{"timestamp":"1503403764.702794075","source":"stackdriver-nozzle","message":"stackdriver-nozzle.metricsBuffer","log_level":2,"data":{"error":"rpc error: code = 8 desc = Your metric descriptor quota has been exhausted"}}
{"timestamp":"1503403764.708293676","source":"stackdriver-nozzle","message":"stackdriver-nozzle.metricsBuffer","log_level":2,"data":{"error":"rpc error: code = 8 desc = Your metric descriptor quota has been exhausted"}}
  1. When the service crashes:

{"timestamp":"1503566611.790120125","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"read tcp 10.0.0.51:48572-\u003e130.211.228.210:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc4201842a0, 0x95f4e4, 0x8, 0xc157e0, 0xc5174fa820, 0xc5176e4648, 0x1, 0x1)\n\t/var/vcap/data/compile/stackdriver-nozzle/go/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:131 +0xc7\nmain.main()\n\t/var/vcap/data/compile/stackdriver-nozzle/go/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:59 +0x43d\n"}}
{"timestamp":"1503566611.792421341","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Stopping heartbeater"}}
{"timestamp":"1503566619.110969305","source":"stackdriver-nozzle","message":"stackdriver-nozzle.version","log_level":1,"data":{"name":"cf-stackdriver-nozzle","release":"1.0.3","user_agent":"cf-stackdriver-nozzle/1.0.3"}}
{"timestamp":"1503566619.120277643","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{"APIEndpoint":"https://api.gcp.trackerred.com","BatchCount":10,"BatchDuration":1,"DebugNozzle":false,"Events":"CounterEvent,Error,HttpStartStop,LogMessage,ValueMetric,ContainerMetric","HeartbeatRate":30,"NewlineToken":"","Password":"\u003credacted\u003e","ProjectID":"<project-id>","ResolveAppMetadata":true,"SkipSSL":false,"SubscriptionID":"stackdriver-nozzle","Username":"<username>"}}
{"timestamp":"1503566619.138031244","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Starting heartbeater"}}
{"timestamp":"1503566619.802249670","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Starting heartbeater"}}
  1. Monit status during the last few minutes before crash:
    screen shot 2017-08-24 at 10 23 03 am

  2. Memory utilization for one stackdriver nozzle vm: (Rest of the vms looks similar)
    image

  3. Service logs:

# tail -f stackdriver-nozzle-ctl.err.log
[2017-08-24 01:48:25+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 01:48:24 UTC 2017 --------------
[2017-08-24 06:16:30+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 06:16:30 UTC 2017 --------------
[2017-08-24 08:03:45+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 08:03:45 UTC 2017 --------------
[2017-08-24 09:23:38+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 09:23:38 UTC 2017 --------------
# tail -f stackdriver-nozzle-ctl.log
[2017-08-24 08:03:45+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 08:03:45 UTC 2017 --------------
[2017-08-24 08:03:45+0000] Removing stale pidfile...
[2017-08-24 09:23:38+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 09:23:38 UTC 2017 --------------
[2017-08-24 09:23:38+0000] Removing stale pidfile...

heartbeat.Increment blocks when it's not being drained

During internet connection loss the heartbeater blocks when performing Increment while waiting for the channel to be ready to send.

The nozzle runs go heartbeater.Increment() in several places because we want to limit the affect of telemetry on the hot path. When the call blocks we can end up an unbounded amount of these goroutines:

goroutine 369 [chan send]:
github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/heartbeat.(*heartbeater).IncrementBy(0xc420224300, 0xb60b50, 0x15, 0xea60)
	/usr/local/google/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/heartbeat/heartbeater.go:130 +0x6c
created by github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle.(*nozzle).Start.func2
	/usr/local/google/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle/nozzle.go:86 +0x6d

Repro

  • Apply this patch
diff --git a/src/stackdriver-nozzle/cloudfoundry/firehose.go b/src/stackdriver-nozzle/cloudfoundry/firehose.go
index ac60fc2..13caed1 100644
--- a/src/stackdriver-nozzle/cloudfoundry/firehose.go
+++ b/src/stackdriver-nozzle/cloudfoundry/firehose.go
@@ -53,6 +53,8 @@ func (c *firehose) Connect() (<-chan *events.Envelope, <-chan error) {
 	refresher := cfClientTokenRefresh{cfClient: c.cfClient}
 	cfConsumer.SetIdleTimeout(time.Duration(30) * time.Second)
 	cfConsumer.RefreshTokenFrom(&refresher)
+	// DO NOT CHECK IN
+	cfConsumer.SetMaxRetryCount(1)
 	return cfConsumer.Firehose(c.subscriptionID, "")
 }
  • Run stackdriver-nozzle
  • Disconnect from the internet

Use Native Google Cloud Storage for Release Blobs

This BOSH release uses Google Cloud Storage (GCS) for storing release blobs in S3 compability mode and should be migrated to using native GCS. This enables service account support and better support for large file uploads.

The latest version of bosh2 (2.0.28-cb77557-2017-07-11T23:04:21Z) supports native GCS as a blobstore (see: cloudfoundry/bosh-cli#238).

Migration Plan

  1. Ensure project and developers are using the latest bosh2 (>= 2.0.28-cb77557-2017-07-11T23:04:21Z). This is needed for CI pipelines and wherever releases are built.

  2. Sync blobs locally with BOSH v2:

    bosh2 sync-blobs
  3. Remove object_ids from config/blobs.yml:

    sed -i '/object_id/d' config/blobs.yml
  4. Update config/final.yml:

    ---
    final_name: <<unchanged>>
    blobstore:
      provider: gcs
      options:
        bucket_name: <<unchanged>>
        # remove: host, endpoint, use_ssl
  5. Update config/private.yml (secrets for developers and CI, do not check in)

    blobstore:
      options:
        json_key: <<service account key>>

    To generate a new service account/key:

    export project_id=Project hosting your GCS bucket, eg my-gcp-project
    export bucket_name=GCS bucket name, eg my-bosh-release-blobs
    
    export service_account_name=${bucket_name}-blobs
    export service_account_email=${service_account_name}@${project_id}.iam.gserviceaccount.com
    credentials_file=$(mktemp)
    
    gcloud config set project ${project_id}
    gcloud iam service-accounts create ${service_account_name} --display-name "BOSH-CLI access for ${bucket_name}"
    gsutil iam ch serviceAccount:${service_account_email}:objectCreator,objectViewer gs://${bucket_name}
    gcloud iam service-accounts keys create ${credentials_file} --iam-account ${service_account_email}
    
    echo "$(cat ${credentials_file})"
  6. Re-upload the blobs to confirm everything works and reassign IDs:

    bosh2 upload-blobs

Error filling in template 'event_filters.json.erb'

Fail in CI:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'stackdriver-nozzle'. Errors are:
     - Unable to render templates for job 'stackdriver-nozzle'. Errors are:
       - Error filling in template 'event_filters.json.erb' (line 4: Can't find property '["nozzle.event_filters.blacklist"]')

Use the gce_instance monitored resource type

Using gce_instance instance as the monitored resource type (rather than the global monitored resource) increases throughput as nozzle instances are added since the Stackdriver API shards based on the gce_instance's instance_id label.

Enable Nozzle Debugging for Tile Users

The nozzle has the ability to send errors/crashes that it generates to Stackdriver Monitoring. This is done today when 'DEBUG_NOZZLE' is turned on.

These reports will be useful for operators managing the nozzle. Let's expose it as a property on the tile.

http2Client.notifyError got notified that the client transport was broken EOF

Hi,

We are using Stackdriver nozzle tile 1.0.6 with PCF 1.11.6

We are not able to see logs in our Stackdriver/logging project. The error in Stackdriver-nozzle vm is:
http2Client.notifyError got notified that the client transport was broken EOF

We started seeing this issue exactly after installing 1.0.6 of Stackdriver nozzle. We upgraded directly from 1.0.3 to 1.0.6.

Is there anyway to debug this?

HttpStartStop does not format requestID properly

Actual:

httpStartStop: {
   startTimestamp: 1480442028497669600     
   peerType: "Server"     
   requestId: {
    low: 15730922093683923000      
    high: 13584426878641205000      
   }
   method: "GET"     
   stopTimestamp: 1480442028506655500     
   statusCode: 200     
   contentLength: 42     
   uri: "https://api.cf.jrjohnsondev.cloudnativeapp.com/v2/syslog_drain_urls"     
   userAgent: "Go-http-client/1.1"     
   remoteAddress: "104.198.9.208:34258"     
  }

Expected: A proper GUID (abcdef-01020..)

Fatal error from Firehose causes nozzle to spin

The firehose is hitting a fatal error and seems to try to shut down the nozzle but the nozzle process does not exit, it just spins idle.

Example error:

{"timestamp":"1505114919.113666058","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"websocket: close 1008 (policy violation): Client did not respond to ping before keep-alive timeout expired.","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc420152240, 0xa37f5f, 0x8, 0xd94ac0, 0xc4203618a0, 0xc420145370, 0x1, 0x1)\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:132 +0xca\nmain.main()\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61 +0x620\n"}}

The app either needs to retry the connection or exit. The issue is related to #107 which has the symptom of no metrics/logs being reported.

Nozzle can't recover from an expired refresh token

We've seen a case of a refresh token used by the nozzle expiring, which resulted in the nozzle process never being able to reconnect to Firehose when it disconnects. Relevant log messages (human-readable timestamp in UTC prepended to each log message):

2018-01-02T11:37:03.646067 {"timestamp":"1514893023.646067142","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{...}}
...nozzle started, working fine for a while. Then disconnect happens...
2018-01-03T20:13:53.360623 {"timestamp":"1515010433.360623360","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"read tcp [redacted]:51612-\u003e[redacted]:443: i/o timeout"}}
2018-01-03T20:13:53.886122 {"timestamp":"1515010433.886122227","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:54.922373 {"timestamp":"1515010434.922372818","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:56.946103 {"timestamp":"1515010436.946102858","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}

The refresh token (which I redacted) in this case had issue time of 1514893023 (Jan 2 11:37:03 UTC), so it was the same refresh token which got issued when the nozzle process started. I don't yet have a good understanding of how refresh tokens are supposed to be refreshed, but it clearly did not happen here.

The nasty part is that the nozzle remains in such (broken) state indefinitely and needs to be restarted manually.

Two possible workarounds come to mind:

  • Like suggested in cloudfoundry/go-cfclient#34, recreate the cfclient from scratch when cfClient.GetToken() fails. This will probably require moving cfclient creation closer to firehose.go (which might be tricky, since the same client is also used in AppInfoRepository).
  • Just panic in cfClientTokenRefresh.RefreshAuthToken() if a token cannot be refreshed several times in a row, making sure the process is restarted and all tokens are refreshed.

@johnsonj, any thoughts?

Break requirement for credentials job in tile

Background

The tile is an easy way for PCF users to deploy the nozzle. With a tile uploaded, it's possible to deploy the agents (google-fluentd, stackdriver-agent) as addons:

- name: stackdriver-tools
  version: latest 

addons:
- name: stackdriver-agents
  jobs:
  - name: credentials
    release: stackdriver-tools
  - name: google-fluentd
    release: stackdriver-tools
  - name: stackdriver-agent
    release: stackdriver-tools
  properties:
    credentials:
      application_default_credentials: |
        ...
    project_id: ...

Problem

With this addon co-located we can no longer deploy the nozzle because it also deploys the credentials job.

Error

Director task 152
  Started preparing deployment > Preparing deployment. Failed: Colocated job 'credentials' is already added to the instance group 'stackdriver-nozzle'. (00:00:00)

Error 100: Colocated job 'credentials' is already added to the instance group 'stackdriver-nozzle'.

Task 152 error

Proposal

Allow the user to pass in service account JSON and use this by for the tile. Create a credentials file as part of the stackdriver-nozzle job and export GOOGLE_APPLICATION_CREDENTIALS

Stackdriver-nozzle repeat panic (every 30 seconds)

Panic.txt
Good morning,

We are running CF-Deployment 1.12.0, Stackdriver-tools 1.0.2 on GCP.

We continually see the following panic message which results in the stackdriver process failing, monit continually restarts. We still see metrics and logs flowing into stackdriver itself (so not 100% sure if we are experiencing any data loss at this moment). Note this has been occurring for some time (not linked to any specific version of CF-Deployment).

Opt-in to Alpha Labels/Metrics

The labels and metric names in develop have changed significantly (#136, #138, #144) and may continue to adapt as we iterate on the nozzle.

In order to allow for the fast iteration of the labels/metrics while releasing reliability improvements we should add a toggle to enable this new behavior. The toggle should be either: master branch as it in v1.0.5 or anything goes alpha behavior. This will help us transition to a 2.0 release where we can break users dashboards.

I believe we can accomplish this relatively easily:

  • Restore the original labelMaker as legacyLabelMaker
  • Add a flag to config to EnableAlphaMetrics. Plumb through job spec/tile UI.
  • Inject the correct labelMaker during App construction
  • Add a conditional to the metric prefix assignment (or perhaps refactor this into a service object)

We will drop the opt-in and legacy code paths with the v2.0.0 release.

/cc @fluffle @knyar

PCF 2.0 Support

  • use dynamic_ips instead of static_ips in tile
  • use bosh2 to create releases: #111
  • create release with --sha2

Bad release: v1.0.6

This release is broken. Under heavy load the nozzle is susceptible to hanging. Manual validation did not pick this up, possibly due to the nozzle logging plenty of its own metrics or conflicting nozzle versions writing to the same project.

  • Pull release from PivNet
  • Remove release from GitHub. Tag will remain for historical reasons but binary is misleading.
  • Add/re-introduce buffer

Related issue: #107

The good news is this exposed several edge case bugs around hangs/shutdown.

Create GCP project for CI

Currently the deploy stage of the CI pipeline pushes the release to a CF installation running in a random project. We should provision a new project specific to stackdriver-tools with a minimal CF installation for this specific purpose.

Support exporting metrics to Stackdriver Logging

Today we split events from the Loggregator into Stackdriver Logging/Monitoring. It is desirable to have the metric data available in Stackdriver Logging so users can use exports for further analysis. An example would be exporting to BigQuery to perform custom calculations on metrics not available in Stackdriver Monitoring.

Design Considerations:

  • Should it be a 'All or Nothing' to log to both Stackdriver Logging and Monitoring or should each endpoint have it's own list of events?
  • Should we need to perform the same culling for metrics destined for Stackdriver Logging?

Throttle TimeSeries requests to Stackdriver metrics API

The Stackdriver API recommends sending at most 1 TimeSeries value every 30s (a batch request can contain 200 individual TimeSeries values). The nozzle should keep a map of TimeSeries values and put them to the API at a 30s interval.

Convert example to BOSH add-ons

With the new BOSH 2.0 features, the GCP tools can be colocated on every single VM using add-ons instead of manually specifying them at the deployment manifest.

Emit a metric for dropped metrics due to RST_STREAM

Why does RST_STREAM happen:

  1. A metric is sent out of order
  2. A metric is sent too frequently

In #72 we addressed the RST_STREAM error for the second case. We did not address the first case.

The Loggreagor does not guarantee order of event delivery and it scales by sharding messages across multiple nozzle deployments. It's not (reasonable) possible to re-order these metric and coordinate that across various nozzles. it also doesn't make sense to redefine the semantics of the system at the nozzle level.

We should gracefully handle the RST_STREAM error and emit a metric that is actionable for operators.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.