cloudfoundry-community / stackdriver-tools Goto Github PK

View Code? Open in Web Editor NEW

21.0 8.0 13.0 35.73 MB

Stackdriver Nozzle for Cloud Foundry Loggregator, Host Monitoring Agents BOSH Release

License: Apache License 2.0

Shell 11.85% Ruby 1.38% Smarty 0.50% HCL 0.16% Go 79.93% Makefile 1.44% HTML 4.35% Dockerfile 0.39%

stackdriver stackdriver-agent stackdriver-monitoring fluentd bosh-release stackdriver-nozzle cloud-foundry loggregator

stackdriver-tools's Introduction

Status

This project is no longer actively maintained and the repository has been archived.

stackdriver-tools release for BOSH

This release provides Cloud Foundry and BOSH integration with Google Cloud Platform's Stackdriver Logging and Monitoring.

Functionality is provided by 3 jobs in this release:

A nozzle job for forwarding Cloud Foundry Firehose data to Stackdriver
A Fluentd job for forwarding syslog and template logs to Stackdriver Logging
A Stackdriver Monitoring Agent job for sending VM health metrics to Stackdriver Monitoring

Project Status

The following is generally available:

Stackdriver Host Monitoring Agent (stackdriver-agent)
Stackdriver Host Logging Agent (google-fluentd)
Stackdriver Nozzle (stackdriver-nozzle)
- Stackdriver Logging for Cloud Foundry Log Events (LogMessage, Error, HttpStartStop)
- Stackdriver Monitoring for Cloud Foundry Metric Events (ContainerMetric, ValueMetric, CounterEvent)

The following is in beta:

Stackdriver Nozzle
- Stackdriver Logging for Cloud Foundry Metric Events (ContainerMetric, ValueMetric, CounterEvent)

The project was developed in partnership with Google and Pivotal and is actively maintained by Google.

Getting started

Enable Stackdriver APIs

Ensure the Stackdriver Logging and Stackdriver Monitoring APIs are enabled.

Quotas

Depending on the size of the cloud foundry deployment and which events the nozzle is forwarding, it can be quite easy to reach the default Stackdriver quotas:

Google quotas can be viewed and managed on the API Quotas Page. An operator can increase the default quota up to a limit; exceeding that, use the contact links to request even higher quotas.

Create and configure service accounts

All of the jobs in this release authenticate to Stackdriver Logging and Monitoring via Service Accounts. Follow the GCP documentation to create a service account via gcloud with the following roles:

roles/logging.logWriter
roles/logging.configWriter
roles/monitoring.metricWriter

You can either authenticate the job(s) by specifying the service account in the cloud_properties for the resource pool running the job(s) or by configuring credentials.application_default_credentials in the job spec.

You may also read the access control documentation for more general information about how authentication and authorization work for Stackdriver.

General usage

To use any of the jobs in this BOSH release, first upload it to your BOSH director:

bosh2 upload-release https://storage.googleapis.com/bosh-gcp/beta/stackdriver-tools/latest.tgz

The stackdriver-tools.yml sample BOSH 2.0 manifest illustrates how to use all 3 jobs in this release (nozzle, host logging, and host monitoring). You can deploy the sample with the following commands:

bosh2 upload-stemcell https://bosh.io/d/stemcells/bosh-google-kvm-ubuntu-trusty-go_agent

bosh2 update-cloud-config -n manifests/cloud-config-gcp.yml \
          -v zone=... \
          -v network=... \
          -v subnetwork=... \
          -v "tags=['stackdriver-nozzle']" \
          -v internal_cidr=... \
          -v internal_gw=... \
          -v "reserved=[10....-10....]"

bosh2 deploy manifests/stackdriver-tools.yml \
            -d stackdriver-nozzle \
            --var=firehose_endpoint=https://.. \
            --var=firehose_username=stackdriver_nozzle \
            --var=firehose_password=... \
            --var=skip_ssl=false \
            --var=gcp_project_id=... \
            --var-file=gcp_service_account_json=path/to/service_account.json \

This will create a self-contained deployment that sends Cloud Foundry firehose data, host logs, and host metrics to Stackdriver.

Deploying each job individually is described in detail below.

Deploying the nozzle

Create a new deployment manifest for the nozzle. See the example manifest for a full deployment and the jobs.stackdriver-nozzle section for the nozzle.

To reduce message loss, operators should run a minimum of two instances. With two instances, updating stemcells and other destructive BOSH operations will still leave an instance draining logs.

The loggregator system will round-robin messages across multiple instances. If the nozzle can't handle the load, consider scaling to more than two nozzle instances.

The spec describes all the properties an operator should modify.

Stackdriver Error Reporting

Stackdriver can automatically detect and report errors from stack traces in logs. However, this does not automatically work with Loggregator because it sends each line from app output as a separate log message to the nozzle. To enable this feature of Stackdriver, apps will need to manually encode stacktraces on a single line so that the stackdriver-nozzle can send them as single messages to Stackdriver.

This is accomplished by replacing newlines in stacktraces with a unique character, which is set using the firehose.newline_token template variable in the nozzle so that the nozzle can reconstruct the stacktrace on multiple lines.

For example, if firehose.newline_token is set to ∴, a Go app would need to implement something like the following:

const newlineToken = "∴"

func main() {
    ...
    defer handlePanic()
    ...
}

func handlePanic() {
    	e := recover()
    	if e == nil {
    		return
    	}
    
    	stack := make([]byte, 1<<16)
    	stackSize := runtime.Stack(stack, true)
    	out := string(stack[:stackSize])
    
    	fmt.Fprintf(os.Stderr, "panic: %v", e)
    	fmt.Fprintf(os.Stderr, strings.Replace(out, "\n", newlineToken, -1))
    	os.Exit(1)
}

This outputs the stacktrace separately from the panic so that the panic remains in the logs and the stacktrace is logged by itself. This allows Stackdriver to detect the stacktrace as an error.

For an example in Java, see this section of the Loggregator documentation.

Deploying host logging

The google-fluentd template uses Fluentd to send both syslog and template logs (assuming that template jobs are writing logs into /var/vcap/sys/log/*/*.log) to Stackdriver Logging.

To forward host logs from BOSH VMs to Stackdriver, co-locate the google-fluentd template with an existing job whose host logs should be forwarded.

Include the stackdriver-tools release in your existing deployment manifest:

releases:
  ...
  - name: stackdriver-tools
    version: latest
  ...

Add the google-fluentd template to your job:

jobs:
  ...
  - name: nats
    templates:
      - name: nats
        release: cf
      - name: metron_agent
        release: cf
      - name: google-fluentd
        release: stackdriver-tools
  ...

Deploying host monitoring

The stackdriver-agent template uses the Stackdriver Monitoring Agent to collect VM metrics to send to Stackdriver Monitoring.

To forward host metrics forwarding from BOSH VMs to Stackdriver, co-locate the stackdriver-agent template with an existing job whose host metrics should be forwarded.

Include the stackdriver-tools release in your existing deployment manifest:

releases:
  ...
  - name: stackdriver-tools
    version: latest
  ...

Add the stackdriver-agent template to your job:

jobs:
  ...
  - name: nats
    templates:
      - name: nats
        release: cf
      - name: metron_agent
        release: cf
      - name: stackdriver-agent
        release: stackdriver-tools
  ...

Deploying as a BOSH addon

Specify the jobs as addons in your runtime config to deploy Stackdriver Monitoring and Logging agents on all instances in your deployment. Do not specify the jobs as part of your deployment manifest if you are using the runtime config.

# runtime.yml
---
releases:
  - name: stackdriver-tools
    version: latest

addons:
- name: stackdriver-tools
  jobs:
  - name: google-fluentd
    release: stackdriver-tools
  - name: stackdriver-agent
    release: stackdriver-tools

To update the runtime config:

bosh2 update-runtime-config -d <your deployment> runtime.yml

Then redeploy your manifest:

bosh2 deploy -d <your deployment> path/to/manifest.yml

Development

Updating google-fluentd

google-fluentd is versioned by the Gemfile in src/google-fluentd. To update fluentd:

Update the version specifier in the Gemfile (if necessary)
Update Gemfile.lock: bundle update
Create a vendor cache from the Gemfile.lock: bundle package
Tar and compress the vendor folder: tar zvc vendor > google-fluentd-vendor-<VERSION>-plugin-<VERSION>.tgz
Update the vendor version in the google-fluentd package packaging and spec
Add vendored cache to the BOSH blobstore: bosh2 add-blob google-fluentd-vendor-<VERSION>-plugin-<VERSION>.tgz google-fluentd-vendor/google-fluentd-vendor-VERSION-NUMBER.tgz
Create a dev release and deploy it to verify that all of the above worked
Update the BOSH blobstore: bosh upload-blobs
Commit your changes

bosh-lite

Both the nozzle and the fluentd jobs can run on bosh-lite. To generate a working manifest, start from the bosh-lite-example-manifest. Note the application_default_credentials property, which should be filled in with the contents of a Google service account key.

Contributing

For details on how to contribute to this project - including filing bug reports and contributing code changes - please see CONTRIBUTING.md.

Copyright

stackdriver-tools's People

Contributors

Stargazers

Watchers

Forkers

johnsonj erjohnso making fluffle knyar anthonysroberts jccarte jhvhs z4ce pivotalsquid digitalsanity evandbrown

stackdriver-tools's Issues

Deploy the Spinner with the Tile

Push the Spinner cf app as part of the tile. Re-use the service account used by the nozzle. This may require adding a GOOGLE_CREDENTIALS_JSON to the manifest and plumbing it through to the stackdriver client (see root_service_account_json, client creation as an example)

Surface the configuration options (with friendly name/description/sane default):

SPINNER_COUNT
SPINNER_WAIT

Open questions:

Do we need to enable/disable it or just assume users want it?
Should we need to deploy multiple instances?

/cc @hustons @sahilm @GarimaSharma (+tom)

Nozzle sends more than 200 time series in a single request

With nozzle built from develop branch and the default metrics_buffer_size set to 200 I am seeing the following errors from Stackdriver:

code = InvalidArgument desc = Field timeSeries had an invalid value: A maximum of 200 TimeSeries can be written in a single request.

I believe this is caused by the fact that metrics_buffer_size limits the number of MetricEvents passed to the metric adapter, however each MetricEvent can have multiple metrics (which share the same labels). As the result, 200 MetricEvents can result in more than 200 TimeSeries in a CreateTimeSeriesRequest.

There should probably be a stricter check that ensures that CreateTimeSeriesRequest always conforms to API restrictions and does not have more than 200 time series.

Use native Stackdriver Monitoring Protos

see discussion in #137. Use native protos at the point of translation instead of ushering them into an intermediary format.

Provide Bosh 2.0 manifest.

It would be convenient to have a Bosh 2.0 manifest available for deployments.

This would remove the need for the Bosh 1 CLI that supports ERB as well as allow users to take advantage of vars files and store.

/cc @apoydence

Update to bosh2 in CI

blocks: #109, #125

Migrate stackdriver logging to V2 api

March 30, 2017: The v1 API is shut down. All code using the v1 API stops working. src

rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 2

Hello,

I am using the nozzle in combination with google cloud stackdriver. In our logs we keep seeing the following errors:

rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 2"

our configuration looks like this:

    export FIREHOSE_ENDPOINT=https://api.our.domain.io
    export FIREHOSE_USERNAME=firehose
    export FIREHOSE_PASSWORD=password
    export FIREHOSE_EVENTS=LogMessage,Error,HttpStartStop,CounterEvent,ValueMetric,ContainerMetric
    export FIREHOSE_SKIP_SSL=false
    export FIREHOSE_SUBSCRIPTION_ID=stackdriver-nozzle
    export FIREHOSE_NEWLINE_TOKEN=

    export DEBUG_NOZZLE=true
    export RESOLVE_APP_METADATA=true

could you help with this problem?

Many thanks,
Claudio

Deduplicate process-level metrics

Stackdriver has a limit of 500 custom metrics per project, and the latest build from develop branch already attempts to create more. As the result, SD API requests fail with the following error message:

rpc error: code = ResourceExhausted desc = Your metric descriptor quota has been exhausted

Note, #139 increased the number of metrics by prepending origin to the metric name. While it's the right thing to do in general, there are several metrics that seem to be created for multiple processes and seem to mean the same thing for all of them:

memoryStats.lastGCPauseTimeNS
memoryStats.numBytesAllocated
memoryStats.numBytesAllocatedHeap
memoryStats.numBytesAllocatedStack
memoryStats.numFrees
memoryStats.numMallocs
numCPUS
numGoRoutines

In our test PCF instance the 8 metrics listed above repeat 26 times each, so deduplicating them (by not prepending origin to metric name) will decrease the total number of metrics by 182. This seems like a quick easy win, but I suspect in the future we might also want to add metric blacklist/whitelist to give users better control of the number of metrics created by the nozzle.

@johnsonj, what do you think?

Automated Load Testing and Failure Testing

Verify QPS
Verify dropped messages are handled gracefully
Can we simulate cascading failure?

Uses deprecated version of Golang

It looks like the nozzle has not kept up with newer versions of Go. It has 1.7.1 and still has golang 1.6 blob.

/cc @apoydence

Document metric filter feature for tile

The pending nozzle release includes new whitelist, blacklist feature.

Add sane defaults to the whitelist based on reference SLOs (PR: #192)
Add documentation to the tile that includes how to use the feature and any rational behind default values. The how to use should live under installation and the rational should (probably) live under using the nozzle. (pivotal-cf/docs-gcp-stackdriver/pull/9)

Emit Spinner results to Stackdriver Monitoring

The spinner emits a log entry that describes the outcome of it's log loss test.
The user should be able to emit a custom metric with this result to Stackdriver Monitoring.

possible metrics:

stackdriver-spinner/logs.sent - cumulative total of log messages sent to loggregator
stackdriver-spinner/logs.recieved - cumulative total of logs received by the probe
stackdriver-spinner/logs.lost - cumulative total of logs never received

labels:

director - corresponds to the same director value for Stackdriver Nozzle. Used when multiple PCF instances are logging to a single Stackdriver project. Make an ENV variable for the app?
index - index of the Cloud Foundry app (in case the user is running multiple copies)

/cc cloud-ops for implementation/collab @hustons @sahilm @GarimaSharma (+tom)
/cc cre for metrics guidance/awareness @fluffle @knyar

Downloading blobs requires credentials

It seems that 2baa699 broke the tile building script (scripts/build-custom-tile-docker.sh started in Docker by scripts/custom-tile) because downloading blobs requires credentials now:

Step 5/5 : RUN scripts/build-custom-tile-docker.sh
 ---> Running in 4a23e3392c29

Blob download 'golang/go1.9.linux-amd64.tar.gz' (id: 9389191f-2e77-4df2-6ccd-d4a1639ed201) failed
Blob download 'google-fluentd-vendor/google-fluentd-vendor-0.12-plugin-0.5.3.tgz' (id: da306bf9-a23c-46cc-69e7-7e86a54e7bbb) failed
Blob download 'google-fluentd-vendor/google-fluentd-vendor-0.14.tgz' (id: ad88f6c7-fd8d-41ae-6109-65e3f604e76f) failed
Blob download 'libtool/libtool-2.4.2.tar.gz' (id: a72448bd-9ae1-482a-54f3-a73f6820c2c4) failed
Blob download 'libyajl/yajl-2.1.0.tar.gz' (id: 0a77c971-4860-4059-5bb7-8b3e0049f2ff) failed
Downloading blobs:
  - Getting blob '9389191f-2e77-4df2-6ccd-d4a1639ed201' for path 'golang/go1.9.linux-amd64.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'da306bf9-a23c-46cc-69e7-7e86a54e7bbb' for path 'google-fluentd-vendor/google-fluentd-vendor-0.12-plugin-0.5.3.tgz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'ad88f6c7-fd8d-41ae-6109-65e3f604e76f' for path 'google-fluentd-vendor/google-fluentd-vendor-0.14.tgz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob 'a72448bd-9ae1-482a-54f3-a73f6820c2c4' for path 'libtool/libtool-2.4.2.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
  - Getting blob '0a77c971-4860-4059-5bb7-8b3e0049f2ff' for path 'libyajl/yajl-2.1.0.tar.gz':
      Building client SDK:
        google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Exit code 1

@johnsonj, is this something that can be easily fixed by changing permissions on the GCS bucket?

Shipping multiline stack traces from google-fluentd job for use with Stackdriver Error Reporting

I'd like to make use of the Stackdriver Error reporting functionality - https://cloud.google.com/error-reporting/docs/viewing - to track errors that CF platform components are logging.

As a concrete example; the CF UAA component logs verbose multiline Java stacktrace errors to /var/vcap/sys/log/uaa/uaa.log

For example:

[2016-12-20 18:50:08.070] uaa - 10605 [localhost-startStop-1] .... FATAL --- RecognizeFailureDispatcherServlet: Unable to start UAA application.
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.cloudfoundry.identity.uaa.security.web.SecurityFilterChainPostProcessor#0' defined in ServletContext resource [/WEB-INF/spring-servlet.xml]: Cannot resolve reference to bean 'identityZoneResolvingFilter' while setting bean property 'additionalFilters' with key [TypedStringValue: value [#{T(org.cloudfoundry.identity.uaa.security.web.SecurityFilterChainPostProcessor.FilterPosition).position(2)}], target type [null]]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'identityZoneResolvingFilter' defined in ServletContext resource [/WEB-INF/spring-servlet.xml]: Cannot resolve reference to bean 'identityZoneProvisioning' while setting bean property 'identityZoneProvisioning'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'identityZoneProvisioning' defined in ServletContext resource [/WEB-INF/spring/multitenant-endpoints.xml]: Cannot resolve reference to bean 'jdbcTemplate' while setting constructor argument; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'flyway' defined in class path resource [spring/data-source.xml]: Invocation of init method failed; nested exception is org.flywaydb.core.api.FlywayException: Unable to obtain Jdbc connection from DataSource
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:359)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:108)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveManagedMap(BeanDefinitionValueResolver.java:407)
        at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:165)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1481)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1226)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:543)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:482)
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:306)
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
        ...snip...
        at org.springframework.web.servlet.FrameworkServlet.initWebApplicationContext(FrameworkServlet.java:553)
        at org.springframework.web.servlet.FrameworkServlet.initServletBean(FrameworkServlet.java:494)
        at org.springframework.web.servlet.HttpServletBean.init(HttpServletBean.java:136)
        at javax.servlet.GenericServlet.init(GenericServlet.java:158)
        at org.cloudfoundry.identity.uaa.web.RecognizeFailureDispatcherServlet.init(RecognizeFailureDispatcherServlet.java:56)
        at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1227)
        at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1140)
        at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1027)
        at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5038)
...snip...

My understanding of StackDriver Error reporting is that if we can capture all the stacktrace information in a single log message; then we "automagically" get a host of Stackdriver Error reporting goodness.

My concrete question is around how to ship multiline Java stacktrace log messages; but its part of a more generic question around where logic for understanding the log format should live.

One place I can see the logic going is in the jobs/google-fluentd/templates/vcap.conf config. This could potentially be extended with some more detailed in_tail config format for specific log files - in the case of the of the uaa.log file some custom multiline config that recognises Java stack traces.

However, before submitting a PR for the above; I'd like to find out if there is a better place to put such logic.

Thanks!

Instance index should be a label not a metric

Currently the container metric's "instance index" is being sent as a separate metric type. It should really be a label on the other container metrics. (E.g. This is the CPU utilization for this instance of this app)

tile generates invalid ops manager manifest in develop

workaround:

diff --git a/tile.yml.erb b/tile.yml.erb
index d82806d..5313730 100644
--- a/tile.yml.erb
+++ b/tile.yml.erb
@@ -58,9 +58,7 @@ forms:
     type: string
     default: HttpStartStop,LogMessage,Error
     label: Whitelist for Stackdriver Logging
-    description: |
-      Comma separated list without spaces consisting of any or all of HttpStartStop,LogMessage,Error. 
-      The following events are in beta can also be used: ValueMetric,CounterEvent,ContainerMetric
+    description: Comma separated list without spaces consisting of any or all of HttpStartStop,LogMessage,Error. The following events are in beta can also be used, ValueMetric,CounterEvent,ContainerMetric
   - name: firehose_events_to_stackdriver_monitoring
     type: string
     default: CounterEvent,ValueMetric,ContainerMetric
@@ -106,6 +104,7 @@ forms:
     default: 1000
     label: Logging Batch Count
     description: Batch size for log messages being sent to Stackdriver
+    type: integer
   - name: metric_path_prefix
     type: string
     default: firehose
@@ -120,4 +119,4 @@ forms:
     type: boolean
     default: false
     label: Nozzle Debugging
-    description: Enable Nozzle Debugging Features. With this enabled each Stackdriver Nozzle instance will host a web server on 0.0.0.0:6060 that exposes debug information such as a heap dump and running threads.
\ No newline at end of file
+    description: Enable Nozzle Debugging Features. With this enabled each Stackdriver Nozzle instance will host a web server on port 6060 that exposes debug information such as a heap dump and running threads.
\ No newline at end of file

Replace dropsonde with Loggregator API

Loggregator API was introduced in Loggregator v75 and picked up in Cloud Foundry v251.

This API changes many of the logging structures and increases reliability (UDP->TCP for the firehose)

Allow side-by-side installation of develop/release tile versions

Specify a unique FIREHOSE_SUBSCRIPTION_ID for the develop tile.

The two can be installed side-by-side today as of 5000810 but Loggregator will shard across the two different versions which is not expected.

Update google-fluentd, stackdriver-agent, fluent-plugin-google-cloud

google-fluentd: latest
stackdriver-agent: latest
fluent-plugin-google-cloud: >= v0.6.15

Add monitoring roles to documentation

The current suggested roles do not allow the stackdriver-agent to setup/write metrics. Today it needs an 'Editor' role to write metrics. Additionally it seems like the agent does configuration that requires the 'Owner' role on first start.

Provide a snippet to create a service account and add appropriate roles. Similar to the example app.

Provide an upgrade path for metric changes

Fix to #100 requires changing an exiting metric. How should existing users handle this?

Add new dependency in OSS disclosure

From PR #76 - github.com/mitchellh/hashstructure

Use existing stackdriver_nozzle credentials for tile

ERT now provisions a stackdriver-nozzle user for us as of 1.9.29+, 1.10.16+, and 1.11.2+.

How it's done

Clarify Project Status

The project status is outdated and does not reflect the current state. The overall nozzle is stable and the addons (host agents) are also stable.

Remove general beta warning in the README
Add beta verbiage around sending Metrics to Stackdriver Logging

Possible memory leak due to quota exhaustion

Hi,

We are using Stackdriver nozzle tile 1.0.3 with PCF 1.11.6

For our stackdriver nozzle vms, in few minutes the memory utilization increases to 98% and the service crashes/restarts.

What we anticipate as problem:

We are exceeding metric descriptor quota for the service account used. However it is not clear which quota precisely need to be increased.
Memory should not spike if quota is reached rather the content in memory should be flushed on this error. This is causing intermittent failure of our deployments.

What we expect out of this issue:

The installation docs of stackdriver-nozzle tile for pcf should mention about which quota to be increased and to what(if possible).
If the quota is reached, the memory leak should not happen.

Logs:

During the time when memory utilization is increasing:

{"timestamp":"1503403764.702794075","source":"stackdriver-nozzle","message":"stackdriver-nozzle.metricsBuffer","log_level":2,"data":{"error":"rpc error: code = 8 desc = Your metric descriptor quota has been exhausted"}}
{"timestamp":"1503403764.708293676","source":"stackdriver-nozzle","message":"stackdriver-nozzle.metricsBuffer","log_level":2,"data":{"error":"rpc error: code = 8 desc = Your metric descriptor quota has been exhausted"}}

When the service crashes:


{"timestamp":"1503566611.790120125","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"read tcp 10.0.0.51:48572-\u003e130.211.228.210:443: read: connection reset by peer","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc4201842a0, 0x95f4e4, 0x8, 0xc157e0, 0xc5174fa820, 0xc5176e4648, 0x1, 0x1)\n\t/var/vcap/data/compile/stackdriver-nozzle/go/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:131 +0xc7\nmain.main()\n\t/var/vcap/data/compile/stackdriver-nozzle/go/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:59 +0x43d\n"}}
{"timestamp":"1503566611.792421341","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Stopping heartbeater"}}
{"timestamp":"1503566619.110969305","source":"stackdriver-nozzle","message":"stackdriver-nozzle.version","log_level":1,"data":{"name":"cf-stackdriver-nozzle","release":"1.0.3","user_agent":"cf-stackdriver-nozzle/1.0.3"}}
{"timestamp":"1503566619.120277643","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{"APIEndpoint":"https://api.gcp.trackerred.com","BatchCount":10,"BatchDuration":1,"DebugNozzle":false,"Events":"CounterEvent,Error,HttpStartStop,LogMessage,ValueMetric,ContainerMetric","HeartbeatRate":30,"NewlineToken":"","Password":"\u003credacted\u003e","ProjectID":"<project-id>","ResolveAppMetadata":true,"SkipSSL":false,"SubscriptionID":"stackdriver-nozzle","Username":"<username>"}}
{"timestamp":"1503566619.138031244","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Starting heartbeater"}}
{"timestamp":"1503566619.802249670","source":"stackdriver-nozzle","message":"stackdriver-nozzle.heartbeater","log_level":1,"data":{"debug":"Starting heartbeater"}}

Monit status during the last few minutes before crash:
Memory utilization for one stackdriver nozzle vm: (Rest of the vms looks similar)
Service logs:

# tail -f stackdriver-nozzle-ctl.err.log
[2017-08-24 01:48:25+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 01:48:24 UTC 2017 --------------
[2017-08-24 06:16:30+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 06:16:30 UTC 2017 --------------
[2017-08-24 08:03:45+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 08:03:45 UTC 2017 --------------
[2017-08-24 09:23:38+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 09:23:38 UTC 2017 --------------

# tail -f stackdriver-nozzle-ctl.log
[2017-08-24 08:03:45+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 08:03:45 UTC 2017 --------------
[2017-08-24 08:03:45+0000] Removing stale pidfile...
[2017-08-24 09:23:38+0000] ------------ STARTING stackdriver-nozzle-ctl at Thu Aug 24 09:23:38 UTC 2017 --------------
[2017-08-24 09:23:38+0000] Removing stale pidfile...

heartbeat.Increment blocks when it's not being drained

During internet connection loss the heartbeater blocks when performing Increment while waiting for the channel to be ready to send.

The nozzle runs go heartbeater.Increment() in several places because we want to limit the affect of telemetry on the hot path. When the call blocks we can end up an unbounded amount of these goroutines:

goroutine 369 [chan send]:
github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/heartbeat.(*heartbeater).IncrementBy(0xc420224300, 0xb60b50, 0x15, 0xea60)
	/usr/local/google/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/heartbeat/heartbeater.go:130 +0x6c
created by github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle.(*nozzle).Start.func2
	/usr/local/google/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle/nozzle.go:86 +0x6d

Repro

Apply this patch

diff --git a/src/stackdriver-nozzle/cloudfoundry/firehose.go b/src/stackdriver-nozzle/cloudfoundry/firehose.go
index ac60fc2..13caed1 100644
--- a/src/stackdriver-nozzle/cloudfoundry/firehose.go
+++ b/src/stackdriver-nozzle/cloudfoundry/firehose.go
@@ -53,6 +53,8 @@ func (c *firehose) Connect() (<-chan *events.Envelope, <-chan error) {
 	refresher := cfClientTokenRefresh{cfClient: c.cfClient}
 	cfConsumer.SetIdleTimeout(time.Duration(30) * time.Second)
 	cfConsumer.RefreshTokenFrom(&refresher)
+	// DO NOT CHECK IN
+	cfConsumer.SetMaxRetryCount(1)
 	return cfConsumer.Firehose(c.subscriptionID, "")
 }

Run stackdriver-nozzle
Disconnect from the internet

Document upgrade process to Stackdriver Nozzle 2.0

The planned nozzle release will require manual steps to prepare Cloud Foundry and the GCP Project. Document these steps for the release notes and tile docs.

Notes to capture:

How to clear out project metrics, keeping in mind the nozzle must be stopped

Tasks:

Documentation for BOSH release users (PR: #198)
Documentation for tile users (PR: pivotal-cf/docs-gcp-stackdriver/pull/10)

Use Native Google Cloud Storage for Release Blobs

This BOSH release uses Google Cloud Storage (GCS) for storing release blobs in S3 compability mode and should be migrated to using native GCS. This enables service account support and better support for large file uploads.

The latest version of bosh2 (2.0.28-cb77557-2017-07-11T23:04:21Z) supports native GCS as a blobstore (see: cloudfoundry/bosh-cli#238).

Migration Plan

Ensure project and developers are using the latest bosh2 (>= 2.0.28-cb77557-2017-07-11T23:04:21Z). This is needed for CI pipelines and wherever releases are built.
Sync blobs locally with BOSH v2:
```
bosh2 sync-blobs
```
Remove object_ids from config/blobs.yml:
```
sed -i '/object_id/d' config/blobs.yml
```

Update config/final.yml:

---
final_name: <<unchanged>>
blobstore:
  provider: gcs
  options:
    bucket_name: <<unchanged>>
    # remove: host, endpoint, use_ssl

Update config/private.yml (secrets for developers and CI, do not check in)

blobstore:
  options:
    json_key: <<service account key>>

To generate a new service account/key:

export project_id=Project hosting your GCS bucket, eg my-gcp-project
export bucket_name=GCS bucket name, eg my-bosh-release-blobs

export service_account_name=${bucket_name}-blobs
export service_account_email=${service_account_name}@${project_id}.iam.gserviceaccount.com
credentials_file=$(mktemp)

gcloud config set project ${project_id}
gcloud iam service-accounts create ${service_account_name} --display-name "BOSH-CLI access for ${bucket_name}"
gsutil iam ch serviceAccount:${service_account_email}:objectCreator,objectViewer gs://${bucket_name}
gcloud iam service-accounts keys create ${credentials_file} --iam-account ${service_account_email}

echo "$(cat ${credentials_file})"

Re-upload the blobs to confirm everything works and reassign IDs:
```
bosh2 upload-blobs
```

Tile name should match pivnet slug name

Name property in tile.yaml.erb should be gcp-stackdriver-nozzle. This enables OpsMan to find new versions of the tile once they're published.

Error filling in template 'event_filters.json.erb'

Fail in CI:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'stackdriver-nozzle'. Errors are:
     - Unable to render templates for job 'stackdriver-nozzle'. Errors are:
       - Error filling in template 'event_filters.json.erb' (line 4: Can't find property '["nozzle.event_filters.blacklist"]')

Reaching max firehose connection retry causes nozzle to spin

After reaching the max number of firehose connection retries the firehose disconnects but the rest of the nozzle remains running. This causes the nozzle to chug along doing nothing but appearing running to monit.

Use the gce_instance monitored resource type

Using gce_instance instance as the monitored resource type (rather than the global monitored resource) increases throughput as nozzle instances are added since the Stackdriver API shards based on the gce_instance's instance_id label.

Enable Nozzle Debugging for Tile Users

The nozzle has the ability to send errors/crashes that it generates to Stackdriver Monitoring. This is done today when 'DEBUG_NOZZLE' is turned on.

These reports will be useful for operators managing the nozzle. Let's expose it as a property on the tile.

http2Client.notifyError got notified that the client transport was broken EOF

Hi,

We are using Stackdriver nozzle tile 1.0.6 with PCF 1.11.6

We are not able to see logs in our Stackdriver/logging project. The error in Stackdriver-nozzle vm is:
http2Client.notifyError got notified that the client transport was broken EOF

We started seeing this issue exactly after installing 1.0.6 of Stackdriver nozzle. We upgraded directly from 1.0.3 to 1.0.6.

Is there anyway to debug this?

HttpStartStop does not format requestID properly

Actual:

httpStartStop: {
   startTimestamp: 1480442028497669600     
   peerType: "Server"     
   requestId: {
    low: 15730922093683923000      
    high: 13584426878641205000      
   }
   method: "GET"     
   stopTimestamp: 1480442028506655500     
   statusCode: 200     
   contentLength: 42     
   uri: "https://api.cf.jrjohnsondev.cloudnativeapp.com/v2/syslog_drain_urls"     
   userAgent: "Go-http-client/1.1"     
   remoteAddress: "104.198.9.208:34258"     
  }

Expected: A proper GUID (abcdef-01020..)

Fatal error from Firehose causes nozzle to spin

The firehose is hitting a fatal error and seems to try to shut down the nozzle but the nozzle process does not exit, it just spins idle.

Example error:

{"timestamp":"1505114919.113666058","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"websocket: close 1008 (policy violation): Client did not respond to ping before keep-alive timeout expired.","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc420152240, 0xa37f5f, 0x8, 0xd94ac0, 0xc4203618a0, 0xc420145370, 0x1, 0x1)\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:132 +0xca\nmain.main()\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61 +0x620\n"}}

The app either needs to retry the connection or exit. The issue is related to #107 which has the symptom of no metrics/logs being reported.

Nozzle can't recover from an expired refresh token

We've seen a case of a refresh token used by the nozzle expiring, which resulted in the nozzle process never being able to reconnect to Firehose when it disconnects. Relevant log messages (human-readable timestamp in UTC prepended to each log message):

2018-01-02T11:37:03.646067 {"timestamp":"1514893023.646067142","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{...}}
...nozzle started, working fine for a while. Then disconnect happens...
2018-01-03T20:13:53.360623 {"timestamp":"1515010433.360623360","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"read tcp [redacted]:51612-\u003e[redacted]:443: i/o timeout"}}
2018-01-03T20:13:53.886122 {"timestamp":"1515010433.886122227","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:54.922373 {"timestamp":"1515010434.922372818","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:56.946103 {"timestamp":"1515010436.946102858","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}

The refresh token (which I redacted) in this case had issue time of 1514893023 (Jan 2 11:37:03 UTC), so it was the same refresh token which got issued when the nozzle process started. I don't yet have a good understanding of how refresh tokens are supposed to be refreshed, but it clearly did not happen here.

The nasty part is that the nozzle remains in such (broken) state indefinitely and needs to be restarted manually.

Two possible workarounds come to mind:

Like suggested in cloudfoundry/go-cfclient#34, recreate the cfclient from scratch when cfClient.GetToken() fails. This will probably require moving cfclient creation closer to firehose.go (which might be tricky, since the same client is also used in AppInfoRepository).
Just panic in cfClientTokenRefresh.RefreshAuthToken() if a token cannot be refreshed several times in a row, making sure the process is restarted and all tokens are refreshed.

@johnsonj, any thoughts?

Break requirement for credentials job in tile

Background

The tile is an easy way for PCF users to deploy the nozzle. With a tile uploaded, it's possible to deploy the agents (google-fluentd, stackdriver-agent) as addons:

- name: stackdriver-tools
  version: latest 

addons:
- name: stackdriver-agents
  jobs:
  - name: credentials
    release: stackdriver-tools
  - name: google-fluentd
    release: stackdriver-tools
  - name: stackdriver-agent
    release: stackdriver-tools
  properties:
    credentials:
      application_default_credentials: |
        ...
    project_id: ...

Problem

With this addon co-located we can no longer deploy the nozzle because it also deploys the credentials job.

Error

Director task 152
  Started preparing deployment > Preparing deployment. Failed: Colocated job 'credentials' is already added to the instance group 'stackdriver-nozzle'. (00:00:00)

Error 100: Colocated job 'credentials' is already added to the instance group 'stackdriver-nozzle'.

Task 152 error

Proposal

Allow the user to pass in service account JSON and use this by for the tile. Create a credentials file as part of the stackdriver-nozzle job and export GOOGLE_APPLICATION_CREDENTIALS

Stackdriver-nozzle repeat panic (every 30 seconds)

Panic.txt
Good morning,

We are running CF-Deployment 1.12.0, Stackdriver-tools 1.0.2 on GCP.

We continually see the following panic message which results in the stackdriver process failing, monit continually restarts. We still see metrics and logs flowing into stackdriver itself (so not 100% sure if we are experiencing any data loss at this moment). Note this has been occurring for some time (not linked to any specific version of CF-Deployment).

Opt-in to Alpha Labels/Metrics

The labels and metric names in develop have changed significantly (#136, #138, #144) and may continue to adapt as we iterate on the nozzle.

In order to allow for the fast iteration of the labels/metrics while releasing reliability improvements we should add a toggle to enable this new behavior. The toggle should be either: master branch as it in v1.0.5 or anything goes alpha behavior. This will help us transition to a 2.0 release where we can break users dashboards.

I believe we can accomplish this relatively easily:

Restore the original labelMaker as legacyLabelMaker
Add a flag to config to EnableAlphaMetrics. Plumb through job spec/tile UI.
Inject the correct labelMaker during App construction
Add a conditional to the metric prefix assignment (or perhaps refactor this into a service object)

We will drop the opt-in and legacy code paths with the v2.0.0 release.

/cc @fluffle @knyar

PCF 2.0 Support

use dynamic_ips instead of static_ips in tile
use bosh2 to create releases: #111
create release with --sha2

Bad release: v1.0.6

This release is broken. Under heavy load the nozzle is susceptible to hanging. Manual validation did not pick this up, possibly due to the nozzle logging plenty of its own metrics or conflicting nozzle versions writing to the same project.

Pull release from PivNet
Remove release from GitHub. Tag will remain for historical reasons but binary is misleading.
Add/re-introduce buffer

Related issue: #107

The good news is this exposed several edge case bugs around hangs/shutdown.

Create GCP project for CI

Currently the deploy stage of the CI pipeline pushes the release to a CF installation running in a random project. We should provision a new project specific to stackdriver-tools with a minimal CF installation for this specific purpose.

Support exporting metrics to Stackdriver Logging

Today we split events from the Loggregator into Stackdriver Logging/Monitoring. It is desirable to have the metric data available in Stackdriver Logging so users can use exports for further analysis. An example would be exporting to BigQuery to perform custom calculations on metrics not available in Stackdriver Monitoring.

Design Considerations:

Should it be a 'All or Nothing' to log to both Stackdriver Logging and Monitoring or should each endpoint have it's own list of events?
Should we need to perform the same culling for metrics destined for Stackdriver Logging?

Throttle TimeSeries requests to Stackdriver metrics API

The Stackdriver API recommends sending at most 1 TimeSeries value every 30s (a batch request can contain 200 individual TimeSeries values). The nozzle should keep a map of TimeSeries values and put them to the API at a 30s interval.

Add Docker build pipeline

Build the docker image (ci/docker) in CI

Run a single instance of the nozzle

Counter support in the nozzle requires deterministic routing of loggregator metric events to nozzle instances. This might eventually be implemented in Loggregator V2 API, but until that is done (and we switch to V2), we need to change the tile to restrict the number of nozzle instances to 1 (currently default is 2, and the value can be changed by the user).

Command Injection Vulnerability

This exploit exists in this project. We've highlighted the affected lines

Details in this report

stackdriver-tools
https://github.com/cloudfoundry-community/stackdriver-tools/blob/master/src/common/utils.sh#L3-L4

Convert example to BOSH add-ons

With the new BOSH 2.0 features, the GCP tools can be colocated on every single VM using add-ons instead of manually specifying them at the deployment manifest.

Emit a metric for dropped metrics due to RST_STREAM

Why does RST_STREAM happen:

A metric is sent out of order
A metric is sent too frequently

In #72 we addressed the RST_STREAM error for the second case. We did not address the first case.

The Loggreagor does not guarantee order of event delivery and it scales by sharding messages across multiple nozzle deployments. It's not (reasonable) possible to re-order these metric and coordinate that across various nozzles. it also doesn't make sense to redefine the semantics of the system at the nozzle level.

We should gracefully handle the RST_STREAM error and emit a metric that is actionable for operators.