envoyproxy / go-control-plane Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 499.0 47.6 MB

Go implementation of data-plane-api

License: Apache License 2.0

Makefile 1.28% Go 94.59% Shell 4.08% Starlark 0.05%

go-control-plane's People

Contributors

Stargazers

Watchers

Forkers

rshriram bndw kyessenov douglas-reid fizzwu dumbomir alexxnica kryndex enterstudio hsmade skyrocknroll jamsajones ilackarms turbinelabs davecheney ttindell2 taion809 cdmitri shakti-das ostromart tvoran reza-putra bailkeri bobby-stripe dunjut yangminzhu gregbanks pims quanjielin zackbutcher y0username tjrivera skwasiborski karanvasnani chehai etsangsplk mandarjog arnaudbriche lei-tang zyfjeff aeroglyphic venilnoronha cmluciano gsagula awesomegolang agile6v newrelic-forks wozz jun-chang mrshah-at-ibm klarose bevyx jun06t crystaldust ggreenway nakabonne hzxuzhonghu nzoschke greghanson nutanix rvarghes thales-e-security jimmycyj pravinsinghal idev4u vishalpowar vadimeisenbergibm yuan-stripe trevorriles bretagne-peiqi keyki lazybetrayer qinzhao168 talnordan yuval-k dschaller biefy dalavancloud fishcakez brian-avery gambler13 chainhelen derekargueta smlrepo anstoli andrew-su yhtsnda kyu-c terciodemelo tomwans mpitt jlaham thejasbabu grumbleafrican jessicayuen qiwzhang zxz2801 krisnova furlongmt incfly

go-control-plane's Issues

Tag releases

Hello,
I've been using the go-control-plane for a little bit now and in that time there have been several breaking changes to the API. It's not a big deal but it's unexpected work. My suggestion is to start tagging releases after a breaking change (v0.8, v0.9, v0.10, etc) so it's easier to pin specific versions in glide.

Consider using OpenCensus for monitoring instrumentation

OpenCensus is a library for go that has extensible export backends:
https://github.com/census-instrumentation/opencensus-go

This should be a good choice to keep this library vendor-neutral.

Filter.config_type causes protobuf errors for http_connection_manager since v1.9

The structure of the Filter message has changed in v1.9 so that config became config_type. Now I get the error in Envoy's logs when the log level is set to debug

gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener fff: Proto constraint validation failed (HttpConnectionManagerValidationError.StatPrefix: ["value length must be at least " '\x01' " bytes"]):

even though StatPrefix in HttpConnectionManager is set properly

here is my http_connection_manager

hcm := envoyhcm.HttpConnectionManager{
		CodecType:  envoyhcm.AUTO,
		StatPrefix: "BLABLABLA",

		RouteSpecifier: &envoyhcm.HttpConnectionManager_RouteConfig{
			RouteConfig: <<<MyRouteConfig>>>,
		},
	}

here is my filterChains

anyHCM, err := types.MarshalAny(hcm)

	if err != nil {
		panic(err)
	}

filterChains := []envoylistener.FilterChain{
		envoylistener.FilterChain{
			Filters: []envoylistener.Filter{
				{
					Name: util.HTTPConnectionManager,
					ConfigType: &envoylistener.Filter_TypedConfig{
						TypedConfig: anyHCM,
					},
				},
			},
		},
	}

and here is my listener

listener := api.Listener{
		Name:         "my_listener",
		Address:      *mkAddress("127.0.0.1", uint32(4589)),
		FilterChains: *filterChains,
	}

I am not sure whether this is a bug but surely this only applies for v1.9+ so maybe http_connection_manager isn't properly serialized before sending the response

Put a link in main readme

this pkg/test/main little test program is very useful to understand the envoy control plane, may be we should put a link in this repo's readme to give fresh user (like me) a good start.

Panic occurred when serializing OkResponse for ext_authz

Problem

Hi all

Panic occurred when my external authorization service marshal OkResponse.

Below are stack trace(out-of-order, the first line should be deepest trace):

github.com/envoyproxy/go-control-plane/envoy/api/v2/core.(*HeaderValueOption).MarshalTo(0xc0004c6d50, 0xc000e462ae, 0x4, 0x4, 0x16, 0x6ab, 0x0)
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /build/vendor/google.golang.org/grpc/server.go:802 +0x86
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha/external_auth.pb.go:592 +0x15d

github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha.(*CheckResponse).MarshalTo(0xf2e460, 0xc000e45c00, 0x6b2, 0x6b2, 0x6b2, 0x6b2, 0xa3ca40)
github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha.(*CheckResponse_OkResponse).MarshalTo(0xf2b4b8, 0xc000e45c00, 0x6b2, 0x6b2, 0xc000795950, 0x440057, 0x700)
google.golang.org/grpc/encoding/proto.codec.Marshal(0xa3ca40, 0xf2e460, 0x1, 0xc000795a68, 0xc000795b88, 0x92e3d2, 0xc000795bc0)
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha/external_auth.pb.go:624 +0xdf
github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha.(*OkHttpResponse).MarshalTo(0xf402e0, 0xc000e45c03, 0x6af, 0x6af, 0x6c1, 0x3, 0xc000152e00)
github.com/envoyproxy/go-control-plane/envoy/api/v2/core.(*HeaderValue).MarshalTo(0xc00049e900, 0xc000e462b0, 0x2, 0x2, 0x12, 0x2, 0x0)
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/api/v2/core/base.pb.go:1878 +0x27d
    /build/vendor/google.golang.org/grpc/server.go:1139 +0xd58
    /build/vendor/google.golang.org/grpc/server.go:681 +0xa1
google.golang.org/grpc.(*Server).sendResponse(0xc000001680, 0xafabc0, 0xc000097380, 0xc0004b2400, 0xa3ca40, 0xf2e460, 0x0, 0x0, 0xc000898827, 0x0, ...)
goroutine 2175 [running]:
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha/external_auth.pb.go:553 +0x16d
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/api/v2/core/base.pb.go:1908 +0xe5
    /build/vendor/google.golang.org/grpc/server.go:683 +0x9f
panic: runtime error: index out of range
google.golang.org/grpc.(*Server).handleStream(0xc000001680, 0xafabc0, 0xc000097380, 0xc0004b2400, 0x0)
google.golang.org/grpc.encode(0x7f87ea759030, 0xf5dd70, 0xa3ca40, 0xf2e460, 0xf5dd70, 0x9ac4a0, 0xaefd40, 0x0, 0x0)
github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha.(*CheckResponse).Marshal(0xf2e460, 0xa3ca40, 0xf2e460, 0x7f87ea7590e8, 0xf2e460, 0x1)
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000898600, 0xc000001680, 0xafabc0, 0xc000097380, 0xc0004b2400)
    /build/vendor/google.golang.org/grpc/encoding/proto/proto.go:70 +0x19c
    /build/vendor/google.golang.org/grpc/server.go:957 +0x514
    /build/vendor/google.golang.org/grpc/rpc_util.go:511 +0x5e
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000001680, 0xafabc0, 0xc000097380, 0xc0004b2400, 0xc000150ae0, 0xf2ca90, 0x0, 0x0, 0x0)
    /build/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v2alpha/external_auth.pb.go:569 +0x7f

Vendor Info

I'm using go module for my dependency management.

require (
	github.com/dgrijalva/jwt-go v3.2.0+incompatible
	github.com/envoyproxy/go-control-plane v0.6.3
	github.com/fsnotify/fsnotify v1.4.7
	github.com/gogo/googleapis v1.1.0
	github.com/gogo/protobuf v1.1.1
	github.com/lyft/protoc-gen-validate v0.0.11 // indirect
	golang.org/x/net v0.0.0-20181114220301-adae6a3d119a // indirect
	google.golang.org/grpc v1.16.0
)

Golang version

1.11

Releated Links

gogo/protobuf#523
gogo/protobuf#485

Envoy xDS not updating after 15-20 minutes.

Created xDS GRPC v2 API for envoy as suggested in go-control-plane/pkg/test/main/main.go. Callback, Management Server and Gateway are same, AccessLogServer is removed. It works perfectly and CDS, RDS and LDS are updated successfully.

The problem is from envoyproxy (docker) service. At first, on change of snapshot, GRPC API sends StreamResponse, then envoyproxy does StreamRequest and updates xDS. After ~15mins on change of snapshot, GRPC API still sends StreamResponse but envoyproxy does no StreamRequest and hens no xDS is updated. After this, if I restart envoyproxy, StreamRequest is called, xDS is updated. Problem reappears after ~15mins of each envoy proxy restart.

Something I noticed:

func (cb *callbacks) OnStreamResponse(id int64, req *v2.DiscoveryRequest, res *v2.DiscoveryResponse) {} id here changes only when I restart envoyproxy. For a session of envoyproxy it stays same for any number of StreamResponse and StreamRequest. Does it meant to stay same or should it change for each Request?

My envoy_config_yml:

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901
dynamic_resources:
  ads_config:
    api_type: GRPC
    refresh_delay: 30s
    cluster_names:
    - xds_cluster
  cds_config:
    ads: {}
  lds_config:
    ads: {}
node:
  cluster: my-cluster
  id: mystack
static_resources:
  clusters:
  - connect_timeout: 1s
    hosts:
    - socket_address:
        address: envoy-discovery-service # docker service name
        port_value: 18000
    http2_protocol_options: {}
    name: xds_cluster
    type: logical_dns
    dns_lookup_family: V4_ONLY
    dns_refresh_rate: 10s

Note: I am using one docker stack with envoyproxy and envoy-discovery-service as 2 different services.

unpin envoy

Due to temporary protocol issues, Envoy is pinned to 3/7/19 build here https://github.com/envoyproxy/go-control-plane/pull/158/files#diff-3254677a7917c6c01f55212f86c57fbfR11.
Unpin it once resolved.

Ordering on xDS responses

In ADS mode, it is preferable to respond to xDS queries in the following order, to match the topological ordering of cross-references between xDS resources:

We should support sequencing the responses in this order. This is mostly applicable to ADS mode, since in xDS mode, snapshots may be partial (e.g. only RDS).

Is there a way that EDS IPAddress accepts docker hostname with EDS Cluster?

Is there a way we can use docker service eg webapp_dev_webapp.1.no9rqsfuz73k026uod3hsjk83 as core.SocketAddress.Address?

Docker service name for a static cluster address defined in envoy config yaml works well and resolves fine, as cluster type is logical_dns. For dynamic Envoy config, cluster type = EDS, and each cluster has its Endpoints. If I provide normal IP Address and port for each Endpoint, it works fine.
But each of my services is hosted in docker, and there is an issue using docker's internal ipaddress which is in subnet format. My option is to use docker services' Host name, as envoy itself is in same docker network, resolving hostname should not be a problem.

I initialized Endpoint as

endpoints := make([]cache.Resource, 1)
var eps []endpoint.LbEndpoint
eps = append(eps, endpoint.LbEndpoint{Endpoint: &endpoint.Endpoint{
	Address: &core.Address{
		Address: &core.Address_SocketAddress{
			SocketAddress: &core.SocketAddress{
				Protocol: core.TCP,
				Address: "webapp_dev_webapp.1.no9rqsfuz73k026uod3hsjk83",
				ResolverName: "LOGICAL_DNS",
				PortSpecifier: &core.SocketAddress_PortValue{
					PortValue: uint32(8080),
				},
			},
		},
	},
},
})

endpoints[0] = &v2.ClusterLoadAssignment{
	ClusterName: "my_cluster",
	Endpoints: []endpoint.LocalityLbEndpoints{{
		LbEndpoints: eps,
	}},
}

var edsSource *core.ConfigSource
edsSource = &core.ConfigSource{
	ConfigSourceSpecifier: &core.ConfigSource_Ads{
		Ads: &core.AggregatedConfigSource{},
	},
}
cluster := &v2.Cluster{
	Name:          "my_cluster",
	ConnectTimeout: 5 * time.Second,
	Type:           v2.Cluster_EDS,
	EdsClusterConfig: &v2.Cluster_EdsClusterConfig{
		EdsConfig: edsSource,
	},
}

// some listener and routes too

After generating and setting snapshot with above settings, envoys updates LDS, RDS, CDS but throws error in EDS as
[warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment rejected: Unknown address resolver: LOGICAL_DNS. Consider setting resolver_name or setting cluster type to 'STRICT_DNS' or 'LOGICAL_DNS'.

Is docker hostname via EDS cluster not supported yet or am I missing something?

Group nodes to reduce snapshot cache size

Currently, we need two separate identities for proxies:

an actual ID used to track the state of the proxy on the server - detecting when the proxy goes away, monitoring accepted versions
a lookup ID used to select the snapshot to push to the proxy.

In the current implementation, hey are treated as 1-1, which means there as many snapshots as there are proxies. In some cases, this is necessary, if all proxies are indeed different. But in many cases, proxies should receive the same configuration if grouped appropriately.

Add file access log filter name constant

Allow Callbacks.OnStreamRequest() to return an error to close streams

When trying to add a simple rate limiting into an implementation of Callbacks, it came to me that OnStreamRequest() cannot close stream processing as it does not return an error. It could be a common case to allow server close stream on specific requests.

unpin envoy

Envoy is temporarily pinned to 3/7/19.
https://github.com/envoyproxy/go-control-plane/pull/158/files#diff-3254677a7917c6c01f55212f86c57fbfR11

Caching xDS responses

The management server has to maintain a cache of the desired configurations for all proxies in the mesh. Since there could be a lot more proxies that different configurations, it seems we want to be able to group proxy nodes into buckets.

Let's start with a generic bucketing function:

function nodeGroup(proxy api.Node): string

For example, we can use the cluster field in the node to group API responses if the cluster corresponds to the service cluster. Alternatively, we can use the generic metadata field in the node message to assign configuration to proxies (e.g. in kubernetes, we may use pod labels to uniequely identify sidecars).

The cache is a map from the node group and response type to the desired response:

var cache map[NodeGroup][ResponseType]proto.Message

Build error when vendoring dependency

Bug

Revision: 39c73ae5ba014b648d4164a24340a78b6b761978

I am building a docker image with an instance of the Aggregate Discovery Service that imports this library as a package. I am receiving a syntactical error from cache.go:

Error

vendor/github.com/envoyproxy/go-control-plane/pkg/cache/cache.go:25: syntax error: unexpected = in type declaration
make: *** [build] Error 2

I'm not entirely sure why this is happening because line 25 is valid syntax.

Vendor Information

I am using dep for my dependency management.

[[projects]]
  branch = "master"
  name = "github.com/envoyproxy/go-control-plane"
  packages = [
    "envoy/api/v2",
    "envoy/api/v2/auth",
    "envoy/api/v2/cluster",
    "envoy/api/v2/core",
    "envoy/api/v2/endpoint",
    "envoy/api/v2/listener",
    "envoy/api/v2/ratelimit",
    "envoy/api/v2/route",
    "envoy/config/accesslog/v2",
    "envoy/config/bootstrap/v2",
    "envoy/config/filter/accesslog/v2",
    "envoy/config/filter/fault/v2",
    "envoy/config/filter/network/http_connection_manager/v2",
    "envoy/config/filter/network/tcp_proxy/v2",
    "envoy/config/metrics/v2",
    "envoy/config/ratelimit/v2",
    "envoy/config/trace/v2",
    "envoy/service/accesslog/v2",
    "envoy/service/discovery/v2",
    "envoy/type",
    "pkg/cache",
    "pkg/log",
    "pkg/server",
    "pkg/test",
    "pkg/test/resource",
    "pkg/util"
  ]
  revision = "39c73ae5ba014b648d4164a24340a78b6b761978"

Java Port

Most of our disco/metrics/etc. infrastructure is built around java, so I've been working on a java port of go-control-plane. Would there be interest in creating a new repo under the envoyproxy org for it?

I've basically followed the same design as go-control-plane, with just some minor tweaks to make it more idiomatic for the java world.

enforce deterministic marshalling for protos

We need a linter check so that we do not accidentally generate non-deterministic marshalling code.

Implement SDS

add stubs for SecretDiscoveryService
add secrets to snapshots
work-out ADS semantics for secret fetching
test and validation

Cache Comparison

Would it make sense to add a comparison method in the simple cache, where you can send an old snapshot and compare it to what you want the snapshot to be, comparing everything except version?

If using a timer based updater (naive approach), this would allow us to not cause proxies to update unless changes occur. There likely is a better approach to this.

Relax consistency check for ADS

ADS mode requires a hold on the response until the entire set of snapshot names is requested (to simplify version tracking of partial name subsets). That means, Snapshot must be self-consistent when submitted to the cache; it should have as many cluster load assignments as there are clusters. However, the client of the library, may want to use one pool of cluster load assignments for all proxies and all snapshots. We should relax the consistency check to validate that the set of clusters is a subset of cluster load assignments, not a precise match.

Update to go 1.10

That should help with developer experience since 1.10 manages build caches much better.

core.Healthcheck.{Timeout,Interval} use types.Duration not time.Duration

The duration style fields in core.Healthcheck are defined using the gogo/types Duration struct

type HealthCheck struct {
	// The time to wait for a health check response. If the timeout is reached the
	// health check attempt will be considered a failure.
	Timeout *google_protobuf3.Duration `protobuf:"bytes,1,opt,name=timeout" json:"timeout,omitempty"`
	// The interval between health checks.
	Interval *google_protobuf3.Duration `protobuf:"bytes,2,opt,name=interval" json:"interval,omitempty"`

Elsewhere in the core package the more convenient time Duration type is used

type ApiConfigSource struct {
        // ...
	RefreshDelay *time.Duration `protobuf:"bytes,3,opt,name=refresh_delay,json=refreshDelay,stdduration" json:"refresh_delay,omitempty"`
}

Why are there differences in the two definitions? Is it possible to change Healthcheck to use *time.Duration?

Generating a snapshot from a configuration file

This is a feature request.
From the conclusion, How about adding a function to generate snapshot from yaml or json file?
Like this:

import "pkg/cache"
func NewSnapshotFromYAML(filePath string) cache.Snapshot {}

I think we need the abstracted API, if we want to manage the resource that becomes the response of xDS with yaml.
For example, in case we want to review the data-plane configuration with multiple people, we need to manage them with yaml or json files.

please cut a v0.3 tag

Hi @kyessenov,

we would like to use some of the more recent schema changes, but would prefer a stable tag. Do have time to cut one in the next few days?

Thanks!

get sum error，how can i do？

github.com\envoyproxy\go-control-plane\envoy\api\v2\cds.pb.go:2226:21: c.cc.NewStream undefined (type *grpc.ClientConn has no field or method NewStream)

Ordering on version info

It seems that we should optionally support an ordering operator on the versions, to make it easier to load balance xDS requests between control plane servers. If a server receives a request for a newer version, it should withhold replying until it catches up. For example, let's say we go through versions v1, v2, v3, and the state is:

envoy at v2
server1 at v1
server3 at v3

If envoy makes a request to server1, then server1 should withhold until it receives v3 rather than replying with v1.

Merge / accept Rotor, in whole or in part

So with Turbine Labs going away, there's not going to be anybody around to work on Rotor. Rotor is a control plane based on go-control-plane that adds EDS integration with common service discovery registries like Kubernetes, Consul, and AWS (EC2 tags and ECS).

If the Envoy project wants it, it's yours.

Why It's Worth Adopting Rotor

The community needs an out-of-the-box control plane

Rotor exists based on Turbine Labs' experience trying to get people up and running with Envoy at scale, both as a support vendor and a community member (mostly Slack). Most people figure out quickly that they need a control plane, and they typically disappear for 2-4 weeks trying to write an EDS integration. They generally win the battle (producing a control plane) but lose the war (maintaing the control plane becomes a drag on successfully deploying Envoy everywhere).

The primary motivation for users is EDS integration with either Consul or Kubernetes. Rotor solves this without writing any code, which has been a huge win for a lot of folks.

In addition, the notion of a centralized control plane has been adopted broadly, and Rotor provides some facility for defining listeners and routes in a single place (a config file for Rotor, which is roughly the same format as Envoy LDS). Many users find this operationally preferable to managing and distributing static config files, as it gets them on a path to use xDS over gRPC quickly.

Considerations

Most of these can be overcome with some work, but I'll put all the cards on the table (to the best of my ability).

Rotor used to be commercially backed. In particular, it has code to connect to the Turbine Labs API, which is no longer functional. It's easy enough to ignore, since that code path is only used when you pass an API key. But perhaps one would want to remove the CLI help text or the code soon after adopting the project.
Rotor's documentation is partially hosted on the Turbine Labs blog. The blog isn't going away (thanks, Medium!), but one might want to fold the intro post or the big update post into the Rotor's README.
Istio's Pilot has some of the same goals. Strategically, it may be more interesting to adopt Pilot and put work towards that. Rotor's primary advantage was always with folks with significant non-Kubernetes deployments, and there are a lot of those people out there. I suspect it will be a while before Pilot can be the blessed solution for typical users, because it's going to be a while before most people are 100% on Kubernetes. Even then, Istio's momentum is real, and having two suggested control planes may not be a win for users.
Rotor is not a particularly robust solution for LDS and RDS. It's just another place to write RDS and LDS configuration. Turbine Labs felt that this was our commercial wheelhouse, so Rotor was never going to, e.g. read routes from Consul tags and set up routes based on that. We've seen users adopt Rotor as a way to configure LDS/RDS with just config file vs. writing code for go-control-plane, but this area of functionality could be much more full-featured.

If you ask me, I think there's value in at least having Rotor brought under the Envoy / CNCF care as a companion project. There's a real need for a more standardized control plane story, whether that's Rotor or some other project.

Happy to discuss anything else in this issue, if there are other questions or concerns about the project!

ADS grpc stream get closed after each request

This might be

I have implemented ADS server for a basic control plane using this package. I can send the data with proper versioning but for a reason i cannot understand, envoy keep closing the grpc stream which get's logged as a warning (see below). This might be the expected behavior but it feels somewhat wrong to me. My undertstanding of grpc is to reuse existing connections as much as possible.

Now I have experimented with the KeepaliveEnforcementPolicy and KeepaliveParams on the grpc server but it doesn't change anything. As far as i know, everything is pretty much vanilla.

I understand this might be a problem in https://github.com/envoyproxy/envoy too but i though to ask here before.

envoy_1  | [2019-01-08 18:46:49.320][000006][info][main] [source/server/server.cc:463] starting main dispatch loop
envoy_1  | [2019-01-08 18:46:49.325][000006][info][upstream] [source/common/upstream/cluster_manager_impl.cc:132] cm init: initializing cds
envoy_1  | [2019-01-08 18:46:49.567][000006][warning][upstream] [source/common/config/grpc_mux_impl.cc:268] gRPC config stream closed: 0,
envoy_1  | [2019-01-08 18:46:50.275][000006][warning][upstream] [source/common/config/grpc_mux_impl.cc:268] gRPC config stream closed: 0,

test framework for dynamic routing

See https://github.com/DecipherNow/fabric-experiments/tree/master/metrics-test

This has 1 http client and 1 grpc client each talking to a 'store' service which talks to a 'bank' service. The clients send simple requests at random intervals. The servers add random amounts of latency.

To facilitate testing dynamic routing I propose to add:

multiple clients of each type
one or more clusters of multiple store services
one or more clusters of multiple bank services

This will prepare for the test platform which will have

an edge proxy routing client requests to members of the store cluster
a sidecar for each store service to manage dynamic configuration and routing
a sidecar for each bank service

Cache garbage collection

Maintaining a cache of responses requires garbage collection of response for stale nodes.
Since the goal for now is a total cache (e.g. the entire response cache is loaded on-demand/a-priori), the right strategy seems to be one of or combination of:

let the client evict responses;
evict responses after certain duration threshold.

Server-side validation auto-generated functions

Blocked on bufbuild/protoc-gen-validate#52

inconsitent usage of context package

Generated API uses golang.org/x/net/context".Context should use context".Context

github.com/envoyproxy/go-control-plane/pkg/server

../github.com/envoyproxy/go-control-plane/pkg/server/server.go:65:41: cannot use server literal (type *server) as type Server in return argument:
*server does not implement Server (wrong type for FetchClusters method)
have FetchClusters("context".Context, "github.com/envoyproxy/go-control-plane/envoy/api/v2".DiscoveryRequest) ("github.com/envoyproxy/go-control-plane/envoy/api/v2".DiscoveryResponse, error)
want FetchClusters("golang.org/x/net/context".Context, "github.com/envoyproxy/go-control-plane/envoy/api/v2".DiscoveryRequest) ("github.com/envoyproxy/go-control-plane/envoy/api/v2".DiscoveryResponse, error)

Compilation finished with exit code 2

Switch to gogo from goproto

Is there an interest in switching to gogoprotos from goprotos?
The generated code looks a lot more pleasant to work with (oneofs are terrible in goproto), plus we may get insubstantial performance improvements.

WDYT?

@davecheney @junr03 @rshriram

Set-up xDS protocol testing

We should make some basic ping-pong style tests for the server. That requires a client. Should we mock Envoy or just use the real thing? If we use a real envoy, we can probably query its admin interface to validate config application.

Cache ADS Mode Snapshot Consistency Question

We've run into an issue when using Envoy with a Snapshot Cache configured for ADS mode and statically defined clusters that are discovered through EDS.

This comes from trying to use EDS to discover the endpoints for the tracing service, for instance, in our Envoy config we have:

tracing:
  http:
    name: envoy.zipkin
    config:
      collector_cluster: trace-collector
      collector_endpoint: /api/v1/spans

And then a static cluster configured for EDS:

static_resources:
  clusters:
  - name: trace-collector
    connect_timeout: 0.25s
    lb_policy: ROUND_ROBIN
    type: EDS
    eds_cluster_config:
      eds_config:
        api_config_source:
          api_type: GRPC
          grpc_services:
            - envoy_grpc:
                cluster_name: xds_cluster

We've noticed that if we configure the rest of Envoy's CDS and EDS to use the ADS stream, we run into a deadlock on initial Envoy boot.

I've traced the request flow and it goes like this:

Envoy looks up static trace-collector endpoints through EDS.
Cache responds normally, Envoy moves on to dynamic CDS over ADS.
*CDS over ADS response is served from go-control-plane cache with a list of clusters including the trace-collector.
Envoy performs EDS over ADS but does not request the trace-collector resource, since this was already discovered separately.
go-control-plane cache does not respond to the EDS over ADS because the requested resource names are not a superset of the snapshot's resources, the logic found here: https://github.com/envoyproxy/go-control-plane/blob/master/pkg/cache/simple.go#L220

To work around this issue, we have to exclude trace-collector from the snapshot used for ADS.

My question is, what is the motivation behind ensuring the requested resources are a superset of the cached snapshot resources? My gut feeling is that the other way around makes more sense: the cached snapshot is a superset of the requested resources.

Is Consistent() correct?

Hi. Thank you very much for the implementation. This helps me a lot.

I tried this but faced an error at cache.Consitent(). It compares with the numbers of endpoints, but GetResourceReference(cluster items) returns the number of ClusterConfig, not the number of endpoints. It returns at most one for each cluster item.
https://github.com/envoyproxy/go-control-plane/blob/master/pkg/cache/snapshot.go#L74-L75
https://github.com/envoyproxy/go-control-plane/blob/master/pkg/cache/resource.go#L87-L95

Stupid question, but how to run this control plane

How to run the control plane, able to run all make command.

Not sure, how can i run start this control plane, does some more work is required ?

Update:

Now able to run to run it by: make integration.rest

--> building test binary
env XDS=rest build/integration.sh
INFO[0000] upstream listening HTTP/1.1                   port=18080
INFO[0000] access log server listening                   port=18090
INFO[0000] gateway listening HTTP/1.1                    port=18001
INFO[0000] waiting for the first request...        
INFO[0000] management server listening                   port=18000

I see something like this, so some endpoint are up where I can listen to, still if I can have some reference will really be helpful, how can I run RDS with any platform or any pointers in direction, where I can go figure out !

Does this project outdate?

I try this project to wirte my own CDS， however i found the grpc interface changed.

In this project, it's

func (*server) StreamClusters(v2.ClusterDiscoveryService_StreamClustersServer) error {
    pass
}

but in date-plane-api, it change to

service ClusterDiscoveryService {
  rpc StreamClusters(stream DiscoveryRequest) returns (stream DiscoveryResponse) {
}

so, have it change?

ADS zero traffic-loss staged updates

xDS protocol was designed with eventual consistency goal in mind. Under certain circumstances, there is a possibility of traffic loss in the middle of configuration update. For example, a transition from a listener L1 using cluster C1 to a listener L2 using cluster C2 using two CDS and LDS would cause a brief loss in the inconsistent state L1, C2 or L2, C1. The way around this problem is to sequence updates so that resources are not removed while in use. This requires using ADS to multiplex all resource updates over a single stream, as well as the sequencing logic in the management server. In this scenario, the correct sequence is:

update clusters to {C2, C1}
update listeners to {L2}
update clusters to {C2}

We can model this problem as batching atomic updates. The server receives a logical operation to update the desired configuration to {L2, C2}. The server keeps track of {L1, C1} applied remotely at the proxy. The server then creates a sequence of updates as above and streams the sequence of updates to CDS and LDS types through ADS.

For this to happen, the server has to take the logical batched updates to multiple xDS resources types at a time as input. The server also has to maintain the remote state of applied xDS resources to correctly compute the diff and remove stale resources from the remote proxies.

Is there an equivalent to /v1/registration/service-name for EDS?

I miss the status endpoint of the lyft/discovery reference SDS implementation. Is there a way to get the same information using the v2 messages?

Boostrap Config Unmarshaling Issues?

Assuming the following valid Envoy JSON configuration file (validated by actually running an envoy instance with this):

{
   "admin":{
      "access_log_path":"/dev/null"
   },
   "node":{
      "cluster":"example-service-1",
      "id":"id-1"
   },
   "static_resources":{
      "clusters":[
         {
            "connect_timeout":"2.00s",
            "lb_policy":1,
            "name":"cluster1",
         }
      ],
      "listeners":[
          {
             "address": {
                 "socket_address": {
                 "address": "0.0.0.0",
                 "port_value": 9090,
                 "ipv4_compat": true
               }
             },
             "name": "example-service",
             "filter_chains":[{
                "filters":[{
                  "name":"envoy.http_connection_manager",
                  "drain_timeout": "3000s",
                  "stat_prefix":"ingress_http",
                  "config":{
                     "access_log":[
                        {
                           "config":{
                              "path":"/dev/stdout"
                           },
                           "name":"envoy.file_access_log"
                        }
                     ],
                     "codec_type":"auto",
                     "http_filters":[
                        {
                           "name":"envoy.router"
                        }
                     ],
                     "route_config":{
                        "virtual_hosts":[
                           {
                              "domains":[
                                 "*"
                              ],
                              "name":"backend",
                              "routes":[
                                 {
                                    "match":{
                                       "prefix":"/"
                                    },
                                    "route":{
                                       "cluster":"cluster1",
                                       "timeout": "3000s",
                                       "prefix_rewrite": "/"
                                    }
                                 }
                              ]
                           }
                        ]
                     }
                  }
               }]
             }]
          }
       ]
   }
}

Why would the following unmarshaling code fail with this error:

	bootstrap := &v2.Bootstrap{}
	err := jsonpb.Unmarshal(r, bootstrap)
	if err != nil {
		return nil, stacktrace.Propagate(err, "failed in unmarshalling config bytes into bootstrap object")
	}

	return bootstrap, nil

failed in unmarshalling config bytes into bootstrap object
         --- at confutil/config.go:91 (unmarshal) ---
        Caused by: unknown field "drain_timeout" in listener.Filter

drain_timeout is a valid param in listener.Filter unless I am misunderstanding something. For clarification we are using the following package "github.com/gogo/protobuf/jsonpb" to handle the unmarshaling of JSON into the proto structs.

Missing channel init?

Where do these channels get initialized?

I see where they're "re-initialized" down here, but I'm not seeing where they're initially set. I think maybe some calls to s.cache.CreateWatch() might be missing up above?

Thanks!

Repo setup

Please use this issue to discuss general setup. I want to agree on a bunch of things before anyone starts pushing code:

CI setup. (CircleCI)
Build system. Bazel?
Dependency tracking. Glide/godep/something else?
Outline of project scope. What this is and what this isn't in README.

Can we please have a new tagged release

Hello,

Can we please have a new tagged release to pick up #102

If this tag could be in semver form that would be very helpful in letting me transition my project from Dep to Go 1.11's module system.

Thank you.

Integration test does not wait before closing HTTP Servers

I want to make some changes to go-control-plane locally. Prior to making any modifications, I decided to run the integration tests before doing anything else. Surprisingly, they failed:

make integration.docker 
--> building Linux test binary
--> building test docker image
Sending build context to Docker daemon  182.3MB
Step 1/5 : FROM envoyproxy/envoy:latest
 ---> cf2442a79328
Step 2/5 : ADD sample /sample
 ---> Using cache
 ---> 901328cc36a6
Step 3/5 : ADD build/integration.sh build/integration.sh
 ---> Using cache
 ---> e9f5d2931ab0
Step 4/5 : ADD bin/test-linux /bin/test
 ---> Using cache
 ---> 21abea4c5a7c
Step 5/5 : ENTRYPOINT build/integration.sh
 ---> Using cache
 ---> c430b556d974
Successfully built c430b556d974
Successfully tagged test:latest
docker run -it -e "XDS=ads" test -debug
INFO[0000] upstream listening HTTP/1.1                   port=18080
INFO[0000] waiting for the first request...             
INFO[0000] gateway listening HTTP/1.1                    port=18001
INFO[0000] access log server listening                   port=18090
INFO[0000] management server listening                   port=18000
ERRO[0000] http: Server closed                          
ERRO[0000] http: Server closed                          
DEBU[0000] stream 1 open for                            
DEBU[0000] open watch 1 for type.googleapis.com/envoy.api.v2.Cluster[] from nodeID "test-id", version "" 
INFO[0000] initial snapshot {Xds:ads Version: UpstreamPort:18080 BasePort:9000 NumClusters:4 NumHTTPListeners:2 NumTCPListeners:2} 
INFO[0000] executing sequence                            requests=5 updates=3
INFO[0000] update snapshot                               version=v0
DEBU[0000] respond open watch 1[] with new version "v0" 
DEBU[0000] respond type.googleapis.com/envoy.api.v2.Cluster[] version "" with version "v0" 
DEBU[0000] respond type.googleapis.com/envoy.api.v2.ClusterLoadAssignment[cluster-v0-3 cluster-v0-2 cluster-v0-1 cluster-v0-0] version "" with version "v0" 
DEBU[0000] open watch 2 for type.googleapis.com/envoy.api.v2.Cluster[] from nodeID "test-id", version "v0" 
INFO[0000] request batch                                 batch=0 failed=4 ok=0 pass=false
DEBU[0000] respond type.googleapis.com/envoy.api.v2.Listener[] version "" with version "v0" 
DEBU[0000] open watch 3 for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment[cluster-v0-3 cluster-v0-2 cluster-v0-1 cluster-v0-0] from nodeID "test-id", version "v0" 
DEBU[0000] respond type.googleapis.com/envoy.api.v2.RouteConfiguration[route-v0-1 route-v0-0] version "" with version "v0" 
DEBU[0000] open watch 4 for type.googleapis.com/envoy.api.v2.Listener[] from nodeID "test-id", version "v0" 
DEBU[0000] open watch 5 for type.googleapis.com/envoy.api.v2.RouteConfiguration[route-v0-1 route-v0-0] from nodeID "test-id", version "v0" 
INFO[0000] request batch                                 batch=1 failed=4 ok=0 pass=false
INFO[0001] request batch                                 batch=2 failed=4 ok=0 pass=false
INFO[0001] request batch                                 batch=3 failed=4 ok=0 pass=false
INFO[0002] request batch                                 batch=4 failed=4 ok=0 pass=false
INFO[0002] server callbacks                              fetches=0 requests=8
ERRO[0002] failed all requests in a run 0               
Envoy log: envoy.log
Makefile:89: recipe for target 'integration.docker' failed
make: *** [integration.docker] Error 1

Note these lines:

ERRO[0000] http: Server closed
ERRO[0000] http: Server closed

If you look at the code for one, you can see:

func RunHTTP(ctx context.Context, upstreamPort uint) {
	log.WithFields(log.Fields{"port": upstreamPort}).Info("upstream listening HTTP/1.1")
	server := &http.Server{Addr: fmt.Sprintf(":%d", upstreamPort), Handler: echo{}}
	go func() {
		if err := server.ListenAndServe(); err != nil {
			log.Error(err)
		}
	}()
	if err := server.Shutdown(ctx); err != nil {
		log.Error(err)
	}
}

It's not waiting for func() to actually finish, so it just closes the server immediately. I'm guessing that on my laptop, this happens almost always, whereas in circleci the timing is different?

I've fixed it locally by using ctx.Done() like the gRPC server methods.

I'm using go 1.11.
I tested this against e79e039 and revert_grpc_version_update

I'll submit a PR soon.

Is it possible to get a new tagged release

Hello,

Is it possible to get a new tagged released of this package including #92 and friends?

Thanks

Dave

/cc @kyessenov

project doesn't build

The code in package server imports package context, but the generated code in package api references x/net/context. In addition, the code references a bunch of packages that are missing from the vendor directory.

On a clean checkout after running make depend.install, make test and make build fail (on go 1.8.3; Ubuntu). Output below the fold:

$ make test                 
--> running unit tests
# github.com/envoyproxy/go-control-plane/pkg/server
pkg/server/server.go:68: cannot use server literal (type *server) as type Server in return argument:
	*server does not implement Server (wrong type for FetchClusters method)
		have FetchClusters("context".Context, *api.DiscoveryRequest) (*api.DiscoveryResponse, error)
		want FetchClusters("github.com/envoyproxy/go-control-plane/vendor/golang.org/x/net/context".Context, *api.DiscoveryRequest) (*api.DiscoveryResponse, error)
# github.com/envoyproxy/go-control-plane/pkg/server
pkg/server/server.go:68: cannot use server literal (type *server) as type Server in return argument:
	*server does not implement Server (wrong type for FetchClusters method)
		have FetchClusters("context".Context, *api.DiscoveryRequest) (*api.DiscoveryResponse, error)
		want FetchClusters("github.com/envoyproxy/go-control-plane/vendor/golang.org/x/net/context".Context, *api.DiscoveryRequest) (*api.DiscoveryResponse, error)
[... snip successful tests in other packages ...]

and

$ make build
vendor/github.com/lyft/protoc-gen-validate/checker.go:13:2: cannot find package "github.com/lyft/protoc-gen-star" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/github.com/lyft/protoc-gen-star (vendor tree)
	/usr/lib/go-1.8/src/github.com/lyft/protoc-gen-star (from $GOROOT)
	$GOPATH/src/github.com/lyft/protoc-gen-star (from $GOPATH)
vendor/github.com/lyft/protoc-gen-validate/tests/harness/executor/harness.go:13:2: cannot find package "github.com/lyft/protoc-gen-validate/tests/harness" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/github.com/lyft/protoc-gen-validate/tests/harness (vendor tree)
	/usr/lib/go-1.8/src/github.com/lyft/protoc-gen-validate/tests/harness (from $GOROOT)
	$GOPATH/src/github.com/lyft/protoc-gen-validate/tests/harness (from $GOPATH)
vendor/github.com/lyft/protoc-gen-validate/tests/harness/executor/cases.go:14:2: cannot find package "github.com/lyft/protoc-gen-validate/tests/harness/cases/go" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/github.com/lyft/protoc-gen-validate/tests/harness/cases/go (vendor tree)
	/usr/lib/go-1.8/src/github.com/lyft/protoc-gen-validate/tests/harness/cases/go (from $GOROOT)
	$GOPATH/src/github.com/lyft/protoc-gen-validate/tests/harness/cases/go (from $GOPATH)
vendor/golang.org/x/net/http2/h2i/h2i.go:38:2: cannot find package "golang.org/x/crypto/ssh/terminal" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/golang.org/x/crypto/ssh/terminal (vendor tree)
	/usr/lib/go-1.8/src/golang.org/x/crypto/ssh/terminal (from $GOROOT)
	$GOPATH/src/golang.org/x/crypto/ssh/terminal (from $GOPATH)
vendor/google.golang.org/grpc/credentials/oauth/oauth.go:28:2: cannot find package "golang.org/x/oauth2" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/golang.org/x/oauth2 (vendor tree)
	/usr/lib/go-1.8/src/golang.org/x/oauth2 (from $GOROOT)
	$GOPATH/src/golang.org/x/oauth2 (from $GOPATH)
vendor/google.golang.org/grpc/credentials/oauth/oauth.go:29:2: cannot find package "golang.org/x/oauth2/google" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/golang.org/x/oauth2/google (vendor tree)
	/usr/lib/go-1.8/src/golang.org/x/oauth2/google (from $GOROOT)
	$GOPATH/src/golang.org/x/oauth2/google (from $GOPATH)
vendor/google.golang.org/grpc/credentials/oauth/oauth.go:30:2: cannot find package "golang.org/x/oauth2/jwt" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/golang.org/x/oauth2/jwt (vendor tree)
	/usr/lib/go-1.8/src/golang.org/x/oauth2/jwt (from $GOROOT)
	$GOPATH/src/golang.org/x/oauth2/jwt (from $GOPATH)
vendor/google.golang.org/grpc/examples/helloworld/mock_helloworld/hw_mock.go:7:2: cannot find package "github.com/golang/mock/gomock" in any of:
	$GOPATH/src/github.com/envoyproxy/go-control-plane/vendor/github.com/golang/mock/gomock (vendor tree)
	/usr/lib/go-1.8/src/github.com/golang/mock/gomock (from $GOROOT)
	$GOPATH/src/github.com/golang/mock/gomock (from $GOPATH)
Makefile:34: recipe for target 'build' failed
make: *** [build] Error 1

How to use listener.Filter.ConfigType?

Thank you for the package but probably lately listener.Filter.Config https://godoc.org/github.com/envoyproxy/go-control-plane/envoy/api/v2/listener#Filter has been deprecated in favor of listener.Filter.ConfigType which is probably of type types.Any where types is "github.com/gogo/protobuf/types". I tried converting &envoyhcm.HttpConnectionManager{} using types.MarshalAny() but it seems not working, my question is how to use an instance of HttpConnectionManager to be compliant with the new type listener.Filter.ConfigType?

Streaming gRPC server

At the basic level, the management server provides a streaming gRPC server that implements xDS APIs. This server accepts connections from Envoys, and streams configuration updates to all proxies.

The server has to keep track of (version_info, nonce) for each connected Node, as well as the type of the requested resource (routes, listeners, clusters, or multiplexed through a single ADS stream). A dedicated go-routine will be used for each tuple (Node, resource_type). These go-routines can function in parallel. Each go-routine maintains (version_info, nonce) state through request and response cycles.

The job of each routine is to drive the desired configuration in the remote proxies. The routine uses version and nonce to determine whether the configuration is successfully applied. If it's not applied, the routine attempts to apply it with retries.

The desired configuration is an input to each routine or the pool of routines. This can be an individual resource instance (EDS, CDS, RDS, or LDS), or a batch of resources to permit staging coordinated updates (TBD in another issue).

As a first cut, we can stand up a server that hosts a pool of go-routines and manages them (by spinning out on connection request, allocating state for each, and establishing channels for communication).

cc @rshriram @junr03

Avoid repeated push of rejected configuration

If envoy rejects a config, the snapshot cache will attempt to push the same config immediately after envoy requests it again. If envoy does not limit its requests, the server will drive envoy into a loop of request-receive-reject, potentially causing unnecessary CPU load.