cloudfoundry / switchboard Goto Github PK

View Code? Open in Web Editor NEW

33.0 48.0 9.0 16.57 MB

Golang TCP Proxy

License: Apache License 2.0

Go 2.33% JavaScript 71.50% CSS 26.08% Shell 0.04% HTML 0.04% Stylus 0.01% Procfile 0.01%

switchboard's Introduction

DEPRECATED: this repo has been merged into https://github.com/cloudfoundry/pxc-release

switchboard

A TCP router written on Golang.

Developed to replace HAProxy as the proxy tier enabling high availability for the MySQL dbaas for Cloud Foundry. Responsible for routing of client connections to a one node at a time of a backend cluster, and failover on cluster node failure. For more information, see the develop branch of cf-mysql-release/docs/proxy.md.

Why switchboard?

There are several other proxies out there: Nginx, HAProxy and even MariaDB's MaxScale. None of them met a specific criteria which is critical for the performance of the cluster in the case that a database server becomes unhealthy but is still accessible. Switchboard detects this condition (via healthchecks) and severs the connection. This forces the client to reconnect, and will be routed to a healthy backend. From the client's perspective it looks like it is connected to a single backend that briefly disappeared and is immediately available again.

Development

Proxy

Install Go by following the directions found here

Running the tests requires Ginkgo:

go get github.com/onsi/ginkgo/ginkgo

Run the tests using the following command:

./bin/test

UI

Ensure phantomjs v2.0 or greater is installed.

To do this on OSX using homebrew:

brew install phantomjs

Run the UI tests using the following command:

./bin/test-ui

Build UI assets:

./bin/build-ui

switchboard's People

Contributors

Stargazers

Watchers

Forkers

michaljemala cautio digideskio datafordev htoooth hunterchen valeriap vmware-archive abg

switchboard's Issues

Automatic switchboard process failure and restart

Our uaa service fluctuated for a minute due to JDBC connection error We could link the downtime to mysql proxy's switchboard restart. The downtime happened at the same time the process had restarted. Based on the logs can you tell us as to what could be the reason for the failure-restart as it happened twice in 10 days

details about your deployment
3 mysql nodes and 2 proxy nodes
Release version CF/287, mysql version - cf-mysql/32

{"timestamp":"1552556151.317238569","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.lock.lost-lock","log_level":2,"data":{"error":"Unexpected response code: 500","key":"v1/locks/mysql_lock","session":"1","value":""}}
{"timestamp":"1552556151.317294359","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.lock.done","log_level":1,"data":{"key":"v1/locks/mysql_lock","session":"1","value":""}}
{"timestamp":"1552556151.317338943","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.registration-runner.poll-until-signaled.deregistering-service","log_level":1,"data":{"service":"mysql","session":"2.1","update-interval":"1.5s"}}
{"timestamp":"1552556152.952075720","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.registration-runner.poll-until-signaled.finished","log_level":1,"data":{"service":"mysql","session":"2.1","update-interval":"1.5s"}}
{"timestamp":"1552556152.952127218","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.registration-runner.finished","log_level":1,"data":{"service":"mysql","session":"2"}}
{"timestamp":"1552556152.952205896","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.Received signal","log_level":1,"data":{"signal":2}}
{"timestamp":"1552556152.952279568","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.Received signal","log_level":1,"data":{"signal":2}}
{"timestamp":"1552556152.952310562","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.Proxy runner has exited","log_level":1,"data":{}}
{"timestamp":"1552556152.952507496","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.Switchboard exited unexpectedly","log_level":3,"data":{"error":"Exit trace for group:\nlock exited with error: lock lost\nregistration exited with nil\nhealth exited with nil\nmonitor exited with nil\napi exited with nil\nbridge exited with nil\n","proxyConfig":{"Port":3306,"Backends":[{"Host":"10.3.5.19","Port":3306,"StatusPort":9200,"StatusEndpoint":"galera_status","Name":"backend-0"},{"Host":"10.3.6.19","Port":3306,"StatusPort":9200,"StatusEndpoint":"galera_status","Name":"backend-1"},{"Host":"10.3.7.19","Port":3306,"StatusPort":9200,"StatusEndpoint":"galera_status","Name":"backend-2"}],"HealthcheckTimeoutMillis":5000},"trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-incubator/switchboard/vendor/code.cloudfoundry.org/lager.(*logger).Fatal(0xc420115da0, 0x8bf522, 0x1f, 0xac1ce0, 0xc420585dc0, 0xc42049a858, 0x1, 0x1)\n\t/var/vcap/packages/switchboard/src/github.com/cloudfoundry-incubator/switchboard/vendor/code.cloudfoundry.org/lager/logger.go:131 +0xc7\nmain.main()\n\t/var/vcap/packages/switchboard/src/github.com/cloudfoundry-incubator/switchboard/main.go:151 +0xd13\n"}}
panic: Exit trace for group:
lock exited with error: lock lost
registration exited with nil
health exited with nil
monitor exited with nil
api exited with nil
bridge exited with nil

Please help out to solve this issue, we are facing recently many times.
Thanks in Advance .. :) :) :)

README pls

How would we use this for our own services?

What is switchboard?

I know we have Gorouter and Diego has some sort of routing logic. What role does switchboard fulfill?

Switchboard should expose a port for non-primary nodes

We have several larger CF installations (2500 spaces, 15000 apps), and period scraping of CAPI for metadata to power logging and telemetry causes significant performance impact due to the load on the primary DB node.

It would be great if I could configure a set of read-only CAPI nodes pointed at a separate port on switchboard which directs to the non-primary nodes. I could then register these CAPI nodes to a new route like reporting-api.<system domain> and use this for my telemetry queries.

Docs / Ops Guide

is there really only one read-only command available (API with stats) and a dashboard?

as an operator I need more docs. For example:

how to mark one (healthy) Galera node a down? No Switchboard traffic
how to force Switchboard route traffic to other node (some kind of switch) (even if node healthy)

thanks

how to split/distribute connections to all the nodes in the cluster

While fighting to mitigate the popular "to many connections open issue" (max_connections) I found that from a cluster of 3 nodes only one node was receiving all the connections.

The setup or thoughts where that if each node is configured to handle 100 connections, the cluster capacity would be ~300 connections.

Therefore wondering if there is a way to configure/setup switchboard like HAproxy with balance roundrobin or something that could balance the connections among all the nodes, otherwise, the current behavior is more like a "failover" in where just 2 nodes are required and there is no advangate of having a cluster with more than 2 nodes.

Discover master from etcd/consul?

I have a feeling that switchboard is designed only for master-master clusters where switchboard is responsible for master selection.

In our PG cluster system we are orchestrating the master selection and promotion; we just want to communicate the new master/leader to the TCP routing layer and have it sever client connections to force them to reconnect. Possible?

Test gitbot issue.

README section: Why switchboard vs haproxy vs nginx

In the spirit of Hashicorp explaining why to use one of their projects vs existing possible solutions, could we add a section to the README about what Switchboard will do, and how it is better and/or where to use it instead of haproxy/nginx/whatever?

Why health check an http endpoint rather than backend service's own TCP endpoint?

At Marco's suggestion, I'm looking at switchboard for a different service backend than cf-mysql-release

I see the health check system is hard coded to test for an http endpoint - which I assume is different from the service TCP endpoint - but why?

What is wrong with the idea of health checking the backend service's active TCP connection?

Or how do you recommend that arbitrary backend services co-publish additional HTTP endpoints just for health checking? Is there an agent process that you are using that we can borrow?

I don't really want backend services to have to run an additional agent - it makes process monitoring hard inside docker containers for example - so I'm a little torn about this requirement for an additonal HTTP endpoint for health checking.

go get - fails on main.go

$ go version
go version go1.4 darwin/amd64

$ go get github.com/cloudfoundry-incubator/switchboard
# github.com/cloudfoundry-incubator/switchboard
Projects/go/src/github.com/cloudfoundry-incubator/switchboard/main.go:28: undefined: cf_lager.AddFlags
Projects/go/src/github.com/cloudfoundry-incubator/switchboard/main.go:31: assignment count mismatch: 2 = 1

Are you doing some probing on the incomming connections?

Are you doing some probing on the incomming connections? if so can I get a pointer in the code?
I am looking at different ways that proxy applications are identifying incoming connections.
How do they verify that it's HTTP or TLS or other?

Thread Leak?

Hi all,

I'm running SCF, deploying with Helm, which all worked fine. After a while my UAA was unhappy, and I traced it back to the Switchboard process on the MySQL pod. I've no idea if Switchboard was responsible, or just the first thing to fail due to the lack of file descriptors/whatever.

It seems that there's some sort of resource leak?

{"timestamp":"1559839050.040677309","source":"/var/vcap/packages/switchboard/bin/switchboard","message":"/var/vcap/packages/switchboard/bin/switchboard.New active backend","log_l
evel":1,"data":{"backend":{"host":"mysql-0.mysql-set.uaa.svc.cluster.local","port":13306,"status_port":9200,"healthy":true,"name":"backend-0","currentSessionCount":0}}}
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f559a35ef67 m=26 sigcode=18446744073709551610

goroutine 0 [idle]:

goroutine 2 [running]:
runtime.systemstack_switch()
        /var/vcap/packages/golang/src/runtime/asm_amd64.s:298 fp=0xc42004c788 sp=0xc42004c780 pc=0x45bde0
runtime.gcStart(0x0, 0x2, 0x1a7af62145b4, 0xc400000000)
        /var/vcap/packages/golang/src/runtime/mgc.go:1319 +0x2c1 fp=0xc42004c7a8 sp=0xc42004c788 pc=0x41a301
runtime.forcegchelper()
        /var/vcap/packages/golang/src/runtime/proc.go:251 +0x6d fp=0xc42004c7e0 sp=0xc42004c7a8 pc=0x42eedd
runtime.goexit()
        /var/vcap/packages/golang/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc42004c7e8 sp=0xc42004c7e0 pc=0x45ea11
created by runtime.init.4
        /var/vcap/packages/golang/src/runtime/proc.go:234 +0x35

I tried starting it manually, and got this:

# ./switchboard_ctl start
2019-06-06 18:10:17 +0000 ----- Starting switchboard...
------------ STARTING switchboard_ctl at Thu Jun  6 18:10:17 UTC 2019 --------------
------------ STARTING switchboard_ctl at Thu Jun  6 18:10:17 UTC 2019 --------------
su: failed to execute /bin/bash: Resource temporarily unavailable

Apparently there are no ulimits set:

# ulimit
unlimited

Recreating the pod didn't help, but recreating the whole node did. Does this suggest the culprit could have been anything on that node?