Giter VIP home page Giter VIP logo

mssql-server-ha's Introduction

This repository contains the source code of the Pacemaker resource agents that ship in the mssql-server-ha package.

This is a snapshot of our SQL Server-internal repository where the actual development takes place. As the commit histories are completely different, we cannot currently accept pull requests to this repository. This snapshot is provided so that users can see the source of the agents, make changes to suit any specific scenarios they have, write agents for other clustering systems following the same protocol, and so on. We intend to migrate development to this repository at a future date.

We're happy to receive bug reports, suggestions and feedback for the code in this repository as Github issues.

Kubernetes agents for monitoring SQL Server instances and Availability Groups are also coming, and will be added to this repository at a future date.

Availability Group resource agent ocf:mssql:ag

This is made up of a golang binary go/src/ag-helper and a shell script ag/ag. ag-helper can be built by running GOPATH=$PWD/go go install ag-helper

The agent can be installed by moving the files to these locations:

  • ag/ag to /usr/lib/ocf/resource.d/mssql/ag
  • ag/docs/* to /usr/lib/ocf/lib/mssql/*
  • go/bin/ag-helper to /usr/lib/ocf/lib/mssql/ag-helper

The shell script is the entry point for the resource agent and delegates to the helper binary for most tasks. The helper binary monitors the instance health by running sp_server_diagnostics and the AG health by querying sys.databases. It also implements the promote and demote actions by running the ALTER AVAILABILITY GROUP FAILOVER and ALTER AVAILABILITY GROUP SET (ROLE = SECONDARY) DDLs.

Major changes since SQL2017:

  • Provide hostname support for ag and fci
  • Install External Lease in Pacemaker
  • Introduce external write lease handling to Pacemaker AG resource agent
  • Not to wait for databases to come online during failover
  • Bring secondaries offline in post promote
  • Various Pacemaker AG agent fixes for more reliable failovers

Failover Cluster Instance resource agent ocf:mssql:fci

This is made up of a golang binary go/src/fci-helper and a shell script fci/fci. fci-helper can be built by running GOPATH=$PWD/go go install fci-helper

The agent can be installed by moving the files to these locations:

  • fci/fci to /usr/lib/ocf/resource.d/mssql/fci
  • fci/docs/* to /usr/lib/ocf/lib/mssql/*
  • go/bin/fci-helper to /usr/lib/ocf/lib/mssql/fci-helper

The shell script is the entry point for the resource agent and handles starting and stopping the sqlservr process. The script invokes the fci-helper binary to fixup the server name after starting the resource (if necessary), and to monitor the instance health by running sp_server_diagnostics

Major changes since SQL2017:

  • Ensure ag-helper and fci-helper exit if the resource agent process is killed
  • ag helper will reattempt connection if connection times out for monitor action

License

MIT

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

mssql-server-ha's People

Contributors

arsing avatar microsoft-github-policy-service[bot] avatar yunxijia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mssql-server-ha's Issues

Update this repo

I am attempting to configure the pacemaker AG resource on Ubuntu 20.04 and Pacemaker 2.0.3, with SQL Server 2019. I am running into issues and it would be helpful to have the source of the ag-helper script especially. Based on the behavior I'm seeing it is clear that ag-helper (and/or the libraries it uses) has changed significantly since the snapshot in this repo. The request is to update this repo with a new snapshot if possible.

ag-helper script puts AG into RESOLVING state

System details

SQL: SQL Server 2019 Enterprise
OS: Ubuntu 20.04
Pacemaker: 2.0.3

Issue

It appears the monitor operation in the ag-helper is continually taking the AG offline and putting it in RESOLVING state from a previous good state. I see the following output in pacemaker logs:

INFO: monitor: 2021/12/15 19:49:45 From RetryExecuteWithTimeout - Attempt 1 to connect to the instance at localhost:1433
INFO: monitor: 2021/12/15 19:49:45 Connected to the instance at localhost:1433
INFO: monitor: 2021/12/15 19:49:50 Monitor Caller is: monitor.
INFO: monitor: 2021/12/15 19:49:50 [DEBUG] AG Helper Monitor Role info: AVAILABILITY GROUP prod-dbcluster-2 on instance prod-dbcluster-2d
INFO: monitor: 2021/12/15 19:49:50 Replica is PRIMARY (1)
INFO: monitor: 2021/12/15 19:49:50 Offlining replica...
INFO: monitor: 2021/12/15 19:49:50 Replica is RESOLVING (0)
INFO: monitor: 2021/12/15 19:49:50 Instance name is prod-dbcluster-2d.
INFO: monitor: PROMOTION_SCORE: -INFINITY

It does not appear to give a reason why it is transitioning the state. I have confirmed that the primary replica AG is flipping from PRIMARY to RESOLVING when this happens. If I disable the monitor operation on the resource, the resource comes up in a good state and I can add databases to the AG. However, with monitor disabled failover is broken.

Submitting a PR

Hello, am I unable to submit Pull Requests for this repository?

ERROR: Permission to Microsoft/mssql-server-ha.git denied to zimmertr.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Anyway, you guys should list the dependencies on the README for installing the packages. For example consider adding the following to Line #9:

Be sure to properly set your GOPATH environment variable and clone down the required dependencies below before attempting to install `ag-helper` or `fci-helper`.

1) go get github.com/denisenkom/go-mssqldb  
2) go get cloud.google.com/go/civil  
3) go get golang.org/x/crypto/md4 

Otherwise, do you just expect people to crawl their way through dependency hell manually when they compile and install this?

ocf:mssql:fci doesn't restart after change of server name

Hello, I tried to test ocf:mssql:fci agent on our pacemaker cluster. I followed the official tutorial. When I tried to create ocf:mssql:fci instance, it failed, and when I started with the debug-start command, it threw this error:

[root@virt-537 ~]# pcs resource debug-start mssql-server
crm_resource: Error performing operation: OK
Operation start for mssql-server (ocf:mssql:fci) returned: 'invalid parameter' (2)
58932 58924
 
Jan 04 11:17:32 INFO: mssql_validate
Jan 04 11:17:32 INFO: Resource agent invoked with: start
Jan 04 11:17:32 INFO: mssql_start
Jan 04 11:17:32 INFO: SQL Server started. PID: 58924; user: mssql; command: /opt/mssql/bin/sqlservr
Jan 04 11:17:33 INFO: start: 2022/01/04 11:17:33 fci-helper invoked with hostname [localhost]; port [1433]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-mssql-server-start]; connection-timeout [20]; health-threshold [3]; action [start]
Jan 04 11:17:33 INFO: start: 2022/01/04 11:17:33 fci-helper invoked with virtual-server-name [mssql-server]
Jan 04 11:17:33 INFO: start: 2022/01/04 11:17:33 From RetryExecute - Attempt 1 to connect to the instance at localhost:1433
Jan 04 11:17:33 INFO: start: 2022/01/04 11:17:33 Attempt 1 returned error: Unresponsive or down Unable to open tcp connection with host 'localhost:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
Jan 04 11:17:34 INFO: start: 2022/01/04 11:17:34 From RetryExecute - Attempt 2 to connect to the instance at localhost:1433
Jan 04 11:17:34 INFO: start: 2022/01/04 11:17:34 Attempt 2 returned error: Unresponsive or down Unable to open tcp connection with host 'localhost:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
Jan 04 11:17:35 INFO: start: 2022/01/04 11:17:35 From RetryExecute - Attempt 3 to connect to the instance at localhost:1433
Jan 04 11:17:35 INFO: start: 2022/01/04 11:17:35 Attempt 3 returned error: Unresponsive or down Unable to open tcp connection with host 'localhost:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
Jan 04 11:17:36 INFO: start: 2022/01/04 11:17:36 From RetryExecute - Attempt 4 to connect to the instance at localhost:1433
Jan 04 11:17:36 INFO: start: 2022/01/04 11:17:36 Connected to the instance at localhost:1433
Jan 04 11:17:41 INFO: start: 2022/01/04 11:17:41 Setting local server name to mssql-server...
Jan 04 11:17:41 INFO: start: 2022/01/04 11:17:41 Querying local server name...
Jan 04 11:17:41 INFO: start: 2022/01/04 11:17:41 Local server name is virt-537
Jan 04 11:17:41 INFO: start: ERROR: 2022/01/04 11:17:41 Expected local server name to be mssql-server but it was virt-537
ocf-exit-reason:2022/01/04 11:17:41 Expected local server name to be mssql-server but it was virt-537
Jan 04 11:17:41 INFO: mssql-server start : 2

Log from the 'pcs cluster status' command.

Full List of Resources:
  * fence-virt-535      (stonith:fence_xvm):     Started virt-535
  * fence-virt-537      (stonith:fence_xvm):     Started virt-537
  * mssql-server        (ocf::mssql:fci):        Stopped
 
Failed Resource Actions:
  * mssql-server_monitor_0 on virt-535 'invalid parameter' (2): call=17, status='complete', exitreason='2022/01/04 11:15:39 Expected local server name to be mssql-server but it was virt-535', last-rc-change='2022-01-04 11:15:33 +01:00', queued=0ms, exec=5169ms
  * mssql-server_monitor_0 on virt-537 'invalid parameter' (2): call=17, status='complete', exitreason='2022/01/04 11:15:39 Expected local server name to be mssql-server but it was virt-537', last-rc-change='2022-01-04 11:15:33 +01:00', queued=0ms, exec=5175ms

I think that problem is missing restart after a set of the local server name. It works when I remove a resource and create a new one with the same name. Also, if I restart the SQL server manually, it works.

Monitor delay timeout changes are ignored

I have Pacemaker v.1.14 with manages SQL 2017 cluster in AWS.
SQL resource was added with command
`sudo pcs resource create SQLInstance ocf:mssql:fci op defaults timeout=60s --group sql-rg

sudo pcs resource show SQLInstance
Resource: SQLInstance (class=ocf provider=mssql type=fci)
Operations: start interval=0s timeout=1000 (SQLInstance-start-interval-0s)
stop interval=0s timeout=20 (SQLInstance-stop-interval-0s)
monitor interval=10 timeout=30 (SQLInstance-monitor-interval-10)
defaults interval=0s timeout=60s (SQLInstance-defaults-interval-0s)
`

I noticed SQL engine is started as a cluster resource at least 2 times because 20 sec timeout is exceed. I decided to change it to 60 sec with command

sudo pcs resource update SQLInstance op monitor interval=10s timeout=60s

It actually didn't change monitor behaviour. It still displays 20 sec timeout in a log and initiates SQL stop requests. It is a real problem as under load SQL stop-start 2-3-4 times in my case.

OSS Development - Pending Tasks

As the README notes, currently this repository contains only snapshots of the source for each SQL Server release. This issue tracks everything that's blocking us to migrate development from our Microsoft-internal repo to this one.

  • Golang dependencies

    For legal reasons, we have fixed versions of all dependencies, including transitive dependencies. We cannot fetch any versions of dependencies that we don't have legal approval for, nor can we fetch any new dependencies. We already have legal approval for all dependencies that get fetched with go get for each of our binaries.

    Currently we have a tarball of deps hosted on an internal server. A build task downloads and unpacks it into $GOPATH/src/vendor

    Options:

    1. dep

      • Does not support Gopkg.toml in $GOPATH/src. So package manifest needs to be created separately for each binary.
      • Con: As an extension of the above, it also creates a per-package vendor directory. Dependencies are fetched over and over again for each binary package.
      • Specifying versions of transitive dependencies requires adding [[override]]s for each of them. Otherwise it always fetches the latest versions of transitive dependencies. The documentation recommends against [[override]] because it expects the first lockfile created to handle that, but does not say how one is supposed to migrate an existing transitive dep version requirement to a new lockfile.
      • ❌ Pulls in more packages than go get would pull in. This appears to be because it pulls in deps for unused binaries. For example, one of our packages ends up with a transitive dependency on petar/GoLLRB because that is one of the deps of this binary in google/btree, even though that binary is never built.
      • Creates dependency on github.com availability for internal builds. The tools does support aliasing the source URL, but seems to require the internal repos support the go-import meta tag which ours don't. Even if that wasn't an issue, it's not clear if the tool will fall back to github.com when run by users outside Microsoft who don't have access to the internal repos.
    2. glide

      • Please consider trying to migrate from Glide to dep. ... Glide will continue to be supported for some time but is considered to be in a state of support rather than active feature development.

      • ❌ Pulls in more packages than go get would pull in. Reason unclear.
    3. godep - Archived in favor of dep

    4. govendor

      • Does not reliably add transitive deps, but manual editing of the generated vendor.json was sufficient.
      • Requires listing every sub-package individually, which is both a pro (allows ignoring files that Legal did not approve) and a con (the file is 800 lines long for ~30 packages).
      • Con: Creates dependency on github.com availability.
    5. Check in and push the vendor directory with flat files and no .git directories - the same as what our tarball unpacks to.

      • Con: Bloats repo.
    6. Check in and push the vendor directory with submodules to the corresponding GH repos

      • Con: Creates dependency on github.com availability.
    7. Makefile with git clone commands.

      • Con: Creates dependency on github.com availability.
  • Golang compiler binaries

    We have a specific version of the golang compiler that we have legal approval for. Just like above, we have a tarball in the build server containing the linux_amd64 binaries for that version that gets downloaded and unpacked at build time.

    How do we migrate this workflow to this public repo? We don't want to download the version from golang.org for every build, and we don't want to rely on golang.org hosting this version forever anyway.

    Answer: Have the Makefile assume go is in $PATH and $GOROOT is set. Have internal builds download and extract the internal tarball like it does now.

  • golang:alpine Docker image

    The Kubernetes agents are compiled with a particular version of the golang:alpine Docker image that, again, is the only version we have legal approval for. Currently we source the image from a Microsoft-internal Docker repo.

    Can we rely on golang to keep old Docker tags around forever? Or should we publish the image on Docker hub under the microsoft owner?

    Answer: Have a Makefile variable for the Docker image name. Default it to golang:alpine but override it for internal builds to the internal repo's image.

  • master and sql2017 branches

    Currently the master branch of this repo is a misnomer, since it actually contains the sql2017 code. Our internal repo's master is targeted towards the next release. This is now fixed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.