open-contracting / deploy Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 4.31 MB

Deployment configuration and scripts

Home Page: https://ocdsdeploy.readthedocs.io/en/latest/

License: Apache License 2.0

SaltStack 76.31% Python 4.84% Shell 15.62% HTML 3.23%

deployment

deploy's People

Contributors

Stargazers

Watchers

Forkers

bikramtuladhar anton-shakh-qg duncandewhurst

deploy's Issues

Kingfisher: New Hetzner testing options

@robredpath has written in CRM-4925:

The images for Hetzner servers can be downloaded from https://download.hetzner.de/bootimages/ (user: hetzner p/w: download) so, if we wanted, I think we could set up some kind of Vagrant+Salt server testing setup.

Comment out HTTP/HTTPS Fix in Docs to Cove proxy

https://github.com/open-contracting/deploy/blob/master/salt/apache/ocds-docs-live.conf.include#L249

    {# This solves a problem - "Convert to Spreadsheet" was working on HTTPS but not HTTP. Fix this by forcing header to be something the CSFR check likes. #}
    {# When we change HTTPS to force, there will be no HTTP traffic and this block should be removed. #}
    {% if testing  %}
        <Location /review>
                Header add referer "https://testing.live.standard.open-contracting.org"
                RequestHeader set referer "https://testing.live.standard.open-contracting.org"
        </Location>
        <Location /infrastructure/review>
                Header add referer "https://testing.live.standard.open-contracting.org"
                RequestHeader set referer "https://testing.live.standard.open-contracting.org"
        </Location>
    {% endif %}
    {% if not testing %}
        <Location /review>
                Header add referer "https://standard.open-contracting.org"
                RequestHeader set referer "https://standard.open-contracting.org"
        </Location>
        <Location /infrastructure/review>
                Header add referer "https://standard.open-contracting.org"
                RequestHeader set referer "https://standard.open-contracting.org"
        </Location>
    {% endif %}

Add [email protected] wherever [email protected] occurs

Testing Salt changes

Document how to test changes to Salt against a virtual machine and in a separate branch, perhaps through a simple worked example.

Custom Prometheus monitoring for Kingfisher

Suggestions:

Length of Redis (or similiar) ques
Time taken to run views

From #28

OC4IDS Data Review Tool's language switcher has CSRF error

https://standard.open-contracting.org/infrastructure/review/

This doesn't occur with the OCDS DRT: https://standard.open-contracting.org/review/

Document where Kingfisher views status can be found

It would be helpful for analysts to know where the hosted Kingfisher views logs are.

Why is pip frozen on Django projects?

pip==8.1.2

I figure we can at minimum use > instead of ==.

Set up OCP Prometheus server

From #28

Question 1: Which box should this be on - a new box? A 1GB Bytemark box should be fine.

Question 2: For alerts, we need a way to send email. What shall we use? AWS has an email sending service, or are there other options?

Document how to get the values for the PRIVATE_KEY and SEARCH_SECRET environment variables

Server monitoring

https://ocdsdeploy.readthedocs.io/en/latest/server-monitoring.html

Document how to fully set up the agent on each server so that others can do this. Document how to set up a server, or more likely, link to existing documentation.

All those "Invalid HTTP_HOST header" sentry errors we get are due to having one HTTPS virtualhost only

See https://github.com/OpenDataServices/deploy/issues/6

nagios sends thousands of messages to /var/mail/root

It seems to send a new message every few minutes. I thought we weren't using Icinga and Nagios anymore? I deleted all the mail messages up to now. Sample subjects on ocdskingfisher-new:

 U  16 [email protected] Thu Oct 17 01:17  25/889   [RECOVERY] disk / on process.kingfisher.open-contracting.org is OK!
 U  17 [email protected] Thu Oct 17 01:17  25/913   [RECOVERY] disk on process.kingfisher.open-contracting.org is OK!
 U  18 [email protected] Thu Oct 17 01:19  25/874   [PROBLEM] procs on process.kingfisher.open-contracting.org is WARNING!
 U  19 [email protected] Thu Oct 17 01:23  25/927   [PROBLEM] memory on process.kingfisher.open-contracting.org is UNKNOWN!
 U  20 [email protected] Thu Oct 17 01:34  25/885   [PROBLEM] load on process.kingfisher.open-contracting.org is CRITICAL!
 U  21 [email protected] Thu Oct 17 01:42  25/907   [PROBLEM] apt on process.kingfisher.open-contracting.org is WARNING!

Sample message:

Return-Path: <[email protected]>
X-Original-To: root@localhost
Delivered-To: root@localhost
Received: by process.kingfisher.open-contracting.org (Postfix, from userid 112)
	id 188B05D00CD7; Tue, 23 Jul 2019 06:25:46 +0200 (CEST)
Subject: [PROBLEM] load on process.kingfisher.open-contracting.org is CRITICAL!
To: <root@localhost>
X-Mailer: mail (GNU Mailutils 3.4)
Message-Id: <20190723042548.188B05D00CD7@process.kingfisher.open-contracting.org>
Date: Tue, 23 Jul 2019 06:25:46 +0200 (CEST)
From: [email protected]
X-IMAPbase: 1571862900 21517
Status: O
X-UID: 5721

***** Service Monitoring on process *****

load on process.kingfisher.open-contracting.org is CRITICAL!

Info:    CRITICAL - load average: 8.56, 7.87, 7.90

When:    2019-07-23 06:25:46 +0200
Service: load
Host:    process.kingfisher.open-contracting.org
IPv4:    127.0.0.1
IPv6:    ::1

Add docs on stopping and starting Scrapyd

robredpath (re: stopping and starting Scrapyd): under what circumstances might I want to do this? [from open-contracting/kingfisher-collect#103]

Add content to explain

Creating a new server

To document:

Document how to generate and change root password
Add host-specific steps for Hetzner
Expand Prometheus section after #31
ODS CRM pages could move to OCP CRM wiki

To discuss and document:

Logging the IP address, hostname, root password in a password manager (A keypass file could be created using OCP current practice)

Other:

Check/Test whether python-msgpack and python-concurrent.futures are actually needed

The current process depends on making entries in Open Data Services Coop resources ~~and using the Open Data Services Coop deploy token~~. Document that fact here, or discuss a new process so that OCP staff can make new servers and document that here.

Process & Docs - making changes to Salt scripts

https://ocdsdeploy.readthedocs.io/en/latest/making-changes.html

Discuss testing against a virtual machine or making changes against live servers. Discuss procedures for people who can not deploy for whatever reason to make changes (eg always via pull request?). Have clearer guidelines about when it is and is not appropriate to commit straight to master. Document.

Process & Docs - Deploy Token Process

https://ocdsdeploy.readthedocs.io/en/latest/deploying.html

Work out how we’re going to make a deploy token work between Open Data Services and OCP, so that OCP staff can deploy. Document.

Add to Docs: Before deploying to Kingfisher, check whether any spiders currently running

Link to https://kingfisher-scrape.readthedocs.io/en/latest/use-hosted.html#are-any-spiders-currently-running

Invalid HTTP_HOST header

Invalid HTTP_HOST header: '46.43.2.235'. You may need to add '46.43.2.235' to ALLOWED_HOSTS.
Invalid HTTP_HOST header: '46.43.2.235:443'. You may need to add '46.43.2.235' to ALLOWED_HOSTS.
Invalid HTTP_HOST header: 'live.standard-search.opencontracting.uk0.bigv.io'. You may need to add 'live.standard-search.opencontracting.uk0.bigv.io' to ALLOWED_HOSTS.

Let's add these to the allowed hosts (the IP is for the standard-search server).

Can we salt-ssh '*' pkg.autoremove purge=True

salt-ssh '*' pkg.autoremove list_only=True returns a lot of packages that can be removed (some of which were installed for icinga2/nagios #45 (comment) but not all).

Is ocds-legacy site needed any more?

We tried to move this to ocds-docs-live server, but something hadn't gone totally right as when we decommissioned an old server today http://ocds.opendataservices.coop/standard/r/1__0__RC/en/standard/intro/ went offline.

But a message was received saying that's not an important site any more? Is that right?

If so, we can remove ocds-legacy from deploy repository to keep things clean.

If not, we should be able to put it back online.

Document usage of https certsonly to install new certs

I see that you need to first deploy with https='certonly', then with either 'yes' or 'force'.

Original issue title and description

{{ servername }}_acquire_certs will always error if certs not yet created

This state runs:

/etc/init.d/apache2 reload; letsencrypt certonly --non-interactive --no-self-upgrade --expand --email [email protected] --agree-tos --webroot --webroot-path /var/www/html/ {{ domainargs }}

When changing servername, this outputs:

          stderr:
              Job for apache2.service failed because the control process exited with error code.
              See "systemctl status apache2.service" and "journalctl -xe" for details.
              Saving debug log to /var/log/letsencrypt/letsencrypt.log
              Plugins selected: Authenticator webroot, Installer None
              Obtaining a new certificate
              Performing the following challenges:
              http-01 challenge for cove-live.oc4ids.opencontracting.uk0.bigv.io
              http-01 challenge for master.cove-live.oc4ids.opencontracting.uk0.bigv.io
              Using the webroot path /var/www/html for all unmatched domains.
              Waiting for verification...
              Cleaning up challenges
          stdout:
              Reloading apache2 configuration (via systemctl): apache2.service failed!
              IMPORTANT NOTES:
               - Congratulations! Your certificate and chain have been saved at:
                 /etc/letsencrypt/live/cove-live.oc4ids.opencontracting.uk0.bigv.io/fullchain.pem
                 Your key file has been saved at:
                 /etc/letsencrypt/live/cove-live.oc4ids.opencontracting.uk0.bigv.io/privkey.pem
                 Your cert will expire on 2020-01-27. To obtain a new or tweaked
                 version of this certificate in the future, simply run certbot
                 again. To non-interactively renew *all* of your certificates, run
                 "certbot renew"
               - If you like Certbot, please consider supporting our work by:
              
                 Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
                 Donating to EFF:                    https://eff.org/donate-le

systemctl status apache2.service -n 10 shows:

Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io apachectl[32419]: AH00526: Syntax error on line 69 of /etc/apache2/sites-enabled/cove.conf:
Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io apachectl[32419]: SSLCertificateFile: file '/etc/letsencrypt/live/cove-live.oc4ids.opencontracting.uk0.bigv.io/cert.pem' does not exist or is empty
Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io apachectl[32419]: Action 'graceful' failed.
Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io apachectl[32419]: The Apache error log may have more information.
Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io systemd[1]: apache2.service: Control process exited, code=exited status=1
Oct 29 21:53:08 cove-live.oc4ids.opencontracting.uk0.bigv.io systemd[1]: Reload failed for The Apache HTTP Server.

As I understand, the cert.pem file won't exist until the letsencrypt certonly command is run. Should we instead reload apache after letsencrypt certonly?

Process & Docs - making changes to non-salt scripts (eg. scripts used by Travis)

https://ocdsdeploy.readthedocs.io/en/latest/making-changes.html

Discuss procedures and document such that all staff can make changes to these.

Do we need fix for Unicode Strings appear

OpenDataServices/opendataservices-deploy#81

Check if we need this fix in this repo (I suspect we do) and apply if so

Docs: Document usage of removeapache

I assume these are one-time-use utilities in cases where apache or uwsgi are removed from a server (though I don't know in what case that would occur).

Do we want a backup of the database on Kingfisher live?

Tho we should be able to reload almost everything* from files on disk, it would take a while.

This came up in conversation with @kindly.

ps.* The tiny exception is explained in open-contracting/kingfisher-process#122

Add Django macro(s)

To reduce duplication between Toucan, etc.

Add Romina to Sentry, for the Toucan project

Please use @romifz's @cds.com.py address.

Use a single deploy script

I think the only difference is the directory to which files are mirrored. This can be a configuration in .travis.yml (profiles already have partial control over the directory).

That way, we'll have only one deploy-docs.sh script which will be easier to maintain.

The script can check that all necessary environment variables are set.

Set up Docs on Read The Docs

After #8 is merged.

"ocdsdeploy" seems free as a project URL?

Sentry: Delete open-contracting-validator and add project that corresponds to CoVE DSN's

The DSN for open-contracting-validator isn't the same as those in this repo. So, I assume there is another project that is the 'real' one for the OCDS and OC4IDS Data Review Tools.

Use .gitmodules

Use .gitmodules for the private repos and add setup instructions to readme.

For updating the submodules, I prefer instructions in the readme to a shell script, but if the shell script is kept, it should use --rebase to avoid extra merge commits.

Sentry: Delete open-contracting-standard-coll

Nothing reports to it. Its DSN doesn't occur in this repository.

kingfisher-archive: StrictHostKeyChecking=no should be removed

See open-contracting-archive/kingfisher-archive#6

Regular server maintenance

Discuss if this will be the same or different in future. Document the procedure here more fully.

Research, doc, put into salt - keeping state across scrapyd restarts

Important for server maintanence reasons

Prometheus: Add RabbitMQ exporter for Kingfisher

From #28

The agents are at https://prometheus.io/docs/instrumenting/exporters/

Install flatten tool and ocdskit from PyPi, not GitHub

I think we should be PR’ing changes to this repo.

Replace occurrences of old names of products, when deploying to new server

'ocdskit-web' will become confusing over time, as we (and new users) will forget that it was the old name for Toucan.

Gitmodules breaks Read The Docs

https://readthedocs.org/projects/ocdsdeploy/builds/9705262/

Kingfisher - there is a failed state due to View changes

https://github.com/open-contracting/kingfisher-views/pull/32/files removes requirements.txt

This unfortunately means we have a failed state:

      ID: /home/ocdskfp/ocdskingfisherviews/.ve/
Function: virtualenv.managed
  Result: False
 Comment: An exception occurred in this state: Traceback (most recent call last):
           ..............
          FileNotFoundError: [Errno 2] No such file or directory: '/home/ocdskfp/ocdskingfisherviews/requirements.txt'
 Started: 09:53:07.347843
Duration: 369.936 ms
 Changes:

Deploy Redash using Docker

When we wrote these sripts, Redash provided a Ubuntu install script.

Now they don't - only a Docker version.

So this won't work on a fresh server. (A step fails with a 404 error)

There also raises worries that on the next major upgrade we might get into difficulties, as the upgrade script we are using won't be the right thing to do any more. (We use our own version, for reasons explained at the top of https://github.com/open-contracting/deploy/blob/master/salt/redash/upgrade-nointeraction )

Use Prometheus for monitoring

We would like to switch to Prometheus for OCP server and service monitoring. This is a very popular fully open source project that we now use on our servers - see https://prometheus.io/

Its model is that a small agent runs on the target as a service and makes a HTTP end point available. The Server component then regularly “pulls” data from that end point. The HTTP end point returns a plain text file with a bunch of keys; keys can be fully defined by you. (This contrasts to a “push” model that others use, tho you can do “push” stuff if you really want.)

Data is then available in a nice web UI, with current status, historical graphs and alarms. Other dashboards can be hooked up if you want.

The machine exporter ( https://github.com/prometheus/node_exporter ) exports stats such as CPU, RAM and Disk use. Historical data on this lets us answer questions about how much load a machine typically has - these questions have come up before in server planning. We set this up under its own user on each server and in our experience it consumes minimal resources.

We also use https://github.com/prometheus/blackbox_exporter - this lets you monitor websites for good service. (I know we have uptimerobot but a little more doesn’t hurt)

We haven’t used these agents before, but https://prometheus.io/docs/instrumenting/exporters/ lists agents for both Redis and Postgres - good for Kingfisher.

The pull rather than push model makes for a much simpler setup. You can even have more than one server at a time pull data from an end point. This would allow us to setup all the monitors and alerts in ODSC’s infrastructure for now, but if OCP ever wanted the data to appear in their own server later this would be easy to do. It also allows you to run a server on your laptop to test things, but still pull real data - quite nice.

You can write custom exporters, because the HTTP end point format is so simple. They even provide an official Python library so you can easily measure metrics inside your app. https://prometheus.io/docs/instrumenting/clientlibs/ We could use this in things like Kingfisher; monitoring the length of any queues for instance, or monitoring views’ progress.

All configuration is done by files on disk, mostly YAML. This means that salt can simply, immediately and with no human intervention set up a fully working system. (This contrasts nicely to our current system, where salt installs some stuff but then you have to go on by hand to set up the machine in the monitoring network with some awkward steps)

We’d do the basics initially and then talk with you to see what else we wanted to add once it's up and running - but first we wanted to check in quickly to see what you thought of Prometheus?

Add any relevant content from ODS wiki

e.g. https://opendataservices.plan.io/projects/servers/wiki/SaltNotes

Switch from letsencrypt to certbot

(Copied from letsencrypt.sls)

The version of letsencrypt in the 16.04 repo is tragically old (0.4.1) and predates renaming to certbot, nice apache support, et. The version in the 18.04 repo is just a alias for certbot.

When we get rid of our last 16.04 servers, we can just switch to certbot.

Docs - Salt version needed?

https://ocdsdeploy.readthedocs.io/en/latest/salt.html

Work out the minimum version of Salt required and document here for people who want to install a suitable version. Link to install pages for common operating systems.

salt-ssh 'ocds-redash' state.apply fails

ocds-redash:
- Detected conflicting IDs, SLS IDs need to be globally unique.
The conflicting ID is 'restart-nignx' and is found in SLS 'base:ocds-redash' and SLS 'base:prometheus-client-nginx'

Other profiles have staging banner

https://standard.open-contracting.org/profiles/eu/master/en/
https://standard.open-contracting.org/profiles/gpa/master/en/

Is this because they are on master? Can we make an exception?

Change URL used for OC4IDS Data Tool

Currently the standard site proxies traffic to the data tool via the FQDN "oc4ids.cove.live.opendataservices.coop"

https://github.com/open-contracting/deploy/blob/master/pillar/live_pillar.sls#L3

We should just change that to "cove.cove-live.oc4ids.opencontracting.uk0.bigv.io", like the other server. That will need to be changed on the Data Tool server too, and make sure the SSL cert is obtained correctly.

Consider separating deployments of Process and Views

Process and Views (if open-contracting/kingfisher-summarize#34 is pursued) only interact at the level of the database, i.e. Views can work with any database whose schema is the same as Process.

As such, I don't see why both should always be deployed at the time, as is currently the case.

If one of the two has a broken deployment, it shouldn't cause the other to fail, like in #24.

open-contracting / deploy Goto Github PK

deploy's People

Contributors

Stargazers

Watchers

Forkers

deploy's Issues

Recommend Projects

Recommend Topics

Recommend Org