Giter VIP home page Giter VIP logo

puppet's Issues

mesos.o.b.e nginx proxies only read DNS at startup, eventually proxy to wrong mesos master

When a leader election occurs, the DNS name leader.mesos changes. But unfortunately, nginx only resolves the name once, at startup, and so it continues to proxy to a master which is no longer leading (which then issues a redirect to the new leader, but with the internal (non-proxied) hostname and port, and everything just doesn't work).

Seems like the commercial version of nginx supports re-resolving (boo), but there are some tricks online to get the open-source version to re-resolve, too.

sks-keyserver: dump keys to mirrors

there aren't that many high-quality sks keyserver keydumps available for starting new servers from. most of the existing ones are low-bandwidth, slow, or unreliable at best, so I think it would be valuable for us to offer a keydump. We can put it on mirrors and automatically get http/s, rsync, and ftp as well, at a cost of just ~12GB of disk space.

Here is my tentative proposal for how to do it, but I'm not sure it's a good or tractable design.

some constraints:

  1. The keyserver itself needs to be turned off during the dump (~3 minutes)
  2. The dump command (sks dump $count $output_directory $output_prefix) is a little too smart - it tries to split the dump into separate files (each with $count keys) and write them to the specified output directory itself, rather than dumping to a single file / stdout and letting us split/compress it ourselves afterwards.

my proposal would be something like the following:

write a sync-archive-style script in ocf_mirrors that uses a keytab like ocfbackups to log into the keyserver and:

  1. rm -r /var/lib/sks/dump/*
  2. systemctl stop sks-keyserver
  3. docker run --rm -it -w /var/lib/sks -v /var/lib/sks:/var/lib/sks zhusj/sks sks dump 20000 /var/lib/sks/dump sks-dump
  4. systemctl start sks-keyserver
    then, back on the host, rsync -azH --delete something@pgp:/var/lib/sks/dump /opt/mirrors/ftp/sks-keydump/

what do y'all think?

nix jessie code

thanks to @jvperrin we're now an all-Stretch shop, minus the temporary old-death resurrection. We can now kill any conditionals targeting Jessie in the codebase, and start replacing them with conditionals targeting buster, lol.

Retry obtaining certs if domains don't match

One problem right now with the Let's Encrypt changes (in #337) is that if new domains are added in LDAP, they likely won't be in DNS yet, a request will be made for a new cert, it'll fail, and won't be retried, since it's been refreshed once and that's it.

If the domains requested don't match the cert, or maybe if the previous cert request failed, we should retry the cert request, since giving up isn't a good option. I'm not sure what would be the best option here, one quick and dirty trick that comes to mind is to make a lockfile of sorts, so that if the cert fails to get obtained, a future run can check if the file exists, and schedule a run. I'd hope there's something cleaner than this, maybe a custom fact that shows the domains in each cert by parsing openssl x509 -in cert.pem -text output?

mysql root password preseed doesn't work

the preseed file we include for debconf (https://github.com/ocf/puppet/blob/master/modules/ocf_mysql/manifests/init.pp#L6) has two problems. First, we're now on mariadb-server-10.1, so the package name for debconf needs to be updated, but more importantly, the mariadb-server-10.1 package templates doesn't appear to support mysql-server/root_password and root_password_again debconf options at all:

root@e5a1d73f257e:/# sudo debconf-show mariadb-server-10.1                                                                                                                               
  mariadb-server-10.1/nis_warning:
  mariadb-server-10.1/old_data_directory_saved:
  mariadb-server-10.1/postrm_remove_databases: false

unless I'm interpreting this incorrectly, we'll need to change the way we preseed the root password and do initial grants.

Augment mirrors monitoring

We should add multiple upstreams to healthcheck against and take the max of the last_updated timestamp (or the analogue for other distros). This will prevent the situation where both us and our upstream monitor go out of of date, and thus we don't get any alerts from the healthcheck

We should also add Prometheus monitoring to our systemd units that perform the syncing. If any unit goes into the Failed state we can proactively fix our sync scripts before getting the alert that we are days out of date. This will help our "out-of-date" avgs if we want to apply to be a tier 1 mirror.

Switching a desktop from staff-only to not staff-only doesn't change the wrong password message

We have a custom message for staff-only desktops that informs users when they're denied entry that it is staff-only. However, if a desktop was previously set to staff-only in puppet and then was set to not be staff-only, that message doesn't change.

The command that is supposed to recompile lightdm-greeter to include the new message is here, but since the subscribe points to the filename, lightdm-greeter will only be recompiled if the actual contents of the message template change, instead of when $staff_only changes as intended.

To fix this, you couid write both the staff-only and not staff-only greeter error message to a single file on the desktop, and have the subscribe point to that file.

Fix docker socket group override

We currently set the Docker socket group by adding a systemd override for docker.socket, but it seems that lately dockerd no longer uses systemd to manage its socket. Hence, our override needs to be updated.

The way to go is probably to override the docker.service command to pass the -G flag with the appropriate group. (Unless someone knows of a better way.)

disaggregate postgres backups into per-database files

@ja5087 worked on getting postgres databases into our backups in #366 / #377, but we should move to having backups that are per-database instead of a single dump of the entire server's state. this is better for recovery and will help make our backups more efficient as well by taking advantage of rsnapshot's hardlinking features.

configure 802.3ad LACP on hypervisors and switch

last week I reorganized all the 10GbE DAC cables for the hypervisors, such that each hypervisor has both ports on the SolarFlare NIC plugged into the Arista 7050 switch, in consecutive interfaces. e.g. riptide is plugged into Ethernet 1 and Ethernet 2 on the switch, which, physically, are the leftmost top and bottom ports on the switch. Each hypervisor is organized this way, taking a top and bottom port, and @dongkyunc has helpfully added labels to the underside of the switch indicating which server each group corresponds to.

Now, we need to activate that second interface on each hypervisor, and preferably put both interfaces into an LACP channel group. This means that while both links are up, the virtual LACP bond interface will have an aggregate bandwidth of 20Gb/s, but if one of the links fails, the interface will not drop entirely but instead merely drop to 10Gb/s. This gives us an element of fault tolerance while doubling the existing bandwidth each server can utilize. Configuring 802.3ad will require configuration on both the hosts and the switch. @gpl and I have experimented with configuring the switch to support LACP while @cg505 and I have experimented with configuring the hosts to support bond0 as the LACP interface. Some work still needs to happen to get everything working, but I was able to configure the bond interface on dev-fallingrocks into active-passive mode before accidentally locking myself out of the machine when trying to configure the switch into LACP mode.

We will need to modify the configuration in https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/networking.pp to support bringing up the bond interface correctly and configuring it to bridge to br0 VMs as well. Doing it would likely make for an interesting blog post of sorts as well since much of the documentation online for doing this is rather out of date.

Purge any files in /etc/sudoers.d (except the README)

We want any added root users to go through usudo in puppet instead of being added manually, so we should manage the /etc/sudoers.d directory and make sure that it is empty (except for the README file that is provided in it, since apparently it needs to have some file in the directory to work). An excerpt from the /etc/sudoers.d/README file:

# Note that there must be at least one file in the sudoers.d directory (this
# one will do), and all files in this directory should be mode 0440.

lets-encrypt-update silently failing

for some reason the lets-encrypt-update script we use to acquire new and updated certs for death has been silently failing to run for some time (rt#7937, rt#7901, etc.)

there's nothing relevant in the logs, as far as I can tell, and we aren't getting any cron emails from failures, so perhaps there might be some way for us to turn up the verbosity / have it report everything for a bit to see if we can trigger a failure?

Printing improvements

  • Printing 2 copies of a double-sided document with an odd number of pages causes the second copy to start on the back side of the last page of the first copy. Suggested solution: tell CUPS not to duplicate pages and figure out how to add blank pages and copies manually in GhostScript.
  • Some .docx files converted to PDF by Google Docs cause PostScript errors on the printer. So far they've printed fine after passing through pdf2ps first.
  • Figure out a way to print non-letter-size jobs. Sample PDF.
  • Implement an alerting system for desktops, including notifications for jobs received by the printer, completed jobs, and pages remaining. I think @jvperrin is making some progress on this using dbus, but alternative solutions would be welcome.

add vsftpd and rsyncd transfer stats to mirrors statistics

/var/log/vsftpd.log contains statistics on things downloaded from mirrors over FTP, and syslog contains log lines from rsyncd with rsync on and `total size' as a search key. we might be able to extract this information and add it to the mirrors statistics.

back up VM XML

we should start dumping the VM XML files and including them in backups if possible. we can have a scrip that logs in to the hypervisors as ocfbackups, parses the output of virsh list --all or some more cli-friendly thing, and then "ssh -K ocfbackups@$hypervisor virsh dumpxml $domain" > $hypervisor-$domain.xml" or something like that

separate out firewall syslog and more aggressively compress/logrotate

logs coming from the bsecure firewall are massive (at our scale). Even after disabling logging from mirrors and death, it's still almost 2+GB a day.

I have some provisional rsyslog and logrotate conf changes that try to redirect firewall log entries from /var/log/remote/ to /var/log/external_firewall/<rule name>/<rule name>.log and then compress the living hell out of them to save space. Luckily they compress well, to ~10% of original size.

/etc/sudoers is broken if staffvm and no owner

The /etc/sudoers file becomes broken if the hiera file to add an owner is not added quickly enough, since it contains the following line (notice the single space at the front):

 ALL=(ALL) NOPASSWD: ALL

This appears to be because the hiera lookup for an owner returns an empty string in the usudo array, leading to a line being added in the /etc/sudoers template, but not one with an actual user. Probably should just filter out empty or invalid users in the auth.pp manifest?

Get http://rt/ to work from inside the lab

Right now that URL redirects to fluffy (instead of RT as it used to) because the nginx running on marathon-lb doesn't understand vhost aliases or something like that. We should fix that.

don't try to guess the correct interface in ocf::networking

currently we do

$br_iface = grep($ifaces_array, 'en.+')[0]

this only works if we have a single active interface connected to our machines, and if the cable is plugged into the first port. This is not a safe assumption now that we have 10GbE cards in our servers. For example, this code picks enp4s0f0 as the 'active' interface on corruption, as this is the first of the interfaces built into corruption's motherboard, but the new SolarFlare cards we have live at ens5, and the correct interface we should be writing to /etc/network/interfaces is ens5f1np1.

Perhaps a custom fact like

#!/bin/bash 
set -euo pipefail

for i in /sys/class/net/en*; do
  if ethtool "$(basename $i)" | grep -q "Link detected: yes"; then
    echo "$(basename $i)"
  fi
done

would be better for selecting the correct interface(s) that are actually active and plugged into the machines.

ocfbroker group exists on the desktops

seems like this shouldn't be the case:

abizer@blackout:~$ check abizer
abizer:*:35934:1000:Abizer Lokhandwala:/home/a/ab/abizer:/bin/zsh
Created on: 2015-10-03
Member of group(s): ocfbroker ocfstaff ocfroot ocfdev ocfofficers

Upgrade remaining machines from jessie to stretch

2018-10-23 03:13:07 -0700 //death.ocf.berkeley.edu/Facter (err): error while processing "/opt/puppetlabs/puppet/cache/facts.d/iface-linked" for external facts: child process returned non-zero exit status (1).

File in question is here: puppet/modules/ocf/facts.d/iface-linked on line 8

death, tsunami, and werewolves are the remaining machines currently on jessie. We should tie up the loose ends here before transitioning to buster.

move desktop files in Xsession

Currently the .desktop files in ~/Desktop are written from the Xsession file directly. It would make a lot more sense to just put them in /etc/skel/Desktop as actual files.

add postgres to ocf_backups

I've done some experimental work on creating an ocfpgbackups postgres role with GRANT SELECT on all databases for backups, but the actual implementation could probably be done by a new staff member to cut their teeth on working on the infra. We'd want to add a rule to rsnapshot.conf in the ocf_backups module, probably using pg_dump or pg_dumpall.

Move off `validate_re`

The validate_re function is deprecated in recent versions of stdlib, and should be replaced by using abstract data types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.