ocf / puppet Goto Github PK
View Code? Open in Web Editor NEWPuppet config for OCF servers and lab machines
Home Page: https://www.ocf.berkeley.edu/
Puppet config for OCF servers and lab machines
Home Page: https://www.ocf.berkeley.edu/
When a leader election occurs, the DNS name leader.mesos
changes. But unfortunately, nginx only resolves the name once, at startup, and so it continues to proxy to a master which is no longer leading (which then issues a redirect to the new leader, but with the internal (non-proxied) hostname and port, and everything just doesn't work).
Seems like the commercial version of nginx supports re-resolving (boo), but there are some tricks online to get the open-source version to re-resolve, too.
there aren't that many high-quality sks keyserver keydumps available for starting new servers from. most of the existing ones are low-bandwidth, slow, or unreliable at best, so I think it would be valuable for us to offer a keydump. We can put it on mirrors and automatically get http/s, rsync, and ftp as well, at a cost of just ~12GB of disk space.
Here is my tentative proposal for how to do it, but I'm not sure it's a good or tractable design.
some constraints:
sks dump $count $output_directory $output_prefix
) is a little too smart - it tries to split the dump into separate files (each with $count keys) and write them to the specified output directory itself, rather than dumping to a single file / stdout and letting us split/compress it ourselves afterwards.my proposal would be something like the following:
write a sync-archive
-style script in ocf_mirrors
that uses a keytab like ocfbackups
to log into the keyserver and:
rm -r /var/lib/sks/dump/*
systemctl stop sks-keyserver
docker run --rm -it -w /var/lib/sks -v /var/lib/sks:/var/lib/sks zhusj/sks sks dump 20000 /var/lib/sks/dump sks-dump
systemctl start sks-keyserver
rsync -azH --delete something@pgp:/var/lib/sks/dump /opt/mirrors/ftp/sks-keydump/
what do y'all think?
As suggested by @chriskuehl, have notify use ocflib functions rather than parse the output of paper.
Note: This is also a good starter issue if anyone's interested.
box.com is deprecating DAV, see https://community.box.com/t5/Box-Product-News/Deprecation-WebDAV-Support/ba-p/55684
rclone has good support for this, but it could be difficult to puppet, since the version currently packaged in Debian doens't support box.com. We'll have to figure out how to get the latest version.
thanks to @jvperrin we're now an all-Stretch shop, minus the temporary old-death resurrection. We can now kill any conditionals targeting Jessie in the codebase, and start replacing them with conditionals targeting buster, lol.
@jvperrin mentioned this some time ago but I thought I'd make a ticket for it
Migrated from ocf/ocflib#7
Box.com has announced that they are deprecating WebDAV support on January 31st, 2019. Indiana University has a nice page on alternative ways to connect to Box.com. I'm thinking that connecting to box over the FTPS
protocol will be the best replacement of those listed.
One problem right now with the Let's Encrypt changes (in #337) is that if new domains are added in LDAP, they likely won't be in DNS yet, a request will be made for a new cert, it'll fail, and won't be retried, since it's been refreshed once and that's it.
If the domains requested don't match the cert, or maybe if the previous cert request failed, we should retry the cert request, since giving up isn't a good option. I'm not sure what would be the best option here, one quick and dirty trick that comes to mind is to make a lockfile of sorts, so that if the cert fails to get obtained, a future run can check if the file exists, and schedule a run. I'd hope there's something cleaner than this, maybe a custom fact that shows the domains in each cert by parsing openssl x509 -in cert.pem -text
output?
i just had my ears blasted out
It's supposed to be adjusted here, but volume is still set to max upon login.
among other things, ocf-netboot starts by deleting everything in /opt/tftpd
, but it can't recover those things if mirrors are down.
the preseed file we include for debconf (https://github.com/ocf/puppet/blob/master/modules/ocf_mysql/manifests/init.pp#L6) has two problems. First, we're now on mariadb-server-10.1, so the package name for debconf needs to be updated, but more importantly, the mariadb-server-10.1 package templates doesn't appear to support mysql-server/root_password and root_password_again debconf options at all:
root@e5a1d73f257e:/# sudo debconf-show mariadb-server-10.1
mariadb-server-10.1/nis_warning:
mariadb-server-10.1/old_data_directory_saved:
mariadb-server-10.1/postrm_remove_databases: false
unless I'm interpreting this incorrectly, we'll need to change the way we preseed the root password and do initial grants.
We should add multiple upstreams to healthcheck against and take the max of the last_updated
timestamp (or the analogue for other distros). This will prevent the situation where both us and our upstream monitor go out of of date, and thus we don't get any alerts from the healthcheck
We should also add Prometheus monitoring to our systemd units that perform the syncing. If any unit goes into the Failed
state we can proactively fix our sync scripts before getting the alert that we are days out of date. This will help our "out-of-date" avgs if we want to apply to be a tier 1 mirror.
The check made for the ocf_nfs
fact doesn't work on stretch, since it only has one slash instead of two:
supernova (stretch):
$ mount | grep home
services:/home on /home type nfs4 [...]
tsunami (jessie):
$ mount | grep home
services://home on /home type nfs4 [...]
We have a custom message for staff-only desktops that informs users when they're denied entry that it is staff-only. However, if a desktop was previously set to staff-only in puppet and then was set to not be staff-only, that message doesn't change.
The command that is supposed to recompile lightdm-greeter to include the new message is here, but since the subscribe
points to the filename, lightdm-greeter will only be recompiled if the actual contents of the message template change, instead of when $staff_only
changes as intended.
To fix this, you couid write both the staff-only and not staff-only greeter error message to a single file on the desktop, and have the subscribe
point to that file.
We currently set the Docker socket group by adding a systemd override for docker.socket
, but it seems that lately dockerd no longer uses systemd to manage its socket. Hence, our override needs to be updated.
The way to go is probably to override the docker.service
command to pass the -G
flag with the appropriate group. (Unless someone knows of a better way.)
@ja5087 worked on getting postgres databases into our backups in #366 / #377, but we should move to having backups that are per-database instead of a single dump of the entire server's state. this is better for recovery and will help make our backups more efficient as well by taking advantage of rsnapshot's hardlinking features.
To the browser_homepage
value in common.yaml
last week I reorganized all the 10GbE DAC cables for the hypervisors, such that each hypervisor has both ports on the SolarFlare NIC plugged into the Arista 7050 switch, in consecutive interfaces. e.g. riptide is plugged into Ethernet 1
and Ethernet 2
on the switch, which, physically, are the leftmost top and bottom ports on the switch. Each hypervisor is organized this way, taking a top and bottom port, and @dongkyunc has helpfully added labels to the underside of the switch indicating which server each group corresponds to.
Now, we need to activate that second interface on each hypervisor, and preferably put both interfaces into an LACP channel group. This means that while both links are up, the virtual LACP bond interface will have an aggregate bandwidth of 20Gb/s, but if one of the links fails, the interface will not drop entirely but instead merely drop to 10Gb/s. This gives us an element of fault tolerance while doubling the existing bandwidth each server can utilize. Configuring 802.3ad will require configuration on both the hosts and the switch. @gpl and I have experimented with configuring the switch to support LACP while @cg505 and I have experimented with configuring the hosts to support bond0
as the LACP interface. Some work still needs to happen to get everything working, but I was able to configure the bond interface on dev-fallingrocks into active-passive mode before accidentally locking myself out of the machine when trying to configure the switch into LACP mode.
We will need to modify the configuration in https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/networking.pp to support bringing up the bond interface correctly and configuring it to bridge to br0 VMs as well. Doing it would likely make for an interesting blog post of sorts as well since much of the documentation online for doing this is rather out of date.
We want any added root users to go through usudo
in puppet instead of being added manually, so we should manage the /etc/sudoers.d
directory and make sure that it is empty (except for the README
file that is provided in it, since apparently it needs to have some file in the directory to work). An excerpt from the /etc/sudoers.d/README
file:
# Note that there must be at least one file in the sudoers.d directory (this
# one will do), and all files in this directory should be mode 0440.
I don't want to type "walpurgisnacht" ever again
currently it forks and systemd loses track of it
for some reason the lets-encrypt-update script we use to acquire new and updated certs for death has been silently failing to run for some time (rt#7937, rt#7901, etc.)
there's nothing relevant in the logs, as far as I can tell, and we aren't getting any cron emails from failures, so perhaps there might be some way for us to turn up the verbosity / have it report everything for a bit to see if we can trigger a failure?
These support HTTPS both for messages between each other and for the HTTP APIs/UIs.
This is a nice-to-have, since the public endpoints that staff use are already HTTPS. The only unencrypted communication is over our own network.
It seems like apt.dockerproject.org hasn't updated since May 5 even though it was getting monthly updates before. We might consider switching to download.docker.com, which is apparently the official source now.
pdf2ps
first./var/log/vsftpd.log
contains statistics on things downloaded from mirrors over FTP, and syslog
contains log lines from rsyncd with rsync on
and `total size' as a search key. we might be able to extract this information and add it to the mirrors statistics.
we now have servers that can boot over efi, like jaws, riptide, corruption, and dataloss
the firewall was replaced by a Palo Alto 5260 (bsecure) and the switch was replaced with an Arista 7050SX, so we should probably either fix rancid to work with the Arista device or get rid of it entirely since we don't really need it anymore
we should start dumping the VM XML files and including them in backups if possible. we can have a scrip that logs in to the hypervisors as ocfbackups, parses the output of virsh list --all
or some more cli-friendly thing, and then "ssh -K ocfbackups@$hypervisor virsh dumpxml $domain" > $hypervisor-$domain.xml"
or something like that
we can probably use a lot of the logic from pam_mkhomedir on desktops
logs coming from the bsecure firewall are massive (at our scale). Even after disabling logging from mirrors and death, it's still almost 2+GB a day.
I have some provisional rsyslog and logrotate conf changes that try to redirect firewall log entries from /var/log/remote/
to /var/log/external_firewall/<rule name>/<rule name>.log
and then compress the living hell out of them to save space. Luckily they compress well, to ~10% of original size.
The /etc/sudoers
file becomes broken if the hiera file to add an owner is not added quickly enough, since it contains the following line (notice the single space at the front):
ALL=(ALL) NOPASSWD: ALL
This appears to be because the hiera lookup for an owner returns an empty string in the usudo
array, leading to a line being added in the /etc/sudoers
template, but not one with an actual user. Probably should just filter out empty or invalid users in the auth.pp
manifest?
https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security
probably also includesubdomains and preload.
As suggested by @chriskuehl in #193, it would be much nicer to have something to identify dev hosts rather than checking for dev-
at the start of the hostname every time.
accounts probably should not have both of these specified for long periods of time. It may make sense for us to periodically confirm that no associations have been left by accident.
i stupidly put some kubernetes files in /etc/kubernetes. this is managed by kubeadm... so far it's been fine, but it's better to take these files out of there.
Right now that URL redirects to fluffy (instead of RT as it used to) because the nginx running on marathon-lb doesn't understand vhost aliases or something like that. We should fix that.
currently we do
this only works if we have a single active interface connected to our machines, and if the cable is plugged into the first port. This is not a safe assumption now that we have 10GbE cards in our servers. For example, this code picks enp4s0f0
as the 'active' interface on corruption, as this is the first of the interfaces built into corruption's motherboard, but the new SolarFlare cards we have live at ens5
, and the correct interface we should be writing to /etc/network/interfaces is ens5f1np1
.
Perhaps a custom fact like
#!/bin/bash
set -euo pipefail
for i in /sys/class/net/en*; do
if ethtool "$(basename $i)" | grep -q "Link detected: yes"; then
echo "$(basename $i)"
fi
done
would be better for selecting the correct interface(s) that are actually active and plugged into the machines.
This has something to do with tagmail config, according to Dara.
seems like this shouldn't be the case:
abizer@blackout:~$ check abizer
abizer:*:35934:1000:Abizer Lokhandwala:/home/a/ab/abizer:/bin/zsh
Created on: 2015-10-03
Member of group(s): ocfbroker ocfstaff ocfroot ocfdev ocfofficers
2018-10-23 03:13:07 -0700 //death.ocf.berkeley.edu/Facter (err): error while processing "/opt/puppetlabs/puppet/cache/facts.d/iface-linked" for external facts: child process returned non-zero exit status (1).
File in question is here: puppet/modules/ocf/facts.d/iface-linked on line 8
death, tsunami, and werewolves are the remaining machines currently on jessie. We should tie up the loose ends here before transitioning to buster.
Currently the .desktop files in ~/Desktop are written from the Xsession file directly. It would make a lot more sense to just put them in /etc/skel/Desktop as actual files.
I've done some experimental work on creating an ocfpgbackups postgres role with GRANT SELECT
on all databases for backups, but the actual implementation could probably be done by a new staff member to cut their teeth on working on the infra. We'd want to add a rule to rsnapshot.conf in the ocf_backups module, probably using pg_dump
or pg_dumpall
.
The validate_re
function is deprecated in recent versions of stdlib, and should be replaced by using abstract data types.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.