Comments (16)
Ideally the CI server that does the auditing should run on an infrastructure that is independent of the infrastructure that builds the image. For instance we could use bitbucket + circle ci for the auditing CI service while we keep using github + travis ci to build the images.
from manylinux.
A few thoughts:
I'm not seeing how this hashing idea would work -- we don't have deterministic builds, so two identical builds of the same docker image will generally have different hashes for most binaries, due to things like embedded timestamps. (Deterministic builds are really hard -- ref 1, ref 2. Not really worth trying given the archaic toolchains we're stuck with, IMO.) Maybe I'm not understanding exactly what's being proposed?
I'm also not sure what we should worry about exactly (or in security jargon, "what's our threat model"). Empirically, compromises of software distribution sites are very rare (not sure why), and practically speaking it doesn't make sense for us to worry about being more secure than, say, pip or pypi. We definitely should have the conversation about security, because it's a bit of a kick to realize how large the exposure is in things like this, but I think it's better to start by thinking about what specifically are the worst risks and what kind of practical things we can do to mitigate them.
Here's a quick-and-dirty attempt to enumerate our trusted base (i.e., list of things where compromising them would let someone trojan all manylinux binaries):
- github
- and in particular, all of our github accounts (anyone with write access to the repo can push changes directly or extract the quay.io credentials, probably without anyone noticing for a long time -- currently this is @njsmith @ogrisel @rmcgibbo @matthew-brett @dstufft and anyone else who has admin-level access to the pypa org, which ironically I don't have permissions to check. Security!)
- quay.io
- and in particular @njsmith and @rmcgibbo's quay.io accounts
- plus the special deploy key we use on travis
- this stuff is particularly tricky because if you ever
docker login
from your laptop then now your laptop has a password-equivalent access token stored in plaintext on disk, when encrypting a deploy key for travis it's easy to accidentally behind a plaintext version on your laptop, etc.
- dockerhub (for the base image)
- travis-ci
- python.org
- everyone in the
pip
/wheel
trust chain (we fetch and runhttps://bootstrap.pypa.org/get-pip.py
, thenpip install wheel
)- including the relevant pypi and github accounts
- everyone in the
auditwheel
trust chain (we dopip install auditwheel
)- including the relevant pypi and github accounts
- the CA/TLS infrastructure involved in connecting to all of the above
A lot of this is stuff for me falls into "not worth worrying about". All else being equal it'd be nice if this list were shorter, but realistically if someone compromises github or quay.io or dockerhub or travis-ci or python.org or pip or pypi or the global certificate authority infrastructure, then really the manylinux docker images are the least of our concerns.
The things that jump out at me as perhaps worth worrying about are:
- The github and quay.io credentials for our various accounts
- the auditwheel trust chain (though ATM this appears to be basically the above list + @rmcgibbo's pypi credentials)
- being mindful about trying to minimize adding new items to the above list :-). (it's noticeably shorter than it would have been before the extra sha-256 checks were added in #44)
Regarding github: enabling 2FA is probably a good idea (I just did :-)), but hardly sufficient -- I know for me, if someone got access to my laptop or phone then they could cause all kinds of havoc with my logged-in browsers and ssh keys. In particular they could silently push changes directly to the master
branch of projects like this one. Not sure what to do about this :-(. Really what I want is a way to set it up so that accounts with "write" access have the ability to click the green merge button but not to push directly -- this way someone who stole my credentials could post a PR and then immediately merge it, which is still a risk, but it would be very obvious (lots of notifications sent out etc.), so someone would notice. AFAIK though GH doesn't have any way to do this -- if you can merge, you can also do secretive pushes. Maybe it's possible to do something with the protected branch feature? (Though then you'd still have the problem of a compromised account being used to secretly turn off branch protection... unless it sends a notification when that happens? I haven't checked.)
Regarding quay.io: it turned out I had a stray credential stored in /root/.docker/config.json, which I deleted... but in general this is rather annoying -- they don't even offer 2FA. Fortunately, unlike github, I basically never actually need to log into the site now that things are set up, so I guess I'll make sure that the only copies of the password are stored securely (e.g. not in my browser password store), and also disable their github-based login mechanism, and then make sure that I stay logged out on my browser... ugh.
Maybe we should put up a little wiki page or a note in the README about this? (notes on securing accounts that get access, notes on reviewing changes for their effect on the trusted base)
Trying to think of folks to CC who have security background and are interested in the manylinux stuff... maybe @dstufft @alex?
from manylinux.
Actually, I missed a piece in the list above: it looks like there's a bit of a mess around the CentOS version of the devtools 2 release. Apparently the toolchain that everyone's using to build generic linux binaries for distribution (not just us, but also the super-popular holy build box, and probably others as well) is a bunch of unsigned RPMs fetched over insecure-http from someone's personal account at people.centos.org
.
(Notice in the readme: "Known issues: (0) unsigned packages.")
AFAICT though this is currently the only available version of this toolchain that doesn't require a RH subscription.
This is kinda suboptimal from an internet public health standpoint. Maybe someone at Redhat can/should take an interest? @ncoghlan might know who to ping?
from manylinux.
Ouch - I'd forgotten that one of the downsides of using CentOS 5 as the baseline was not being able to use the softwarecollections.org infrastructure (since that only supports CentOS 6+).
@lhawthorn, @kbsingh, any ideas? Context is https://www.python.org/dev/peps/pep-0513/ which relies on CentOS 5 and Developer Toolset 2 as a "lowest common denominator" build environment for cross-distro Linux binaries.
from manylinux.
A possible alternative approach would be to use CERN's devtoolset 2 binaries for Scientific Linux rather than the people.centos.org ones: http://linuxsoft.cern.ch/cern/devtoolset/slc5-devtoolset.repo
from manylinux.
More info on CERN's setup: http://linux.web.cern.ch/linux/devtoolset/#install
from manylinux.
Tru's stack should be the best devtools-2 setup for now, I can work with him to make sure its revalidated and put onto the mirror/cdn instead.
However, its worth keeping in mind that EL5 overall is now well into its wind-down days and we are working with folks still running it to move off ( EOL date is Q1 2017 ).
w.r.t the SLC devtools-2, that was also only ever a test release, never meant to go into prod for any role, and was never maintained.
Within those 2 limitations, if you feel its still a route worth adopting, I'll work with Tru and get the devtools-2 stack in a better home, revalidated and signed.
from manylinux.
@ncoghlan: Interesting, I failed to find that. Looking at http://linuxsoft.cern.ch/cern/devtoolset/slc5-devtoolset.repo , it looks like they probably do provide signed packages, so if we can figure out how to use them + load the relevant key + make sure that rpm is configured to reject unsigned packages, then it would close the main threat vectors. Unfortunately I am a Debian guy and have no idea how to do that :-)
from manylinux.
@kbsingh: I'm hoping manylinux2 will be able to use CentOS 6 + devtoolset-3 from softwarecollections.org as a cross-distro binary baseline, but at the moment EL5 is still too widespread in academia to ignore (plus it's the established baseline that folks like Enthought, Continuum Analytics and Phusion have demonstrated works well in practice)
from manylinux.
@kbsingh: oh, thanks for the update. never mind about the SLC devtools then :-)
I know EL5 is running down, but unfortunately it's still the de facto baseline that everyone seems to be using for "I need to build a binary that will run on ~all systems". (Fortunately this doesn't require actually using it for anything besides running make
and then copying the binaries off to an actually useful system, but...) Hopefully we'll get to move off it next year after it goes off support, but it's one of those things where we'd rather not be the first to try... in particular in Python-land we have an actual spec mandating its use for all binaries that are allowed onto the main distribution channel. ...Basically what @ncoghlan just said. I see I am slow at typing today :-).
So, if there's a reasonable way to make the devtools-2 available in a more robust way, that would be much appreciated.
(Worst case it would probably be fine to just provide a tarball that could be dumped into /opt/rh somewhere along with its hash... we all know that there will be no more devtools-2 releases :-))
CC'ing @FooBarWidget too, since the HBB probably would probably also benefit from having a secure source for devtoolset-2, and they might have comments.
from manylinux.
Thanks for CC'ing me. Yeah having devtools-2 available in a more robust/secure way would be great. I don't mind that CentOS 5 is being deprecated as long as existing stuff keeps working in the future.
from manylinux.
Hi, I published the devtools for CentOS-5 because I am using it. At that time, there was little/none feedback/interest from the community, and CentOS-6 was getting most of the traction. As @kbsingh said, we can definitely work out a solution.
from manylinux.
@truatpasteurdotfr: they're certainly very much appreciated! The python wheels built with this repo have already been downloaded ~120,000 times, and that number will probably go up by an order of magnitude within the next month or two as packages like numpy and scipy start publishing builds. I also know that both Continuum and Enthought have been using them for parts of their Python distributions, plus there are lots of folks using HBB for I have no idea what. So there's definitely lots of interest, it turns out -- I guess just it took a while :-). So thank you!
from manylinux.
@njsmith about reproducible builds, it's not as bad as one might have thought:
The hashes for a locally built patchelf (and gcc from devtools):
[~/code/manylinux (master)]$ docker run --rm -ti ogrisel/manylinux bash
[root@eeabeb2b02d9 /]# sha256sum `which patchelf`
f251b57091fe8fa746f3f61ac4470529b60133cef4877cae1d32704b319c3929 /usr/local/bin/patchelf
[root@eeabeb2b02d9 /]# sha256sum `which gcc`
759df5b696dde0b7cc8ed272e98ccfd79daaa3d76fa74f95673caa3b24a28d9f /opt/rh/devtoolset-2/root/usr/bin/gcc
match with the image built by our CI:
(py35)0 [~]$ docker pull quay.io/pypa/manylinux1_x86_64
b01c2ad1-4619-4449-af85-2e16bd306064-n1: Pulling quay.io/pypa/manylinux1_x86_64:latest... : downloaded
(py35)0 [~]$ docker run -ti --rm quay.io/pypa/manylinux1_x86_64 bash
[root@5c36ff2d8ed6 /]# sha256sum `which patchelf`
f251b57091fe8fa746f3f61ac4470529b60133cef4877cae1d32704b319c3929 /usr/local/bin/patchelf
[root@5c36ff2d8ed6 /]# sha256sum `which gcc`
759df5b696dde0b7cc8ed272e98ccfd79daaa3d76fa74f95673caa3b24a28d9f /opt/rh/devtoolset-2/root/usr/bin/gcc
But the global hashes for the docker images themselves do not match:
(py35)0 [~]$ docker images | grep manylinux
ogrisel/manylinux latest sha256:f3a8b 7 minutes ago 1.74 GB
quay.io/pypa/manylinux1_x86_64 latest sha256:92b6f 34 hours ago 1.74 GB
probably build timestamps.
https://reproducible-builds.org/ is a very interesting resource though. In particular the tools they provide might be useful if we want to guarantee reproducible builds for manylinux1 images:
https://reproducible-builds.org/tools/
from manylinux.
The hashes for the python binaries for instance do not match.
from manylinux.
If/when you switch to CentOS 6, seriously consider just using devtoolset-4 if it is an option. It has a very recent C++ compiler with full C++14 support. Just something to thing about.
from manylinux.
Related Issues (20)
- Can we upgrade gcc 11.x in manylinux2014? HOT 2
- `manylinux_2_28` incorrectly uses OpenSSL 1.1.1k instead of 3.0.12 HOT 3
- Error Repairing Wheel to manylinux2014_x86_64 ABI Due to Too-Recent Versioned Symbols HOT 4
- Manylinux2014 avx512 support. HOT 1
- manylinux1's latest is missing a pinned tag HOT 1
- Tracking issue for manylinux_2_34 image HOT 15
- New architecture support [LoongArch] HOT 1
- Y2038 Problem HOT 6
- Include uv in tools? HOT 9
- Add CPython 3.13 HOT 3
- [CI] Use GitHub-hosted aarch64 runners
- manylinux_2_28_x86_64 image doesn't have pip HOT 1
- Remove PyPy 3.7 & 3.8 from images on 2024-07-01
- Support for musllinux_1_1 images will be dropped on 2024-11-01
- Manylinux 2014 images no longer work due to CentOS 7 going EOL HOT 3
- [Question] Clang versions to create a manylinux_2_28 compliant wheel? HOT 1
- manylinux2014 EOL
- Problem building the manylinux images HOT 1
- Vault repos for 2014 image HOT 2
- Drop CPython 3.6 & 3.7 on 2025-05-06 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from manylinux.