Comments (13)
I dug into http://build.ros.org/job/Ibin_uT32__rail_grasp_collection__ubuntu_trusty_i386__binary/ since it's relatively quick to reproduce. Running it locally I cannot reproduce the memory exhaustion. And it looks like the maximum memory usage is about 5.5% of my 15GB of RAM which less than 1GB so I don't know of any limitations at that level. Here are some plots of my system's memory usage just before, during and after the critical 50% build object.
I reproduced the build locally with:
mkdir /tmp/release_job
generate_release_script.py https://raw.githubusercontent.com/ros-infrastructure/ros_buildfarm_config/production/index.yaml indigo default rail_grasp_collection ubuntu trusty amd64 > /tmp/release_job/release_job_indigo_roscpp.sh
cd /tmp/release_job
sh release_job_indigo_roscpp.sh
Maybe this could be tried on a build executor to get the right kernel etc.
from buildfarm_deployment.
I had a couple of false starts but I have confirmed that this issue is related to the kernel version bump rather than the docker version bump. Which means that the attempted cure for ros-infrastructure/ros_buildfarm#535 is worse than the original issue.
I've still been reproducing with with a run of the full release script and haven't yet tucked into exactly the cause. It's not memory usage as running stress --vm 2 --vm-bytes 3.99G
works just fine and stress --vm 2 --vm-bytes 4G` fails with a different error message. So it's got to be something particular to how or what's being allocated by the build process. I wonder if there are glibc interface changes between the trusty libc and the 4.15 kernel?
from buildfarm_deployment.
I've been trying to whittle this down to a minimal case. I've been using the rail_grasp_collection package as a test sample because it had a short successful build time according to jenkins. The failure comes when make is running
/usr/bin/i686-linux-gnu-g++ -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"rail_grasp_collection\" -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -DNDEBUG -D_FORTIFY_SOURCE=2 -I/tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/include -I/opt/ros/indigo/include -I/usr/include/eigen3 -o MakeFiles/rail_grasp_collection.dir/src/GraspCollector.cpp.o -c /tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/src/GraspCollector.cpp
But interestingly, running that line on its own doesn't seem to cause issues.
If I strace -f
the make process I eventually get the following mmap failure which leads to the failure:
[pid 1511] mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 EPERM (Operation not permitted)
But if I compare the possible EPERM reasons from mmap(2):
EPERM: The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.
EPERM: The operation was prevented by a file seal; see fcntl(2).
Neither fits as there is no file backing the mmap. I tried just invoking mmap a handful of times in a separate container and at one point was able to reproduce the issue but now just get Cannot allocate memory
after ~4k allocations as I'm not munmapping anything.
from buildfarm_deployment.
I've got strace output from a successful run via shell and from a failed run via a minimal makefile (contains the command invocation hard-coded, no other targets or variables).
There's not really much I can identify except that the successful run has more munmap
s immediately after mmap2
calls than the one via make does. but nothing is definitively identifiable.
from buildfarm_deployment.
Would it make sense to open a ticket on the Docker tracker?
Lots more eyes over there.
from buildfarm_deployment.
Would it make sense to open a ticket on the Docker tracker?
I don't really see this as a docker issue. It happens for both Docker 17.05 and 18.03/18.05. It's changing the kernel that causes grief. If anything I might post on the Docker forums and/or reply on https://www.reddit.com/r/docker/comments/8l539q/docker_virtual_memory_running_out/
I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.
from buildfarm_deployment.
Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.
I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.
True.
from buildfarm_deployment.
Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.
My apologies. I had a writeup that ended up getting deleted when I found a flaw in my methods.
The issue doesn't exhibit with the default Xenial kernel. I brought up a machine using our agent AMI but partitioned off from the main buildfarm network and downgraded it's kernel back to the default linux-aws 4.4 kernel and the issue was resolved.
I also tried downgrading docker back to the 17.05 version which was previously used but that had no effect on the issue (it still occurred with the 4.15 kernel and did not occur using 4.4).
However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.
from buildfarm_deployment.
However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.
Or disable the mitigations?
Not much private info going around on the farm?
Or are there opportunities for leaking data that I'm not aware of?
Provided it's really those mitigations that are causing this, of course.
from buildfarm_deployment.
Provided it's really those mitigations that are causing this, of course.
Disabling the mitigations is another way to resolve ros-infrastructure/ros_buildfarm#535 I haven't tested if 4.4 with mitigations disabled exhibits this same issue. But that's a valid test to perform.
I am reluctant to run without mitigations on the public farm primarily because I lack the expertise to be confident that we would not be opening any new significant new attack vectors in doing so.
from buildfarm_deployment.
I'm not an expert either, so this is just speculation, but perhaps registering a repository for dev jobs with a malicuous package that exploits either vulnerability to retrieve the jenkins admin pw?
Seems far fetched but that is basically the end-of-the-world-scenario that started this whole mess.
from buildfarm_deployment.
Yeah, it looks like maybe finding ways to live with the performance hits on Trusty will be better.
The kernel upgrade is also causing regressions in the realsense driver and downstream packages: IntelRealSense/realsense-ros#388
from buildfarm_deployment.
We did end up reverting the kernel change which precipitated this.
from buildfarm_deployment.
Related Issues (20)
- Bionic builds for Arm{64,hf} sometimes fail with a corrupt package HOT 2
- Ubuntu trusty containers need 4.15 kernel to avoid performance hit HOT 1
- git config missing on ROS2 farm
- Some jobs on build.ros.org seem to be failing with no apparent reason HOT 3
- Apparent race condition with connext generator causes occasional failures HOT 4
- import_upstream job is taking much longer than it used to. HOT 3
- Jenkins admin password not injected at deployment HOT 17
- "Proxy setup broken" after master deployment HOT 5
- install rebuild plugin
- Jenkins build priority strategy not set by the deployment
- Add rsync settings for upload job
- backport reprepro 5.3 for the repository HOT 1
- configure_git_user script returns error HOT 10
- Enable connection from agents on deploy
- Shell escaping password HOT 1
- Incorrect exit code running reconfigure.bash HOT 1
- Increasing available memory for release builds [Pinocchio] HOT 6
- enable cgroup memory management
- Build command from Jenkins GUI failed cause "Could not find specified credentials"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from buildfarm_deployment.