Giter VIP home page Giter VIP logo

Comments (13)

tfoote avatar tfoote commented on July 1, 2024

I dug into http://build.ros.org/job/Ibin_uT32__rail_grasp_collection__ubuntu_trusty_i386__binary/ since it's relatively quick to reproduce. Running it locally I cannot reproduce the memory exhaustion. And it looks like the maximum memory usage is about 5.5% of my 15GB of RAM which less than 1GB so I don't know of any limitations at that level. Here are some plots of my system's memory usage just before, during and after the critical 50% build object.

screenshot from 2018-05-25 14-34-21
screenshot from 2018-05-25 14-34-28
screenshot from 2018-05-25 14-34-36
screenshot from 2018-05-25 14-34-48

I reproduced the build locally with:

mkdir /tmp/release_job
generate_release_script.py https://raw.githubusercontent.com/ros-infrastructure/ros_buildfarm_config/production/index.yaml indigo default rail_grasp_collection ubuntu trusty amd64 > /tmp/release_job/release_job_indigo_roscpp.sh
cd /tmp/release_job
sh release_job_indigo_roscpp.sh

Maybe this could be tried on a build executor to get the right kernel etc.

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

I had a couple of false starts but I have confirmed that this issue is related to the kernel version bump rather than the docker version bump. Which means that the attempted cure for ros-infrastructure/ros_buildfarm#535 is worse than the original issue.

I've still been reproducing with with a run of the full release script and haven't yet tucked into exactly the cause. It's not memory usage as running stress --vm 2 --vm-bytes 3.99G works just fine and stress --vm 2 --vm-bytes 4G` fails with a different error message. So it's got to be something particular to how or what's being allocated by the build process. I wonder if there are glibc interface changes between the trusty libc and the 4.15 kernel?

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

I've been trying to whittle this down to a minimal case. I've been using the rail_grasp_collection package as a test sample because it had a short successful build time according to jenkins. The failure comes when make is running

/usr/bin/i686-linux-gnu-g++ -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"rail_grasp_collection\" -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -DNDEBUG -D_FORTIFY_SOURCE=2 -I/tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/include -I/opt/ros/indigo/include -I/usr/include/eigen3 -o  MakeFiles/rail_grasp_collection.dir/src/GraspCollector.cpp.o -c /tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/src/GraspCollector.cpp

But interestingly, running that line on its own doesn't seem to cause issues.

If I strace -f the make process I eventually get the following mmap failure which leads to the failure:

[pid  1511] mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 EPERM (Operation not permitted)

But if I compare the possible EPERM reasons from mmap(2):

EPERM: The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.

EPERM: The operation was prevented by a file seal; see fcntl(2).

Neither fits as there is no file backing the mmap. I tried just invoking mmap a handful of times in a separate container and at one point was able to reproduce the issue but now just get Cannot allocate memory after ~4k allocations as I'm not munmapping anything.

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

I've got strace output from a successful run via shell and from a failed run via a minimal makefile (contains the command invocation hard-coded, no other targets or variables).

There's not really much I can identify except that the successful run has more munmaps immediately after mmap2 calls than the one via make does. but nothing is definitively identifiable.

from buildfarm_deployment.

gavanderhoorn avatar gavanderhoorn commented on July 1, 2024

Would it make sense to open a ticket on the Docker tracker?

Lots more eyes over there.

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

Would it make sense to open a ticket on the Docker tracker?

I don't really see this as a docker issue. It happens for both Docker 17.05 and 18.03/18.05. It's changing the kernel that causes grief. If anything I might post on the Docker forums and/or reply on https://www.reddit.com/r/docker/comments/8l539q/docker_virtual_memory_running_out/

I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.

from buildfarm_deployment.

gavanderhoorn avatar gavanderhoorn commented on July 1, 2024

Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.

I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.

True.

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.

My apologies. I had a writeup that ended up getting deleted when I found a flaw in my methods.

The issue doesn't exhibit with the default Xenial kernel. I brought up a machine using our agent AMI but partitioned off from the main buildfarm network and downgraded it's kernel back to the default linux-aws 4.4 kernel and the issue was resolved.

I also tried downgrading docker back to the 17.05 version which was previously used but that had no effect on the issue (it still occurred with the 4.15 kernel and did not occur using 4.4).

However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.

from buildfarm_deployment.

gavanderhoorn avatar gavanderhoorn commented on July 1, 2024

However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.

Or disable the mitigations?

Not much private info going around on the farm?

Or are there opportunities for leaking data that I'm not aware of?

Provided it's really those mitigations that are causing this, of course.

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

Provided it's really those mitigations that are causing this, of course.

Disabling the mitigations is another way to resolve ros-infrastructure/ros_buildfarm#535 I haven't tested if 4.4 with mitigations disabled exhibits this same issue. But that's a valid test to perform.

I am reluctant to run without mitigations on the public farm primarily because I lack the expertise to be confident that we would not be opening any new significant new attack vectors in doing so.

from buildfarm_deployment.

gavanderhoorn avatar gavanderhoorn commented on July 1, 2024

I'm not an expert either, so this is just speculation, but perhaps registering a repository for dev jobs with a malicuous package that exploits either vulnerability to retrieve the jenkins admin pw?

Seems far fetched but that is basically the end-of-the-world-scenario that started this whole mess.

from buildfarm_deployment.

tfoote avatar tfoote commented on July 1, 2024

Yeah, it looks like maybe finding ways to live with the performance hits on Trusty will be better.

The kernel upgrade is also causing regressions in the realsense driver and downstream packages: IntelRealSense/realsense-ros#388

from buildfarm_deployment.

nuclearsandwich avatar nuclearsandwich commented on July 1, 2024

We did end up reverting the kernel change which precipitated this.

from buildfarm_deployment.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.