Giter VIP home page Giter VIP logo

Comments (15)

traversaro avatar traversaro commented on August 21, 2024 2

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

The issue was fixed upstream for GCC14, see:

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change, that for backport can be more adapt to reduce the risk of patch conflicts.

from ctng-compilers-feedstock.

h-vetinari avatar h-vetinari commented on August 21, 2024 1

OK, sorry about that. I followed your "downstream issue" a bit, that's why I got to ipopt. If this happens purely with python+libgomp, then I'm more stumped (I thought it was something about setting/parsing the OMP_* options). I fail to imagine how the commit you referenced would touch the ABI, but perhaps that's the case. Might be interesting to rebuild python with gcc 13 to see if that changes anything?

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024 1

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024 1

Ok, I think this is the combination of two different behaviour/problems:

  • P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have environ==NULL
    • I am not sure why this happens, and if it is expected behaviour or a bug
  • P2: libgomp >= 13 segfaults if environ==NULL
    • This behaviour I think it is a bug of libgomp, as environ==NULL is a valid state, for example caused by calling clearenv() on Linux.

P2 can be reproduced easily on libgomp >= 13 with this MWE:

#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

int main () {
    clearenv();
    void * handle = dlopen("libgomp.so.1", RTLD_NOW);
   
    if (handle) {
        fprintf(stderr, "dlopen of libgomp.so.1 done correctly.\n");
        return EXIT_SUCCESS;
    } else {
        fprintf(stderr, "dlopen of libgomp.so.1 failed with error: %s.\n", dlerror());
        return EXIT_SUCCESS;
    }
    return EXIT_SUCCESS;
}

to run:

gcc -ldl test_gomp_segfault.c -o test_gomp_segfault
./test_gomp_segfault

I will open a bug upstream in GCC for P2.

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

libgomp <= 12 is used

Indeed, it seems that the problematic piece of code was only introduced in libgomp 13 : gcc-mirror/gcc@9f2fca5 .

from ctng-compilers-feedstock.

h-vetinari avatar h-vetinari commented on August 21, 2024

Based on the patch you found, it seems to have something to do with parsing OMP_* environment variables?

I saw that the ipopt-feedstock sets

  # Environment variables needed by spral
  # See https://github.com/ralna/spral#usage-at-a-glance
  export OMP_CANCELLATION=TRUE
  export OMP_PROC_BIND=TRUE

In particular, from the commit you linked that introduced the new facility for host vs. device, it seems to me that:

  • The parsing for OMP_PROC_BIND changed substantially (as opposed to OMP_CANCELLATION)
  • While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.
  • Out of all the test cases in that commit, none of them has a value of TRUE for OMP_PROC_BIND, but rather things like: "spread", "close", "spread,spread", "spread,close"

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I am not sure this is related to ipopt/spral. The environment in which this happens reported in #114 (comment) is created with mamba create -n testsegfault libgomp python, and in that environment no OMP_* variable are defined.

While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.

Just to understand, which stack? The problem occurs just by combining libgomp and python, and I do not think that python depends on libgomp .

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I reproduced the issue in Debian and Ubuntu distro with apt-packages that contain gomp 13, while earlier distros with gomp 12 all pass fine: https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6172933871. On the other hand, Fedora 38 has gomp 13.2.0, but does not reproduce the error, similarly also latest arch does not reproduce the problem.

from ctng-compilers-feedstock.

S-Dafarra avatar S-Dafarra commented on August 21, 2024

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

Just to be sure I created a minimal C-based test, and indeed the issue does not appear to happen with that, see https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6174144211 and https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/blob/main/test.c .

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I was able to reproduce the problem without libgomp, just with a manually coded shared lib, i.e. testso.c :

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h> // Include this header for environ

extern char **environ; // Declare extern environ

static void __attribute__((constructor))
initialize_env (void)
{
    char **env;
    fprintf(stderr, "Print debug\n", *env);
    env = environ;
    fprintf(stderr, "environ %p env %p\n", env, environ);
    for (env = environ; *env != 0; env++)
    {
        fprintf(stderr, "%s\n", *env);
    }
    return;
}
gcc -shared -fPIC testso.c -o testso.so
(testsegfault) traversaro@IITICUBLAP257:~/test_ipopt_dir$ python -c "import ctypes; import os; ctypes._dlopen('./testso.so', os.RTLD_DEEPBIND)"
Print debug
environ (nil) env (nil)
Segmentation fault

While in normal use:

Trying to load with RTLD_LAZY|RTLD_DEEPBIND ./testso.so
Print debug
environ 0x7ffcf93cefc0 env 0x7ffcf93cefc0

For some reason the environ global variable is set to 0/NULL.

So perhaps we should move the issue to Python feedstock?

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

from ctng-compilers-feedstock.

h-vetinari avatar h-vetinari commented on August 21, 2024

Great job!

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change

Proof of that statement, using Github's UI.

from ctng-compilers-feedstock.

traversaro avatar traversaro commented on August 21, 2024

P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have environ==NULL

* I am not sure why this happens, and if it is expected behaviour or a bug

It turns that also this was working fine in gomp <= 12 and it does not work in gomp 13, so I opened an issue also for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111556 . However, to be honest I am not sure if this is a problem in libgomp, in glibc or simply a problem of how ELF and the POSIX spec interact.

from ctng-compilers-feedstock.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.