Reproducer: <div class="highlight highlight-text-shell-session notranslate positio

More info: the nesting works if done in two steps: <div class="highlight highlight

flux-job: `client.c:432: pty_read_cb: read: Input/output error` when nesting jobs with interactive ptys about flux-core HOT 4 CLOSED

grondo commented on September 9, 2024

flux-job: `client.c:432: pty_read_cb: read: Input/output error` when nesting jobs with interactive ptys

from flux-core.

Comments (4)

grondo commented on September 9, 2024

More info: the nesting works if done in two steps:

grondo@pi3:~$ flux alloc -N1
grondo@pi0:~$ flux alloc -N1
grondo@pi0:~$ flux getattr instance-level
2

from flux-core.

grondo commented on September 9, 2024

Hm, this "fixes" the issue, but I need to understand why flux-job isn't in the foreground process group in the failing case.

diff --git a/src/common/libterminus/client.c b/src/common/libterminus/client.c
index 3bccd55ad..8e3d743da 100644
--- a/src/common/libterminus/client.c
+++ b/src/common/libterminus/client.c
@@ -432,6 +432,12 @@ static void pty_read_cb (flux_reactor_t *r,
             llog_fatal (c, "read: %s", strerror (errno));
             pty_client_exit (c, NULL);
         }
+        else if (errno == EIO) {
+            /*  Force ourselves in foreground
+            */
+            if (tcgetpgrp (STDIN_FILENO) != getpgrp ())
+                tcsetpgrp (STDIN_FILENO, getpgrp());
+        }
         return;
     }
     if (len == 0) {

from flux-core.

grondo commented on September 9, 2024

Ah, the difference in behavior here is described by this comment:

flux-core/src/broker/runat.c

Lines 468 to 474 in b0c565f

 /* For shell commands run the target cmdline in a separate process 

  * group so that any processes spawned by the shell will be signaled 

  * in runat_abort(). This does not work for an interactive shell 

  * (seems to disable access to the pty), so this flag is not set in 

  * runat_push_shell(). 

  */ 

 cmd->flags |= FLUX_SUBPROCESS_FLAGS_SETPGRP;

When a new instance is launched with -o pty.interactive, then the flux-broker process becomes the process group leader of the foreground process group. When it launches a non-interactive-shell initial program (in the failing case here flux-alloc->flux job attach), it puts the initial program into a new process group, which therefore cannot access the tty (a background process group).

Possibly, a solution here is to disable setpgrp whenever stdin is a tty, but I wonder if that will have any other fallout. Another solution might be to use the pre-exec subprocess hook to force the initial program as the foreground process group before it is executed. I'll try both and see which works out (probably the first will be simplest)

from flux-core.

grondo commented on September 9, 2024

To sum up more succinctly, the problem here is that the broker wants to create a new process group for the initial program (except when it launches a plain interactive shell) so that it can signal any processes spawned from rc2 as a group. However, this disables access to the tty for those processes.

For example, even this simple test fails:

$ flux start vim
Dec 16 08:18:31.079432 broker.err[0]: rc2: Killing stopped non-interactive process
Dec 16 08:18:31.080101 broker.err[0]: rc2.0: vim Killed (rc=137) 0.0s

Here vim is stopped with SIGTTIN when it tries to write to the terminal, and to avoid a hang this workaround is implemented in the broker currently:

flux-core/src/broker/runat.c

Lines 196 to 208 in b0c565f

 case FLUX_SUBPROCESS_STOPPED: 

 /* If this process is non-interactive, and stdin was a tty, 

  * then likely the subprocess was stopped due to SIGTTIN/TTOU. 

  * To avoid what would be a permanent hang, log an error and 

  * kill the child process. 

  */ 

 if (!entry->interactive && isatty (STDIN_FILENO)) { 

 flux_log (r->h, LOG_ERR, 

 "%s: Killing stopped non-interactive process", 

 entry->name); 

 flux_subprocess_kill (p, SIGKILL); 

 } 

 break;

(flux job attach blocks SIGTTIN for another issue, which is why it gets EIO when reading from the terminal instead of SIGTTIN like vim)

There's probably two approaches to fix this problem:

Do not run processes under a separate process group when stdin is a tty. This could affect cleanup of initial programs that are shell scripts or that invoke multiple processes, but is the simplest.
Have the broker act more like a shell, since it is trying to run groups of processes and manage them like a shell does. Run rc2 in its own process group all the time, and make the process group the foreground process group with tcsetpgrp(3) if the broker is currently in the foreground process group. This one would need more thought and some experimentation to see if there is any fallout.

from flux-core.

flux-job: `client.c:432: pty_read_cb: read: Input/output error` when nesting jobs with interactive ptys about flux-core HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	/* For shell commands run the target cmdline in a separate process
	* group so that any processes spawned by the shell will be signaled
	* in runat_abort(). This does not work for an interactive shell
	* (seems to disable access to the pty), so this flag is not set in
	* runat_push_shell().
	*/
	cmd->flags \|= FLUX_SUBPROCESS_FLAGS_SETPGRP;

	case FLUX_SUBPROCESS_STOPPED:
	/* If this process is non-interactive, and stdin was a tty,
	* then likely the subprocess was stopped due to SIGTTIN/TTOU.
	* To avoid what would be a permanent hang, log an error and
	* kill the child process.
	*/
	if (!entry->interactive && isatty (STDIN_FILENO)) {
	flux_log (r->h, LOG_ERR,
	"%s: Killing stopped non-interactive process",
	entry->name);
	flux_subprocess_kill (p, SIGKILL);
	}
	break;