Comments (29)
I just tried compiling Zed on Linux with the rustc_codegen_cranelift backend. It crashes reliably with:
fatal runtime error: IO Safety violation: owned file descriptor already closed
Thread 1 "Zed" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: Bestand of map bestaat niet.
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007ffff786fe8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x00007ffff7820fb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff780b472 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000555562b7e5d1 in std::sys::pal::unix::abort_internal () at library/std/src/sys/pal/unix/mod.rs:366
#5 0x0000555560e37664 in std::sys::pal::unix::fs::debug_assert_fd_is_open () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/std/src/rt.rs:43
#6 0x0000555560dffad6 in std::os::fd::owned::{impl#7}::drop () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/std/src/os/fd/owned.rs:181
#7 0x0000555560e38069 in core::ptr::drop_in_place<std::os::fd::owned::OwnedFd> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#8 0x000055555b0f8dde in core::ptr::drop_in_place<std::sys::pal::unix::fd::FileDesc> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#9 0x000055555b0f928c in core::ptr::drop_in_place<std::sys::pal::unix::process::process_common::Stdio> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#10 0x000055555b0f8856 in core::ptr::drop_in_place<core::option::Option<std::sys::pal::unix::process::process_common::Stdio>> ()
at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#11 0x000055555b0f930f in core::ptr::drop_in_place<std::sys::pal::unix::process::process_common::Command> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#12 0x000055555b0f8d67 in core::ptr::drop_in_place<std::process::Command> () at /home/bjorn/Projects/cg_clif2/build/stdlib/library/core/src/ptr/mod.rs:542
#13 0x000055555b10be12 in alacritty_terminal::tty::unix::new () at /home/bjorn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/alacritty_terminal-0.23.0/src/tty/unix.rs:285
#14 0x000055555afdc160 in terminal::{impl#7}::new () at crates/terminal/src/terminal.rs:377
#15 0x000055555a49c2fb in project::terminals::{impl#0}::create_terminal () at crates/project/src/terminals.rs:167
$ dist/rustc-clif -vV
rustc 1.80.0-nightly (b1ec1bd65 2024-05-18)
binary: rustc
commit-hash: b1ec1bd65f89c1375d2cf2fb733a87ef390276d3
commit-date: 2024-05-18
host: x86_64-unknown-linux-gnu
release: 1.80.0-nightly
Cranelift version: 0.108.0
Upon further investigation, I found alacritty/alacritty#7996, which fixes the crash I found here. I think it is very likely that the bug this fixes is responsible for the panic described in this issue too.
from rust.
Nvm, I just read the attached issue. I’ll pull that into zed and see if it fixes out problems
from rust.
We've been running the fixed alacritty in preview for a week, and on stable for a few days, and have seen 0 instances of this - this either means we've been unusually lucky, or this is now solved.
Thank you very much to @the8472 for adding the debug asserts, and to @jakobhellermann for figuring out the problem.
from rust.
Thank you for sharing!
I am not necessarily opposed to this change, I just felt it would be useful to know a bit more context.
Hmm. I wonder if, in another thread that is not bound by the executor, but does wait until after the worktree code has started up, destroying one of the directories in question would do it? That would make sense, and would explain why we get such a lovely "error" from the OS. That would be a classic case of "force majeure" that we typically do want to ignore, yes.
from rust.
I will simply trust @the8472 that EBADF is actually a highly specific error, in this context at least
Too much trust 😅. It's entirely possible that the usual suspects (FUSE, network filesystems, bugs in filesystem impls) do something crazy. But there's hope, let it at least interact with reality before it gets crushed again.
At least libfuse docs say this about the release operation:
The filesystem may reply with an error, but error values are not returned to close() or munmap() which triggered the release.
from rust.
Error codes really ought to include ENFS and EFUSE.
- ENFS: no more-specific error available, just some hot garbage from NFS.
- EFUSE: no more-specific error available, just some hot garbage from FUSE.
from rust.
@ConradIrwin Can you discuss the tracebacks for this, or does your telemetry lack sufficiently sophisticated crash reporting to make that possible?
from rust.
I expect that the standard library should not panic on situations that do happen in reality.
I know you meant something slightly aside from the reading of this, but I do expect the standard library to panic on indexing a slice with its length, or greater than. And people write code like that every day. So clearly the criteria has to be something more like
- whether it is reasonable to proceed anyways (arguable?)
- whether it is fixable (arguably not, if we get an error as joyously "descriptive" as EBADF)
from rust.
Fair: I do expect panics on programmer error or in a situation where it's unsafe to continue. I am relatively (but it's obviously hard to be 100%) sure this is not programmer error, and I'm convinced that it'd be just fine if the program kept running (c.f. how this error is ignored when dropping a file descriptor).
The full back-trace is below. All line numbers in Zed code are relative to v0.131.6 and in rust 1.77.1. I'm not sure why the top line number is missed by symbolicate, but the panic message narrows it down to this assertion
I believe the reason this panics at on line 4018 of worktree.rs in Zed's code is that child_paths
goes out of scope, which was retaining a ReadDir
returned from std::fs::read_dir()
(via smol) here. I can't see any shenanigans on this codepath that would lead us to to invalidate the file descriptor, which leads me to believe that "this just happens sometimes" when interacting with the filesystem. (Though we are holding this across async/await, so it seems possible that something else is affecting the system's state, but it's very unclear as to what would cause this kind of a problem).
Panic `unexpected error during closedir: Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" }` on thread 38 (com.apple.root.user-initiated-qos)
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/std/src/sys/pal/unix/fs.rs:0: _$LT$std..sys..pal..unix..fs..Dir$u20$as$u20$core..ops..drop..Drop$GT$::drop::h84b244fb776f88f3
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507: core::ptr::drop_in_place::<std::path::PathBuf>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507: core::ptr::drop_in_place::<std::sys::pal::unix::fs::InnerReadDir>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/alloc/src/sync.rs:1756: <alloc::sync::Arc<std::sys::pal::unix::fs::InnerReadDir>>::drop_slow
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/mem/mod.rs:394: core::mem::size_of_val_raw::<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/alloc/layout.rs:201: <core::alloc::layout::Layout>::for_value_raw::<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/alloc/src/boxed.rs:1241: <alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send> as core::ops::drop::Drop>::drop
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507: core::ptr::drop_in_place::<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>>
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:507: core::ptr::drop_in_place::<core::pin::Pin<alloc::boxed::Box<dyn futures_core::stream::Stream<Item = core::result::Result<std::path::PathBuf, anyhow::Error>> + core::marker::Send>>>
crates/worktree/src/worktree.rs:4018: <worktree::BackgroundScanner>::scan_dir::{closure#0}
crates/worktree/src/worktree.rs:3754: <worktree::BackgroundScanner>::scan_dirs::{closure#0}::{closure#0}::{closure#0}
crates/gpui/src/executor.rs:457: <gpui::executor::Scope>::spawn::<<worktree::BackgroundScanner>::scan_dirs::{closure#0}::{closure#0}::{closure#0}>::{closure#0}
/rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/task/poll.rs:51: <core::task::poll::Poll<()>>::map::<core::result::Result<(), alloc::boxed::Box<dyn core::any::Any + core::marker::Send>>, core::result::Result<(), alloc::boxed::Box<dyn core::any::Any + core::marker::Send>>::Ok>
/Users/administrator/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.7.0/src/raw.rs:557: <async_task::raw::RawTask<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>, (), <gpui::executor::BackgroundExecutor>::spawn_internal<()>::{closure#0}, ()>>::run
libdispatch.dylib _dispatch_client_callout
libdispatch.dylib _dispatch_root_queue_drain
libdispatch.dylib _dispatch_worker_thread2
libsystem_pthread.dylib _pthread_wqthread
libsystem_pthread.dylib start_wqthread
from rust.
For comparison a failure to close a file during the drop of a file descriptor is silently ignored as I would expect.
This is different because closing regular files can emit IO errors for which there is no good recovery strategy because it is unclear after calling close whether the FD is already closed successfully or not. The recommended strategy for error handling in those cases is to call sync_all
on files first before closing them.
EBADF is very different. It indicates that something closed a file descriptor it did not own, which is a violation of IO safety. Unless you have some evidence that this is just macos weirdness I would expect this to indicate a very serious bug in userspace which can lead to data corruption and UB due to file descriptor confusion.
This error should not be suppressed.
Arguably we should also not silently ignore EBADF for OwnedFd and only suppress other errors such as EIO.
And maybe we should add a fn close(self) -> Result<()>
API to allow non-panicking error handling, though even then EBADF
shouldn't be silently ignored and the application should be aborted to prevent data from being accidentally clobbered.
from rust.
Caveat: closedir
operates on a DIR*
, not on a fd directly. So it's possible that the C library returns EBADF for something that's not an invalid file descriptor but some other invalid state in that wrapping struct.
Is the macos C library open source? Then we can check there. If not it'd require reproducing the issue and tracing syscalls to tell the difference.
from rust.
It is, though as with everything Apple, it's a mess.
Errors can only occur from the close
call, the rest of closedir
just does some reclaiming. Most of the kernel is open-source as well, so one could try to figure out if anything except the fd lookup could cause the EBADF
, but I'm not motivated enough to do that.
from rust.
Thanks for looking it up. I think that is enough evidence that there's probably some library-UB going on in zed and we shouldn't try to silence it.
from rust.
The only place in zed
itself which calls as_raw_fd
is this code for getting the PID of the child process running in the integrated terminal. While it's definitely very fishy (it's not tied to the lifetime of the PTY), it doesn't close anything so I doubt that this issue is caused there.
As for dependencies, it might help to get a complete backtrace to see which ReadDir::drop
specifically is causing this.
from rust.
It's not necessarily readdir that's causing this, it could be the victim of something else going rogue and closing the wrong fd numbers.
from rust.
It must be... Though I'm hoping that there is enough locality between the close
and the read_dir
to point us in the direction.
from rust.
I have an idea how to harden std types against these kinds of things, but it would be fairly expensive and imo only make sense for debug_assert builds. Randomization for file descriptor numbers. But I'm not sure if people would be on board with it.
from rust.
I think that is a very interesting idea and would support merging it as a debug assertion if it finds a bug in Zed.
I am not yet convinced this is a bug in Zed.
from rust.
Yeah, without rootcause or even reproducer I'm not claiming certainty.
Another possibility that I see is some race in the Arc
leading to a double-drop.
from rust.
@ConradIrwin I don't think this is a "this just happens sometimes" thing. File descriptors should not randomly vanish from a process. File descriptors are densely allocated resources holding critical data. If they were to get misnumbered under the nose of a process things could go seriously wrong, including data corruption. And if Arc
is broken that'd also be serious.
So there could be a serious bug somewhere and that imo warrants further investigation rather than silencing it in the standard library. That it is some benign (although surprising) macos behavior is also possibility, but this should be proven before we silence it.
I have opened #124130 to abort on EBADF in OwnedFd. That should provide a signal to tell whether it's specific to ReadDir or file descriptors are getting clobbered in general. But you'll have to build Zed with a custom rustc build for that, or nightly if it gets merged.
from rust.
Arguably we should also not silently ignore EBADF for OwnedFd and only suppress other errors such as EIO.
Old MacDonald had a farm...
Yes, I'm in favor of us, at least for as long as we can tolerate it, silencing specifically whatever errors we expect to have happen and for us to be able to do nothing about, instead of "any error at all". I will simply trust @the8472 that EBADF is actually a highly specific error, in this context at least, and not merely the surface description of "something bad happened lol".
from rust.
Thanks for all the input here, and particularly the reassurance that "this shouldn't happen".
It seems like the next steps are to try and narrow down:
- Is this a logic bug in Zed or it's dependencies.
- Is this a filesystem bug in macOS (or Fuse, or Samba - which I know we have some users using).
For the in-process case it seems like a "bad file descriptor" means "one that the OS doesn't think you have", so either:
- Something is generating arbitrary numbers and sneaking them into this Dir object (seems very unlikely)
- Something is closing arbitrary numbers as file descriptors and sometimes this happens to close an open Dir object (seems very unlikely).
- Something is closing a file-descriptor twice, and between the first and second close the file-descriptor is re-used by the Dir open, so by the time we come to drop the Dir we have lost it (seems most plausible).
- (maybe) Something is closing all file-descriptors on app shutdown, but rust code continues to run after that?
- ...your idea here...
Although most of our code is rust, we do like to the Cocoa runtime, and a few other non-rust libraries, so the search space is rather large. Ideas for narrowing it down would be very appreciated! (In rust, I think we just need to audit all our crates for AsRawFd/IntoRawFd
, and assuming no rust bugs, everything else should be "safe" according to https://rust-lang.github.io/rfcs/3128-io-safety.html?)
For the filesystem bug case, I haven't yet tried reproducing this on other filesystems. I know we have users using Samba at least, and Fuse/sshfs seems quite probable too, so that might be a fruitful path.
If anyone's interested in this kind of spelunking, I'd love a second brain while working through this: https://calendly.com/conradirwin/pairing, otherwise any interesting hypothesis that we can test for how to get a reproduction would be awesome.
If this does turn out to be a bug on the file system side, then I think we do need to not panic in rust. If it turns out to be a bug on the user-space side, then we'll be glad to have tracked it down, and the current behavior seems ok (until we do find a filesystem bug, at which point we'll want to be back to square one).
In parallel we might want to explore ideas from #98338 on how we can handle fallible drops for things like this.
from rust.
(In rust, I think we just need to audit all our crates for AsRawFd/IntoRawFd, and assuming no rust bugs, everything else should be "safe" according to https://rust-lang.github.io/rfcs/3128-io-safety.html?)
Yes, that's the intent.
I'm currently writing a FUSE testcase on linux to see if I can get a EBADF propagated from one of its replies to a close()
.
from rust.
Thanks! I should be able to do something similar for macFUSE / Samba next week.
from rust.
strace excerpt:
openat(AT_FDCWD, "./fuse-test/foo", O_RDONLY|O_CLOEXEC) = 5
close(5) = -1 EBADF (Bad file descriptor)
Hope has been canceled once again.
Test program
use std::{fs::File, thread, time::{Duration, UNIX_EPOCH}};
use fuser::{FileAttr, FileType};
struct TestFs {}
const DUMMY_ATTR: FileAttr = FileAttr {
ino: 2,
size: 13,
blocks: 1,
atime: UNIX_EPOCH,
mtime: UNIX_EPOCH,
ctime: UNIX_EPOCH,
crtime: UNIX_EPOCH,
kind: FileType::RegularFile,
perm: 0o644,
nlink: 1,
uid: 501,
gid: 20,
rdev: 0,
flags: 0,
blksize: 512,
};
impl fuser::Filesystem for TestFs {
fn init(&mut self, _req: &fuser::Request<'_>, _config: &mut fuser::KernelConfig) -> Result<(), libc::c_int> {
Ok(())
}
fn lookup(&mut self, _req: &fuser::Request<'_>, parent: u64, name: &std::ffi::OsStr, reply: fuser::ReplyEntry) {
let d = Duration::from_secs(1);
reply.entry(&d, &DUMMY_ATTR, 0);
}
fn open(&mut self, _req: &fuser::Request<'_>, _ino: u64, _flags: i32, reply: fuser::ReplyOpen) {
reply.opened(1, 0);
}
fn flush(&mut self, _req: &fuser::Request<'_>, ino: u64, fh: u64, lock_owner: u64, reply: fuser::ReplyEmpty) {
reply.error(libc::EBADF);
}
}
fn main() {
let fs = TestFs {};
let t = thread::spawn(|| {
//let options = [MountOption::RO, MountOption::AutoUnmount, MountOption::AllowRoot];
fuser::mount2(fs, "fuse-test", &[]).expect("mount failed");
});
thread::sleep(Duration::from_millis(500));
let f = File::open("./fuse-test/foo").unwrap();
drop(f);
t.join().unwrap();
}
from rust.
If this replicates on macos and users have some janky fuse drivers that'd be a likely suspect. Though those drivers should have a bug filed against them.
from rust.
A quick (though inconclusive) update: Although I'm not certain it doesn't reproduce, I was unable to reproduce with the example program (it fails before it gets to the end), and my testing with Zed itself, I could only reproduce libgit2/libgit2#6796, not this error with sshfs
.
Still to try: updating the example program to run to completion, and setting up a Samba server to test that one.
from rust.
If you can reproduce this reliably, I’d love to debug it together - do you have time to talk the first week of July? If so please reach out to [email protected]
In my mind there’s quite a chance this is a bug in zed or its dependencies, but it’s also possible that this is a failure introduced by the filesystem.
things I’d like to try:
- adding logging to open/close fds to see if we can find a double close
- Knowing more about the file system you’re using
from rust.
I can confirm that using the current master branch of alacritty-terminal fixes the IO safety crash with rustc_codegen_cranelift. It does still crash, but that is because of some unimplemented things in rustc_codegen_cranelift (rust-lang/rustc_codegen_cranelift#1492), not a bug in Zed or any of it's dependencies.
diff --git a/crates/terminal/Cargo.toml b/crates/terminal/Cargo.toml
index ef6260730..3ce869597 100644
--- a/crates/terminal/Cargo.toml
+++ b/crates/terminal/Cargo.toml
@@ -14,7 +14,7 @@ doctest = false
[dependencies]
-alacritty_terminal = "0.23"
+alacritty_terminal = { git = "https://github.com/alacritty/alacritty", version = "0.24.1-dev" }
anyhow.workspace = true
collections.workspace = true
dirs = "4.0.0"
alacritty/alacritty#7996 hasn't been published to crates.io yet, so I had to use a git dependency.
from rust.
Related Issues (20)
- Tracking Issue for `debug_more_non_exhaustive` HOT 3
- rustdoc leaks private types in Implementations on Foreign Types HOT 1
- bootstrap: `profile = "tools"` contains incompatible options. HOT 2
- ICE "Failed to extract DefId" HOT 1
- Request for `tune-cpu` code generation option to be promoted to stable HOT 2
- ICE: `ConstArgHasType(^0, usize)` has escaping bound vars, so it cannot be wrapped in a dummy binder. HOT 2
- ICE: `only 'variances_of' returns '&[ty::Variance]'`
- ICE. `try_from_lit: received const param which shouldn't be possible` HOT 3
- unreachable!("state is never set to invalid values") is reached HOT 53
- ICE in the face of malformed code HOT 2
- array::IntoIter::new deprecation warning suggestion may lead to compilation error on 2015 and 2018 editions HOT 2
- CStr::from_ptr() crashes the program with a segfault when loading a string from memory after passing it in a global_asm!() function. HOT 6
- `tool::prepare_tool_cargo` should run `builder.ensure` automatically, based on tool mode HOT 4
- ICE: `adding a def'n for node-id NodeId(18) and def kind AnonConst but a previous def'n exists` HOT 9
- Doctests cannot recognize paths that lead to a type in the crate root HOT 3
- `./x check std` should not try to check `no_std` targets HOT 2
- Tracking Issue for unsized const parameter types: `feature(unsized_const_params)`
- Some invalid cargo cache data leads to ICE
- Goofy suggestion when trying to use a raw ptr receiver and also the self keyword in the body HOT 1
- Tracking Issue for Contracts
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rust.