Giter VIP home page Giter VIP logo

Comments (9)

Zygo avatar Zygo commented on August 16, 2024

ipos 15650891 is not a multiple of 4096, so something is weird there. Is this near the end of a file, especially a file that is currently or recently being modified?

There are sometimes glitches where something changes on the filesystem and one source of btrfs metadata disagrees with another because they're capturing data from the filesystem at different times (sometimes several minutes apart). That tends to get discovered by an exception handler when the discrepancy breaks something that doesn't like the filesystem changing underneath it (like the binary search algorithm in ExtentWalker).

It's usually harmless, and bees will visit the extents again on the next scan generation. If the file keeps getting modified all the time, bees will keep triggering the exception (patches to implement a glob- or regexp- based path exclusion framework welcome ;); however, if the file is really being modified all the time then dedup on that file will be ineffective anyway.

from bees.

kakra avatar kakra commented on August 16, 2024

Nothing accesses the file. Actually it's game data from the Windows Steam version running within Wine. But Wine (and thus Steam) is not running currently, and wasn't running in days.

Maybe it results from bad interaction with autodefrag? Tho I don't see why this should happen because actually nobody is writing directly to the file. Even bees seems to refuse to work on the file due to above constraints with exceptions.

There seem to be at least two files of the same game data that keep coming again and again in a loop and eventually bees never finishes with it.

More details:

Sep 19 21:35:10 jupiter beesd[7483]: crawl: BeesAddress(fd 384 /#187708 (deleted) offset 0x1000)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: Found matching range: BeesRangePair: 4K src[0x1000..0x2000] dst[0x2995000..0x2996000]
Sep 19 21:35:10 jupiter beesd[7483]: crawl: src = 384 /#187708 (deleted)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: dst = 383 /home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_77.blk
Sep 19 21:35:10 jupiter beesd[7483]: crawl: creating brp (4K [0x1000..0x2000] fd = 384 '/#187708 (deleted)', 4K [0x2995000..0x2996000] fid = 262:5149095 fd = 383 '/home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_
Sep 19 21:35:10 jupiter beesd[7483]: crawl: Opening dst bfr 4K [0x2995000..0x2996000] fid = 262:5149095 fd = 383 '/home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_77.blk'
Sep 19 21:35:10 jupiter beesd[7483]: crawl: chase_extent_ref ino BtrfsInodeOffsetRoot { .m_inum = 5149095, .m_offset = 0x2995000, .m_root = 262 } bbd BeesBlockData { 4K 0x1000 fd = 384 '/#187708 (deleted)', data[4096] = 'A\x08\x00\x00tttt\x00\x00\x00\x00...' }
Sep 19 21:35:10 jupiter beesd[7483]: crawl:
Sep 19 21:35:10 jupiter beesd[7483]: crawl: *** EXCEPTION ***
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: ***

$ ls -alh BulkChunk1_77.blk
-rw-r--r-- 1 kakra kakra 57M 10. Jan 2017  BulkChunk1_77.blk

It's not even big...

from bees.

Zygo avatar Zygo commented on August 16, 2024

That is sounding more like a bug in the binary search algorithm previously mentioned. :-P

Could you run fiemap on the offending file? That will show the extent metadata fields as bees sees them. Maybe there's something anomalous in there that is confusing BtrfsExtentWalker.

For that matter, fiewalk should fail in the same way as Bees does, but it's a bit more controlled so it's easier to understand what's going on. (be sure to change the #if 0 to #if 1 so we get both walking directions tested).

from bees.

kakra avatar kakra commented on August 16, 2024

There you go:
https://gist.github.com/kakra/eb0f2bddde592e47d478f936683f3fc6

from bees.

kakra avatar kakra commented on August 16, 2024

Meanwhile, my btrfs crashed with some refcount issues, some double linked extents and some orphaned extents. "btrfsck" wasn't able to fix this due to compression used and the extent length being different to what is expected.

I was able to fix this with btrfs-zero-log, then delete the affected inodes, then letting btrfsck fix the rest (it needed 3 runs to fix everything). Finally, I restored good copies of the deleted files from the backup.

The problem seems to no longer occur, so this error may be an indicator of an already broken file system.

But I think, this error was introduced by using bees - which of course is probably not bees fault but still a bug in btrfs which is already documented and tracked in the mailing list (object already exists, errno=-17). Running bees can inject this problem into the filesystem, system then freezes and needs a hard reboot. I had a few of these. It eventually results in the problem I had.

from bees.

Zygo avatar Zygo commented on August 16, 2024

s/introduced/triggered/ by bees. It wouldn't be the first bug that bees found in the kernel.

Which kernel version are you running?

from bees.

kakra avatar kakra commented on August 16, 2024

Ah yes, "triggered"... I was struggling to find the right word. ;-)

It's been 4.12.13, now 4.12.14, ck patchset, using bfq scheduler. As a vague guess, bfq might be involved in triggering it. I had similar errors some months back when I tried the out-of-tree bfq patches.

Since bees initial scan is done, the system ran stable (without freezes), except that last time when it presented that errno=-17 to me upon cold reboot.

BTW: After investigating logs since last boot, I still see such messages:

Sep 29 22:58:35 jupiter beesd[742]: crawl: *** EXCEPTION ***
Sep 29 22:58:35 jupiter beesd[742]: crawl:         exception type std::runtime_error: ipos = 10321965, new_vec.rbegin()->m_end = 10321928 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 29 22:58:35 jupiter beesd[742]: crawl: ***
Sep 29 22:58:35 jupiter beesd[742]: crawl: scan_forward 128K [0x660000..0x680000] fid = 259:9922554 fd = 270 '/gentoo/rootfs/usr/lib64/libnvidia-ptxjitcompiler.so.384.90'
Sep 29 22:58:35 jupiter beesd[742]: crawl: ---  END  TRACE --- exception ---

So either something in my btrfs is still toast, or bees cannot handle this.

It again repeats a lot of those message for the same file (with long runs of identical ipos, but changing over time). Over time, it switches to a different file to throw this exception on.

According to the package manager, the above file is still pristine. So maybe nothing to worry too much about.

from bees.

Zygo avatar Zygo commented on August 16, 2024

I've heard bad experiences involving bfq and btrfs from multiple people. Not so much from bfq itself but the multiqueue IO it depends on. I'd test that a lot more thoroughly on VMs before turning it on in production.

I wouldn't worry too much about the bees exception. I hit that a few times myself.

My guess is that there's a bug in the ExtentWalker binary-search code that is triggered by the specific details of the layout of these files on your filesystem. Maybe there's a hole that is exactly the wrong size and gets missed during search window expansion.

The other possibility is that there's a duplicate or overlapping extent ref in the filesystem, but in that case it should have shown up in the fiemap output too, but it didn't.

I haven't had a chance to verify this yet (and my TODO list includes rewriting ExtentWalker anyway...).

from bees.

kakra avatar kakra commented on August 16, 2024

Well, I don't consider my home system as a real production system... While, yes, I'm using it as my main system for doing production stuff like development, documentation, and games (if you want to call games "production") with doing daily backups, I'm also using it as a testbed for pre-production environments, that's preparation and testing containers before pushing things to production environment. Bfq is one of those things that I would currently not deploy to that said production environments. It works well enough here (and does improve perceived performance a lot) but I fear it would create fatal problems in a 24/7 production environment where it is even not that suitable as an IO scheduler because that environment is about server containers. I'm running deadline/noop there, because underlying storage is a SAS HBA with BBU and SSD cache.

If you're going to rewrite the ExtentWalker, I'm fine with not working on non-fatal bugs in the current implementation, especially if it's hard to find.

from bees.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.