First, thank you for this great Open Source project, I've been using it for years and

Regarding versions: ideal situation (recommended) - all module

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[BUG] Empty chunks and copies with different checksums about moosefs HOT 8 CLOSED

deltabweb commented on June 7, 2024

[BUG] Empty chunks and copies with different checksums

from moosefs.

Comments (8)

chogata commented on June 7, 2024

You wrote that you have a mix of versions. I will need more details on this. Basically, if you have non-recommended setup (like a master in an older version than a chunk server), bad things can happen and you never should keep your system like that. A mix of versions should only be a temporary state when you upgrade your MooseFS instance, modules should be upgraded in proper order (metalogger, master, chunk servers, clients) and it should never be a permanent solution. If you provide more details, I can try to guess what might have happened to your system.

from moosefs.

chogata commented on June 7, 2024

And one more, very important question: what filesystem do you use to store chunks on chunk servers?

from moosefs.

deltabweb commented on June 7, 2024

Oh, you might be onto something!

I thought I was running the same version of MooseFS for all the nodes but I have just realised that one of my chunkservers is running 3.0.115 - all the rest of my cluster is running 3.0.116. I will upgrade this node and report back.

Also, the client I am using to run mfsfileinfo -c <file> is running 3.0.117. Am I right in assuming that the client version doesn't matter as much as long as it is a newer version?

Regarding the filesystem, all the nodes are running ext4 except one which is running xfs - my cluster is all over the place because I've been experimenting with it 😢

from moosefs.

deltabweb commented on June 7, 2024

Some good news, the Problem 2 is mostly solved after upgrading MooseFS on the node that was running an older version! 🎉 Thanks!

With that being said, I was able to find 2 more files that are still showing the issue:

	chunk 0: 0000000001AD6B88_00000004 / (id:28142472 ver:4)
		copy 1: 192.168.1.11:9422 (status:VALID ; blocks: 88 ; checksum digest: 5400356471FE35681C301E0C28319014)
		copy 2: 192.168.1.12:9422 (status:VALID ; blocks: 88 ; checksum digest: 9F3D8C53D652858629F55C2EF667E867)
		copies have different checksums !!!

	chunk 0: 00000000010DB33B_00000003 / (id:17675067 ver:3)
		copy 1: 192.168.1.11:9422 (status:VALID ; blocks: 72 ; checksum digest: A2BAD0CF5F2F09211E32340CB2EFF70A)
		copy 2: 192.168.1.12:9422 (status:VALID ; blocks: 72 ; checksum digest: 86306AC23B480F3A5DB1FA9EA69DF4D9)
		copies have different checksums !!!

The chunkserver running on 192.168.1.12 happens to be the only node running xfs if that is important.

Also the Problem 1 is still present. I have recovered most files by changing storage classes to delete the corrupt chunks and then re-create them but I am still very interested in understanding what happened. Corrupt chunks that aren't detected is probably one of the scariest form of failure.

from moosefs.

chogata commented on June 7, 2024

Regarding versions:

ideal situation (recommended) - all modules in the same version :)
slightly less ideal, but still totally acceptable: master/metalogger and chunk servers in the same version, clients in older version(s)
still should work fine, but we don't recommend other than during upgrade - master/metalogger in the same version, chunk servers in older version(s) than master, clients in older version(s) than both master and chunk servers
DO NOT USE: master older than chunk servers, master or chunk servers older than clients

Regarding silent deterioration: MooseFS has mechanisms to guard against it, but not everything will be detected instantly. Chunkservers are constantly checking (in a big, long loop) all chunks - they read the whole chunk and check if the content matches the checksums, if it doesn't, the chunk is marked as invalid and duplicated from another copy. But this is a slow check, depending on the number of chunks on your chunk server, it can take up to a month to check all the chunks (this is also a reason not to cram as many disks as possible into one chunk server, but divide them into more chunk servers). It's also performed on any chunk copy that is needed for i/o, so if a copy is corrupted, i/o is not performed on it. But this is internal deterioration on a single chunk server.

In case of a chunk server machine failure (especially after power failure, but basically after every failure), we always recommend to check: first the filesystem with whatever tool is appropriate to what you are using, then all the chunks with mfschunktool, to check against deterioration (basically to do manually the same thing fast, that the background loop is doing during normal chunk server operation, just more slowly). Any invalid chunks should be removed from the chunk folders (not deleted, but copied somewhere "to the side" until all checks are complete, then deleted). Then the chunk server process can be allowed to run and any undergoal chunks will be replicated from other copies.

In case of multiple machine failures we recommend you do exactly the same, one by one. But if you are unlucky and you had many machine failures (more than the goal of your files), you can end up with some chunks that have no valid copies. Then you have two choices: if you can restore those files from another source, if it's possible, if not, then you can try to manually check the content of the chunks you copied "to the side" and see if you can find one that, to your knowledge, is valid. Then you use mfschunktool to repair checksums on this copy, stop one chunk server, add this chunk to one of the drives, remove the .chunkdb file from this drive (DON'T FORGET THIS STEP!) and start the chunk server. The manually restored copy will be read and other copies will be made, up to the goal defined in storage class.

But one more thing can happen in case of machine failure - and practically only then, it should not happen during normal work of your operating system, unlike silent deterioration of single bytes - the filesystem that you are using can do a "rollback" on a file that was written during the crash. Then, the copy of the chunk will "look valid", but will not match copies on other chunk servers. This is why we:

recommend to use the simplest filesystem possible (we'd rather miss one copy and have it replicate from another, than have a copy that is older that the system thinks it is)
we always say we do not guarantee integrity of files that were written during or immediately before a crash, especially one involving a power failure and we recommend you verify the files involved manually

Theoretically, because we write data first and checksums after, if a rollback immediately after a crash happens, it should only rollback the last operation, making the data not match the checksums (and this you can detect with mfschunktool). But that's theory... in our experience, we've seen filesystems do strange things. The most "nasty" one here is ZFS in our experience, ext4 and xfs are what we had best experiences with so far. But never say never. Especially if lots of data was written to disks when the crash happened. Your problems are most probably explainable by a rollback during fsck (or similar) or some other "strange" behaviour of the underlying filesystem after the crash, that resulted in "rolling back" some chunks. That's why the step "check the files that were written/modified during and directly before the crash" is also always necessary.

from moosefs.

deltabweb commented on June 7, 2024

Hi @chogata, thanks for the very detailed answer.
I think this can explain the last 2 files corrupted with Problem 2, we can consider this problem closed.

But I do not think it explains the Problem 1 (= empty chunks):
The chunks with 0 blocks are all on the host with IP 192.168.1.10; this host is running both mfsmaster and mfschunkserver, it has never crashed but has empty chunks that appear valid according to moosefs.

Also all these chunks have the same D41D8CD98F00B204E9800998ECF8427E checksum, which makes it look like nothing was written and then this checksum was saved on disk?

How is that possible? I still can't understand that part.

from moosefs.

chogata commented on June 7, 2024

Sorry for the late reply, I did not get a notification about a new message in this thread.

Empty chunks also have a checksum, so empty chunks with a valid checksum can exist. This in itself is not strange. What is strange is that you have different copies of the same chunk - one with data, one empty. If there is no hardware failure, this situation should absolutely NOT happen.

But... This reminds me of a problem one of our clients had some time ago. They also had many files, that would have valid (from their point of view) copies of chunks on some servers, and a second copy of those chunks, that were too short (some of them zero length, some longer, but all shorter than the copies on other servers). But they never had a crash of any of the machines in their system AND all the "too short" chunks were located on one machine. Exactly like in your case - the chunks with data are all on different machines, the zero ones are all on one machine.

With this client, we did an investigation on a live system and it turned out their machine had a hardware problem. This is a very nasty thing, that can happen, if a machine has problems with processor cache, and we observed it total of two times only in our whole history - once quite a long time ago already with yet another client and the second time more recently with this client with "too short" chunks.

Basically, the scenario is like that (machine code level already, so we do not talk about variables in code, we talk about specific memory addresses): one thread writes a value, let's say X, to memory, this part of the memory happens to be stored also in the processor's cache. A second thread reads the value, it gets X. Then the first thread writes something else again, for example Y and immediately after (switching between threads happens at this exact moment), the second thread tries to read that value... and still gets X.

First time we encountered this problem, it took us two days of tests to really believe what we were seeing, but the proof was undeniable. The fault lied with some bad hardware - two processor configuration and problems with syncing the cache of the two processors (thread one worked on another processor than thread two and, while the write and read was of the same logical memory address, those two threads physically accessed data from different processor caches and the caches sometimes did not manage to sync the data "fast enough"). The client changed the machine to another one and the problem disappeared.

Second time it was a single processor configuration, but because of our insistence, the client checked their machine and it turned out they had some alerts, that the temperature inside the chassis is "a bit higher than normal". They found out one of the processor fans was faulty. They exchanged it, the temperature went back to normal and... chunks were again always recorded with correct length.

So, the only thing that comes to mind - check your 192.168.1.10 machine, because it's very possible it has a RAM/processor cache problem - maybe because of some faulty hardware, maybe because it's not cooled well enough.

from moosefs.

deltabweb commented on June 7, 2024

Interesting, I am almost sure that these empty chunks got created when some of my chunkservers were down and the whole cluster was writing a lot of large files.

192.168.1.10 being both the master and a chunkserver, it would have been under high load and I wouldn't be surprised the CPU temperature got pretty high at that time. I don't have the temperature history on hand unfortunately. It is also running standard (non ECC) RAM so we can't exclude a memory issue either.

Threads getting different values from the cache is pretty unexpected stuff indeed. Good job on debugging that!

This explanation sounds plausible so I'll monitor this machine closely and we can consider this issue closed.
Thanks again for your help!

from moosefs.

[BUG] Empty chunks and copies with different checksums about moosefs HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent