There are a couple issues with range tombstone truncation exposed in these test cases:

Was this fixed by <a class="issue-link js-issue-link" data-error-text="Failed to load

The compaction issue still exists. Let me update <a class="issue-link js-issue-link" d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

DeleteRange truncation issues,about cockroachdb/pebble

Comments (29)

ajkr commented on August 20, 2024 1

Perhaps it can be migrated away by rewriting the SSTs in all non-singleton "compaction units" so that the user key doesn't straddle any more.

I think it is not enough. We'd also need to migrate all files with range tombstones that extend outside their file's key-range. For example, in the #27 case we had:

1: a-c
3: c-d d-e

Here the L1 file contained [a, e)@1. We need to migrate it to have the range tombstone [a, c)@1. Otherwise the subsequent parts of the test case will still encounter the same failure mode.

My fear is that the migration logic will end up needing to truncate tombstones at compaction unit boundaries, effectively doing the same thing we're talking about doing in compaction. So maybe just stick with the original plan.

from pebble.

petermattis commented on August 20, 2024

Nice tests. Feel free to open a PR, even if the tests are broken. We can either skip them or comment them out until they are fixed.

from pebble.

petermattis commented on August 20, 2024

For iterator seek, it should be enough to limit the seek over shadowed data to the sstable boundaries.

I've been thinking a bit about how to do this. I don't have it completely worked out, but I agree it seems doable.

For writing compaction outputs it is a bit more difficult. We need to truncate tombstones before writing them but can't simply use the input sstables' boundaries due to cases like facebook/rocksdb#4432 (comment).

Do you mind expanding on the problem here? My thinking (and the current pebble code) is to not truncate the tombstones during compaction. There are test cases in testdata/compaction_iter which cover this. For example:

define
a.SET.1:a
b.RANGEDEL.2:d
c.SET.4:d
----

iter
first
next
tombstones c
tombstones
----
a#1,1:a
c#4,1:d
b-d#2
.
c-d#2
.

The call to tombstones c simulates ending an sstable at c and flushing the active tombstones. Notice that the tombstone b-d#2 is not truncated in the first sstable, though it is truncated on the start key in the second.

from pebble.

ajkr commented on August 20, 2024

The problem I'm thinking of happens if the tombstones that are untruncated during compaction can also extend the file boundaries. In the second test case, notice how the compact a-b L1 step takes as input a 1: a-c file and produces a 2: a-e file. That causes the range tombstone contained in the file to delete a wider range after compaction than it deleted before the compaction. A problem occurs if a key in (c, e] that is newer than the range tombstone gets compacted to L3+ before the L1->L2 compaction happens. Then the newer key will be shadowed by the L2 file after compaction happens. The test case provides one example.

from pebble.

petermattis commented on August 20, 2024

Ah, got it. Missed the detail that the second compaction expanded the table boundaries. I agree that truncating to compaction unit boundaries will need to be done. I think it also implies that if a user-key straddles two tables in a level, we definitely need to compact both tables as a unit. I was previously thinking we could compact the right table separately because it contains older keys than the left table, but untruncated range tombstones foul this up.

from pebble.

petermattis commented on August 20, 2024

Was this fixed by #62, or is there something still to be done about compactions?

from pebble.

ajkr commented on August 20, 2024

The compaction issue still exists. Let me update #27 so it exposes the current known problem.

from pebble.

ajkr commented on August 20, 2024

OK it is up to date.

from pebble.

ajkr commented on August 20, 2024

@petermattis I remember we discussed a while ago whether it's necessary for multiple versions of a user key to be spread across multiple sstables. IIRC we concluded probably not but Pebble will still need to read files like that because RocksDB generates them. But does Pebble generate them too? If not, this problem is a bit easier to solve because in compaction we can limit a range tombstone's extension to the first key in the next file. In RocksDB we couldn't do that because the first key in the next file might be the same as the last key in the current file, in which case limiting the range tombstone in this way would lose info about whether the current file's last key is covered.

from pebble.

petermattis commented on August 20, 2024

But does Pebble generate them too?

I'm pretty sure Pebble generates sstables like that currently, but it doesn't seem difficult to prohibit that. The inner compaction loop is https://github.com/petermattis/pebble/blob/master/compaction.go#L847-L866. Keeping track of the previous key added to an sstable in that loop is slightly redundant because the sstable writer already does that. Perhaps we can expose that. See sstable.Writer.block.curKey.

I recall thinking that it didn't make sense to split multiple versions of a user key across sstables. Doing so leads to "atomic compaction units" which seem no better than simply allowing a larger sstable. Is my memory of this correct?

from pebble.

ajkr commented on August 20, 2024

I recall thinking that it didn't make sense to split multiple versions of a user key across sstables. Doing so leads to "atomic compaction units" which seem no better than simply allowing a larger sstable. Is my memory of this correct?

Yes! By making this change an "atomic compaction unit" will always be a single sstable, then it'll be a lot easier to truncate range tombstones at atomic compaction unit boundaries during compaction.

from pebble.

petermattis commented on August 20, 2024

SGTM. Do you foresee any difficulties in supporting RocksDB generated databases which wouldn't have this invariant?

from pebble.

ajkr commented on August 20, 2024

Will draw a few examples to recall whether we need to truncate according to the input files' compaction unit boundaries or the output files' compaction unit boundaries. If it's the input there'll be a problem.

from pebble.

ajkr commented on August 20, 2024

OK, reviewed it a bit, and think we need to truncate based on input files' atomic compaction unit boundaries. That's because both RocksDB and Pebble currently write out tombstones that are not truncated to the output file's compaction unit boundaries. And that's normal considering when we finalize a file's tombstones we don't know yet where will that file's compaction unit's upper-bound be. (Although it'd be quite convenient if the upper-bound were always the first key in the next file, so yeah, I guess this is necessary for RocksDB compatibility).

from pebble.

petermattis commented on August 20, 2024

Do you mind explaining that again with a diagram? Probably because it is evening here, but that explanation went over my head.

from pebble.

ajkr commented on August 20, 2024

Sure, let me try again, my previous explanation was too lazy. There are like three parts to this "proof". It is still not very clear, in my opinion -- sorry for that.

RocksDB truncates range tombstones to the atomic compaction unit boundaries at read time.
We need to truncate range tombstones such that they don't extend outside of their containing atomic compaction unit in order to fix the test case in #27. I bet we could achieve this either by the write-time mechanism above (making a "atomic compaction unit" always a single sstable and truncating to the next file's first key). Or we could do it at read time like RocksDB.
Since RocksDB does not do the truncation at write time, we cannot assume our inputs are safely truncated to atomic compaction unit boundaries. So we must do it at read time.

Here's a bit of guessing why 1) is the case. Consider:

L1 has SST1 with a#20 and c#40. L1 also has SST2 with c#30 and e#50. There is a range tombstone [b, d)#60 spanning both SSTs.
L2 has SST3 with b#10.
Let's say the DB has enough snapshots such that no deleted keys are dropped during the compaction.
L1->L2 compaction happens. SST1 and SST2 are both picked since they're an atomic unit. SST3 is also picked because it overlaps.
- The first file written gets point keys a#20, b#10, and c#40.
- The compaction iterator is positioned at c#30.
- We need to write out the range tombstone so we can close out the first file, but where should we truncate it? If we truncate at c we lose the knowledge that it covers c#40. We could lookahead on the compaction iterator until we find the next distinct user key, and use that as a truncation point, but that wasn't implemented. So the best we can do with the knowledge we have is dump the tombstone [b, d)#60 in its entirety, not knowing whether it'll end up spanning multiple atomic compaction units or not.

from pebble.

ajkr commented on August 20, 2024

Here is some more illustration on how truncating tombstones can solve #27 (step 2. above).

The first interesting step is here:

compact a-b
----
1: a-c c-e
2: d-d

With the current approach, the first L1 file a-c holds the full range tombstone [a, e). If we did truncation at write time, the tombstone could be truncated to [a, c). Note the point key for c is in the second L1 file c-e, so it should be fine to leave c out of the first file's tombstone's range.

Omitting a couple steps, this is the next interesting one:

compact a-b L1
----
2: a-e
3: c-d d-e

Now, the file we wrote to L2 got expanded due to the tombstone [a, e). If we had done the write-time truncation mentioned above it wouldn't have happened -- we'd just get a file a-c with tombstone [a,c). Even if we hadn't done that, and we'd done read-time truncation instead, we also would have written tombstone [a, c) to an L2 file a-c, because the input file's compaction unit was a-c.

The verification step that fails is it finds the range tombstone in L2 covers the point-key c in L3, which it shouldn't. If the L2 file had been a-c instead, this would not have happened.

For this example, it showed we need to do truncation, but IMO it didn't quite prove the truncation has to be on atomic compaction unit boundaries. For that there are even more painfully complicated cases in our discussion in facebook/rocksdb#4432 (comment).

from pebble.

petermattis commented on August 20, 2024

Thanks for the explanations. I agree with your conclusion that we need to do the truncation at read time for compatibility with RocksDB generated sstables. We should try and do this soonish while the details of this problem are fresh in both of our minds.

from pebble.

tbg commented on August 20, 2024

Just spectating, what's a "compaction unit"? From sleuthing, I think I gather that some SSTables have to be jointly compacted because they contain overlapping user keys (i.e. one SST has right boundary c@30 while the next one has c@40 or even c@20). Is that all there is to it?

from pebble.

petermattis commented on August 20, 2024

Just spectating, what's a "compaction unit"? From sleuthing, I think I gather that some SSTables have to be jointly compacted because they contain overlapping user keys (i.e. one SST has right boundary c@30 while the next one has c@40). Is that all there is to it?

Yes, that is all there is to it. Prior to range tombstones, it was fine to compact such sstables independently (I think). As we were discussing above, it would make sense to simply include all of the versions for a particular user key in a single sstable, but there are now compatibility issues to think about.

from pebble.

tbg commented on August 20, 2024

but there are now compatibility issues to think about.

Maybe I'm naive, but one would hope (?) that the situation in which RocksDB actually spreads user keys across SSTs is pretty rare. Perhaps it can be migrated away by rewriting the SSTs in all non-singleton "compaction units" so that the user key doesn't straddle any more. Hopefully this is usually a no-op, and this issue would stop forcing its way into pebble's design.

from pebble.

ajkr commented on August 20, 2024

Let me think if that can work. If we can migrate away from non-singleton compaction units while opening the DB, that'd be great. But there might be cases where the existing data cannot be safely rewritten without this truncation logic. Will think. (Also note pebble currently can spread versions of the same user key across sstables, so that'd need to change too.)

from pebble.

tbg commented on August 20, 2024

For my edification, how do user keys end up straddling ssts anyway? We write c@20 in memtable 1, which gets flushed and c@20 happens to be the smallest key, then later we write c@40 into memtable 2, and it happens to be the largest key and so we'll end up with ssts [*, c@40] [c@20,*]? Isn't bending over backwards necessary to avoid this?

from pebble.

ajkr commented on August 20, 2024

It happens during compaction, not really during flush. Let's say the input files have key c@40 and key c@20. So compaction could be writing the first output SST, and it just so happens to fill up after it writes c@40. Then the next file will has to start with c@20 (assuming there is a snapshot between seqnums 20 and 39 that prevents us from simply dropping c@20).

from pebble.

ajkr commented on August 20, 2024

To clarify, overlapping endpoints could also happen in flush in the way you described. But we don't see range tombstone related problems there, at least not yet. The problems happen in L1+ files due to their special characteristic where the range tombstone endpoints may extend outside the file boundaries.

Using the example just above, imagine there were a range tombstone [b, d)@50 involved in that compaction. We write the first output file that ends at c@40 and then need to populate its range tombstone meta-block. We can't truncate [b, d)@50 to [b, c)@50 because then we lose the info that c is covered. So we just write out the full tombstone even though that file's upper-bound is still user key c.

EDIT: sorry @50, will use ~~@60~~ something else in my examples next time.

from pebble.

ajkr commented on August 20, 2024

Or we could fail to open DBs containing files that have (1) overlapping endpoints with an adjacent file and (2) have range tombstones extending out of a file satisfying (1). In the rare (?) case that it happens they can go back to RocksDB and run a full compaction. Then we're guaranteed no files have overlapping keys because snapshots must've been released when they restarted their DB, and compaction to bottom level eliminates any duplicate versions of keys.

The reason I'm bringing this up again is because, after attempting to solve it in Pebble, I don't think there is a good way to make it any simpler than it is in RocksDB.

from pebble.

petermattis commented on August 20, 2024

Or we could fail to open DBs containing files that have (1) overlapping endpoints with an adjacent file and (2) have range tombstones extending out of a file satisfying (1).

I wonder how often these cases occur. Running a full compaction on RocksDB would make a much more difficult upgrade cycle as a full compaction can take a very long time. Perhaps we can call into RocksDB just to compact the offending files. Is that what you were suggesting?

The reason I'm bringing this up again is because, after attempting to solve it in Pebble, I don't think there is a good way to make it any simpler than it is in RocksDB.

Can you link to the code in RocksDB which does this? I recall reviewing it at some point, but don't recall exactly where it is.

from pebble.

ajkr commented on August 20, 2024

Perhaps we can call into RocksDB just to compact the offending files. Is that what you were suggesting?

I didn't think of that but it seems that'd also make sure all compaction units are singletons as any duplicate key versions should be eliminated. Maybe some edge cases with merge but we can think through that later if we want to go that route. I shouldn't give up on the original plan so easily. Will still keep trying to submit a PR.

Can you link to the code in RocksDB which does this?

It was part of this PR: facebook/rocksdb#4432

from pebble.

petermattis commented on August 20, 2024

Would it be sufficient to ensure that the boundaries for a compaction never exceeds the input sstables? In the scenario in #27, the problematic step is when we compact a-b in L1:

1: a#2,1-c#72057594037927935,15
3: c#3,1-d#72057594037927935,15 d#0,15-e#72057594037927935,15

(I'm including the sequence numbers in the keys for clarity). After the compaction we have:

2: a#2,1-e#72057594037927935,15
3: c#3,1-d#72057594037927935,15 d#0,15-e#72057594037927935,15

But we should have the invariant that the output range for a compaction can never exceed the input range. That invariant sounds straightforward to maintain and I think it would fix the bug highlighted in #27. Are there other scenarios for which this wouldn't be a sufficient fix?

from pebble.

DeleteRange truncation issues about pebble HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent