Giter VIP home page Giter VIP logo

Comments (10)

Xorlev avatar Xorlev commented on August 31, 2024

Hi Yong,

No, it's not possible that indexing changed your files. The whole operation is read only. Indexing reads your sstable index files and writes new "index index" files.

The "split" step doesn't change the files, it uses the index index files to find offsets into them and passes that along to the mappers as InputSplits. That being said, it is possible that the code is skipping over data somewhere, so we're very interested to figure out where that's coming from.

Do you have any idea if those 9.2M records were different in any way?

from hadoop-sstable.

java8964 avatar java8964 commented on August 31, 2024

I can reproduce the count difference using one SSTable file. So I like to seek your help tracking this issue.

Our data is composite key + composite column names data. I choose one example SSTable files including -Data.db file about 1190323388 bytes, plus CompressionInfor.db, Index.db, Filter.db, Summary.db, Statistics.db.

The sstable2json, parsing the Data.db file, will output 195238 as count of row key. If I iterator the columns, I will get 46167243 columns from these row keys.

Now, I built the index of the above set data. After that, I wrote the following unit test code:

    public void testSSTableRowInputFormat() throws Exception {
        long keyCnt = 0;
        long recordCnt = 0;
        Properties props = new Properties();          
props.load(this.getClass().getClassLoader().getResourceAsStream("t.properties"));
        Configuration conf = new Configuration();
        conf.set(HadoopSSTableConstants.HADOOP_SSTABLE_CQL, props.getProperty("cassandra.table.ddl"));
        Job job = new Job(conf);
        SSTableRowInputFormat ssTableRowInputFormat = new SSTableRowInputFormat();
        ssTableRowInputFormat.addInputPath(job, new Path("/folder/"));
        for (InputSplit inputSplit : ssTableRowInputFormat.getSplits(job)) {
            SSTableSplit sstableSplit = (SSTableSplit) inputSplit;
            TaskAttemptContext context = new TaskAttemptContext(conf, TaskAttemptID.forName("attempt_200707121733_0001_m_000000_0"));
            RecordReader<ByteBuffer, SSTableIdentityIterator> recordReader = ssTableRowInputFormat.createRecordReader(inputSplit, context);
            recordReader.initialize(inputSplit, context);
            while (recordReader.nextKeyValue()) {
                keyCnt++;
                SSTableIdentityIterator sii = recordReader.getCurrentValue();
                while (sii.hasNext()) {
                    recordCnt++;
                    sii.next();
                }
            }
        }
        System.out.println("keyCnt = " + keyCnt);
        System.out.println("recordCnt = " + recordCnt);
    }

The output is
keyCnt = 195234
recordCnt = 46167221

So using hadoop-sstable, it looks like I lost 4 row keys, and 22 columns. I have the -Index.db and -Index.db.Index files, but not sure how internal you are using the index file to generate the split and parsing the -Data.db file. Any help for debugging this issue?

Thanks

from hadoop-sstable.

java8964 avatar java8964 commented on August 31, 2024

I added the "split.getStart() + ":" + split.getEnd()" in the split loop, here is the output:

0:1073769762
1073771138:2147788988
2147789179:3222478600
3222478734:3485271831

It looks like the file is splitted into 4 splits, here are my questions:

  1. The -Data.db file itself is only 1,190,323,388 length, why the split offset bytes reach to 3,485,271,831?
  2. From the splits to splits, there are some gap. Is that normal? For example, the first split ends on 1,073,769,762, but the 2nd split starts at 1,073,771,138.

Thanks

Yong

from hadoop-sstable.

bvanberg avatar bvanberg commented on August 31, 2024

Hi Yong,

Good question.

This has to do with compression. -Data.db is a compressed file. -Index.db is an index into the uncompressed data. Splits are generated from the -Index.db file.

Because the splits are indices into the uncompressed data, it follows that the data must be decompressed to leverage the splits. This is where the -CompressionInfo.db file comes in. This file contains information about the compressed blocks in the -Data.db file. This allows us to read the compressed data file as if it were uncompressed. Clear as mud? Fortunately we don't have to worry about these details as the C* i/o code handles the decompression for us and we just read the files as if they were uncompressed.

Given all of that your splits should map to valid indices found in your -Index.db files. If you suspect that the -Index.db.index files are somehow incorrect you can validate against the -Index.db directly, but not the -Data.db file.

from hadoop-sstable.

bvanberg avatar bvanberg commented on August 31, 2024

Additionally, the gap is normal. This is because we are generating splits from the Index.db which has a bunch of offsets into the data. If you inspect the Index.db you'll find that the splits account for all of the offsets contained within.

from hadoop-sstable.

java8964 avatar java8964 commented on August 31, 2024

Do you have any hint about how the 4 row keys is not available from the ssTableRowInputFormat? Or what additional steps I can take to see why these 4 rows keys missed?

Thanks

from hadoop-sstable.

bvanberg avatar bvanberg commented on August 31, 2024

Given that you are short 4 row keys, and you generated 4 splits there could be an issue there. You should be able to validate that your splits fully cover your Index.db offsets. i.e. Every offset contained within Index.db can be accounted for by the split ranges.

from hadoop-sstable.

java8964 avatar java8964 commented on August 31, 2024

Hi,

After some debugging, I think I identify the bug.

In the file
https://github.com/fullcontact/hadoop-sstable/blob/master/sstable-core/src/main/java/com/fullcontact/sstable/hadoop/mapreduce/SSTableRecordReader.java

On the line 115, it should be

    protected boolean hasMore() {
        return reader.getFilePointer() <= split.getEnd();
    }

instead of

    protected boolean hasMore() {
        return reader.getFilePointer() < split.getEnd();
    }

The reason is that when the code generate the split, the gap between the split bounder is OK, but there will be one row key data after the boundary. So in the hasMore() logic, if we are using '<' instead of '<=', it will lost this one row key data after the boundary.

In my example data, which use the default 1G as the split size, so for splits:
0:1073769762
1073771138:2147788988
2147789179:3222478600
3222478734:3485271831

When the offset reaches 1073769762, since we are using '<' as the logic in hasMore(), it will return false and lost the row key, which is exactly located between 1073769762 and 1073771138.
I want to write a unit test for SSTableRecordReader class, if you give me the CQL of the test data of /data/Keyspace1-Standard1-ic-0-xxx.

I am happy to submit a pull request with a unit test if I can have the CQL of the data SSTable files.

Thanks

from hadoop-sstable.

bvanberg avatar bvanberg commented on August 31, 2024

This is great, thanks. Feel free to PR this at your leisure.

On Tue, Dec 16, 2014 at 11:41 AM, Yong Zhang [email protected]
wrote:

Hi,

After sometime debugging, I think I identify the bug.

In the file

https://github.com/fullcontact/hadoop-sstable/blob/master/sstable-core/src/main/java/com/fullcontact/sstable/hadoop/mapreduce/SSTableRecordReader.java

On the line 115, it should be

protected boolean hasMore() {
    return reader.getFilePointer() <= split.getEnd();
}

instead of

protected boolean hasMore() {
    return reader.getFilePointer() < split.getEnd();
}

The reason is that when the code generate the split, the gap between the
split bounder is OK, but there will be one row key data after the gap. So
in the hasMore() logic, if we are using '<' instead of '<=', it will lost
this one row key data after the boundary.

In my example data, which use the default 1G as the split size, so for
splits:
0:1073769762
1073771138:2147788988
2147789179:3222478600
3222478734:3485271831

When the offset reaches 1073769762, since we are using '<' as the logic in
hasMore(), it will return false and lost the row key, which is exactly
located between 1073769762 and 1073771138.
I want to write a unit test for SSTableRecordReader class, if you give me
the CQL of the test data of /data/Keyspace1-Standard1-ic-0-xxx.

I am happy to submit a pull request with a unit test if I can have the CQL
of the data SSTable files.

Thanks


Reply to this email directly or view it on GitHub
#15 (comment)
.

from hadoop-sstable.

bvanberg avatar bvanberg commented on August 31, 2024

I applied this fix in a recent PR. Thanks again!

from hadoop-sstable.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.