I'm running a Pig job over a couple of sets of identical data, with the only differenc

different results with and without index about hadoop-lzo HOT 11 CLOSED

twitter commented on August 12, 2024

different results with and without index

from hadoop-lzo.

Comments (11)

kevinweil commented on August 12, 2024

We've seen this too just recently. I'll bet that you have 30 splits for your file -- can you verify that's true? I'm almost positive that this is an off-by-one end of split regression. We're on it. Please verify though so we can be sure it's the same bug.

Also, are you using the Pig LZO-based loaders? If so, it's likely an issue with elephant-bird. LMK and I can close this and re-open it there.

from hadoop-lzo.

jakeo commented on August 12, 2024

Yes, I am using the elephant-bird LZO loaders for Pig. Thanks for the code btw, and sorry for logging the bug in the wrong place.

The file actually has 33 splits.

from hadoop-lzo.

dvryaboy commented on August 12, 2024

This is an Elephant Bird bug, in LzoLineRecordReader. I pushed a fix to my fork at http://github.com/dvryaboy/elephant-bird . Jake, can you try that build and let us know if that fixes the issue you've observed?

from hadoop-lzo.

jakeo commented on August 12, 2024

Hi Dmitriy, the fix didn't fix this particular issue, still returning 17214 records. I wish I could share my dataset with you, but I can't. Is there anything else I can provide to help?

from hadoop-lzo.

dvryaboy commented on August 12, 2024

Huh. Ok that's weird. Just to make sure -- if you run hadoop cat | lzop -x - | wc -l on the compressed file, what number do you get?

from hadoop-lzo.

jakeo commented on August 12, 2024

143313687

from hadoop-lzo.

splunk-cwanek commented on August 12, 2024

What is the status of this issue? I believe I am running up against this while using hive.

I have a small, partitioned hive table. I have a single file inside each partition. I am running a simple "select count(1)" query against a single partition. Done without the index, I get the correct result. When I add the index, the count increases by one.

When I look at the actual differences in the output from a "select *" query, the extra result seems is mostly NULLs, and the first value is binary garbage.

I am using an older version of hadoop-lzo. I plan to upgrade to pick up another bug fix that's important to us, but thought I'd ask about this bug directly.

Thanks,
Charlie

from hadoop-lzo.

dvryaboy commented on August 12, 2024

Charlie,
Really sounds like some lzo corruption going on, please try upgrading to the latest version and let us know if the problem persists.

from hadoop-lzo.

splunk-cwanek commented on August 12, 2024

No luck. I am having the same issue with hadoop-lzo 0.4.9. My hive query returns an incorrect result with the lzo index, and a correct result without.

The hive table/file is actually quite small at 744 lines and 350KB uncompressed. Block size is 64MB. It compresses to <17K. When I index the lzo file, select count(1) returns 745.

Am I even posting to the correct thread? It seems very similar to the behavior I'm seeing...

Thanks again,
Charlie

from hadoop-lzo.

dvryaboy commented on August 12, 2024

Hi Charlie,
It sounds like there is some edge case that has to do with corrupt data being written to the end of a file (or possibly an uncompressed LZO block -- lzo does this thing where it won't compress data if it determines that compressed output would be bigger than uncompressed). It would be extremely helpful if you could share the data / queries needed to reproduce this error. Can you email me, my first name at Twitter.com?

from hadoop-lzo.

splunk-cwanek commented on August 12, 2024

Thanks for helping me get this sorted. For the record, my table was created with a vanilla TextInputFormat, and I was using an ALTER TABLE command to change to DeprecatedLzoTextInputFormat. But this doesn't affect the preexisting partitions I had in my table, so the index file was getting processed as part of the table data through TextInputFormat, instead of being used to calculate splits as DeprecatedLzoTextInputFormat would have.

from hadoop-lzo.

different results with and without index about hadoop-lzo HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent