Giter VIP home page Giter VIP logo

Comments (11)

kevinweil avatar kevinweil commented on August 12, 2024

We've seen this too just recently. I'll bet that you have 30 splits for your file -- can you verify that's true? I'm almost positive that this is an off-by-one end of split regression. We're on it. Please verify though so we can be sure it's the same bug.

Also, are you using the Pig LZO-based loaders? If so, it's likely an issue with elephant-bird. LMK and I can close this and re-open it there.

from hadoop-lzo.

jakeo avatar jakeo commented on August 12, 2024

Yes, I am using the elephant-bird LZO loaders for Pig. Thanks for the code btw, and sorry for logging the bug in the wrong place.

The file actually has 33 splits.

from hadoop-lzo.

dvryaboy avatar dvryaboy commented on August 12, 2024

This is an Elephant Bird bug, in LzoLineRecordReader. I pushed a fix to my fork at http://github.com/dvryaboy/elephant-bird . Jake, can you try that build and let us know if that fixes the issue you've observed?

from hadoop-lzo.

jakeo avatar jakeo commented on August 12, 2024

Hi Dmitriy, the fix didn't fix this particular issue, still returning 17214 records. I wish I could share my dataset with you, but I can't. Is there anything else I can provide to help?

from hadoop-lzo.

dvryaboy avatar dvryaboy commented on August 12, 2024

Huh. Ok that's weird. Just to make sure -- if you run hadoop cat | lzop -x - | wc -l on the compressed file, what number do you get?

from hadoop-lzo.

jakeo avatar jakeo commented on August 12, 2024

143313687

from hadoop-lzo.

splunk-cwanek avatar splunk-cwanek commented on August 12, 2024

What is the status of this issue? I believe I am running up against this while using hive.

I have a small, partitioned hive table. I have a single file inside each partition. I am running a simple "select count(1)" query against a single partition. Done without the index, I get the correct result. When I add the index, the count increases by one.

When I look at the actual differences in the output from a "select *" query, the extra result seems is mostly NULLs, and the first value is binary garbage.

I am using an older version of hadoop-lzo. I plan to upgrade to pick up another bug fix that's important to us, but thought I'd ask about this bug directly.

Thanks,
Charlie

from hadoop-lzo.

dvryaboy avatar dvryaboy commented on August 12, 2024

Charlie,
Really sounds like some lzo corruption going on, please try upgrading to the latest version and let us know if the problem persists.

from hadoop-lzo.

splunk-cwanek avatar splunk-cwanek commented on August 12, 2024

No luck. I am having the same issue with hadoop-lzo 0.4.9. My hive query returns an incorrect result with the lzo index, and a correct result without.

The hive table/file is actually quite small at 744 lines and 350KB uncompressed. Block size is 64MB. It compresses to <17K. When I index the lzo file, select count(1) returns 745.

Am I even posting to the correct thread? It seems very similar to the behavior I'm seeing...

Thanks again,
Charlie

from hadoop-lzo.

dvryaboy avatar dvryaboy commented on August 12, 2024

Hi Charlie,
It sounds like there is some edge case that has to do with corrupt data being written to the end of a file (or possibly an uncompressed LZO block -- lzo does this thing where it won't compress data if it determines that compressed output would be bigger than uncompressed). It would be extremely helpful if you could share the data / queries needed to reproduce this error. Can you email me, my first name at Twitter.com?

from hadoop-lzo.

splunk-cwanek avatar splunk-cwanek commented on August 12, 2024

Thanks for helping me get this sorted. For the record, my table was created with a vanilla TextInputFormat, and I was using an ALTER TABLE command to change to DeprecatedLzoTextInputFormat. But this doesn't affect the preexisting partitions I had in my table, so the index file was getting processed as part of the table data through TextInputFormat, instead of being used to calculate splits as DeprecatedLzoTextInputFormat would have.

from hadoop-lzo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.