Comments (11)
We've seen this too just recently. I'll bet that you have 30 splits for your file -- can you verify that's true? I'm almost positive that this is an off-by-one end of split regression. We're on it. Please verify though so we can be sure it's the same bug.
Also, are you using the Pig LZO-based loaders? If so, it's likely an issue with elephant-bird. LMK and I can close this and re-open it there.
from hadoop-lzo.
Yes, I am using the elephant-bird LZO loaders for Pig. Thanks for the code btw, and sorry for logging the bug in the wrong place.
The file actually has 33 splits.
from hadoop-lzo.
This is an Elephant Bird bug, in LzoLineRecordReader. I pushed a fix to my fork at http://github.com/dvryaboy/elephant-bird . Jake, can you try that build and let us know if that fixes the issue you've observed?
from hadoop-lzo.
Hi Dmitriy, the fix didn't fix this particular issue, still returning 17214 records. I wish I could share my dataset with you, but I can't. Is there anything else I can provide to help?
from hadoop-lzo.
Huh. Ok that's weird. Just to make sure -- if you run hadoop cat | lzop -x - | wc -l on the compressed file, what number do you get?
from hadoop-lzo.
143313687
from hadoop-lzo.
What is the status of this issue? I believe I am running up against this while using hive.
I have a small, partitioned hive table. I have a single file inside each partition. I am running a simple "select count(1)" query against a single partition. Done without the index, I get the correct result. When I add the index, the count increases by one.
When I look at the actual differences in the output from a "select *" query, the extra result seems is mostly NULLs, and the first value is binary garbage.
I am using an older version of hadoop-lzo. I plan to upgrade to pick up another bug fix that's important to us, but thought I'd ask about this bug directly.
Thanks,
Charlie
from hadoop-lzo.
Charlie,
Really sounds like some lzo corruption going on, please try upgrading to the latest version and let us know if the problem persists.
from hadoop-lzo.
No luck. I am having the same issue with hadoop-lzo 0.4.9. My hive query returns an incorrect result with the lzo index, and a correct result without.
The hive table/file is actually quite small at 744 lines and 350KB uncompressed. Block size is 64MB. It compresses to <17K. When I index the lzo file, select count(1) returns 745.
Am I even posting to the correct thread? It seems very similar to the behavior I'm seeing...
Thanks again,
Charlie
from hadoop-lzo.
Hi Charlie,
It sounds like there is some edge case that has to do with corrupt data being written to the end of a file (or possibly an uncompressed LZO block -- lzo does this thing where it won't compress data if it determines that compressed output would be bigger than uncompressed). It would be extremely helpful if you could share the data / queries needed to reproduce this error. Can you email me, my first name at Twitter.com?
from hadoop-lzo.
Thanks for helping me get this sorted. For the record, my table was created with a vanilla TextInputFormat, and I was using an ALTER TABLE command to change to DeprecatedLzoTextInputFormat. But this doesn't affect the preexisting partitions I had in my table, so the index file was getting processed as part of the table data through TextInputFormat, instead of being used to calculate splits as DeprecatedLzoTextInputFormat would have.
from hadoop-lzo.
Related Issues (20)
- create a public email group?
- sc.textFile doesn't seem to use LzoTextInputFormat when hadoop-lzo is installed HOT 2
- where is /build.properties generated HOT 3
- No output when using index file HOT 6
- Hadoop LZO does not take non-default queue HOT 1
- mvn clean test doesn't build jar HOT 4
- lzo with gradle
- New maven version with AArch64 binary HOT 5
- JNI issue in LzoDecompressor_decompressBytesDirect
- Build Failure on Ubuntu HOT 7
- maven.twttr.com has been down for over a day
- Could not find artifact com.hadoop.gplcompression:hadoop-lzo:jar:0.4.16 in Twitter public Maven repo (http://maven.twttr.com) HOT 11
- pom.xml may have an incorrect license
- Compression Level is ignored. HOT 2
- support fileglobs when index files
- Full build instructions for windows 10 HOT 1
- maven.twttr.com outage - 503 errors - breaks builds of downstream projects HOT 23
- changes to continuous integration
- How to decompress LZO file using hadoop-lzo
- LZO codec not working for graviton instances
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hadoop-lzo.