Giter VIP home page Giter VIP logo

hadoop-sstable's Introduction

Hadoop SSTable: Splittable Input Format for Reading Cassandra SSTables Directly

Hadoop SSTable is an InputFormat implementation that supports reading and splitting Cassandra SSTables. Leveraging this input format MapReduce jobs can make use of Cassandra data for offline data analysis.

Cassandra Version Support

###Supported

  • Cassandra 1.2
  • Cassandra 2.0.x

Currently Cassandra 2.1.x is not supported.

Getting Started

See a full example to get a feel for how to write your own jobs leveraging hadoop-sstable.

https://github.com/fullcontact/hadoop-sstable/wiki/Getting-Started

Configuration

Required:

Cassandra Create Table Statement (used for table metadata)

hadoop.sstable.cql="CREATE TABLE foo..."

Recommendation:

The Compressed SSTable Reader uses off-heap memory which can accumulate when task JVMs are reused.

mapred.job.reuse.jvm.num.tasks=1

Additionally, each MapReduce job written using this input format will have it's own set of constraints. We currently tune the following settings when running our jobs.

io.sort.mb
io.sort.factor
mapred.reduce.tasks
hadoop.sstable.split.mb
mapred.child.java.opts

Communication

Binaries

Binaries and dependency information for Maven, Ivy, Gradle and others can be found at [http://search.maven.org]

Example for Gradle:

compile 'com.fullcontact:hadoop-sstable:x.y.z'

Build

To build:

$ git clone [email protected]:fullcontact/hadoop-sstable.git
$ cd hadoop-sstable
$ ./gradlew build

Bugs and Feedback

For bugs, questions and discussions please use the Github Issues.

LICENSE

Copyright 2014 FullContact, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

hadoop-sstable's People

Contributors

bvanberg avatar xorlev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadoop-sstable's Issues

Missing records from the SSTable files

I am testing with 0.1.2 release. Here is the problem I am not sure where it comes from.

I tested with 0.1.2. I have our old implementation, which parse the SSTable file per Mapper. In my test case from one set of production data, we have 375 SSTable files. In my old implementation, which use the same Cassandra SSTable exporter logic, my code will generate 161,313,791,210 records from these SSTables. Using hadoop-sstables, I use a new mapper (which get the (key/SSTableIdentityIterator) from hadoop-sstables, but only got 161,304,497,154 total records from the same 375 SSTable files, so there are 9,294,056 records missing. In the old implementation, it should be bug free on the total count of records, as it is verified by using the Cassandra (sstable2json) tools. So is it possible the indexing and split the SSTable files COULD lost any records? I am going to test one individual SSTable file in next step, but just want to know if any suggestion about this case?

Thanks

Handling deletion inside collection types in Cassandra 2

Hi, first of all thanks for this great implementation. One question while using Hadoop SSTable on Cassandra 2 using collection types.
Lets say one of the column is a map type with data {1: 'yes', 2: 'no', 3: 'true'}
Now if {2:'no'} is deleted from map, incremental sstables give me output as : {"", "d"} as if the entire column is deleted which is not the case here. How to handle it? Or how have you guys handled this case?

IllegalArgumentException - java.nio.Buffer.limit(Buffer.java:267)

java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:267)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getString(AbstractCompositeType.java:226)
at com.fullcontact.sstable.example.SimpleExampleMapper.map(SimpleExampleMapper.java:42)
at com.fullcontact.sstable.example.SimpleExampleMapper.map(SimpleExampleMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
14/11/04 15:16:16 INFO mapred.JobClient: Job complete: job_local_0001
14/11/04 15:16:16 INFO mapred.JobClient: Counters: 0
14/11/04 15:16:16 INFO example.SimpleExample: Total runtime: 2s

The ongoing road map

Hi, Hadoop-sstable is a great idea for processing the C* sstable files efficient, but I start thinking this is a dead end for the future C* version. In our environments, we have lots of datasets stored in C_, and I tried fork your code and keep supporting new types and new version of C_, and here is some output from at least my effort:

  1. C* doesn't have clean and easy internal API to help us to parse the collection type data out from the SSTable in C* 2.x base. I already gave up this path, and use Spark loading the data from C* into HDFS for small/media datasets, and force the C* 2.0/2.1 schema to support CDC in our end.
  2. C* 2.1 also causes trouble for us now, as the internal C* API to dedicate the SSTable file random access toward JDK. This makes the random access the SSTable files on HDFS extreme difficult. This is maybe one of the reason you guys cannot support C* 2.1 yet.

I wonder what are you guys opinion about this? What do you think about the C* 2.1 or even 3.0 support of hadoop-sstable, and especially all the new types coming in the future version?

Thanks

Yong

EOF if all columns not iterated

If you don't walk through all the columns you get an exception:

java.io.EOFException
at com.fullcontact.cassandra.io.util.RandomAccessReader.readFully(RandomAccessReader.java:259)
at com.fullcontact.cassandra.io.util.RandomAccessReader.readFully(RandomAccessReader.java:250)
at com.fullcontact.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:481)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRowRecordReader.nextKeyValue(SSTableRowRecordReader.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

Easy to workaround with:

        while (value.hasNext()) {
            OnDiskAtom atom = value.next();
        }

Kinda a corner case to handle so pretty low priority but could possibly identify if it hasn't reached the end of the row and jump to end.

S3 as data input path.

Hi, I am trying to read data directly from S3 and it fails. It does not gives any error or throw exception but simply first generates an index file of 0 kb. Also, all the JSON part files generated are empty (0 kb).
Needed help to read input (sstables) directly from S3.

double quote filed in cql caught JsonColumnParser NPE

Hi, I use hadoop2.4.1 and cass2.0.15, running SSTableIndexIndexer is ok, but running SimpleExample has problem:

15/11/07 17:28:41 INFO example.SimpleExample: Setting initial input paths to /user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity
15/11/07 17:28:45 INFO example.SimpleExample: Setting initial output paths to /user/qihuang.zheng/velocity_test
15/11/07 17:28:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
15/11/07 17:28:49 INFO input.FileInputFormat: Total input paths to process : 70
15/11/07 17:28:50 INFO mapreduce.JobSubmitter: number of splits:10
15/11/07 17:28:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446657831952_0012
15/11/07 17:28:51 INFO impl.YarnClientImpl: Submitted application application_1446657831952_0012
15/11/07 17:28:51 INFO mapreduce.Job: The url to track the job: http://spark047216:23188/proxy/application_1446657831952_0012/
15/11/07 17:28:51 INFO mapreduce.Job: Running job: job_1446657831952_0012
15/11/07 17:29:01 INFO mapreduce.Job: Job job_1446657831952_0012 running in uber mode : false
15/11/07 17:29:01 INFO mapreduce.Job:  map 0% reduce 0%
15/11/07 17:29:01 INFO mapreduce.Job: Job job_1446657831952_0012 failed with state FAILED due to: Application application_1446657831952_0012 failed 2 times due to AM Container for appattempt_1446657831952_0012_000002 exited with  exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
    at org.apache.hadoop.util.Shell.run(Shell.java:418)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Container exited with a non-zero exit code 1
.Failing this attempt.. Failing the application.
15/11/07 17:29:01 INFO mapreduce.Job: Counters: 0
15/11/07 17:29:01 INFO example.SimpleExample: Total runtime: 21s

my hadoop env is ok,because running example like wordcount is ok.
the log indicate map 0% and redcue %, means mapper not being called at all. but why? I really don't know
when I run cmd without -D hadoop.sstable.cql, no exception: Failed CQL create statement empty happen, which should be, as SimpleExampleMapper catch this exception

if (cql == null || cql.trim().isEmpty()) {
            throw new RuntimeException("Failed CQL create statement empty");
        }

And I debug code find reading input: SSTableRowInputFormat is normal.

15/11/07 17:43:43 DEBUG ipc.ProtobufRpcEngine: Call: getListing took 5ms
15/11/07 17:43:43 DEBUG input.FileInputFormat: Time taken to get FileStatuses: 136
15/11/07 17:43:43 INFO input.FileInputFormat: Total input paths to process : 70
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Initial file list: 70 [LocatedFileStatus{path=hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-CompressionInfo.db; isDirectory=false; length=87211; replication=3; blocksize=134217728; modification_time=1446861739953; access_time=1446861739918; owner=qihuang.zheng; group=supergroup; permission=rw-r--r--; isSymlink=false},
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Removing non-sstable file: hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-CompressionInfo.db
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Reading index file for sstable file: hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-Data.db
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Reading index file: hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-Index.db
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Final file list: 10 [LocatedFileStatus{path=hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-Data.db; isDirectory=false; length=172873282; replication=3; blocksize=134217728;
15/11/07 17:43:43 DEBUG mapreduce.SSTableInputFormat: Splits calculated: 10 [SSTableSplit{dataStart=0, dataEnd=0, idxStart=0, length=8472466, idxEnd=8472466, dataFile=hdfs://tdhdfs/user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity/forseti-velocity-jb-102234-Data.db,

PS: for classpath problem running together with cassandra. I export classpath then run hadoop jar

export HADOOP_CLASSPATH=/usr/install/cassandra/lib/*:cassandra-all-2.0.15.jar:$HADOOP_CLASSPATH

/usr/install/hadoop/bin/hadoop jar hadoop-sstable-2.0.0.jar com.fullcontact.sstable.example.SimpleExample \
    -D hadoop.sstable.cql="CREATE TABLE velocity (attribute text,partner_code text,app_name text,type text,"timestamp" bigint,event text,sequence_id text,PRIMARY KEY ((attribute), partner_code, app_name, type, "timestamp")) WITH compression={'sstable_compression': 'LZ4Compressor'}" \
    -D mapred.task.timeout=21600000 \
    -D mapred.map.tasks.speculative.execution=false \
    -D mapred.job.reuse.jvm.num.tasks=1 \
    -D io.sort.mb=1000 \
    -D io.sort.factor=100 \
    -D mapred.reduce.tasks=512 \
    -D hadoop.sstable.split.mb=1024 \
    -D mapred.child.java.opts="-Xmx2G -XX:MaxPermSize=256m" \
    /user/qihuang.zheng/velocity_backup_1107/226_1105/1/forseti/velocity /user/qihuang.zheng/velocity_test

Problems to run the SimpleExample MR job

I tried to run the SimpleExample as MR job, locally.

Here is what I passed in as parameters:

-fs local -jt local path_to_hadoop-sstable/sstable-core/src/test/resources/data output_path

First, here is the error I got:
Exception in thread "main" java.io.FileNotFoundException: File file:/hadoop-sstable/sstable-core/src/test/resources/data/Keyspace1-Standard1-ic-0-Index.db.Index does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125)

Not sure why the code is looking for the index file like "*-Index.db.Index". I know the index file in Cassandra is not named like that. So I change the SSTABLE_INDEX_SUFFIX to empty string, instead of ".Index", after reading the code.

Now I got the new error:
Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at com.fullcontact.sstable.index.SSTableIndexIndex.readIndex(SSTableIndexIndex.java:63)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableInputFormat.listStatus(SSTableInputFormat.java:85)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableInputFormat.getSplits(SSTableInputFormat.java:139)

The code failed to readLong from the Index File InputStream to the end.

Am I totally trying to do the wrong thing here? Or What is the correct way to test SimpleExample running as a Local MR job, using the test data coming with it?

Thanks

Yong

Support Cassandra 2.0.x

I have one dataset already upgraded to Cassandra 2.0. I am trying to see if "hadoop-sstable" can be changed to support Cassandra 2.0.

One problem I found out is that in Cassandra 2.0, for the class SSTableIdentityIterator, Cassandra changed the constructor of taking "DataInput" to a private constructor, see this link:

https://github.com/apache/cassandra/blob/cassandra-2.0.10/src/java/org/apache/cassandra/io/sstable/SSTableIdentityIterator.java

On line 81.

This brings a problem that it is hard to create a SSTableIdentityIterator instance, as the public constructors only take Cassandra RandomAccessReader, not fullcontact's RandomAccessReader.

I wonder what could be a good solution in this case? Do you guys plan anything for supporting Cassandra 2.x?

Thanks

The Indexer job

This is not a really an issue, but more like a question.
It looks like that the Index build part is running within the driver, multithreading. It works for my test data. What I want to know how is the performance on your production about Indexing the data? Is it making sense to use MR jobs to build the index? Our production system has over 10T SSTable files. I kind of worrying running indexing within one driver could be the bottleneck in this case. What is your guys experience?

Thanks

Too many MapTasks same number as Data.db files

Recenty we want to migration data from C to HDFS, and Here is the Mapper:

        protected void map(ByteBuffer key, SSTableIdentityIterator value, Context context) throws IOException, InterruptedException {
            final ByteBuffer newBuffer = key.slice();
            String partitionKey = UTF8Type.instance.getString(newBuffer);

            StringBuffer sb = null;
            int i=0;
            while (value.hasNext()) {
                OnDiskAtom atom = value.next();
                if (atom instanceof Column) {
                    Column column = (Column) atom;
                    String cn = CQLUtil.getColumnName(column.name(), columnNameConverter);
                    String cv = CQLUtil.byteBufferToString(column.value());

                    if(i%3==0 && "".equals(cv)){
                        sb = new StringBuffer(partitionKey);
                        sb.append(":").append(cn);
                    }else if(i%3==1 && cn.substring(cn.lastIndexOf(":")+1).equals("event")){
                        sb.append(cv);
                    }else if(i%3==2 && cn.substring(cn.lastIndexOf(":")+1).equals("sequence_id")){
                        sb.append(":" + cv);
                        context.write(new Text(sb.toString()), null);
                    }
                }
                i++;
            }
        }

Because we have 2 regular column(event and sequence_id). so mapper output column like this:

PartitionKey
        cluster-key-values: 
        cluster-key-values:regularColumn1 
        cluster-key-values:regularColumn2

And we aggregation One Row by : PartitionKey:cluster-key-values:regularColumn1Value:regularColumn2Value
In this way, one row like this looks more like CQL result or DBMS one row.

Our one SSTable file size almost 160M from Cassandra, put to HDFS(Block size=128MB):
160m

have 1674 Data.db files(Almost 300G data from C*):

[qihuang.zheng@spark047219 ~]$ /usr/install/hadoop/bin/hadoop fs -ls -R /user/qihuang.zheng/velocity_backup_1107/226_1105 | grep "Data" | wc -l
1674

After running in cluster(11Nodes), I see MapTask number is the same as Data.db files:
1674tasks

And ofcourse this job will take many time.
I set -D mapred.map.tasks=180, but map task number still keep 1674.
I guess map task number can't assign, As it read from HDFS InputSplit.

Is there any way to run MR job quickly?
Or Does Yarn application will decrease running time?

Q: Cass columns in multiple sstables

So a question I have is how does hadoop-sstable deal with Cass spreading columns over multiple SSTables. When you query Cass it does the work of finding the ranges you are querying, streaming the SSTables into memtables to give you the "latest" data or deal with tombstones, and then provides the result. Are you doing a full compaction to avoid needing to look in multiple tables? (It didn't sound like it unless Priam does so during backup of your ring.)

Cheers

SequenceFile doesn't work with GzipCodec without native-hadoop code!

java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code!
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:386)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:569)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)

Is anyone familiar with this exception?
I am running it on mac os.

Running into com.fullcontact.cassandra.io.sstable.CorruptBlockException

I am trying to decode some sstable files with following two steps:-

Indexing step:-
hadoop jar hadoop-sstable-0.1.4.jar com.fullcontact.sstable.index.SSTableIndexIndexer /data//cassandra-data

Decoding step, which needs the create table statement:
hadoop jar hadoop-sstable-0.1.4.jar com.fullcontact.sstable.example.SimpleExample -D hadoop.sstable.cql="CREATE TABLE analytics_counters_daily (access_attribute varchar,matrix_type varchar,time_bucket timestamp,event_type varchar,platform varchar,position varchar,attribute_key varchar,medium varchar,count counter,PRIMARY KEY ( (access_attribute, matrix_type), time_bucket, event_type, platform, position, attribute_key, medium)) WITH CLUSTERING ORDER BY (time_bucket DESC);" /data//cassandra-data /data//cassandra-data/decoded

But I get a bunch of corrupt sstable exceptions and the MR job to decode cassandra data fails. Any suggestion on what may be wrong?

15/10/28 05:25:23 INFO mapred.JobClient: Task Id : attempt_201506251917_1703_m_000080_1, Status : FAILED
org.apache.cassandra.io.sstable.CorruptSSTableException: com.fullcontact.cassandra.io.sstable.CorruptBlockException: (hdfs://watson-batch-hbase-namenode-prod/data/gjoshi/cassandra-data/analytics_counters_daily-1f77832036fd11e5a007a7e74a491f29/analytics_counters_backfill-analytics_counters_daily-ka-1305-Data.db): corruption detected, chunk at 1450382757 of length 7744.
at com.fullcontact.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:86)
at com.fullcontact.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:421)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRecordReader.initialize(SSTableRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:479)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
a

Does it work with hadoop 2.0.0?

Hey its not an issue, just wanted to know if you guys are working on to make it run on hadoop 2.0.0.
Or is there a minor change that can make it run with hadoop 2.0.0 ?

Not working with column of type Set<String>

When it is parsing the column of type set, it is returning the name appended with colon (":"). And its not able to parse that column data and returns empty. Do you have any idea about it? It does able to recognize it as of type org.apache.cassandra.db.marshal.SetType(org.apache.cassandra.db.marshal.UTF8Type) but unable to do the getString thing.

Runtime Exception while running SimpleExample MR job

14/11/03 15:20:22 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: Error configuring SSTable reader. Cannot proceed
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRecordReader.getCreateColumnFamilyStatement(SSTableRecordReader.java:144)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRecordReader.initializeCfMetaData(SSTableRecordReader.java:122)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRecordReader.initialize(SSTableRecordReader.java:69)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: org.apache.cassandra.exceptions.SyntaxException: line 1:13 no viable alternative at input '.'
at org.apache.cassandra.cql3.CqlParser.throwLastRecognitionError(CqlParser.java:220)
at org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:261)
at com.fullcontact.sstable.hadoop.mapreduce.SSTableRecordReader.getCreateColumnFamilyStatement(SSTableRecordReader.java:141)
... 6 more
14/11/03 15:20:23 INFO mapred.JobClient: Job complete: job_local_0001
14/11/03 15:20:23 INFO mapred.JobClient: Counters: 0
14/11/03 15:20:23 INFO example.SimpleExample: Total runtime: 2s

Job tuning -- memory

How much memory are you guys giving to each map task? Our use case involves multiple 20-40GB SSTables and we can't seem to get around the Java heap space error.

How to work with it?

Hi,

First of all great work!
We want to POC your code, but I can't find any example or documentation on how to define the job, or map job key & value.
Can you please give some generic code example on how to define the input and map class?

Thanks.

FileSystem.get() calls should pass a URI in order to be easily portable to other systems

I've been working on porting some of this code up to AWS's elastic map reduce framework and have found a bug with the way we are setting paths. Instead of calls to FileSystem.get(job.getConfirguration()), we should pass the optional URI parameter as FileSystem.get(inputPath.toUri(), job.getConfiguration()) to be more robust to other FileSystems (local, s3, hdfs, etc).

If you agree that this is worthwhile, I'm happy to submit a PR with the change.

Problems with readIndex(FileSystem, Path) method of SSTableIndexIndex

I have been using this code to create a MR job to run on AWS's elastic map reduce framework, and it seems that there might be a bug in the readIndex(final FileSystem fileSystem, final Path sstablePath) method. When we open the index using the nativeS3FileSystem, whenever we call inputStream.available() the response is 0. I think the problem is due to the implementation of these inputStream objects, and not necessarily a problem with this repo's code itself. I have managed to fix the issue by moving the code into a while(true) loop and breaking on an EOFException, which though very hacky seems to work.

I'm not sure if there is a better solution to the problem, or if it's really an artifact of a bug upstream, but thought I'd mention it here so others are aware.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.