Giter VIP home page Giter VIP logo

feathers's People

Contributors

andrix avatar brunovianarezende avatar dangra avatar klbostee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

feathers's Issues

add RawFileOutputFormat.java to dumbo

RawFileOutputFormat is an output format that allows the output of raw bytes files (with no TypedBytes wrapping). We've been using it and it showed useful for creating tar files, tokyio cabinet stores in dumbo. I couldn't find a way to attach a file to the issue, to I'll just copy and paste the classe here:

package fm.last.feathers.output;

import java.io.ByteArrayInputStream;
import java.io.DataOutputStream;
import java.io.DataInputStream;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.util.Progressable;

public class RawFileOutputFormat<K, V> extends FileOutputFormat<K, V> {
protected static class RawFileRecordWriter<K, V>
implements RecordWriter<K, V> {
protected DataOutputStream out;

public RawFileRecordWriter(DataOutputStream out) {
    this.out = out;
}

private byte[] readRawBytes(BytesWritable bv)
throws IOException{
    byte[] typedbytesContent = bv.getBytes();
    ByteArrayInputStream bais = new ByteArrayInputStream(typedbytesContent);
    DataInputStream in = new DataInputStream(bais);
    try {
        in.readUnsignedByte();//read the code and discard it
      } catch (Exception eof) {
        return null;
      }
    int length = in.readInt();
    byte[] bytes = new byte[length];
    in.readFully(bytes);
    return bytes;
}

public synchronized void write(K key, V value)
  throws IOException {
  boolean nullValue = (value == null) || value instanceof NullWritable;
  if(!nullValue) {
      BytesWritable bv = (BytesWritable) value;
      out.write(readRawBytes(bv));
  }
}

public synchronized void close(Reporter reporter) throws IOException {
  out.close();
}

}

@OverRide
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new RawFileRecordWriter<K, V>(fileOut);
}
}

Is it possible to use MultipleTextOutputs without the key appearing in the result?

i.e. in such a simple example

@opt("getpath", "yes")
def mapper(key, value):
    path = value[0:5]
    yield (path,None),row

if __name__ == "__main__":
    run(mapper)
dumbo start example.py -input /x -output /y -hadoop /hadoop -libjar /home/zoltanctoth/env-dumbo/feathers/feathers.jar -outputformat text -numReduceTasks 0

Is it possible to have an output file that doesn't have its rows in a <key>\t<value> but simply in <value> format?

I tried it with "" and <backspace> instead of None in yield (path,None),row but none of them seems to work.

build on aws emr cluster fails

Hi
I'm new to Java , Hadoop and AWS. So i'm posting issue after i tried so much of google search
Facing below error while executing build.sh on emr cluster
src/map/Words.java:24: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.;
^
src/map/Words.java:25: error: package org.apache.hadoop.mapred does not exist
import org.apache.hadoop.mapred.
;
^
src/map/Words.java:28: error: cannot find symbol
public class Words extends MapReduceBase

Seems to require added the streaming jar to the hadoop classpath

In order to use this jar with hadoop streaming on a standalone installation of Cloudera CDH3 on Ubuntu Linux, I had to do two things:

  1. change the ant file to add:

+++ b/build.xml
@@ -21,6 +21,8 @@

  •    <fileset dir="${hadoop.home}" 
    
  •             includes="contrib/streaming/hadoop-streaming-*.jar" />
    
  1. add the selected hadoop streaming jar to the HADOOP_CLASSPATH.

In dumbo/backends/streaming.py, I added:
if addedopts['libjarstreaming'] and addedopts['libjarstreaming'][0] != 'no':
addedopts['libjar'].append(streamingjar)
which seemed to be required to get it to work.

Without this, I always got an error that it couldn't figure out where where org.apache.hadoop.typedbytes.TypedBytesWritable was for the Partition function:

Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/typedbytes/TypedBytesWritable
at fm.last.feathers.partition.Prefix.(Unknown Source)

After that, I was able to do use the partition/Prefix class successfully.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.