The P stands for {Profile,Pivoted,Parametrized,Predictive} analytics (choose whatever you feel comfortable with). The project is about entity-based computations with non-structured changing data. One example is session- or cookie-based computations in advertizing, but it can be anything entity-based like fraud detection, gaming, hardware fault analyisis, conversion funnel analysis, etc.
Download crunch from [http://crunch.apache.org]
Crunch needs to be compiled with the crunch.platform=2 flag to properly run in the mr1 mode on Hadoop
> mvn install -Dcrunch.platform=2 -DskipTests
> mvn clean package -DskipTests -P DEPS,JOB
will build all target jars.
As an option, you can create the DEP and JOB files in the root directory to avoid typing the -P DEPS,JOB
each time
touch DEP JOB
It's a good idea sometimes to set the env variable export MAVEN_OPTS=-DskipTests
to save time unless you are working on a significant feature
Use the p-analytics.jar for Hive and Pig. Some libraries (like, again, Avro) need to have dependenciesin the same jar, so the p-analytics-jar-with-dependencies.jar should be used in this case. Use the p-analytics-job.jar to run Crunch jobs from a command line:
> hadoop jar target/p-analytics-job.jar command input(s) output
For example, to run the conversion to Avro that can be loaded into Hive or Pig:
> hadoop jar target/p-analytics-job.jar avro data/hd/attr.txt data/hd/event.txt <output-dir>
To add compression (or add any other flag), you may do:
> hadoop jar target/p-analytics-job.jar avro -Dmapred.output.compress=true data/hd/attr.txt data/hd/event.txt <output-dir>
$ mvn javadoc:javadoc
The javadocs will be in target/site/apidocs/index.html
.
Sometimes the executable requires additional libraries that are not on the default Hadoop set. To generate the classpath for all dependencies do
> mvn -f pom.xml dependency:build-classpath
which needed to be added to HADOOP_CLASSPATH
Look at src/main/{pig,hive} directories for the Pig and Hive scripts. Pig and Hive need to be installed separately.
To build and package the sources and data into one zipped file, run:
> mvn clean assembly:assembly -DskipTests -P DEPS,JOB
Read Maven "Getting Started" http://maven.apache.org/guides/getting-started/index.html.
> mvn archetype:generate -DgroupId=com.cloudera.fts -DartifactId=P-Analytics -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
> mvn -Declipse.workspace=<path-to-eclipse-workspace> eclipse:configure-workspace
> mvn -DdownloadSources=true -DdownloadJavadocs=true eclipse:clean eclipse:eclipse
Modify pom.xml (or edit dependencies later using Eclipse if you trust it)
To install the project in the local repo
> mvn install -DskipTests
Eclipse integration (install m2e Eclipse plugin http://eclipse.org/m2e/download/)
This will setup your eclipse environment:
> mvn -DdownloadSources=true -DdownloadJavadocs=true eclipse:clean eclipse:eclipse
It will download a lot of data, so have a fast Internet connection when doing it for the first time. Then, in Eclipse do the following:
- File->Import...
- General->Existing projects into workspace
- select the "Next" button
- select the projcet toplevel directory
- select the "Finish" button
Each time you modify pom.xml outside of Eclipse you need to 'Update Maven Project' from the Eclipse project menu
To generate Protobuf and Avro code (from the Protobuf definition files are in the src/main/proto
directory and the the Avro schema file in the src/main/avro
directory), run:
> mvn generate-sources
Alternatively you can create and execute it as an eclipse target within Eclipse. The generate java code can be found in target/generated-sources. You might need to add the directory to the Eclipse Java build path by (an ecliplse plugin paranamer-maven-plugin usually does it for you though) :
- Go to Project Explorer view
- Right click on a project
- Go to "Build Path" -> "Configure Build Path..." -> "Add Folder..."
- add target/generate-sources folder to the build path
You alse need to add the conf directory and the p-analytics-job.jar to the classpath if you want to run the executable from within Eclipse.
To analyze maven dependencies, run
> mvn dependency:tree -Dverbose
Avoid multiple versions of the same jar: Some versions of the libraries might have conflicts (like Avro)