Can chemically dependent points become productive members of a text society through a twelve-step program?
Output results are already available as PDFs in the image_files folder.
The bash script run_me.sh fires up twelve tools for the twelve steps:
- Python program encodes bitmap characters into JSON format with numerous flaws
- Hadoop Streaming discards invalid JSON records
- Hive table used to store points before transfer to MySQL with Sqoop
- MySQL script copies the points into another table while removing nulls
- Sqoop transfer back to Hive, X and Y values of points are then split up
- AVRO files generated by Hive
- Pig reads both files and joins on the serial number
- R removes bitmap outliers
- Hadoop Two MR runs remove all the noise points leaving only character information
- Mahout performs K-means to create a cluster for each character
- Sequence File used to write out points with cluster IDs
- Spark with Scala opens the Sequence File and decodes each cluster into text
Verified with MacOS 10.10.1, Java 1.7, Hadoop 1.2.1, Hive 0.13.1, Pig 0.12, Mahout 0.5, Spark 1.2.1, Scala 2.11.6
Centos and Ubuntu verified on VMs