dataapps / chlorine-hadoop Goto Github PK
View Code? Open in Web Editor NEWMapreduce program to detect and mask sensitive data in Hadoop
License: Apache License 2.0
Mapreduce program to detect and mask sensitive data in Hadoop
License: Apache License 2.0
Enable an option to specify incremental scans
Record the current time stamp in an hdfs location so that further scans happen from that timstamp only.
hi,
I am trying to run your program however, getting following error. would you please help me understand what is going on.
I am using HDP docker image version HDP 2.6.1
2017-09-22 10:19:47,008 INFO [main] hadoop.DeepScanPipeline (DeepScanPipeline.java:run(82)) - Hostname = 21290
2017-09-22 10:19:47,031 WARN [main] fs.FSInputChecker (ChecksumFileSystem.java:<init>(165)) - Problem opening checksum file: file:/root/sensitive_out/_temp/attempt_local1444930170_0001_m_000000_0. Ignoring exception:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:155)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:348)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
at org.apache.hadoop.fs.FileUtil.copyMerge(FileUtil.java:401)
at io.dataapps.chlorine.hadoop.DeepScanPipeline.run(DeepScanPipeline.java:91)
at io.dataapps.chlorine.hadoop.Scan.main(Scan.java:160)
2017-09-22 10:19:47,043 INFO [main] fs.FSInputChecker (FSInputChecker.java:readChecksumChunk(285)) - Found checksum error: b[3072, 3072]=
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/sensitive_out/_temp/attempt_local1444930170_0001_m_000001_0 at 523264
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:261)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:93)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:61)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:121)
at org.apache.hadoop.fs.FileUtil.copyMerge(FileUtil.java:403)
at io.dataapps.chlorine.hadoop.DeepScanPipeline.run(DeepScanPipeline.java:91)
at io.dataapps.chlorine.hadoop.Scan.main(Scan.java:160)
2017-09-22 10:19:47,046 ERROR [main] hadoop.DeepScanPipeline (DeepScanPipeline.java:run(102)) - org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/sensitive_out/_temp/attempt_local1444930170_0001_m_000001_0 at 523264
Enable a feature which accepts a parameter --fast which takes samples of inputs from the dataset and performs quick scans
Chlorine-finder has implemented the masking feature.
It will be useful to enable masking ability chlorine-hadoop to mask values in HDFS datasets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.