A crawl system built around Heritrix3.
See the ukwa-documentation for an overview.
The dockerized ensemble of services that run most of the UKWA crawl and ingest processes.
Home Page: https://github.com/ukwa/ukwa-documentation
License: Apache License 2.0
A crawl system built around Heritrix3.
See the ukwa-documentation for an overview.
Rather than perform the main copy in Luigi, we could run Apache Flume when pushing files up to HDFS. It's an extremely well-established and mature tool that's built to do exactly this.
We could use a Spooling Directory Source for *.warc.gz
files, and a Taildir Source to continuously process crawl log files. These would be sent to a suitable HDFS Sink.
The directory sources should be set to not delete files (see deletePolicy
), and to put the file path in the fileHeader
so we can use ${file}
to reconstruct the path on HDFS. (see rough outline here)
A second Luigi-powered process would check for files and delete them when the checksums of both had been verified.
Should be pretty easy to Dockerize, and use a Java 8 base image rather than something like this.
Modern versions of HDFS support an iNotify API. Here's an example: https://github.com/onefoursix/hdfs-inotify-example
We could wrap this daemon up as e.g. a Storm Spout (although this is being considered already and there are issues around scaling). Alternatively, we could just run it as a plain daemon that publishes to e.g. Kafka.
Note that the overall design would have to assume this would sometimes miss things, and make it possible to (re-)run and processing via occasional hadoop fs -lsr
jobs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.