uscdatascience / sparkler Goto Github PK

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Home Page: http://irds.usc.edu/sparkler/

License: Apache License 2.0

Scala 34.60% Shell 4.84% Java 38.15% JavaScript 11.55% Python 8.25% HTML 0.68% Dockerfile 1.50% CSS 0.26% Mustache 0.19%

solr web-crawler spark nutch tika big-data information-retrieval search-engine search distributed-systems

sparkler's Introduction

Sparkler

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

~~Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here~~ Will be done later, eventually!

Notable features of Sparkler:

Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get the image
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
# Step 1. Create a volume for elastic
docker volume create elastic
# Step 1. Inject seed urls
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

sparkler's People

Contributors

Stargazers

Watchers

Forkers

sujen1412 hackbuteer59 rahulpalamuttam manishdwibedy smadha liinnux tspannhw siddalingeshads omkar20895 rithvikshelke adripurkayastha kamaci seraekim nehgu janewqiu ab212 fredygomez sk-s-hub raymondchen-byte gitter-badger lhfbalance karanjeets bmzhao feihugis cmurat meddre5911 bberaldi fansy1990 beraldi nhat2008 rkarthik17 venkatraman-arumugam ravirajadrangi ldaume nsidc shankaragarwal26 bmvsprasad ghaseminya junjiem spiculedata jingruhou ivartanian leexhwhy mikalv hendrakurniad karthi2016 weeki lingya davtalab dilraj45 signallogics ahmadika gongsong voltek62 xavi-reloaded yehualashetgit supermonk alexgitx rohithyeravothula penhchet sasili-adetunji remibacha qopqopqop abratnap anranshise k9team3 kingcall baddlan jontxu70 zhaoyta gcheliz nilportugues hubitor bytearchive tanthml giuseppetotaro misterpilou navashiva prenastro micheladennis sankar009 lifedom allen-oneill ognis1205 kellerli heyhart oderdene reloadbrain buggtb maduhu wangzhiqun alexander144 zyq11223 dinhthietpham rikima plagly radrangi hhy5277 arkhovansky vishalpaalakurthi

sparkler's Issues

Finish Juju charm

Some high level remaining tasks:

Escape metachars in solr queries

java.lang.RuntimeException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.search.SyntaxError: Cannot parse 'group:': Encountered "" at line 1, column 6.
Was expecting one of:
     ...
    "(" ...
    "*" ...
     ...
     ...
     ...
     ...
     ...
    "[" ...
    "{" ...
     ...
    "filter(" ...
     ...

    at edu.usc.irds.sparkler.util.SolrResultIterator.getNextBean(SolrResultIterator.scala:72)
    at edu.usc.irds.sparkler.util.SolrResultIterator.(SolrResultIterator.scala:57)
    at edu.usc.irds.sparkler.CrawlDbRDD.compute(CrawlDbRDD.scala:55)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Is the default config overrideable without updating the jar?

I don't know if I'm just missing something or its not done yet, can I define an external site-default.yaml type config file?

Integrate Nutch Fetcher Queue system for Fetch Function

[NUTCH][MEMEX] Port robots.txt rules from Nutch

Setup URL normalizers

[NUTCH][MEMEX] port the Generator Module (aka scoring plugin )

Sub tasks:

Define Scoring plugin interface
Port over Cosine Similarity from Nutch to Sparkler
Port over Naive Bayes Filter from Nutch to Sparkler
Integrate Domain Explorer code into this generation module

CC @sujen1412

[MEMEX] Change Enum config to have unfetched in place of new

To maintain the Nutch lingo and crawler lingo of a url that is unfetched, I suggest replacing the enumconfig to have UNFETCHED in place of the value NEW.

Add XML-based Sparkler Configuration

Add Sparkler Configuration through sparkler-default.xml and sparkler-site.xml

[MEMEX] Add extractor plugin interface

review if this can be generalised as Parser
Generalise schema to fit all possible extractions that may come up in the future

Job hangs for a minute when Kafka is not configured

When Kafka server is not configured or active, the crawl job make repeated attempts to establish connection which adds reasonable amount of delay.

By default, Kafka feature shall be disabled in configuration.

@karanjeets Thoughts? Can you have a look this?

[NUTCH][MEMEX] Create Generator Plugin Interface

This plugin shall add customize the URL selection for fetching

Create a plugin framework for Sparkler

Java null pointer error in fetch()

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'**
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException
at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6)
java.util.NoSuchElementException: key not found: 6**
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

Visual Analytics : Create admin dashboard to crawldb to visualize the stats in realtime

Get stats from solr crawldb by querying it and visualize the results using charts like D3js (or similar)

Create Wiki

Make plugins as dynamically loadable services (no compile time dependency)

May use OSGI to accomplish the same

Integrate Tika Parser to Parse Function for extracting Text and metadata

The current parse function is limited to outlinks

Define interfaces for the plugins

PS
this is incomplete list

Sparkler Build Failing

The build is failing due to reference to a deprecated (removed) module "sparkler-plugins-active".

Working on it...

Logs:

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ sparkler-api ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/target/sparkler-api-0.1-SNAPSHOT.jar to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.pom
[INFO] 
[INFO] --- maven-bundle-plugin:2.5.0:install (default-install) @ sparkler-api ---
[INFO] Installing edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Writing OBR metadata
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler-plugins 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ sparkler-plugins ---
[INFO] 
[INFO] --- maven-install-plugin:2.4:install (default-install) @ sparkler-plugins ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-plugins/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/plugin/sparkler-plugins/0.1-SNAPSHOT/sparkler-plugins-0.1-SNAPSHOT.pom
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] sparkler-parent .................................... SUCCESS [  0.117 s]
[INFO] sparkler-api ....................................... SUCCESS [  2.089 s]
[INFO] sparkler-plugins ................................... SUCCESS [  0.005 s]
[INFO] sparkler ........................................... FAILURE [  0.495 s]
[INFO] urlfilter-regex .................................... SKIPPED
[INFO] fetcher-jbrowser ................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.265 s
[INFO] Finished at: 2016-10-28T17:54:21-05:00
[INFO] Final Memory: 43M/1451M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project sparkler-app: Could not resolve dependencies for project edu.usc.irds.sparkler:sparkler-app:jar:0.1-SNAPSHOT: Could not find artifact edu.usc.irds.sparkler.plugin:sparkler-plugins:jar:0.1-SNAPSHOT -> [Help 1]

base package "usc.edu.ir" instead of "usc.edu.irds"

Should we make our base package "usc.edu.ir" to maintain consistency with our previous projects and jars already pushed to maven central?

Add Kafka Connector Data Sink

Add Regex URLFilter Plugin

Use the regex url filter plugin from Nutch

Debugging crawl in Sparkler

URL Partitioner

Input: Query Solr for the URLs to be generated

status:NEW

Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host

Fetch

Input: URL to fetch
Output: Request and Response Headers written in a file

Parse

Input: URL (which will be fetched and parsed) OR the fetched content
Output: Extracted Content

Fair Fetcher

Input: List of URLs. Uses Crawl policy
Output: fetched and/or parsed content in separate files under a directory

Default Dashboard config for Crawldb Schema

Add a banana dashboard config JSON file customized for Sparkler crawldb schema.

CC @karanjeets

Add ALV2 license headers on code

Dockerize Sparkler

Dockerize Sparkler for "Pick and Crawl" framework.

@thammegowda : As discussed, please work on this.

Setup build system for plugins

Setup unit tests and integration tests for Sparkler

To begin with, write unit tests for :

test RDD functions
Test core functionality of plugins

Tasks

PS:
The current progress can be tracked on https://github.com/USCDataScience/sparkler/tree/unittests branch

Maven packaging poblem

~/git_workspace/sparkler/target$ java -classpath sparkler-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.sparkler.pipeline.Crawler -m "local" -j "sparkler-job-1465179374801" -i 1
2016-06-05 19:26:41 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-06-05 19:26:42 ERROR SparkContext:95 [main] - Error initializing SparkContext.
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at akka.actor.ActorSystem$Settings.(ActorSystem.scala:169)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:505)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(U

Build failing

http://pastebin.com/xPbxHNhM

...
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=36 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=39 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=41 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=44 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=46 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=48 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=File must end with newline character
warning file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SolrResultIterator.scala message=Avoid using null line=56 column=43
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=26 column=2
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=30 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=35 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=37 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=45 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=52 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=60 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=File must end with newline character
Saving to outputFile=/Users/madhav/Documents/workspace/sparkler/sparkler-app/target/scalastyle-output.xml
Processed 23 file(s)
Found 29 errors
Found 9 war
....

Seems like config issue

Organize the code into proper modules

log4j

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt
log4j:WARN No appenders could be found for logger (edu.usc.irds.sparkler.service.Injector$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

jobId = sjob-1473483131504

So, I solved problem by
java -Dlog4j.configuration=file:///$SPARKLER_HOME/sparkler-app/src/main/resources/log4j.properties -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt

and get jobId=sjob-1473484528794

but when I run crawl
java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
The error occurs

I used hadoop2.4.0 spark1.6.1 nutch1.11 solr6.0.1 jdk1.8.0u92 scala2.11.8
everything works well

How can I fix it?

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/10 14:30:56 INFO SparkContext: Running Spark version 1.6.1
16/09/10 14:30:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/10 14:30:56 INFO SecurityManager: Changing view acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: Changing modify acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/09/10 14:30:56 INFO Utils: Successfully started service 'sparkDriver' on port 49943.
16/09/10 14:30:57 INFO Slf4jLogger: Slf4jLogger started
16/09/10 14:30:57 INFO Remoting: Starting remoting
16/09/10 14:30:57 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:36988]
16/09/10 14:30:57 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 36988.
16/09/10 14:30:57 INFO SparkEnv: Registering MapOutputTracker
16/09/10 14:30:57 INFO SparkEnv: Registering BlockManagerMaster
16/09/10 14:30:57 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b32ee736-ef54-4f7a-83ff-6c8f6ab3d442
16/09/10 14:30:57 INFO MemoryStore: MemoryStore started with capacity 723.0 MB
16/09/10 14:30:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO DiskBlockManager: Shutdown hook called
16/09/10 14:30:57 INFO ShutdownHookManager: Shutdown hook called

Solr Cloud - solrj.SolrServerException: No live SolrServers available to handle this request

When solr cloud is enabled for backend, we get this

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
	at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.0.11:8983/solr/crawldb_shard1_replica1, http://192.168.0.11:8984/solr/crawldb_shard1_replica2]
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
	at edu.usc.irds.sparkler.CrawlDbRDD.getPartitions(CrawlDbRDD.scala:72)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:641)
	at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:153)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:145)
	at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:45)
	at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:236)
	at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
	... 6 more

Support Java Script execution engine for web pages

Content/text/url is not indexing in solr

Why url, parsed text, content is not indexing in solr when crawldb update, as sparkler flow diagram shows on updation?

Discuss about plugin architecture

Guide for sparkler and hdfs

In the guide there is nothing on how to connect hdfs with sparkler. In notes for Apache Nutch Users and developers.

Note 2: Crawled content
Sparkler can produce the segments on HDFS, trying to keep it compatible with nutch content format.

Please share the steps. How?

Update Solr Schema

Change jobId to crawl_id for better understanding and consistency
Add content_type field
Add crawler field for document uniqueness
Change plainText to extracted_text
Change lastFetchedAt to fetch_timestamp
Change indexedAt to indexed_at for consistency
Add fetch_status_code field to record the response code
Add hostname as group can have a different definition in future
Change numTries to retries_since_fetch for better understanding and consistency
Add signature field to store the hash of page's content
Add version field which defines the schema version
Add outlinks field
Change depth to crawler_discover_depth
Add relative_path field to record the file path when dumped
Add parent field for document's parent id
Generate document ID as:
- Seed: SHA256(crawl_id-url-ingestion_timestamp)
- Other: SHA256(crawl_id-url-parent_fetch_timestamp)
Also linked to #49

Minor Issues

Add core.properties in Solr Schema configuration. This will help auto-deploy the core.
Improve Sparkler setup guide and add missing links.

pass crawldb uri on command line

Like solr master I don't know of a reason why you can't pass the crawldb uri.

Working with remote spark.

Has anyone tried this with a non local spark?

I ask because when I try and run on a remote spark I get class mismatch errors:

java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

But then when I check the versions, you guys use Spark 1.6.1 which requires Scala 2.10.x, but you use Scala 2.11.x and if you try and downgrade to 2.10 it doesn't compile.

No fetched content is written

Due to an incorrect validation check, all fetched URLs are filtered out and none is written on disk

rdd.filter(_.fetchedData.getResource.getStatus == FETCHED)

[MEMEX] Feature to add cookies from configured files to the request header sent by fetcher

URL filter regex

Hi,

Am I missing the url filter ? How I can tell the sparkler app to url filter rules ?

in general or per domain

thx

crawl hanging..

Hi,

I injected seed list and start 👍
java -jar target/sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1483360200720 -i 2 -tg 10000 -tn 1000 -m local[*]

after 1-2 min crawlin is hanging no more crawling..

any suggestions ?

`17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.ortadogugazetesi.net/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_71 stored as values in memory (estimated size 61.0 KB, free 12.9 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_71 in memory on localhost:58926 (size: 61.0 KB, free: 757.0 MB)
17/01/02 15:09:39 INFO Executor: Finished task 51.0 in stage 1.0 (TID 127). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 51.0 in stage 1.0 (TID 127) in 3599 ms on localhost (69/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.kibrisgazetesi.com/
17/01/02 15:09:39 INFO Executor: Finished task 71.0 in stage 1.0 (TID 147). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 71.0 in stage 1.0 (TID 147) in 1000 ms on localhost (70/76)
17/01/02 15:09:39 WARN ParseFunction$: PARSING-CONTENT-ERROR http://www.kibrisgazetesi.com/
17/01/02 15:09:39 WARN ParseFunction$: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at org.apache.tika.parser.html.HtmlHandler.characters(HtmlHandler.java:258)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:582)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:62)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:34)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:57)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.evrensel.net/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.yeniakit.com/
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.bilimtarihi.org/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_65 stored as values in memory (estimated size 184.0 KB, free 13.0 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_65 in memory on localhost:58926 (size: 184.0 KB, free: 756.8 MB)
17/01/02 15:09:39 INFO Executor: Finished task 65.0 in stage 1.0 (TID 141). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 65.0 in stage 1.0 (TID 141) in 1849 ms on localhost (71/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_54 stored as values in memory (estimated size 491.2 KB, free 13.5 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_54 in memory on localhost:58926 (size: 491.2 KB, free: 756.4 MB)
17/01/02 15:09:40 INFO Executor: Finished task 54.0 in stage 1.0 (TID 130). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 54.0 in stage 1.0 (TID 130) in 3830 ms on localhost (72/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.evrensel.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_66 stored as values in memory (estimated size 458.8 KB, free 14.0 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_66 in memory on localhost:58926 (size: 458.8 KB, free: 755.9 MB)
17/01/02 15:09:40 INFO Executor: Finished task 66.0 in stage 1.0 (TID 142). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 66.0 in stage 1.0 (TID 142) in 2243 ms on localhost (73/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.ortadogugazetesi.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_75 stored as values in memory (estimated size 117.4 KB, free 14.1 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_75 in memory on localhost:58926 (size: 117.4 KB, free: 755.8 MB)
17/01/02 15:09:40 INFO Executor: Finished task 75.0 in stage 1.0 (TID 151). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 75.0 in stage 1.0 (TID 151) in 1204 ms on localhost (74/76)

[NUTCH][MEMEX] Add Rotating User Agents feature to fetcher

e.u.i.s.model.Resource.<init>(Resource.java:46) java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"

17/01/29 12:43:30 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:64585 in memory (size: 2.8 KB, free: 2.4 GB)
17/01/29 12:43:31 ERROR Executor: Exception in task 7.0 in stage 4.0 (TID 67)
java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"
	at java.net.URL.<init>(URL.java:627)
	at java.net.URL.<init>(URL.java:490)
	at java.net.URL.<init>(URL.java:439)
	at edu.usc.irds.sparkler.model.Resource.<init>(Resource.java:46)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
	at edu.usc.irds.sparkler.service.SolrProxy.addResources(SolrProxy.scala:44)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:43)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:34)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Stream handler unavailable due to: For input string: "0x6"
	at org.apache.felix.framework.URLHandlersStreamHandlerProxy.parseURL(URLHandlersStreamHandlerProxy.java:429)
	at java.net.URL.<init>(URL.java:622)
	... 18 more

Error Message "Could not launch browser" when starting crawler of quickstart guide

17/01/28 15:02:03 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
17/01/28 15:02:03 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_0 not found, computing it
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_1 not found, computing it
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO PluginService$: Felix Configuration loaded successfully
17/01/28 15:02:03 INFO FetcherJBrowserActivator: Activating FetcherJBrowser Plugin
17/01/28 15:02:03 INFO RegexURLFilterActivator: Activating RegexURL Plugin
Bundle Found: org.apache.felix.framework
Bundle Found: fetcher.jbrowser
Bundle Found: urlfilter.regex
[2017-01-28T15:02:04.134] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.134] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.135] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.135] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.135] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.135] ... 1 more
[2017-01-28T15:02:04.359] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.359] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.359] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.359] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.359] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.359] ... 1 more
[2017-01-28T15:02:04.368] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.369] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.369] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.369] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.369] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.369] ... 1 more
17/01/28 15:02:04 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.openqa.selenium.WebDriverException: Could not launch browser.
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'osboxes', ip: '192.168.178.134', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-514.el7.x86_64', java.version: '1.8.0_121'
Driver info: driver.version: JBrowserDriver

Exclude org.json:json library since its license is incompatible with Apache License 2.0

FYI if you are still going to incubate sparkler, you'll need to remove the org.json parser because of an incompatible license.