Giter VIP home page Giter VIP logo

uscdatascience / sparkler Goto Github PK

View Code? Open in Web Editor NEW
409.0 47.0 142.0 23.65 MB

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Home Page: http://irds.usc.edu/sparkler/

License: Apache License 2.0

Scala 34.60% Shell 4.84% Java 38.15% JavaScript 11.55% Python 8.25% HTML 0.68% Dockerfile 1.50% CSS 0.26% Mustache 0.19%
solr web-crawler spark nutch tika big-data information-retrieval search-engine search distributed-systems

sparkler's Introduction

Slack

Open in Gitpod

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!

Notable features of Sparkler:

  • Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
  • Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
  • Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
  • Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
  • Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
  • Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get the image
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
# Step 1. Create a volume for elastic
docker volume create elastic
# Step 1. Inject seed urls
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

sparkler's People

Contributors

amirhosf avatar berkarcan avatar buggtb avatar chrismattmann avatar dependabot[bot] avatar felixloesing avatar gitter-badger avatar giuseppetotaro avatar karanjeets avatar kefaun2601 avatar kyan2601 avatar ldaume avatar lewismc avatar mattvryan-github avatar nhandyal avatar prenastro avatar prowave avatar rahulpalamuttam avatar rohithyeravothula avatar ryanstonebraker avatar sk-s-hub avatar slhsxcmy avatar smadha avatar sujen1412 avatar thammegowda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkler's Issues

Finish Juju charm

Some high level remaining tasks:

  • Add solr relation
  • Pick up spark details from relation
  • Pick up solr details from relation
  • Finish write to configuration
  • Compile and add resource
  • Add actions for remote execution of crawler
  • Add Kafka relation
  • Return id in action output
  • Store last ingest id in kv so crawl can pick it up without user looking it up
  • Sort out solr cloud
  • Finish Docs
  • Add charm push to CI for beta branch

Escape metachars in solr queries

java.lang.RuntimeException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.search.SyntaxError: Cannot parse 'group:': Encountered "" at line 1, column 6.
Was expecting one of:
     ...
    "(" ...
    "*" ...
     ...
     ...
     ...
     ...
     ...
    "[" ...
    "{" ...
     ...
    "filter(" ...
     ...

    at edu.usc.irds.sparkler.util.SolrResultIterator.getNextBean(SolrResultIterator.scala:72)
    at edu.usc.irds.sparkler.util.SolrResultIterator.(SolrResultIterator.scala:57)
    at edu.usc.irds.sparkler.CrawlDbRDD.compute(CrawlDbRDD.scala:55)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Job hangs for a minute when Kafka is not configured

When Kafka server is not configured or active, the crawl job make repeated attempts to establish connection which adds reasonable amount of delay.

By default, Kafka feature shall be disabled in configuration.

@karanjeets Thoughts? Can you have a look this?

Java null pointer error in fetch()

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'**
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException
at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6)
java.util.NoSuchElementException: key not found: 6**
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

Sparkler Build Failing

The build is failing due to reference to a deprecated (removed) module "sparkler-plugins-active".

Working on it...

Logs:

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ sparkler-api ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/target/sparkler-api-0.1-SNAPSHOT.jar to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-api/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.pom
[INFO] 
[INFO] --- maven-bundle-plugin:2.5.0:install (default-install) @ sparkler-api ---
[INFO] Installing edu/usc/irds/sparkler/sparkler-api/0.1-SNAPSHOT/sparkler-api-0.1-SNAPSHOT.jar
[INFO] Writing OBR metadata
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler-plugins 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ sparkler-plugins ---
[INFO] 
[INFO] --- maven-install-plugin:2.4:install (default-install) @ sparkler-plugins ---
[INFO] Installing /gpfs/flash/users/tg830544/sparkler/sparkler-plugins/pom.xml to /home/03755/tg830544/.m2/repository/edu/usc/irds/sparkler/plugin/sparkler-plugins/0.1-SNAPSHOT/sparkler-plugins-0.1-SNAPSHOT.pom
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building sparkler 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] sparkler-parent .................................... SUCCESS [  0.117 s]
[INFO] sparkler-api ....................................... SUCCESS [  2.089 s]
[INFO] sparkler-plugins ................................... SUCCESS [  0.005 s]
[INFO] sparkler ........................................... FAILURE [  0.495 s]
[INFO] urlfilter-regex .................................... SKIPPED
[INFO] fetcher-jbrowser ................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.265 s
[INFO] Finished at: 2016-10-28T17:54:21-05:00
[INFO] Final Memory: 43M/1451M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project sparkler-app: Could not resolve dependencies for project edu.usc.irds.sparkler:sparkler-app:jar:0.1-SNAPSHOT: Could not find artifact edu.usc.irds.sparkler.plugin:sparkler-plugins:jar:0.1-SNAPSHOT -> [Help 1]

Debugging crawl in Sparkler

URL Partitioner

Input: Query Solr for the URLs to be generated

status:NEW

Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host

Fetch

Input: URL to fetch
Output: Request and Response Headers written in a file

Parse

Input: URL (which will be fetched and parsed) OR the fetched content
Output: Extracted Content

Fair Fetcher

Input: List of URLs. Uses Crawl policy
Output: fetched and/or parsed content in separate files under a directory

Setup unit tests and integration tests for Sparkler

To begin with, write unit tests for :

  • test RDD functions
  • Test core functionality of plugins

Tasks

  • Setup a web server for testing
    • Bind it to junit to auto start and stop while running the tests
  • Setup a solr instance for testing
    • Bind it to junit to auto start and stop
  • Test Default Fetcher
  • Test Javascript Engine functionality
  • Test URL filters
  • Test Fetch Function
  • Test Parse Function
  • Test Seed Injection
  • Test URL Normalizer
  • Test HDFS persistance
  • Test Kafka Output

PS:
The current progress can be tracked on https://github.com/USCDataScience/sparkler/tree/unittests branch

Maven packaging poblem

~/git_workspace/sparkler/target$ java -classpath sparkler-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.sparkler.pipeline.Crawler -m "local" -j "sparkler-job-1465179374801" -i 1
2016-06-05 19:26:41 WARN NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-06-05 19:26:42 ERROR SparkContext:95 [main] - Error initializing SparkContext.
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at akka.actor.ActorSystem$Settings.(ActorSystem.scala:169)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:505)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(U

Build failing

http://pastebin.com/xPbxHNhM

...
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=36 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=39 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=41 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=44 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=46 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=Whitespace at end of line line=48 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/Constants.scala message=File must end with newline character
warning file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SolrResultIterator.scala message=Avoid using null line=56 column=43
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=26 column=2
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=30 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=35 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=37 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=45 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=52 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=Whitespace at end of line line=60 column=0
error file=/Users/madhav/Documents/workspace/sparkler/sparkler-app/src/main/scala/edu/usc/irds/sparkler/util/SparklerConfiguration.scala message=File must end with newline character
Saving to outputFile=/Users/madhav/Documents/workspace/sparkler/sparkler-app/target/scalastyle-output.xml
Processed 23 file(s)
Found 29 errors
Found 9 war
....

Seems like config issue

log4j

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt
log4j:WARN No appenders could be found for logger (edu.usc.irds.sparkler.service.Injector$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

jobId = sjob-1473483131504

So, I solved problem by
java -Dlog4j.configuration=file:///$SPARKLER_HOME/sparkler-app/src/main/resources/log4j.properties -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt

and get jobId=sjob-1473484528794

but when I run crawl
java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
The error occurs

I used hadoop2.4.0 spark1.6.1 nutch1.11 solr6.0.1 jdk1.8.0u92 scala2.11.8
everything works well

How can I fix it?

[hadoop@NameNode target]$ java -jar sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1473484528794 -m yarn-client -i 2
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/10 14:30:56 INFO SparkContext: Running Spark version 1.6.1
16/09/10 14:30:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/10 14:30:56 INFO SecurityManager: Changing view acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: Changing modify acls to: hadoop
16/09/10 14:30:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/09/10 14:30:56 INFO Utils: Successfully started service 'sparkDriver' on port 49943.
16/09/10 14:30:57 INFO Slf4jLogger: Slf4jLogger started
16/09/10 14:30:57 INFO Remoting: Starting remoting
16/09/10 14:30:57 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:36988]
16/09/10 14:30:57 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 36988.
16/09/10 14:30:57 INFO SparkEnv: Registering MapOutputTracker
16/09/10 14:30:57 INFO SparkEnv: Registering BlockManagerMaster
16/09/10 14:30:57 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b32ee736-ef54-4f7a-83ff-6c8f6ab3d442
16/09/10 14:30:57 INFO MemoryStore: MemoryStore started with capacity 723.0 MB
16/09/10 14:30:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:394)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:411)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2118)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at edu.usc.irds.sparkler.pipeline.Crawler.init(Crawler.scala:94)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:108)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:201)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:395)
... 21 more
16/09/10 14:30:57 INFO DiskBlockManager: Shutdown hook called
16/09/10 14:30:57 INFO ShutdownHookManager: Shutdown hook called

Solr Cloud - solrj.SolrServerException: No live SolrServers available to handle this request

When solr cloud is enabled for backend, we get this

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
	at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.0.11:8983/solr/crawldb_shard1_replica1, http://192.168.0.11:8984/solr/crawldb_shard1_replica2]
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
	at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
	at edu.usc.irds.sparkler.CrawlDbRDD.getPartitions(CrawlDbRDD.scala:72)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:641)
	at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:153)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:145)
	at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
	at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:45)
	at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:236)
	at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
	... 6 more

Guide for sparkler and hdfs

In the guide there is nothing on how to connect hdfs with sparkler. In notes for Apache Nutch Users and developers.

Note 2: Crawled content
Sparkler can produce the segments on HDFS, trying to keep it compatible with nutch content format.

Please share the steps. How?

Update Solr Schema

  • Change jobId to crawl_id for better understanding and consistency

  • Add content_type field

  • Add crawler field for document uniqueness

  • Change plainText to extracted_text

  • Change lastFetchedAt to fetch_timestamp

  • Change indexedAt to indexed_at for consistency

  • Add fetch_status_code field to record the response code

  • Add hostname as group can have a different definition in future

  • Change numTries to retries_since_fetch for better understanding and consistency

  • Add signature field to store the hash of page's content

  • Add version field which defines the schema version

  • Add outlinks field

  • Change depth to crawler_discover_depth

  • Add relative_path field to record the file path when dumped

  • Add parent field for document's parent id

  • Generate document ID as:

    • Seed: SHA256(crawl_id-url-ingestion_timestamp)
    • Other: SHA256(crawl_id-url-parent_fetch_timestamp)
  • Also linked to #49

Minor Issues

  • Add core.properties in Solr Schema configuration. This will help auto-deploy the core.
  • Improve Sparkler setup guide and add missing links.

Working with remote spark.

Has anyone tried this with a non local spark?

I ask because when I try and run on a remote spark I get class mismatch errors:

java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

But then when I check the versions, you guys use Spark 1.6.1 which requires Scala 2.10.x, but you use Scala 2.11.x and if you try and downgrade to 2.10 it doesn't compile.

No fetched content is written

Due to an incorrect validation check, all fetched URLs are filtered out and none is written on disk

rdd.filter(_.fetchedData.getResource.getStatus == FETCHED)

URL filter regex

Hi,

Am I missing the url filter ? How I can tell the sparkler app to url filter rules ?

in general or per domain

thx

crawl hanging..

Hi,

I injected seed list and start ๐Ÿ‘
java -jar target/sparkler-app-0.1-SNAPSHOT.jar crawl -id sjob-1483360200720 -i 2 -tg 10000 -tn 1000 -m local[*]

after 1-2 min crawlin is hanging no more crawling..

any suggestions ?


`17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.ortadogugazetesi.net/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_71 stored as values in memory (estimated size 61.0 KB, free 12.9 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_71 in memory on localhost:58926 (size: 61.0 KB, free: 757.0 MB)
17/01/02 15:09:39 INFO Executor: Finished task 51.0 in stage 1.0 (TID 127). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 51.0 in stage 1.0 (TID 127) in 3599 ms on localhost (69/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.kibrisgazetesi.com/
17/01/02 15:09:39 INFO Executor: Finished task 71.0 in stage 1.0 (TID 147). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 71.0 in stage 1.0 (TID 147) in 1000 ms on localhost (70/76)
17/01/02 15:09:39 WARN ParseFunction$: PARSING-CONTENT-ERROR http://www.kibrisgazetesi.com/
17/01/02 15:09:39 WARN ParseFunction$: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at org.apache.tika.parser.html.HtmlHandler.characters(HtmlHandler.java:258)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:582)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:62)
at edu.usc.irds.sparkler.pipeline.ParseFunction$.apply(ParseFunction.scala:34)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:57)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.evrensel.net/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.yeniakit.com/
17/01/02 15:09:39 INFO FetchFunction$: Using Default Fetcher
17/01/02 15:09:39 INFO FetcherDefault: DEFAULT FETCHER http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.bilimtarihi.org/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_65 stored as values in memory (estimated size 184.0 KB, free 13.0 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_65 in memory on localhost:58926 (size: 184.0 KB, free: 756.8 MB)
17/01/02 15:09:39 INFO Executor: Finished task 65.0 in stage 1.0 (TID 141). 1664 bytes result sent to driver
17/01/02 15:09:39 INFO TaskSetManager: Finished task 65.0 in stage 1.0 (TID 141) in 1849 ms on localhost (71/76)
17/01/02 15:09:39 INFO ParseFunction$: PARSING http://www.otohaber.com.tr/
17/01/02 15:09:39 INFO MemoryStore: Block rdd_3_54 stored as values in memory (estimated size 491.2 KB, free 13.5 MB)
17/01/02 15:09:39 INFO BlockManagerInfo: Added rdd_3_54 in memory on localhost:58926 (size: 491.2 KB, free: 756.4 MB)
17/01/02 15:09:40 INFO Executor: Finished task 54.0 in stage 1.0 (TID 130). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 54.0 in stage 1.0 (TID 130) in 3830 ms on localhost (72/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.evrensel.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_66 stored as values in memory (estimated size 458.8 KB, free 14.0 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_66 in memory on localhost:58926 (size: 458.8 KB, free: 755.9 MB)
17/01/02 15:09:40 INFO Executor: Finished task 66.0 in stage 1.0 (TID 142). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 66.0 in stage 1.0 (TID 142) in 2243 ms on localhost (73/76)
17/01/02 15:09:40 INFO ParseFunction$: PARSING http://www.ortadogugazetesi.net/
17/01/02 15:09:40 INFO MemoryStore: Block rdd_3_75 stored as values in memory (estimated size 117.4 KB, free 14.1 MB)
17/01/02 15:09:40 INFO BlockManagerInfo: Added rdd_3_75 in memory on localhost:58926 (size: 117.4 KB, free: 755.8 MB)
17/01/02 15:09:40 INFO Executor: Finished task 75.0 in stage 1.0 (TID 151). 1664 bytes result sent to driver
17/01/02 15:09:40 INFO TaskSetManager: Finished task 75.0 in stage 1.0 (TID 151) in 1204 ms on localhost (74/76)

`

e.u.i.s.model.Resource.<init>(Resource.java:46) java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"

17/01/29 12:43:30 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:64585 in memory (size: 2.8 KB, free: 2.4 GB)
17/01/29 12:43:31 ERROR Executor: Exception in task 7.0 in stage 4.0 (TID 67)
java.net.MalformedURLException: Stream handler unavailable due to: For input string: "0x6"
	at java.net.URL.<init>(URL.java:627)
	at java.net.URL.<init>(URL.java:490)
	at java.net.URL.<init>(URL.java:439)
	at edu.usc.irds.sparkler.model.Resource.<init>(Resource.java:46)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunc$$anonfun$apply$5.apply(Crawler.scala:204)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
	at edu.usc.irds.sparkler.service.SolrProxy.addResources(SolrProxy.scala:44)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:43)
	at edu.usc.irds.sparkler.solr.SolrUpsert.apply(SolrUpsert.scala:34)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Stream handler unavailable due to: For input string: "0x6"
	at org.apache.felix.framework.URLHandlersStreamHandlerProxy.parseURL(URLHandlersStreamHandlerProxy.java:429)
	at java.net.URL.<init>(URL.java:622)
	... 18 more

Error Message "Could not launch browser" when starting crawler of quickstart guide

17/01/28 15:02:03 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
17/01/28 15:02:03 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_0 not found, computing it
17/01/28 15:02:03 INFO CacheManager: Partition rdd_3_1 not found, computing it
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
17/01/28 15:02:03 INFO PluginService$: Felix Configuration loaded successfully
17/01/28 15:02:03 INFO FetcherJBrowserActivator: Activating FetcherJBrowser Plugin
17/01/28 15:02:03 INFO RegexURLFilterActivator: Activating RegexURL Plugin
Bundle Found: org.apache.felix.framework
Bundle Found: fetcher.jbrowser
Bundle Found: urlfilter.regex
[2017-01-28T15:02:04.134] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.134] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.135] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.135] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.135] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.135] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.135] ... 1 more
[2017-01-28T15:02:04.359] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.359] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.359] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.359] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.359] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.359] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.359] ... 1 more
[2017-01-28T15:02:04.368] java.lang.NoClassDefFoundError: com/sun/webkit/network/CookieManager
[2017-01-28T15:02:04.369] at com.machinepublishers.jbrowserdriver.JBrowserDriverServer.main(JBrowserDriverServer.java:74)
[2017-01-28T15:02:04.369] Caused by: java.lang.ClassNotFoundException: com.sun.webkit.network.CookieManager
[2017-01-28T15:02:04.369] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2017-01-28T15:02:04.369] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
[2017-01-28T15:02:04.369] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2017-01-28T15:02:04.369] ... 1 more
17/01/28 15:02:04 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.openqa.selenium.WebDriverException: Could not launch browser.
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'osboxes', ip: '192.168.178.134', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-514.el7.x86_64', java.version: '1.8.0_121'
Driver info: driver.version: JBrowserDriver

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.