archivesunleashed / docker-aut Goto Github PK

View Code? Open in Web Editor NEW

12.0 9.0 2.0 930 KB

Docker image for the Archives Unleashed Toolkit

Home Page: https://archivesunleashed.org/

License: Other

Dockerfile 100.00%

docker webarchives spark aut docker-image archives-unleashed

docker-aut's Introduction

docker-aut

Attention

The main branch aligns with the main branch of The Archives Unleashed Toolkit. It can be unstable at times. Stable branches are available for each AUT release.

Introduction

This is the Docker image for Archives Unleashed Toolkit. AUT documentation can be found here. If you need a hand installing Docker, check out our Docker Install Instructions, and if you want a quick tutorial, check out our Toolkit Lesson.

The Archives Unleashed Toolkit is part of the broader Archives Unleashed Project.

Requirements

Install the following dependencies:

Docker

Use

Build and Run

You can build and run this Docker image locally with the following steps:

git clone https://github.com/archivesunleashed/docker-aut.git
cd docker-aut
docker build -t aut .
docker run --rm -it aut

Overrides

You can add any Spark flags to the build if you need too.

docker run --rm -it aut /spark/bin/spark-shell --packages "io.archivesunleashed:aut:1.2.1-SNAPSHOT" --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s

Once the build finishes, you should see:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/11/01 17:27:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://5f477f5dcab5:4040
Spark context available as 'sc' (master = local[*], app id = local-1635787667490).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

PySpark

It is also possible to start an interactive PySpark console. This requires specifying Python bindings and the aut package, both of which are included in the Docker image under /aut/target.

To lauch an interactive PySpark console:

docker run --rm -it aut /spark/bin/pyspark --py-files /aut/target/aut.zip --jars /aut/target/aut-1.2.1-SNAPSHOT-fatjar.jar

Once the build finishes you should see:

Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/11/01 17:41:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http://d03127085be4:4040
Spark context available as 'sc' (master = local[*], app id = local-1635788517329).
SparkSession available as 'spark'.
>>>

Example

Spark Shell (Scala)

When the image is running, you will be brought to the Spark Shell interface. Try running the following command.

Type

:paste

And then paste the following script in:

import io.archivesunleashed._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc).webgraph().show(10)

Press Ctrl+D in order to execute the script. You should then see the following:

+--------------+--------------------+--------------------+------+               
|    crawl_date|                 src|                dest|anchor|
+--------------+--------------------+--------------------+------+
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
+--------------+--------------------+--------------------+------+
only showing top 10 rows

In this case, things are working! Try substituting your own data (mounted using the command above).

To quit Spark Shell, you can exit using CTRL+c.

PySpark

When the images is running, you will be brought to the PySpark interface. Try running the following commands:

from aut import *
WebArchive(sc, sqlContext, "/aut-resources/Sample-Data/*.gz").webgraph().show(10)

You should then see the following:

+--------------+--------------------+--------------------+------+               
|    crawl_date|                 src|                dest|anchor|
+--------------+--------------------+--------------------+------+
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
+--------------+--------------------+--------------------+------+
only showing top 10 rows

In this case, things are working! Try substituting your own data (mounted using the command above).

To quit the PySpark console, you can exit using CTRL+c.

Resources

This build also includes the aut resources repository, which contains NER libraries as well as sample data from the University of Toronto (located in /aut-resources).

The ARC and WARC file are drawn from the Canadian Political Parties & Political Interest Groups Archive-It Collection, collected by the University of Toronto. We are grateful that they've provided this material to us.

If you use their material, please cite it along the following lines:

University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051.2.191340/http://canadianactionparty.ca/Default2.asp

You can find more information about this collection at WebArchives.ca.

Acknowlegements

This work is primarily supported by the Andrew W. Mellon Foundation. Additional funding for the Toolkit has come from the U.S. National Science Foundation, Columbia University Library's Mellon-funded Web Archiving Incentive Award, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

docker-aut's People

Contributors

Stargazers

Watchers

Forkers

machawk1 sepastian

docker-aut's Issues

Mac OS: build fails with out-of-memory error

Describe the bug

On Mac OS, docker build -t aut . fails with java.lang.OutOfMemoryError: Java heap space.

On Linux, the build succeeds.

To Reproduce

On Mac OS, run docker build -t aut .

Expected behavior

Build the Docker image.

Screenshots

n/a

Desktop/Laptop (please complete the following information):

$ uname -a
Darwin C02F37HLML7H 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan  5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_X86_64 x86_64

Smartphone (please complete the following information):

n/a

Additional context

See the log.txt file attached.

Unable to run docker-aut:0.18.0

I'm unable to run the docker container for version 0.18.0.
docker run --rm -it archivesunleashed/docker-aut:0.18.0 results in the following error:

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: com.github.archivesunleashed.tika#tika-parsers;1.22: not found

		:: com.github.netarchivesuite#language-detector;language-detector-0.6a: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.github.archivesunleashed.tika#tika-parsers;1.22: not found, unresolved dependency: com.github.netarchivesuite#language-detector;language-detector-0.6a: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306)
	at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Add Azure Provider

Well, guess I should do this now. 😉

Update VagrantFile to support Azure provisioning, once we get up and running.

CONTRIBUTING.md

Create a CONTRIBUTING.md that let's folks know who to provide feedback, etc.

We can steal from the Islandora one: https://github.com/Islandora-CLAW/CLAW/blob/7.x-2.x/CONTRIBUTING.md

Update to use 0.10.0 release

This is also requires using a new version of Spark Notebook, which uses a different way to load external libraries. The :cp command is no longer available.

Build error

On OS X 10.11.3:

ianmilligan1@Ians-MBP:~/dropbox/git/warcbase_workshop_vagrant$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'ubuntu/trusty64' could not be found. Attempting to find and install...
    default: Box Provider: virtualbox
    default: Box Version: >= 0
==> default: Loading metadata for box 'ubuntu/trusty64'
    default: URL: https://atlas.hashicorp.com/ubuntu/trusty64
==> default: Adding box 'ubuntu/trusty64' (v20160314.0.2) for provider: virtualbox
    default: Downloading: https://atlas.hashicorp.com/ubuntu/boxes/trusty64/versions/20160314.0.2/providers/virtualbox.box
==> default: Successfully added box 'ubuntu/trusty64' (v20160314.0.2) for 'virtualbox'!
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: Warcbase workshop VM
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 8080 (guest) => 9000 (host) (adapter 1)
    default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'starting, running'. The machine is in the
'poweroff' state. Please verify everything is configured
properly and try again.

If the provider you're using has a GUI that comes with it,
it is often helpful to open that and watch the machine, since the
GUI often has more helpful error messages than Vagrant can retrieve.
For example, if you're using VirtualBox, run `vagrant up` while the
VirtualBox GUI is open.

The primary issue for this error is that the provider you're using
is not properly configured. This is very rarely a Vagrant issue.

Will look into this.

warcbase won't build

I've jumped through a lot of hoops trying to get warcbase to build as part of the vagrant build, and it just doesn't want to happen.

You can shell in (vagrant ssh) after the vagrant build and cd /home/vagrant/project/warcbase && sudo mvn clean package appassembler:assemble -DskipTests, and it builds fine.

See: lintool/warcbase#206

Update Lesson Plan to reflect changes made by #6 being merged

aut build fails on master

Working on updating everything here, and I noticed aut is failing to build on the master branch in Docker build process.

Here is the output of the error:

2017-12-07 23:14:13,556 [main-ScalaTest-running-CountableRDDTest] INFO  SparkUI - Stopped Spark web UI at http://172.17.0.2:4040
2017-12-07 23:14:13,558 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2017-12-07 23:14:13,562 [main-ScalaTest-running-CountableRDDTest] INFO  MemoryStore - MemoryStore cleared
2017-12-07 23:14:13,562 [main-ScalaTest-running-CountableRDDTest] INFO  BlockManager - BlockManager stopped
2017-12-07 23:14:13,564 [main-ScalaTest-running-CountableRDDTest] INFO  BlockManagerMaster - BlockManagerMaster stopped
2017-12-07 23:14:13,571 [dispatcher-event-loop-1] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2017-12-07 23:14:13,573 [main-ScalaTest-running-CountableRDDTest] INFO  SparkContext - Successfully stopped SparkContext
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.711 sec - in io.archivesunleashed.spark.rdd.CountableRDDTest
Running io.archivesunleashed.io.ArcRecordWritableTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.231 sec - in io.archivesunleashed.io.ArcRecordWritableTest
Running io.archivesunleashed.io.GenericArchiveRecordWritableTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.382 sec - in io.archivesunleashed.io.GenericArchiveRecordWritableTest
Running io.archivesunleashed.io.WarcRecordWritableTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.344 sec - in io.archivesunleashed.io.WarcRecordWritableTest
Running io.archivesunleashed.ingest.WacArcLoaderTest
2017-12-07 23:14:14,679 [main] INFO  WacArcLoaderTest - 300 records read!
2017-12-07 23:14:14,860 [main] INFO  WacArcLoaderTest - 300 records read!
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.322 sec - in io.archivesunleashed.ingest.WacArcLoaderTest
Running io.archivesunleashed.ingest.WacWarcLoaderTest
2017-12-07 23:14:15,246 [main] INFO  WacWarcLoaderTest - 822 records read!
2017-12-07 23:14:15,623 [main] INFO  WacWarcLoaderTest - 822 records read!
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.762 sec - in io.archivesunleashed.ingest.WacWarcLoaderTest
Running io.archivesunleashed.mapreduce.WacWarcInputFormatTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.244 sec - in io.archivesunleashed.mapreduce.WacWarcInputFormatTest
Running io.archivesunleashed.mapreduce.WacArcInputFormatTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec - in io.archivesunleashed.mapreduce.WacArcInputFormatTest
Running io.archivesunleashed.mapreduce.WacGenericInputFormatTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.351 sec - in io.archivesunleashed.mapreduce.WacGenericInputFormatTest
2017-12-07 23:14:16,340 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2017-12-07 23:14:16,341 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-40f43281-67db-4a4e-843c-8cbe042ff68e

Results :

Tests in error: 
  ExtractPopularImagesTest.run:32->org$scalatest$BeforeAndAfter$$super$run:32->FunSuite.org$scalatest$FunSuiteLike$$super$run:1560->FunSuite.runTests:1560->runTest:32->org$scalatest$BeforeAndAfter$$super$runTest:32->FunSuite.withFixture:1560->FunSuite.newAssertionFailedException:1560 ? TestFailed

Tests run: 75, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:02 min
[INFO] Finished at: 2017-12-07T23:14:16+00:00
[INFO] Final Memory: 70M/554M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project aut: There are test failures.
[ERROR] 
[ERROR] Please refer to /aut/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
The command '/bin/sh -c git clone https://github.com/archivesunleashed/aut.git /aut     && cd /aut && mvn clean install' returned a non-zero code: 1

org.apache.hadoop#hadoop-core;0.20.2-cdh3u4: not found

:: problems summary ::
:::: WARNINGS
		module not found: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4

	==== local-m2-cache: tried

	  file:/root/.m2/repository/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom

	  -- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:

	  file:/root/.m2/repository/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar

	==== local-ivy-cache: tried

	  /root/.ivy2/local/org.apache.hadoop/hadoop-core/0.20.2-cdh3u4/ivys/ivy.xml

	  -- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:

	  /root/.ivy2/local/org.apache.hadoop/hadoop-core/0.20.2-cdh3u4/jars/hadoop-core.jar

	==== central: tried

	  https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom

	  -- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:

	  https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar

	==== spark-packages: tried

	  http://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.pom

	  -- artifact org.apache.hadoop#hadoop-core;0.20.2-cdh3u4!hadoop-core.jar:

	  http://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-core/0.20.2-cdh3u4/hadoop-core-0.20.2-cdh3u4.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hadoop#hadoop-core;0.20.2-cdh3u4: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1083)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I'm working on a 0.11.0 docker build, but ran into this. @ianmilligan1 @lintool you fine with me cutting a 0.11.1 release which resolved the issue?

N.B. At this point I'd prefer to build the Docker image with --packages as opposed to --jars because it is surfacing a lot of dependency issues I've feared have remained hidden for a long time.

unable to run docker image

Unable to get the docker container running. Throws the following issue:

docker run --rm -it aut
...
:: problems summary ::
:::: WARNINGS
		[NOT FOUND  ] com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle) (0ms)

	==== local-m2-cache: tried

	  file:/root/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::              FAILED DOWNLOADS            ::

		:: ^ see resolution messages for details  ^ ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle)

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [download failed: com.thoughtworks.paranamer#paranamer;2.8!paranamer.jar(bundle)]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1083)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Change master branch to main branch

Remove mentions of master.

Update 0.16.0 to use Apache Spark 2.1.3

Spark Notebook - 0.11.0

I'm working creating a 0.11.0 version, and looking at the documentation we have no, there are not Spark Notebook examples. It appears to be all Spark Shell. Should I remove Spark Notebook from the build process and README instructions?

Port is wrong for spark notebook

Update dockerhub image to 0.90.5

The README.md references /aut/target/aut-0.90.5-SNAPSHOT-fatjar.jar:

docker run --rm -it \
  archivesunleashed/docker-aut \
  /spark/bin/pyspark \
  --py-files /aut/target/aut.zip \
  --jars /aut/target/aut-0.90.5-SNAPSHOT-fatjar.jar

but the Docker image on dockerhub archivesunleashed/docker-aut:latest contains aut-0.90.3-SNAPSHOT-fatjar.jar:

$ docker pull archivesunleashed/docker-aut:latest
Using default tag: latest
latest: Pulling from archivesunleashed/docker-aut
Digest: sha256:cbaabbd3bf2783ec3af1956fefb44ce20e10b6c6321cd5c837dd52e3128a2012
Status: Downloaded newer image for archivesunleashed/docker-aut:latest
docker.io/archivesunleashed/docker-aut:latest
$ docker run --rm -it archivesunleashed/docker-aut:latest ls /aut/target
:
aut-0.90.3-SNAPSHOT-fatjar.jar
:

Push the most recent build of archivesunleashed/docker-aut to dockerhub.

Spark Notebook crashes when loading warcbase

The Spark Notebook works on http://127.0.0.1:9000/# as directed in the walkthrough, but when you load the fatjar the browser hangs. Terminal displays following errors and we can't continue.

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.net.URI.<init>(URI.java:588)
    at akka.actor.ActorPathExtractor$.unapply(Address.scala:154)
    at akka.remote.RemoteActorRefProvider.resolveActorRefWithLocalAddress(RemoteActorRefProvider.scala:347)
    at akka.remote.transport.AkkaPduProtobufCodec$.decodeMessage(AkkaPduCodec.scala:191)
    at akka.remote.EndpointReader.akka$remote$EndpointReader$$tryDecodeMessageAndAck(Endpoint.scala:993)
    at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:926)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
    at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Uncaught error from thread [Remote-akka.remote.default-remote-dispatcher-7] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.jar.Attributes.read(Attributes.java:394)
    at java.util.jar.Manifest.read(Manifest.java:199)
    at java.util.jar.Manifest.<init>(Manifest.java:69)
    at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
    at java.util.jar.JarFile.getManifest(JarFile.java:180)
    at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at scala.concurrent.Future$class.foreach(Future.scala:204)
    at scala.concurrent.impl.Promise$DefaultPromise.foreach(Promise.scala:153)
    at akka.remote.transport.netty.NettyTransport$.gracefulClose(NettyTransport.scala:222)
    at akka.remote.transport.netty.TcpAssociationHandle.disassociate(TcpSupport.scala:94)
    at akka.remote.transport.ProtocolStateActor$$anonfun$1.applyOrElse(AkkaProtocolTransport.scala:516)
    at akka.remote.transport.ProtocolStateActor$$anonfun$1.applyOrElse(AkkaProtocolTransport.scala:480)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at akka.actor.FSM$class.terminate(FSM.scala:672)
    at akka.actor.FSM$class.applyState(FSM.scala:617)
    at akka.remote.transport.ProtocolStateActor.applyState(AkkaProtocolTransport.scala:269)
    at akka.actor.FSM$class.processEvent(FSM.scala:609)
    at akka.remote.transport.ProtocolStateActor.processEvent(AkkaProtocolTransport.scala:269)
    at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
    at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

Update lesson plan for use with Docker

@ianmilligan1 or @greebie, you want this?

If you haven't worked with Docker before, this is very helpful.

archivesunleashed / docker-aut Goto Github PK

docker-aut's Introduction

docker-aut

Attention

Introduction

Requirements

Use

Build and Run

Overrides

PySpark

Example

Spark Shell (Scala)

PySpark

Resources

Acknowlegements

docker-aut's People

Contributors

Stargazers

Watchers

Forkers

docker-aut's Issues

Recommend Projects

Recommend Topics

Recommend Org