Giter VIP home page Giter VIP logo

sansa-template-maven-spark's Introduction

SANSA-Stack

Build Status License Twitter

This project comprises the whole Semantic Analytics Stack (SANSA). At a glance, it features the following functionality:

  • Ingesting RDF and OWL data in various formats into RDDs
  • Operators for working with RDDs and data frames of RDF data at various levels (triples, bindings, graphs, etc)
  • Transformation of RDDs to data frames and partitioning of RDDs into R2RML-mapped data frames
  • Distributed SPARQL querying over R2RML-mapped data frame partitions using RDB2RDF engines (Sparqlify & Ontop)
  • Enrichment of RDDs with inferences
  • Application of machine learning algorithms

For a detailed description of SANSA, please visit http://sansa-stack.net.

Layers

The SANSA project is structured in the following five layers developed in their respective sub-folders:

Release Cycle

A SANSA stack release is done every six months and consists of the latest stable versions of each layer at this point. This repository is used for organising those joint releases.

Usage

Spark

Requirements

We currently require a Spark 3.x.x with Scala 2.12 setup. A Spark 2.x version can be built from source based on the spark2 branch.

Release Version

Some of our dependencies are not in Maven central (yet), so you need to add following Maven repository to your project POM file repositories section:

<repository>
   <id>maven.aksw.internal</id>
   <name>AKSW Release Repository</name>
   <url>http://maven.aksw.org/archiva/repository/internal</url>
   <releases>
      <enabled>true</enabled>
   </releases>
   <snapshots>
      <enabled>false</enabled>
   </snapshots>
</repository>

If you want to import the full SANSA Stack, please add the following Maven dependency to your project POM file:

<!-- SANSA Stack -->
<dependency>
   <groupId>net.sansa-stack</groupId>
   <artifactId>sansa-stack-spark_2.12</artifactId>
   <version>$LATEST_RELEASE_VERSION$</version>
</dependency>

If you only want to use particular layers, just replace $LAYER_NAME$ with the corresponding name of the layer

<!-- SANSA $LAYER_NAME$ layer -->
<dependency>
   <groupId>net.sansa-stack</groupId>
   <artifactId>sansa-$LAYER_NAME$-spark_2.12</artifactId>
   <version>$LATEST_RELEASE_VERSION$</version>
</dependency>

SNAPSHOT Version

While the release versions are available on Maven Central, latest SNAPSHOT versions have to be installed from source code:

git clone https://github.com/SANSA-Stack/SANSA-Stack.git
cd SANSA-Stack

Then to build and install the full SANSA Spark stack you can do

./dev/mvn_install_stack_spark.sh 

or for a single layer $LAYER_NAME$ you can do

mvn -am -DskipTests -pl :sansa-$LAYER_NAME$-spark_2.12 clean install 

Alternatively, you can use the following Maven repository and add it to your project POM file repositories section:

<repository>
   <id>maven.aksw.snapshots</id>
   <name>AKSW Snapshot Repository</name>
   <url>http://maven.aksw.org/archiva/repository/snapshots</url>
   <releases>
      <enabled>false</enabled>
   </releases>
   <snapshots>
      <enabled>true</enabled>
   </snapshots>
</repository>

Then do the same as for the release version and add the dependency:

<!-- SANSA Stack -->
<dependency>
   <groupId>net.sansa-stack</groupId>
   <artifactId>sansa-stack-spark_2.12</artifactId>
   <version>$LATEST_SNAPSHOT_VERSION$</version>
</dependency>

How to Contribute

We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.

sansa-template-maven-spark's People

Contributors

gezimsejdiu avatar lorenzbuehmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sansa-template-maven-spark's Issues

Error executing maven package

Hi everyone,

I'm getting the following error when running "mvn clean package" (it seems to be different from the previous maven-related issue):

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.024 s
[INFO] Finished at: 2019-06-01T15:53:19-04:00
[INFO] Final Memory: 22M/217M
[INFO] ------------------------------------------------------------------------
ERROR] Failed to execute goal on project SANSA-Template-Maven-Spark: Could not
resolve dependencies for project net.sansa-stack:SANSA-Template-Maven-Spark:jar:
0.5.1-SNAPSHOT: Failed to collect dependencies at net.sansa-stack:sansa-rdf-spar
k_2.11:jar:0.5.1-SNAPSHOT: Failed to read artifact descriptor for net.sansa-stac
k:sansa-rdf-spark_2.11:jar:0.5.1-SNAPSHOT: Could not transfer artifact net.sansa
-stack:sansa-rdf-spark_2.11:pom:0.5.1-SNAPSHOT from/to maven.aksw.snapshots (htt
p://maven.aksw.org/archiva/repository/snapshots): sun.security.validator.Validat
orException: PKIX path building failed: sun.security.provider.certpath.SunCertPa
thBuilderException: unable to find valid certification path to requested target

I'm eager to try out Sansa so it'd be great if I could resolve this maven issue! :-)

The POM for com.ibm.sparktc.sparkbench:sparkbench:jar:2.3.0_0.4.0 is missing, no dependency information available

`
C:\Users\hakim>git clone https://github.com/SANSA-Stack/SANSA-Template-Maven-Spark.git
Cloning into 'SANSA-Template-Maven-Spark'...
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 150 (delta 0), reused 0 (delta 0), pack-reused 147
Receiving objects: 100% (150/150), 40.01 KiB | 379.00 KiB/s, done.
Resolving deltas: 100% (56/56), done.

C:\Users\hakim>cd SANSA-Template-Maven-Spark

C:\Users\hakim\SANSA-Template-Maven-Spark>mvn clean package
[INFO] Scanning for projects...
[INFO]
[INFO] -------------< net.sansa-stack:SANSA-Template-Maven-Spark >-------------
[INFO] Building SANSA-Template-Maven-Spark 0.7.2-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom (3.9 kB at 6.8 kB/s)
[WARNING] The POM for com.ibm.sparktc.sparkbench:sparkbench:jar:2.3.0_0.4.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.489 s
[INFO] Finished at: 2020-12-06T15:56:37+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project SANSA-Template-Maven-Spark: Could not resolve dependencies for project net.sansa-stack:SANSA-Template-Maven-Spark:jar:0.7.2-SNAPSHOT: Failure to find com.ibm.sparktc.sparkbench:sparkbench:jar:2.3.0_0.4.0 in https://oss.sonatype.org/content/repositories/snapshots/ was cached in the local repository, resolution will not be reattempted until the update interval of oss-sonatype has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

C:\Users\hakim\SANSA-Template-Maven-Spark>`

java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib

Hello,

When running the example on a Spark cluster using 'spark-submit', the following error is encountered. Any ideas what might be causing this?

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib
	at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:135)
	at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:118)
	at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance$lzycompute(NTripleReader.scala:207)
	at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance(NTripleReader.scala:207)
	at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.get(NTripleReader.scala:209)
	at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:148)
	at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:140)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Issue with build using maven

Hi!
This is probably a naive question, since I'm not familiar with scala. But I'm following the steps from the downloads and usage, and I keep getting this error when I $ mvn clean package:

[ERROR] Failed to execute goal on project SANSA-Template-Maven-Spark: Could not resolve dependencies for project net.sansa-stack:SANSA-Template-Maven-Spark:jar:0.7.2-SNAPSHOT: Failed to collect dependencies at net.sansa-stack:sansa-inference-spark_2.11:jar:0.7.2-SNAPSHOT -> net.sansa-stack:sansa-inference-common_2.11:jar:0.7.2-SNAPSHOT -> org.gephi:gephi-toolkit:jar:0.9.2 -> org.netbeans.modules:org-netbeans-modules-masterfs:jar:RELEASE82: Failed to read artifact descriptor for org.netbeans.modules:org-netbeans-modules-masterfs:jar:RELEASE82: Could not transfer artifact org.netbeans.modules:org-netbeans-modules-masterfs:pom:RELEASE82 from/to netbeans (http://bits.netbeans.org/nexus/content/groups/netbeans/): Not authorized , ReasonPhrase:Repository decommissioned. Please refer to https://netbeans.apache.org/about/oracle-transition.html for more information..

Build error while trying to execute "mvn package"

I'm trying to setup SANSA according to getting started page, but I'm getting the following error after running mvn clean package:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22:20 min
[INFO] Finished at: 2018-09-26T14:17:40+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project SANSA-Template-Maven-Spark: Could not resolve dependencies for project net.sansa-stack:SANSA-Template-Maven-Spark:jar:0.4.1-SNAPSHOT: Could not find artifact commons-codec:commons-codec:jar:2.0-SNAPSHOT in maven.aksw.snapshots (http://maven.aksw.org/archiva/repository/snapshots) -> [Help 1]

How can I fix that?

ConfigException in rdf_loader.config

The following exception is thrown when trying to read an NT file using the following line

triplesDF = spark.read.rdf(lang)(input)

The exception is as follows:

Exception in thread "main" com.typesafe.config.ConfigException$Parse: rdf_loader.conf

Workaround was to upgrade SANSA snapshot version to 0.3.1 instead of 0.3.0. Seems the version just needs to be bumped in the POM to resolve this issue.

java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib

Hi,
I am trying to download Spark Template and trying to run below sample code and I am getting java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib
Build was successful with Java 1.8 and Scala 2.11.11
Spark version 2.4.3

$ ./spark-shell --jars /SANSA-Template-Maven-Spark/target/SANSA-Template-Maven-Spark-0.6.0.jar

import org.apache.jena.riot.Lang
import net.sansa_stack.rdf.spark.io._
import net.sansa_stack.query.spark.query._
val input = "file:/SANSA-Examples/sansa-examples-spark/src/main/resources/rdf.nt"
val lang = Lang.NTRIPLES
val triples = spark.rdf(lang)(input)
val sparqlQuery = """SELECT ?s ?p ?o
WHERE {?s ?p ?o }
LIMIT 10"""
val result = triples.sparql(sparqlQuery)
result.show(100,false)


java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:135)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:118)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance$lzycompute(NTripleReader.scala:207)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance(NTripleReader.scala:207)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.get(NTripleReader.scala:209)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:148)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/03/09 15:53:04 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:135)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:118)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance$lzycompute(NTripleReader.scala:207)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance(NTripleReader.scala:207)
at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.get(NTripleReader.scala:209)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:148)
at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.