weso / sparkwdsub Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 3.0 155.95 MB

Spark processing of wikidata subsets

License: MIT License

Scala 96.36% Shell 3.64%

sparkwdsub's Introduction

sparkwdsub

Spark processing of wikidata subsets using Shape Expressions.

This repo contains an example script that processes Wikidata subsets using Shape Expressions.

The algorithm used to validate schemas in parallel has been ported to its own repo at pschema

Building and running

The system requires sbt to be built. Once you download and install it. You only need to run:

sbt assembly

which will generate a flat jar with all dependencies in folder: target/scala-2.12/ called: sparkwdsub.jar.

Using a local cluster

In order to run the system locally, you need to download and install Apache Spark. You should have the executable spark-submit accessible in your path.

Once you have Spark installed and the sparkwdsub.jar generated you can run with the following example:

spark-submit --class "es.weso.wdsub.spark.Main" --master local[4] target/scala-2.12/sparkwdsub.jar -d examples/sample-dump1.json.gz  -m cluster -n testCities -s examples/cities.shex -k -o target/cities

which will generate a folder target/cities with information about the extracted dump.

Using AWS

In AWS, you can do the following steps:

Upload a Wikidata dump to Amazon S3. Dumps from wikidata are available here
Create a cluster in Amazon EMR. Inside the cluster configure 2 steps: one step to run the spark-submit and another one to fail after the run finishes. The step to run sparkwdsub has the following configuration in our system:

spark-submit --deploy-mode client --class es.weso.wdsub.spark.Main s3://weso/projects/wdsub/sparkwdsub.jar --mode cluster --name sparkwdsubLabra01 --dump s3://weso/datasets/wikidata/dump_20151116.json --schema s3://weso/projects/wdsub/author.shex --out s3://weso/projects/wdsub/

Using Google Cloud

Instructions pending...

Command line

Usage: sparkwdsub dump --schema <file> [--out <file>] [--site <string>] [--maxIterations <integer>] [--verbose] [--loggingLevel <string>] <dumpFile>
 Process example dump file.
 Options and flags:
     --help
         Display this help text.
     --schema <file>, -s <file>
         schema path
     --out <file>, -o <file>
         output path
     --site <string>
         Base url, default =http://www.wikidata.org/entity
     --maxIterations <integer>
        Max iterations for Pregel algorithm, default =20
     --verbose
         Verbose mode
     --loggingLevel <string>
         Logging level (ERROR, WARN, INFO), default=ERROR

Example:

sparkwdsub dump -s examples/cities.shex examples/6lines.json

sparkwdsub's People

Contributors

Watchers

Forkers

andrawaag sabahzero manticinc

sparkwdsub's Issues

java.io.UTFDataFormatException when processing 2014 dump

When the program processes the 2014 dump the program fails with the following exception:

java.io.UTFDataFormatException
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3708)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3633)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:3437)
	at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2032)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:629)
	at java.base/java.net.URI.readObject(URI.java:1778)
	at java.base/jdk.internal.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1175)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2325)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Assembly doesn't find some required cats libraries

Although the project seems to work locally and the tests pass. When trying to launch it using spark-submit it raises an error that doesn't find some cats libraries.

We were using decline to parse command line arguments and it failed because it didn't find the cats methods used by decline, so we changed it to use plain arguments. It seems to start working, but crashes when starting to work.

At this moment this is the output of the crash:

21/08/31 14:05:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36909.
21/08/31 14:05:37 INFO NettyBlockTransferService: Server created on 192.168.1.134:36909
21/08/31 14:05:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/31 14:05:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.134:36909 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.134, 36909, None)
java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
	at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
	at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
	at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
	at cats.effect.IOFiber.execR(IOFiber.scala:1151)
	at cats.effect.IOFiber.run(IOFiber.scala:128)
	at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)
Exception in thread "main" java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
	at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
	at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
	at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
	at cats.effect.IOFiber.execR(IOFiber.scala:1151)
	at cats.effect.IOFiber.run(IOFiber.scala:128)
	at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)

My feeling is that the assembly merging policy we use to generate the assembly is failing either because it discards some required file to locate those classes or because it uses an old version of cats.

This is the build.sbt lines that declare the assembly merging policy:

 ThisBuild / assemblyMergeStrategy := {
     case PathList("META-INF", xs @ _*) => MergeStrategy.discard
     case x => MergeStrategy.first
    },
. . .

and when assembling, the system says:

info] compiling 1 Scala source to /home/labra/src/wikidata/sparkwdsub/target/scala-2.12/classes ...
[info] Strategy 'discard' was applied to 663 files (Run the task at debug level to see details)
[info] Strategy 'first' was applied to 176 files (Run the task at debug level to see details)

One intuition is that some required META-file is discarded...

The documentation on sbt-assembly describes several strategies about assembling that we may need to consider.

Add option to show shapes validated in ouput

We can add an option that adds information about the shapes that have been validated together with each node. This information can be useful to debug the algorithm

2014 dumps contain an empty field

Apparently the 2014 dumps contain some empty fields that produce the following exception when processing it:

com.fasterxml.jackson.databind.exc.ValueInstantiationException: Cannot construct instance of `org.wikidata.wdtk.datamodel.implementation.TermImpl`, problem: A text has to be provided to create a MonolingualTextValue
 at [Source: UNKNOWN; line: 1, column: 269] (through reference chain: org.wikidata.wdtk.datamodel.implementation.ItemDocumentImpl["labels"]->java.util.LinkedHashMap["zh-hant"])
 at com.fasterxml.jackson.databind.exc.ValueInstantiationException.from(ValueInstantiationException.java:47)
 at com.fasterxml.jackson.databind.DeserializationContext.instantiationException(DeserializationContext.java:1732)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:491)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:514)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:285)
 at com.fasterxml.jackson.databind.deser.ValueInstantiator.createFromObjectWith(ValueInstantiator.java:229)
 at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:198)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:488)
 at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1287)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:326)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:527)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:364)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:29)
 at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:530)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeWithErrorWrapping(BeanDeserializer.java:528)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:417)
 at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1287)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:326)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:194)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
 at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:130)
 at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:97)
 at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:254)
 at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:68)
 at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719)
 at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1261)
 at org.wikidata.wdtk.datamodel.helpers.JsonDeserializer.deserializeEntityDocument(JsonDeserializer.java:124)
 at es.weso.wdsub.spark.wbmodel.LineParser.line2EntityStatements(LineParser.scala:33)
 at es.weso.wdsub.spark.wbmodel.LineParser.$anonfun$dumpRDD2Graph$2(LineParser.scala:66)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at org.apache.spark.graphx.EdgeRDD$.$anonfun$fromEdges$1(EdgeRDD.scala:107)
 at org.apache.spark.graphx.EdgeRDD$.$anonfun$fromEdges$1$adapted(EdgeRDD.scala:105)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
 at org.apache.spark.scheduler.Task.run(Task.scala:131)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException: A text has to be provided to create a MonolingualTextValue
 at java.base/java.util.Objects.requireNonNull(Objects.java:347)
 at org.apache.commons.lang3.Validate.notNull(Validate.java:225)
 at org.wikidata.wdtk.datamodel.implementation.TermImpl.<init>(TermImpl.java:70)
 at jdk.internal.reflect.GeneratedConstructorAccessor38.newInstance(Unknown Source)
 at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
 at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:124)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:283)

The line 33 from LineParser.scala is: val entityDocument = jsonDeserializer.deserializeEntityDocument(line).

And the line from the dump that produces the exception is:

{"id":"Q1246959","type":"item","aliases":{"zh":[{"language":"zh","value":"\u03f3"}],"it":[{"language":"it","value":"\u05d9"}]},"labels":{"la":{"language":"la","value":"Yot"},"zh-hans":{"language":"zh-hans","value":"\u037f"},"zh-hant":{"language":"zh-hant","removed":""},"zh-hk":{"language":"zh-hk","removed":""},"it":{"language":"it","value":"Jod"},"zh":{"language":"zh","value":"\u037f"},"fr":{"language":"fr","value":"Yot"}},"sitelinks":{"lawiki":{"site":"lawiki","title":"Yot","badges":[]},"zhwiki":{"site":"zhwiki","title":"\u037f","badges":[]},"itwiki":{"site":"itwiki","title":"Jod","badges":[]},"frwiki":{"site":"frwiki","title":"Yot","badges":[]}}},

Document how to create a apache spark cluster on WESO local infra

Document the process of creating an apache spark cluster on local infrastructure.

Experiments on AWS EMR

AWS EMR is the Amazon Web Services Elastic Map Reduce service. It allows creating spark clusters. Our aim was to run these experiments on the WESO Spark cluster but unfortunately, we run out of RAM. This issue will be the history of the experiments in AWS EMR.

Document how to create a apache spark cluster on AWS infra

Document the process of creating an apache spark cluster on AWS EMR infrastructure.

Update to new changes in WShEx and ShEx-s

We have moved some of the functionality to the wshex module in shex-s and we should update to those changes in SparkWDSub.

Check repeated properties

After repairing issue #15 the test with repeated properties failed...we commented it out by now, but we should dive into it to check why it failed.

ValueSetValues with entities in triple constraints

At this moment, a ShEx like the following:

prefix : <http://www.wikidata.org/entity/>

<City> {
 :P31 [ :Q515 ]
}

would be wrong because it would be treated as a local triple constraint and wouldn't match any city.

A solution is to rewrite the ShEx as:

prefix : <http://www.wikidata.org/entity/>

<City> {
 :P31 @<CityCode>
}

<CityCode> [ :Q515 ]

Nevertheless, we should rewrite the ShEx converter to handle right that pattern which is very common.

Merge branch spark-cluster-works with master

During the tests to make sparkwdsub run on AWS, @thewillyhuman created a branch called thewillyhuman/spark-cluster-works which works on AWS because it doesn't use decline and ensures that the dependencies use the right versions.

At the same time @labra kept working adding features to the master branch and at this moment the situation is that when we need to deploy on AWS or run the system, we do a manual process to integrate the latest changes in the master branch with the other branch.

Most of the changes are quite easy, as the main thing that changed was the precisely the Main code that uses scallop instead of decline to handle command line options. Anyway, I think it deserves to merge both branchs and keep working on a unified one.

Cluster executors throw a java.io.InvalidClassException at scala.collection.mutable.WrappedArray

The Apache Spark cluster executors throw the following exception:

java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible: stream classdesc serialVersionUID = 3456489343829468865, local class serialVersionUID = 1028182004549731694
    at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
    at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
    at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
    at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
    at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
    at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
    at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
    at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
    at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
    at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
    at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$2(NettyRpcEnv.scala:299)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:352)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$1(NettyRpcEnv.scala:298)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:298)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
    at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
    at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:382)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:421)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 12 more

Some nodes validate when they should not

The following shape:

PREFIX :    <http://www.wikidata.org/entity/>

start=@<publication>

<publication> {
 :P50 @<author> +;
}

<author> EXTRA :P31 {
  :P31 @<elf> ;
  :P27 @<country_value> 
}

<elf> [ :Q174396 ] 
<country_value> [ :Q183 ]

Shouldn't validate any node because it is asking for publications made by elfs...however, it returns all authors ignoring the values.

We have reproduced with this simple test case

It seems to detect all the failing nodes but it ignores their values and asserts that the node conforms...

Obtain a subset corresponding to the genwiki shape expression and the 2014 dump

The genwiki shape can be located at https://github.com/weso/sparkwdsub/blob/master/examples/genewiki.shex.

AWS trainning to @mistermboy

Do a knowledge transfer on how to run experiments on AWS to @mistermboy.

Document the wrapper code that we use to launch the code

At this moment, when sparkwdsub is launched, we create a Jar with some actions created by @thewillyhuman. It would be nice to record those actions, check if we can automate them integrate them into this project.

Prefix WD not found in gen wiki shape

The shape genwiki uses the prefix wd but this prefix has not been defined. The following lines are just an example.

sparkwdsub/examples/genewiki.shex

Lines 1 to 3 in e53e9d3

 PREFIX wde: <http://www.wikidata.org/entity/> 

 PREFIX : <http://example.org/>

sparkwdsub/examples/genewiki.shex

Lines 9 to 12 in e53e9d3

 :active_site EXTRA wde:P31 { 

 wde:P31 [ wd:Q423026 ] ; 

 wde:P361 @:protein_family * ; 

 }

Extraction of nodes that lack required properties

Un ejemplo vale más que mil palabras:

PREFIX :  <http://www.wikidata.org/entity/>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>

start=@<publication>

<publication> EXTRA :P31 {
    	:P50 @<author> +;
}

<author> EXTRA :P31 {
    	:P31 @<human> ;
	:P27 @<country_type> + ;
}

<country> EXTRA :P31 {
	:P31 @<country_type> ;
}

<human>  [:Q5]
<country_type>  [:Q29]

Ejecutando con esta Shape obtiene, como autor de una publicación coreana -una de tantas-, a una "Academia de Estudios Coreanos" que carece tanto de P27 como P31, que definen a nuestro autor.

Esto significa que es capaz de discernir cuándo la propiedad de una entidad posee el valor que le indicamos, pero si no tiene tal propiedad, la da por válida.

Extraction of nodes lacking the required properties

An example is worth a thousand words:

PREFIX :  <http://www.wikidata.org/entity/>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>

start=@<publication>

<publication> EXTRA :P31 {
    	:P50 @<author> +;
}

<author> EXTRA :P31 {
    	:P31 @<human> ;
	:P27 @<country_type> + ;
}

<country> EXTRA :P31 {
	:P31 @<country_type> ;
}

<human>  [:Q5]
<country_type>  [:Q29]

By running this Shape, it obtains, as the author of a Korean publication -one of many-, an "Academy of Korean Studies" that lacks both P27 and P31, which define our author.

This means that it is able to discern when an entity's property has the value we indicate, but if it does not have such a property, it considers it valid.

	PREFIX wde: <http://www.wikidata.org/entity/>
	PREFIX : <http://example.org/>

	:active_site EXTRA wde:P31 {
	wde:P31 [ wd:Q423026 ] ;
	wde:P361 @:protein_family * ;
	}