Giter VIP home page Giter VIP logo

sparkwdsub's Introduction

Continuous Integration

sparkwdsub

Spark processing of wikidata subsets using Shape Expressions.

This repo contains an example script that processes Wikidata subsets using Shape Expressions.

The algorithm used to validate schemas in parallel has been ported to its own repo at pschema

Building and running

The system requires sbt to be built. Once you download and install it. You only need to run:

sbt assembly

which will generate a flat jar with all dependencies in folder: target/scala-2.12/ called: sparkwdsub.jar.

Using a local cluster

In order to run the system locally, you need to download and install Apache Spark. You should have the executable spark-submit accessible in your path.

Once you have Spark installed and the sparkwdsub.jar generated you can run with the following example:

spark-submit --class "es.weso.wdsub.spark.Main" --master local[4] target/scala-2.12/sparkwdsub.jar -d examples/sample-dump1.json.gz  -m cluster -n testCities -s examples/cities.shex -k -o target/cities

which will generate a folder target/cities with information about the extracted dump.

Using AWS

In AWS, you can do the following steps:

  • Upload a Wikidata dump to Amazon S3. Dumps from wikidata are available here
  • Create a cluster in Amazon EMR. Inside the cluster configure 2 steps: one step to run the spark-submit and another one to fail after the run finishes. The step to run sparkwdsub has the following configuration in our system:
spark-submit --deploy-mode client --class es.weso.wdsub.spark.Main s3://weso/projects/wdsub/sparkwdsub.jar --mode cluster --name sparkwdsubLabra01 --dump s3://weso/datasets/wikidata/dump_20151116.json --schema s3://weso/projects/wdsub/author.shex --out s3://weso/projects/wdsub/

Using Google Cloud

Instructions pending...

Command line

Usage: sparkwdsub dump --schema <file> [--out <file>] [--site <string>] [--maxIterations <integer>] [--verbose] [--loggingLevel <string>] <dumpFile>
 Process example dump file.
 Options and flags:
     --help
         Display this help text.
     --schema <file>, -s <file>
         schema path
     --out <file>, -o <file>
         output path
     --site <string>
         Base url, default =http://www.wikidata.org/entity
     --maxIterations <integer>
        Max iterations for Pregel algorithm, default =20
     --verbose
         Verbose mode
     --loggingLevel <string>
         Logging level (ERROR, WARN, INFO), default=ERROR

Example:

sparkwdsub dump -s examples/cities.shex examples/6lines.json

sparkwdsub's People

Contributors

labra avatar thewillyhuman avatar andrawaag avatar brett--anderson avatar

Watchers

 avatar Mark Iantorno avatar Pablo Menéndez ◔_◔  avatar  avatar

sparkwdsub's Issues

java.io.UTFDataFormatException when processing 2014 dump

When the program processes the 2014 dump the program fails with the following exception:

java.io.UTFDataFormatException
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3708)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3633)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:3437)
	at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2032)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:629)
	at java.base/java.net.URI.readObject(URI.java:1778)
	at java.base/jdk.internal.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1175)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2325)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Assembly doesn't find some required cats libraries

Although the project seems to work locally and the tests pass. When trying to launch it using spark-submit it raises an error that doesn't find some cats libraries.

We were using decline to parse command line arguments and it failed because it didn't find the cats methods used by decline, so we changed it to use plain arguments. It seems to start working, but crashes when starting to work.

At this moment this is the output of the crash:

21/08/31 14:05:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36909.
21/08/31 14:05:37 INFO NettyBlockTransferService: Server created on 192.168.1.134:36909
21/08/31 14:05:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/31 14:05:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.134:36909 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.134, 36909, None)
java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
	at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
	at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
	at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
	at cats.effect.IOFiber.execR(IOFiber.scala:1151)
	at cats.effect.IOFiber.run(IOFiber.scala:128)
	at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)
Exception in thread "main" java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
	at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
	at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
	at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
	at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
	at cats.effect.IOFiber.execR(IOFiber.scala:1151)
	at cats.effect.IOFiber.run(IOFiber.scala:128)
	at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)

My feeling is that the assembly merging policy we use to generate the assembly is failing either because it discards some required file to locate those classes or because it uses an old version of cats.

This is the build.sbt lines that declare the assembly merging policy:

 ThisBuild / assemblyMergeStrategy := {
     case PathList("META-INF", xs @ _*) => MergeStrategy.discard
     case x => MergeStrategy.first
    },
. . .

and when assembling, the system says:

info] compiling 1 Scala source to /home/labra/src/wikidata/sparkwdsub/target/scala-2.12/classes ...
[info] Strategy 'discard' was applied to 663 files (Run the task at debug level to see details)
[info] Strategy 'first' was applied to 176 files (Run the task at debug level to see details)

One intuition is that some required META-file is discarded...

The documentation on sbt-assembly describes several strategies about assembling that we may need to consider.

2014 dumps contain an empty field

Apparently the 2014 dumps contain some empty fields that produce the following exception when processing it:

com.fasterxml.jackson.databind.exc.ValueInstantiationException: Cannot construct instance of `org.wikidata.wdtk.datamodel.implementation.TermImpl`, problem: A text has to be provided to create a MonolingualTextValue
 at [Source: UNKNOWN; line: 1, column: 269] (through reference chain: org.wikidata.wdtk.datamodel.implementation.ItemDocumentImpl["labels"]->java.util.LinkedHashMap["zh-hant"])
 at com.fasterxml.jackson.databind.exc.ValueInstantiationException.from(ValueInstantiationException.java:47)
 at com.fasterxml.jackson.databind.DeserializationContext.instantiationException(DeserializationContext.java:1732)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:491)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:514)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:285)
 at com.fasterxml.jackson.databind.deser.ValueInstantiator.createFromObjectWith(ValueInstantiator.java:229)
 at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:198)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:488)
 at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1287)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:326)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:527)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:364)
 at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:29)
 at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:530)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeWithErrorWrapping(BeanDeserializer.java:528)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:417)
 at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1287)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:326)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:194)
 at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
 at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:130)
 at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:97)
 at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:254)
 at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:68)
 at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719)
 at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1261)
 at org.wikidata.wdtk.datamodel.helpers.JsonDeserializer.deserializeEntityDocument(JsonDeserializer.java:124)
 at es.weso.wdsub.spark.wbmodel.LineParser.line2EntityStatements(LineParser.scala:33)
 at es.weso.wdsub.spark.wbmodel.LineParser.$anonfun$dumpRDD2Graph$2(LineParser.scala:66)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at org.apache.spark.graphx.EdgeRDD$.$anonfun$fromEdges$1(EdgeRDD.scala:107)
 at org.apache.spark.graphx.EdgeRDD$.$anonfun$fromEdges$1$adapted(EdgeRDD.scala:105)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
 at org.apache.spark.scheduler.Task.run(Task.scala:131)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException: A text has to be provided to create a MonolingualTextValue
 at java.base/java.util.Objects.requireNonNull(Objects.java:347)
 at org.apache.commons.lang3.Validate.notNull(Validate.java:225)
 at org.wikidata.wdtk.datamodel.implementation.TermImpl.<init>(TermImpl.java:70)
 at jdk.internal.reflect.GeneratedConstructorAccessor38.newInstance(Unknown Source)
 at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
 at com.fasterxml.jackson.databind.introspect.AnnotatedConstructor.call(AnnotatedConstructor.java:124)
 at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:283)

The line 33 from LineParser.scala is: val entityDocument = jsonDeserializer.deserializeEntityDocument(line).

And the line from the dump that produces the exception is:

{"id":"Q1246959","type":"item","aliases":{"zh":[{"language":"zh","value":"\u03f3"}],"it":[{"language":"it","value":"\u05d9"}]},"labels":{"la":{"language":"la","value":"Yot"},"zh-hans":{"language":"zh-hans","value":"\u037f"},"zh-hant":{"language":"zh-hant","removed":""},"zh-hk":{"language":"zh-hk","removed":""},"it":{"language":"it","value":"Jod"},"zh":{"language":"zh","value":"\u037f"},"fr":{"language":"fr","value":"Yot"}},"sitelinks":{"lawiki":{"site":"lawiki","title":"Yot","badges":[]},"zhwiki":{"site":"zhwiki","title":"\u037f","badges":[]},"itwiki":{"site":"itwiki","title":"Jod","badges":[]},"frwiki":{"site":"frwiki","title":"Yot","badges":[]}}},

Experiments on AWS EMR

AWS EMR is the Amazon Web Services Elastic Map Reduce service. It allows creating spark clusters. Our aim was to run these experiments on the WESO Spark cluster but unfortunately, we run out of RAM. This issue will be the history of the experiments in AWS EMR.

Check repeated properties

After repairing issue #15 the test with repeated properties failed...we commented it out by now, but we should dive into it to check why it failed.

ValueSetValues with entities in triple constraints

At this moment, a ShEx like the following:

prefix : <http://www.wikidata.org/entity/>

<City> {
 :P31 [ :Q515 ]
}

would be wrong because it would be treated as a local triple constraint and wouldn't match any city.

A solution is to rewrite the ShEx as:

prefix : <http://www.wikidata.org/entity/>

<City> {
 :P31 @<CityCode>
}

<CityCode> [ :Q515 ]

Nevertheless, we should rewrite the ShEx converter to handle right that pattern which is very common.

Merge branch spark-cluster-works with master

During the tests to make sparkwdsub run on AWS, @thewillyhuman created a branch called thewillyhuman/spark-cluster-works which works on AWS because it doesn't use decline and ensures that the dependencies use the right versions.

At the same time @labra kept working adding features to the master branch and at this moment the situation is that when we need to deploy on AWS or run the system, we do a manual process to integrate the latest changes in the master branch with the other branch.

Most of the changes are quite easy, as the main thing that changed was the precisely the Main code that uses scallop instead of decline to handle command line options. Anyway, I think it deserves to merge both branchs and keep working on a unified one.

Cluster executors throw a java.io.InvalidClassException at scala.collection.mutable.WrappedArray

The Apache Spark cluster executors throw the following exception:

java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible: stream classdesc serialVersionUID = 3456489343829468865, local class serialVersionUID = 1028182004549731694
    at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
    at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
    at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
    at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
    at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
    at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
    at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
    at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
    at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
    at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
    at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$2(NettyRpcEnv.scala:299)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:352)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$deserialize$1(NettyRpcEnv.scala:298)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:298)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
    at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
    at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
    at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:382)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:421)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 12 more

Some nodes validate when they should not

The following shape:

PREFIX :    <http://www.wikidata.org/entity/>

start=@<publication>

<publication> {
 :P50 @<author> +;
}

<author> EXTRA :P31 {
  :P31 @<elf> ;
  :P27 @<country_value> 
}

<elf> [ :Q174396 ] 
<country_value> [ :Q183 ]

Shouldn't validate any node because it is asking for publications made by elfs...however, it returns all authors ignoring the values.

We have reproduced with this simple test case

It seems to detect all the failing nodes but it ignores their values and asserts that the node conforms...

Extraction of nodes that lack required properties

Un ejemplo vale más que mil palabras:

PREFIX :  <http://www.wikidata.org/entity/>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>

start=@<publication>

<publication> EXTRA :P31 {
    	:P50 @<author> +;
}

<author> EXTRA :P31 {
    	:P31 @<human> ;
	:P27 @<country_type> + ;
}

<country> EXTRA :P31 {
	:P31 @<country_type> ;
}

<human>  [:Q5]
<country_type>  [:Q29]

Ejecutando con esta Shape obtiene, como autor de una publicación coreana -una de tantas-, a una "Academia de Estudios Coreanos" que carece tanto de P27 como P31, que definen a nuestro autor.

Esto significa que es capaz de discernir cuándo la propiedad de una entidad posee el valor que le indicamos, pero si no tiene tal propiedad, la da por válida.

Extraction of nodes lacking the required properties

An example is worth a thousand words:

PREFIX :  <http://www.wikidata.org/entity/>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>

start=@<publication>

<publication> EXTRA :P31 {
    	:P50 @<author> +;
}

<author> EXTRA :P31 {
    	:P31 @<human> ;
	:P27 @<country_type> + ;
}

<country> EXTRA :P31 {
	:P31 @<country_type> ;
}

<human>  [:Q5]
<country_type>  [:Q29]

By running this Shape, it obtains, as the author of a Korean publication -one of many-, an "Academy of Korean Studies" that lacks both P27 and P31, which define our author.

This means that it is able to discern when an entity's property has the value we indicate, but if it does not have such a property, it considers it valid.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.