apache / sedona Goto Github PK

A cluster computing framework for processing large-scale geospatial data

License: Apache License 2.0

Java 46.44% Scala 29.31% JavaScript 0.11% Jupyter Notebook 6.05% Shell 0.14% Python 13.89% HTML 0.09% R 2.59% C 1.32% Dockerfile 0.06% Makefile 0.01%

cluster-computing spatial-sql geospatial spatial-analysis spatial-query scala java python

sedona's Introduction

Download statistics	Maven	PyPI	Conda-forge	CRAN	DockerHub
Apache Sedona	225k/month
Archived GeoSpark releases	10k/month

What is Apache Sedona?
Features
When to use Sedona?
Building Sedona
Documentation
Join the community

Join the community

Follow Sedona on Twitter for fresh news: Sedona@Twitter

Join the Sedona Discord community:

Join the Sedona monthly community office hour: Google Calendar, Tuesdays from 8 AM to 9 AM Pacific Time, every 4 weeks

Sedona JIRA: Bugs, Pull Requests, and other similar issues

Sedona Mailing Lists: [email protected]: project development, general questions or tutorials.

Please first subscribe and then post emails. To subscribe, please send an email (leave the subject and content blank) to [email protected]

What is Apache Sedona?

Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as Apache Spark and Apache Flink. Sedona developers can express their spatial data processing tasks in Spatial SQL, Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.

Features

Some of the key features of Apache Sedona include:

Support for a wide range of geospatial data formats, including GeoJSON, WKT, and ESRI Shapefile.
Scalable distributed processing of large vector and raster datasets.
Tools for spatial indexing, spatial querying, and spatial join operations.
Integration with popular geospatial python tools such as GeoPandas.
Integration with popular big data tools, such as Spark, Hadoop, Hive, and Flink for data storage and querying.
A user-friendly API for working with geospatial data in the SQL, Python, Scala and Java languages.
Flexible deployment options, including standalone, local, and cluster modes.

These are some of the key features of Apache Sedona, but it may offer additional capabilities depending on the specific version and configuration.

Click and play the interactive Sedona Python Jupyter Notebook immediately!

When to use Sedona?

Use Cases:

Apache Sedona is a widely used framework for working with spatial data, and it has many different use cases and applications. Some of the main use cases for Apache Sedona include:

Automotive data analytics: Apache Sedona is widely used in geospatial analytics applications, where it is used to perform spatial analysis and data mining on large and complex datasets collected from fleets.
Urban planning and development: Apache Sedona is commonly used in urban planning and development applications to analyze and visualize spatial data sets related to urban environments, such as land use, transportation networks, and population density.
Location-based services: Apache Sedona is often used in location-based services, such as mapping and navigation applications, where it is used to process and analyze spatial data to provide location-based information and services to users.
Environmental modeling and analysis: Apache Sedona is used in many different environmental modeling and analysis applications, where it is used to process and analyze spatial data related to environmental factors, such as air quality, water quality, and weather patterns.
Disaster response and management: Apache Sedona is used in disaster response and management applications to process and analyze spatial data related to disasters, such as floods, earthquakes, and other natural disasters, in order to support emergency response and recovery efforts.

Code Example:

This example loads NYC taxi trip records and taxi zone information stored as .CSV files on AWS S3 into Sedona spatial dataframes. It then performs spatial SQL query on the taxi trip datasets to filter out all records except those within the Manhattan area of New York. The example also shows a spatial join operation that matches taxi trip records to zones based on whether the taxi trip lies within the geographical extents of the zone. Finally, the last code snippet integrates the output of Sedona with GeoPandas and plots the spatial distribution of both datasets.

Load NYC taxi trips and taxi zones data from CSV Files Stored on AWS S3

taxidf = sedona.read.format('csv').option("header","true").option("delimiter", ",").load("s3a://your-directory/data/nyc-taxi-data.csv")
taxidf = taxidf.selectExpr('ST_Point(CAST(Start_Lon AS Decimal(24,20)), CAST(Start_Lat AS Decimal(24,20))) AS pickup', 'Trip_Pickup_DateTime', 'Payment_Type', 'Fare_Amt')

zoneDf = sedona.read.format('csv').option("delimiter", ",").load("s3a://your-directory/data/TIGER2018_ZCTA5.csv")
zoneDf = zoneDf.selectExpr('ST_GeomFromWKT(_c0) as zone', '_c1 as zipcode')

Spatial SQL query to only return Taxi trips in Manhattan

taxidf_mhtn = taxidf.where('ST_Contains(ST_PolygonFromEnvelope(-74.01,40.73,-73.93,40.79), pickup)')

Spatial Join between Taxi Dataframe and Zone Dataframe to Find taxis in each zone

taxiVsZone = sedona.sql('SELECT zone, zipcode, pickup, Fare_Amt FROM zoneDf, taxiDf WHERE ST_Contains(zone, pickup)')

Show a map of the loaded Spatial Dataframes using GeoPandas

zoneGpd = gpd.GeoDataFrame(zoneDf.toPandas(), geometry="zone")
taxiGpd = gpd.GeoDataFrame(taxidf.toPandas(), geometry="pickup")

zone = zoneGpd.plot(color='yellow', edgecolor='black', zorder=1)
zone.set_xlabel('Longitude (degrees)')
zone.set_ylabel('Latitude (degrees)')

zone.set_xlim(-74.1, -73.8)
zone.set_ylim(40.65, 40.9)

taxi = taxiGpd.plot(ax=zone, alpha=0.01, color='red', zorder=3)

Docker image

We provide a Docker image for Apache Sedona with Python JupyterLab and a single-node cluster. The images are available on DockerHub

Building Sedona

To install the Python package:
```
pip install apache-sedona
```
To compile the source code, please refer to Sedona website
Modules in the source code

Name	API	Introduction
common	Java	Core geometric operation logics, serialization, index
spark	Spark RDD/DataFrame Scala/Java/SQL	Distributed geospatial data processing on Apache Spark
flink	Flink DataStream/Table in Scala/Java/SQL	Distributed geospatial data processing on Apache Flink
snowflake	Snowflake SQL	Distributed geospatial data processing on Snowflake
spark-shaded	No source code	shaded jar for Sedona Spark
flink-shaded	No source code	shaded jar for Sedona Flink
snowflake-tester	Java	tester program for Sedona Snowflake
python	Spark RDD/DataFrame Python	Distributed geospatial data processing on Apache Spark
R	Spark RDD/DataFrame in R	R wrapper for Sedona
Zeppelin	Apache Zeppelin	Plugin for Apache Zeppelin 0.8.1+

Documentation

Please visit Apache Sedona website for detailed information

Powered by

sedona's People

Contributors

Stargazers

Watchers

Forkers

manueltimita manasdebashiskar nexys-consulting caporossi giserh jayapriya90 supergis justvarshney nubiofs n0mer ambling zouzias sabman zjmeixinyanzhi excelwang errolk theseusyang andreazanotti15 barrycug gischen anukat2015 jaharshkotha dhanashreea vrjdev sriram-nappa rajd2012 anuragit19 justnisar schachte seamusdu purukaushik skardile ajinxpatil abirsaha anuranduttaroy nilam603 hmaskai vreuter kavgan openarrow erhkamal omkarkaptan adityatogani94 shrutiocean1906 bhaktibohara cgore1 dhawal9035 yeskarthik aakash-chaurasia rangarajkaushiksundar sujayv chronicnexus safegraphinc ajkulkarni xieyufish openkbs rajananusha ignasva lakshmisagar drsnowbird wxl2github pzz2011 snowch lnagel tordsk jfuentes agile-lab-dev gitter-badger jay-zeng invadergir john-min bertolinux-machine milad-d hwbehrens monkeydunkey ukeerthi prnvshrn dharanisakary foolyoungfly ironman-jason sharvilshah1994 laveeshr bhavint surbhi-singh mithilarun narendrakumar92 l0n3sh4rk sdjcse saumya1510 abakane1 wenma sanjosealbacete jackzhanggiser kanghq jassonhua vamsikg tizemrane geoheil coyotey wongnaret

sedona's Issues

New Method setGrid

Hi guys,
I want to create a PointRDD and a PolygonRDD from a rdd of Point and a rdd of Polygon. I am using the method
/**
* Initialize one SpatialRDD with one existing SpatialRDD
* @param rawPointRDD One existing rawPointRDD
*/
public PointRDD(JavaRDD rawPointRDD) {
this.setRawPointRDD(rawPointRDD);
}

But when I do this the grids of my PointRDD and PolygonRDD are not created and I can't make a match between my PointRDD and Polygon. If i didn't misunderstand the grid is only created if you create your PointRDD from a csv file. Can you make a new method that set the grid too from a rdd? That's would be very useful for me.

Add a new type of spatial join query that returns the intersections directly

Since several users are asking this feature, GeoSpark will officially add the support of this function soon.

The description of this function is as follows:
Given two polygon datasets RDD1 and RDD2, find the polygonal intersection between every polygon in RDD1 and every polygon in RDD2.

Currently GeoSpark solution before the upcoming release:
Step 1. Do spatial join query between RDD1 and RDD2 and get such a PairRDD

<PolygonFromRDD1, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2...>
<PolygonFromRDD1, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2...>
<PolygonFromRDD1, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2, IntersectedPolygonFromRDD2...>
...
...

Step 2. Do a Map on the PairRDD. This Map will do a transformation on each tuple in PairRDD. The transformation pseudo is as follows:

List polygonalIntersections = new ArrayList();
for(Polygon intersectedPolygon : List Of intersectedPolygons from RDD2)
{
 Polygon polygonalIntersection = PolygonFromRDD1.intersection(intersectedPolygon);
 polygonalIntersections.add(polygonalIntersection);
}
return new Tuple2(PolygonFromRDD1, polygonalIntersections);

NPE while loading small input location csv

Hi! Thank you for this useful tool!
I've run into NPE when my locations file was small.
`55.7588875598332,37.61572718349549

55.77040457045658,37.5973594161615

55.76053305623841,37.581566569481815

55.76875949238983,37.64902949062438`

mapPartitions in takeOrdered method in RDD.scala returns empty-iterator.

I was running this type of query:
val kNNQueryPoint = geometryFactory.createPoint(new Coordinate( 55.751244, 37.618423)) val result: util.List[Point] = KNNQuery.SpatialKnnQuery(objectRDD, kNNQueryPoint, 3, false)

issues loading geoJSON

Hi, thank you for building such a useful tool!

I'm trying hard to use geospark in a simple project. I want to do a spatial join on some point data (which I can load fine from a set of CSVs) and US Census Block Group data.

I am having trouble loading the polygon data from the block groups into geospark.

I first use ogr2ogr to convert the available block group shapefiles. The original shp files were downloaded from here.

convert to geojson:

ogr2ogr -f "GEOJSON" bg_geo.json cb_2016_25_bg_500k.shp

try to load using geospark:

import org.datasyslab.geospark.enums.FileDataSplitter
import org.datasyslab.geospark.spatialRDD.PolygonRDD
import org.apache.spark.storage.StorageLevel
val storageLevel = new StorageLevel()
val acs_rdd = new PolygonRDD(sc, "s3://mixing/raw/acs_bg_geo.json", FileDataSplitter.GEOJSON, false, 5, storageLevel);

Expected Behavior
The geoJSON is loaded into the polygon RDD properly.

Actual
I get the following error. I am able to load this geojson in other tools. I am pretty new to scala/spark/geospark, so perhaps I'm making a simple mistake? Any help is appreciated!

java.lang.RuntimeException: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: {; line: 1, column: 0])
 at [Source: {; line: 1, column: 3]
	at org.wololo.geojson.GeoJSONFactory.create(GeoJSONFactory.java:31)
	at org.wololo.jts2geojson.GeoJSONReader.read(GeoJSONReader.java:16)
	at org.datasyslab.geospark.formatMapper.PolygonFormatMapper.call(PolygonFormatMapper.java:102)
	at org.datasyslab.geospark.formatMapper.PolygonFormatMapper.call(PolygonFormatMapper.java:31)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:363)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: {; line: 1, column: 0])
 at [Source: {; line: 1, column: 3]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:470)
	at com.fasterxml.jackson.core.base.ParserBase._handleEOF(ParserBase.java:501)
	at com.fasterxml.jackson.core.base.ParserBase._eofAsNextChar(ParserBase.java:509)
	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2018)
	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextFieldName(ReaderBasedJsonParser.java:743)
	at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:208)
	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:69)
	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
	at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2271)
	at org.wololo.geojson.GeoJSONFactory.create(GeoJSONFactory.java:21)
	... 21 more

Jts shading geotools integration

Could you add shading for the company.vividsolution.jts used in geospark?
I plan to ist geospark and geotools in a project and their respective transitive dependencies collide for jts.

http://stackoverflow.com/questions/13620281/what-is-the-maven-shade-plugin-used-for-and-why-would-you-want-to-relocate-java provide a good overview why to shade such closely coupled internal dependencies.

NPE crash in KNN Query

I have some simple KNN test code based on the example scala code provided on github (https://gist.github.com/jiayuasu/e3571e982c518bb522e6c6c962207255), and it works fine for K=1, but for K>1 I get an NullPointerException in geospark code.

Here is the spark-shell session. This is in spark 1.5.0, using geospark-0.5.0-spark-1.x.jar.

Note that my test file in hdfs looks like this, it is a square of size 1 where the bottom left corner is at the origin. The spatial data starts at index 3:


    0,A,A,0.0,0.0
    1,B,B,0.0,1.0
    2,C,C,1.0,0.0
    3,D,D,1.0,1.0

The imports I used in the shell session were:

    import com.vividsolutions.jts.geom.Coordinate;
    import com.vividsolutions.jts.geom.Envelope;
    import com.vividsolutions.jts.geom.GeometryFactory;
    import com.vividsolutions.jts.geom.Point;
    import org.datasyslab.geospark.enums.FileDataSplitter;
    import org.datasyslab.geospark.enums.IndexType;
    import org.datasyslab.geospark.spatialOperator.KNNQuery;
    import org.datasyslab.geospark.spatialOperator.RangeQuery; 
    import org.datasyslab.geospark.spatialRDD.PointRDD;

The rest of the shell session:

scala>   val testCSVBoundaries = "/user/me/test/testpoints-1-1-boundaries.csv"
testCSVBoundaries: String = /user/me/test/testpoints-1-1-boundaries.csv

scala>   val fact = new GeometryFactory()
fact: com.vividsolutions.jts.geom.GeometryFactory = com.vividsolutions.jts.geom.GeometryFactory@3b92d44f

scala>   val queryPoint = fact.createPoint(new Coordinate(0.5, 0.5))
queryPoint: com.vividsolutions.jts.geom.Point = POINT (0.5 0.5)

scala>   val objectRDD = new PointRDD(sc, testCSVBoundaries, 3, FileDataSplitter.CSV, false)
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@47a24c46

scala>   val result = KNNQuery.SpatialKnnQuery(objectRDD, queryPoint, 4, false)
17/02/03 18:10:24 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 10, cs1hdo1n05.cars.com): java.lang.NullPointerException
        at org.datasyslab.geospark.knnJudgement.GeometryDistanceComparator.compare(GeometryDistanceComparator.java:39)
        at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
        at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
        at org.spark-project.guava.collect.Ordering.max(Ordering.java:551)
        at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:667)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1360)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1357)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

17/02/03 18:10:24 ERROR TaskSetManager: Task 1 in stage 5.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 15, cs1hdo1n05.cars.com): java.lang.NullPointerException
        at org.datasyslab.geospark.knnJudgement.GeometryDistanceComparator.compare(GeometryDistanceComparator.java:39)
        at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
        at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
        at org.spark-project.guava.collect.Ordering.max(Ordering.java:551)
        at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:667)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1360)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1357)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1366)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1353)
        at org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike.scala:611)
        at org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(JavaRDDLike.scala:47)
        at org.datasyslab.geospark.spatialOperator.KNNQuery.SpatialKnnQuery(KNNQuery.java:68)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
        at $iwC$$iwC$$iwC.<init>(<console>:51)
        at $iwC$$iwC.<init>(<console>:53)
        at $iwC.<init>(<console>:55)
        at <init>(<console>:57)
        at .<init>(<console>:61)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
        at org.datasyslab.geospark.knnJudgement.GeometryDistanceComparator.compare(GeometryDistanceComparator.java:39)
        at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
        at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
        at org.spark-project.guava.collect.Ordering.max(Ordering.java:551)
        at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:667)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1360)
        at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1357)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Precompiled jar for geospark

The repository has a maven project for GeoSpark. Is there a pre-compiled jar of this project available in the repository or anywhere else that can be consumed directly? Otherwise the project will have to be built manually before every use.

Not sure of the order of arguments to be passed to new Envelope

GeoSpark fails on Apache Spark 2.X version

Since Apache Spark 2.X is compiled by Scala 2.11, GeoSpark fails on Spatial KNN query and Join Query due to incompatible issues. GeoSpark works well with Apache Spark 1.X versions which are compiled by Scala 2.10. We are working on this issue.

LineStringRDD Constructor refer's to the wrong geometric type for raw JavaRDD

Raw constructor expects JavaRDD<Polygon>.
It should be expecting JavaRDD<LineString>

Radius units and ordering by distance in JoinQuery.DistanceJoinQuery

Hi!
Unfortunately I haven't found units for radius parameter in CircleRDD constructor.
I seems that it's not in km.
Second question: Wouldn't it be a good idea to return joined points sorted by distance(not just HashSet)?

KNNQuery for two PointRDD

Could you make a KNNQuery function which would take two PointRDD in parameters and would do a basic KNNQuery for every points of one RDD with the other PointRDD?
For the moment we can only do a KNNQuery between one point and one PointRDD. I would like to do the same operation with a lot of Points on the same PointRDD. For every point of the first PointRDD I want to find the KNN points from the second PointRDD. I can do a loop of course but I would like to find a optimal way to do it because I am dealing with big datas.

unit of radius in circle RDD

what is the unit for radius in circle RDD?

Conflicting property-based creators, JSON LineString mapping

I believe the version of com.fasterxml.jackson compiled with geospark is out of date, and carries a bug when reading LineString data from JSON.

java.lang.RuntimeException: java.lang.RuntimeException: com.fasterxml.jackson.databind.JsonMappingException: Conflicting property-based creators: already had explicitly marked [constructor for org.wololo.geojson.LineString, annotations: {interface com.fasterxml.jackson.annotation.JsonCreator=@com.fasterxml.jackson.annotation.JsonCreator(mode=DEFAULT)}], encountered [constructor for org.wololo.geojson.LineString, annotations: {interface com.fasterxml.jackson.annotation.JsonCreator=@com.fasterxml.jackson.annotation.JsonCreator(mode=DEFAULT)}] at GeoExample$CustomGeoJsonMapper.call(GeoExample.java:262) at GeoExample$CustomGeoJsonMapper.call(GeoExample.java:232) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Please refer to Conflicting property-based creators #8. You may just need update the version you are using.

This is when GeoJSONFactory.create(String json) is used. I didn't think it to be a geospark issue, but seeing as how this issue has been resolved already in the main package, it must mean that geospark is carrying an older version that introduces the bug.

WKT parsing problem in 0.6.1 version of geospark

With the latest updates from 0.6.1-snapshot to 0.6.1 release a bug was introduced regarding line string parsing. The error is: LinearRing do not form a closed linestring

To reproduce you can use the data from the Babylon showcase https://github.com/geoHeil/geoSparkScalaSample/tree/master/src/test/resources and run

git clone https://github.com/geoHeil/geoSparkScalaSample.git
cd geoSparkScalaSample
sbt run
# when propmpted for multiple main classes select 2

SPARK 2.0.1 java.lang.NoSuchMethodError

Hi everyone, I am using Spark 2.0.1, I added geospark to my project but when i run i get this error java.lang.NoSuchMethodError in every tests using parquet. Do you know why?

Questions - EPSG conversion

Hello Everybody,

I have a simple table generated by createOrReplaceTempView of the processing in Zeppelin 0.7.2 and Spark 2.1.1. It contains X, Y and Value. The X and Y are coordinate in EPSG 32722. When I want to convert such coordinate into EPSG 4326, I usually export the table into CSV file, use a tool to convert the coordinate and then import it back into a new table.

It looks like I can do that process directly in Zeppelin and Spark using GeoSpark, especially after the update on 0.8.1 for the particular function that I am after, However, until now I cannot figure out how to implement that using GeoSpark, after a lot of trying following the examples.

I think my problem is very basic. That is I still don't understand the syntax of the functions for that. Does anybody have any hints please? Any example codes in using existing RDD perhaps?

Thanks a lot in advance.

Kind regards,

Anto

shape file import

Is it possible to import shape files? They seem to be rather common.

Does GeoSpark have the function of changing the attribute value and adding attribute fields of each feature in Shapefile, in addition to spatial coordinates

Tutorial: Source and data for tree example

Can we have access to the complete example of the tutorial one the website? I am having trouble trying to replicate the example https://gist.github.com/jiayuasu/667c466f313dd4f2fc1b

Especially mixing spatial and non-spatial features

Support Different Spatial/Geometrical Data Format

GeoSpark need to support different input and output data format:

GML(Geography Markup Language)
WKT (Well-Known Text)
WKB (Well-Known Binary)

handling a large number of polygons

I have polygons which look like this https://gis.stackexchange.com/questions/242710/polygonize-raster-and-shrink-number-of-polygons and there are lots of them.

Usually a spatial join will lookup the bounding box and then for each matching /filtered pair validate the within constraint. As I have a large number of polygons and so far could not find this 2 step process in geospark's code I wonder if this could be integrated easily in order to increase performance as discussed in #91

So far I could only see that objects are

spatially partitioned
looked up from the R tree

I wonder if this pruning is already implemented (lookup only of the envelopes, then validate within constraint) or if it could easily be added.

If numOfPartitions = 0 a divide by 0 error is thrown

Using incompatible JavaRDD API for Spark 1.X (in particular, < 1.6)

In testing the Spatial Join Query example code (scala), I've found that there is a call to JavaRDD.getNumPartitions() in the 0.5.0-spark-1.x version of the library. (See below for the stacktrace.) JavaRDD.getNumPartitions() is not supported until Spark 1.6, and I'm using 1.5.0.

A simple change will fix it. I have a pull request ready to go against the "GeoSpark-for-Spark-1.X" branch. See here: #54

scala> val objectRDD = new PointRDD(sc, arealmCSV, 0, FileDataSplitter.CSV, false); 
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@47f03b18

scala>   val rectangleRDD = new RectangleRDD(sc, zcCSV, 0, FileDataSplitter.CSV, false); 
rectangleRDD: org.datasyslab.geospark.spatialRDD.RectangleRDD = org.datasyslab.geospark.spatialRDD.RectangleRDD@284463cb

scala>   objectRDD.spatialPartitioning(GridType.RTREE);
java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaRDD.getNumPartitions()I
        at org.datasyslab.geospark.spatialRDD.SpatialRDD.spatialPartitioning(SpatialRDD.java:86)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:70)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:78)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:88)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:90)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:92)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:94)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:96)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:98)
        at $iwC$$iwC$$iwC.<init>(<console>:100)
        at $iwC$$iwC.<init>(<console>:102)
        at $iwC.<init>(<console>:104)
        at <init>(<console>:106)
        at .<init>(<console>:110)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala>

cannot save image on hdfs

Hi all!
Compliments for excellent geospark and babylon libraries.
I have problems saving image file to HDFS: when using local mode, no problems, when launching it as cluster mode, the save doesn't give errors but I cannot find files.
I tried using many format (hdfs://, etc...).
Could you provide me a very simple example to export images to HDFS, please?
I'm using 0.5.1 forma spark 2
Thank you
Best
Roberto

Keep non-spatial attributes during SRDD computation

User should have the freedom to decide whether keep the non-spatial attributes during SRDD computation instead of dropping the non-spatial attributes directly. It is hard for users to recover the dropped attributes afterwards.

Load Shapefile from HDFS

I have a little problem : I have a NYC borough boundaries shapefile in hdfs in nycbb folder :

geo_export_4541ec0d-b900-4d35-b380-85c5c6300376.dbf
geo_export_4541ec0d-b900-4d35-b380-85c5c6300376.shp
geo_export_4541ec0d-b900-4d35-b380-85c5c6300376.shx

I want to do a SpatialJoin with a set of points. So I try to load the shape file first.

When I try :

val  shapefileRDD = new ShapefileRDD(sc,"hdfs://namenode_ip:port/nycbb");

And :

shapefileRDD.getShapeRDD().collect().size()

I get this error :

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 37, 172.16.2.83, executor 2): java.lang.NegativeArraySizeException
	at org.datasyslab.geospark.formatMapper.shapefileParser.parseUtils.shp.ShpFileParser.parseRecordPrimitiveContent(ShpFileParser.java:81)
	at org.datasyslab.geospark.formatMapper.shapefileParser.shapes.ShapeFileReader.nextKeyValue(ShapeFileReader.java:48)
	at org.datasyslab.geospark.formatMapper.shapefileParser.shapes.CombineShapeReader.nextKeyValue(CombineShapeReader.java:90)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
  at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
  at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
  ... 46 elided
Caused by: java.lang.NegativeArraySizeException
  at org.datasyslab.geospark.formatMapper.shapefileParser.parseUtils.shp.ShpFileParser.parseRecordPrimitiveContent(ShpFileParser.java:81)
  at org.datasyslab.geospark.formatMapper.shapefileParser.shapes.ShapeFileReader.nextKeyValue(ShapeFileReader.java:48)
  at org.datasyslab.geospark.formatMapper.shapefileParser.shapes.CombineShapeReader.nextKeyValue(CombineShapeReader.java:90)
  at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at scala.collection.AbstractIterator.to(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  ... 1 more

Did I miss something ?

I tried to access the shapefiles stored in hdfs using sc.textFile for exemple and the file exists and is not empty

Borough Boundaries.zip
.
I use spark 2.1 and GeoSpark 0.8.1.

Rectangle Join Test

I am trying to test the Rectangle Join Query but getting this errors here.

1: List<Tuple2<Polygon, HashSet>> result =
JoinQuery.SpatialJoinQuery(spatialRDD,queryRDD,false,true).collect();

The method SpatialJoinQuery(RectangleRDD, RectangleRDD, boolean) in the type JoinQuery is not applicable for the arguments (RectangleRDD, RectangleRDD, boolean, boolean)

2: When I remove one of the booleans:
Type mismatch: cannot convert from List<Tuple2<Envelope,HashSet>> to List<Tuple2<Polygon,HashSet>>

Support LineStringRDD

A LineString is a one-dimensional object representing a sequence of points and the line segments connecting them.

Using Multipolygons for more efficient joins

I have polygons which look like https://i.stack.imgur.com/w6yup.png and there are lots of them which lead to a gigantic consumption of RAM #91 as well as a very cpu intensive task for https://github.com/DataSystemsLab/GeoSpark/blob/master/core/src/main/java/org/datasyslab/geospark/spatialOperator/JoinQuery.java#L87

Description of my case: as you see in the link an object will emit 2-3k polygons and I have roughly speaking 100000 objects. For each polygon there is an ID which links back to the object and a value. When aggregating the polygons to multi-polygons (one per object per value) their number and thus the number of lookups should shrink considerably. Is there a possibility to implement this in geospark?

custom input format

I have some ascii text files which contain meta data

    key value
    some other
    x 123
    y 456

and matrix of values (heatmap)

X and y specify bottom left corner of the object in WSG64 coordinate system, the matrix bottom left value starts there as well and is evenly spaced in a grid of lets say 10 meters.

Currently gdal and ogr2ogr are used to create a more readable csv
filename_metadataKey_value
which contains a MULTIPOLYGON of the matrix.

How could I use your custom input format to map the matrix to a multi polygon?

Support insert into the RectangleRDD and the R-tree index

It seems that the current implementation doesn't allow to insert new elements (envelopes) into the RectangleRDD and into the built R-tree index. Unfortunately, I have a use case which requires just that as the envelopes are consumed from a dynamic stream of data coming into Spark using Spark streaming from a Kafka topic.

EPSG coordinate system missing

I follow your example of using the coordinate system, but get the following error: "EPSG:4326" from authority "EPSG" found for object of type "EngineeringCRS".

"Node capacity must be greater than 1"

I noticed when you want to create a PointRDD with a rtree grid from a rdd with 100 units
new PointRDD(rddPoint,"rtree",1)
you get an error "Node capacity must be greater than 1"
totalNumberOfRecords = 100
sampleNumberOfRecords =totalNumberOfRecords/100 = 1
nodecapacity = sampleNumberOfRecords / NumPartitions = 1
I think it will be better if the user can choose the percentage of data used to create the grid (new parameter) or if totalNumberOfRecords/100 = 1 sampleNumberOfRecords is equal to 0 (like it is the case if totalNumberOfRecords/100 < 1).
I know you don't really need a rtree for a dataset with 100 rows, it just that I don't want that my system crash if there is a shitty dataset in my workspace.
thanks for your consideration.

jar pushed to maven : too fat ?

Hi folks,
Thank you for adding geospark to maven repo.
I think you may have a problem with the jar you build and push to maven, though.
Your jar includes way too much classes from dependencies (one can found scala std lib classes, py4j classes, etc.). This may lead to classpath conflict, or may be the cause of issue 22 for instance.
Mathieu

Save spatial index for further usage

I've gone through the code and I'm not sure whether I understand it right or not. It seems to me that the spatial index is not supported to be saved for further reuse at this stage. Currently, I would like to save the index to HDFS integrated with the dataset so that I don't need to generate the spatial index every time I try to use the dataset. Could you please give me some suggestions or probably I misunderstand the function.

Thanks.

Babylon unit test failure wrong path

Babylon unit test fail, as the path to
Caused by: java.io.IOException: Mkdirs failed to create file:/Users/jiayu/GitHub/GeoSpark-jiayu-repo/target/scatterplot/PointRDD-parallel-raster/_temporary/0/_temporary/attempt_20170423172254_0047_m_000003_280
is hard coded.

Spatial join does not find all points within polygon

The spatial join does not find all points within a set of polygons.
I take the unmatched points, and subsequently join them with the same polygons, and find more (incomplete) matches.

How to provide semantics on data?

For the most part (except for one bug I found) I'm able to do spatial queries and calculate nearest neighbors, but I am unable to determine how to add semantics to the location data. In other words, if each point is named, how can we get the names when doing a query? For an example of what I mean, say we have a CSV with points that represent locations A, B, and C:

    A,0.0,0.0
    B,1.0,0.0
    C,2.0,0.0

When one does a KNN query, let's say the library returns the point (1.0, 0.0). How are we to associate the point returned with the location name A, B, or C? A lookup against an RDD or Dataframe on the same location dataset would be possible, but you would have to select the row using a key of the double values (X, Y). This is probably okay, but doing that seems fragile. For example, if the KNN query ever returned a calculated value for some reason rather than the actual bitwise representation stored in the dataset. Also this assumes the dataset / spark / whatever uses the same bitwise representation.

Are we expected to do the lookup using the coordinates? Or is there another method I'm not aware of?

Thanks...

Load and save SpatialRDD

Hi guys,
I would like to know what is the best way to save and load a PointRDD, PolygonRDD, RectangleRDD? I have big datasets containing geopoints and I don't want to create a PointRDD and index it everytime I need to do a match with another dataset. I want to create a PointRDD and save it in my HDFS. Something as
val p = new PointRDD()
p.save("path")

Babylon distributed image compile error

I get the following compile error in case Babylon distributed images are used

overloaded method value SaveAsFile with alternatives:
[error]   (x$1: java.util.List[String],x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: java.awt.image.BufferedImage,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: org.apache.spark.api.java.JavaPairRDD,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean
[error]  cannot be applied to (org.apache.spark.api.java.JavaPairRDD[Integer,String], String, org.datasyslab.babylon.utils.ImageType)
[error]     imageGenerator.SaveAsFile(visualizationOperator.distributedVectorImage, outputPath, ImageType.SVG)

e.g. the following using simply vector images compiles fine:

def buildScatterPlot(outputPath: String, spatialRDD: SpatialRDD): Boolean = {
    val envelope = spatialRDD.boundaryEnvelope
    val s = spatialRDD.getRawSpatialRDD.rdd.sparkContext
    val visualizationOperator = new ScatterPlot(7000, 4900, envelope, false, -1, -1, true, true)
    visualizationOperator.CustomizeColor(255, 255, 255, 255, Color.GREEN, true)
    visualizationOperator.Visualize(s, spatialRDD)
    import org.datasyslab.babylon.utils.ImageType
    imageGenerator.SaveAsFile(visualizationOperator.vectorImage, outputPath, ImageType.SVG)
  }

@jiayuasu please could you explain a bit if distributedvevctorimage is required in case the boolean option for parallel image rendering / filter was selected.

Setting StorageLevel to MEMORY_AND_DISK

I've noticed that at the moment storage level of SpatialRDDs is hardcoded to MEMORY_ONLY.

Question: Is there a reason for this and are there any potential pitfalls of setting it to MEMORY_AND_DISK_SER?

Details:
I'm planning to use GeoSpark to query a large dataset (>30Tb), which does not fit in memory.

If the storage level is set to MEMORY_ONLY, this would mean that partitions would have to be recomputed as soon as the memory fills up. This is a problem because I'm doing a significant amount of preprocessing before creating SpatialRDDs. This also means that if I have an index set, this index will also have to be recomputed when I run out of memory, which contradicts the whole point of setting an index. Please correct me if I'm wrong.

A (naive) solution to this issue would be simply creating a custom build of GeoSpark where the storage level is replaced to MEMORY_AND_DISK, however, I wanted to make sure that this will not cause any trouble. Ideally, user should be able to set the storage level as a parameter.

Thanks!

Add native support of Spark DataFrame and Spark SQL

Currently, GeoSpark manipulates the original Spark RDD. Some users suggest that adding the native support of Spark DataFrame may help people. We will add the native support in the upcoming versions.

For now, an easy workaround to get DataFrame rdd is that:

To get a spatial partitioned DataFrame, call GeoSpark myPointRDD. gridPointRDD / myRectangelRD.gridRectangelRDD / myPolygonRDD.gridPolygonRDD to get the spatial partitioned Spark RDD and then convert it to DataFrame.
To get a DataFrame from Range query result or any Spatial RDDs , call myPointRDD.getRawPointRDD() / myRectangelRDD.getRawRectangelRDD() / myPolygonRDD.getRawPolygonRDD() and then convert it to DataFrame.

Support 3D points

The Coordinate structure appears to support more then two dimensions. However, I saw RDD objects like PointRDD, and indexes works only in a two dimension world. Is it possible to use this package to 3D environment or would you like to recommend an alternative package?

My problem is to find the neighbor points for each point and then apply some custom function. But, there, points have 3 dimensions.

SpatialJoinQuery self join performance

Hi,
I am doing a benchmark on this feature.
I have nearly 1Million Polygons in my RDD.
I tried to self join to find the overlaping polygons.
I tested on AWS EMR with 4 nodes. each node has 4vCPU and 30GB memory.
The total Execution time is around 1 hour.

geoPRDD.spatialPartitioning(GridType.RTREE);
geoPRDD.buildIndex(IndexType.RTREE, true);
geoPRDD.indexedRDD.persist(StorageLevel.MEMORY_ONLY());
geoPRDD.spatialPartitionedRDD.persist(StorageLevel.MEMORY_ONLY());
JoinedRDD = JoinQuery.SpatialJoinQuery(geoPRDD, geoPRDD, true, true);

But from your paper, the spatial join query should be much faster than expected. Even though your paper used points.

Is that normal costing such a long time for computing and is there any way to boost the performance?

K-Nearest Neighbor (kNN) Functionality

Find the K-nearest neighbors to a specific point or a set of points

Correlation Methods and Java Examples

Hi,

I was reading through the functionality provided by GeoSpark and I am really interested in what is on offer. I was especially interested in the "Spatial aggregation, spatial autocorrelation and spatial co-location" methods. But could not find any details on how these methods have been implemented.

There appeared to be reference to these already being implemented as examples in "app" folder but I was unable to locate this.

Any information would be greatly appreciated.

Sam

Replace String format input with Enum type

Currently, GeoSpark takes some String type inputs as parameters which is not convenient enough. We will use Enum type to replace the String type inputs. This may lead to some API behavior changes.

JoinQuery

Hi guys,
I would like to make a join between two dataframes which contain geospatial data. One of them has a column which contains Points and the other one has a columns which contains Polygons. I want to make a join between theses two dataframes using these two columns. The problem is that i can transform the columns in PointRDD and PolygonRDD and then make the query but it will only return to me a rdd containing Polygon and a list of Point. It is not what I am looking for, I am looking for the index.
Imagine I got
df1 = ((Row_Id,Point),(1,P1),(2,P2),(3,P3),(4,P4))
df1 = ((Row_Id,Polygon),(1,Po1),(2,Po2),(3,Po3),(4,Po4),(5,Po5))
I want
1 5
2 3
2 4
with P1 is in Po5, P2 is in Po3 and Po4.

The best would be if I can do like a join in spark
val matchedRowsDF = rfqDF.joinWith(rfsDF, rfqDF(m.rfqColName) === rfsDF(m.rfsColName))
but using your algorithm.

Thanks for any advices.

Performance of Range Query

For testing performance of GeoSpark, I created a polygon of 7 points(got coordinates from google maps). I am reading those points from KML file and converted Polygon to JavaRDD. Then I indexed PolygonRDD using "rtree".
To find out whether a particular point is inside given polygon or not I am using RangeQuery.
On my local machine, its taking 32ms(approx) to give me result. My Polygon won't change for lifetime so I cached JavaRDD but still it took approx 32ms when I ran it several times.
So when I run RangeQuery, does spark build the tree at that particular time and then try to find whether given envelope is in range of that polygon or range query itself is taking this much time?
Note: I have only one polygon in my PolygonRDD