Expected behavior Running df = sedona.read.format("geoparquet").lo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Apache Sedona on Fabric, does not read parquet from Lakehouse about sedona HOT 7 CLOSED

robertnagy1 commented on July 24, 2024

Apache Sedona on Fabric, does not read parquet from Lakehouse

from sedona.

Comments (7)

jiayuasu commented on July 24, 2024

@robertnagy1 I guess you need to use something like abfss://XXX. This path must be available to all workers in the cluster.

This should also apply to all other Spark native data sources too.

from sedona.

robertnagy1 commented on July 24, 2024

That's the thing, Fabric mounts by default the dfs to the Lakehouse, so they have some internal logic that propagates this to other cluster workers. The dfs is never exposed to the end user only the alias. I dont know what Sedona expects when it reads parquet? and how is it trying to reach it? As mentioned in Azure synapse you would find the path you want to write with using mssparkutils.fsls("") whilst in Fabric you would find the path using os.listdir("")

from sedona.

jiayuasu commented on July 24, 2024

@robertnagy1 Just tried. This works for me:

sedona.read.format("geoparquet").load("Files/example-1.0.0-beta.1.parquet").show()

My file location (choose relative path for Spark):

from sedona.

robertnagy1 commented on July 24, 2024

Replaced the path, seems like it is enough to refer to "Tables" you don't need to refer to the full mounted path /lakehouse/default/Tables. But i got a second error. I am using a osm_buildings which i converted to parquet.

Py4JJavaError Traceback (most recent call last)
Cell In[64], line 1
----> 1 df = sedona.read.format("geoparquet").load("Files/samples/parquet/buildings_parquet.parquet").show()

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py:899, in DataFrame.show(self, n, truncate, vertical)
893 raise PySparkTypeError(
894 error_class="NOT_BOOL",
895 message_parameters={"arg_name": "vertical", "arg_type": type(vertical).name},
896 )
898 if isinstance(truncate, bool) and truncate:
--> 899 print(self._jdf.showString(n, 20, vertical))
900 else:
901 try:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.call(self, *args)
1316 command = proto.CALL_COMMAND_NAME +
1317 self.command_header +
1318 args_command +
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):

File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception..deco(*a, **kw)
167 def deco(*a: Any, **kw: Any) -> Any:
168 try:
--> 169 return f(*a, **kw)
170 except Py4JJavaError as e:
171 converted = convert_exception(e.java_exception)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o6258.showString.
: java.lang.NoSuchMethodError: 'boolean org.apache.spark.sql.internal.SQLConf.parquetFilterPushDownStringStartWith()'
at org.apache.spark.sql.execution.datasources.parquet.GeoParquetFileFormat.buildReaderWithPartitionValues(GeoParquetFileFormat.scala:213)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:569)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:558)
at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:588)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:282)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:279)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:227)
at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:531)
at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:459)
at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:458)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:502)
at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:52)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:282)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:279)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:227)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:400)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:534)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:519)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4203)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3174)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4193)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:642)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4191)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:120)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:209)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:105)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:67)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4191)
at org.apache.spark.sql.Dataset.head(Dataset.scala:3174)
at org.apache.spark.sql.Dataset.take(Dataset.scala:3395)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:297)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:336)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

from sedona.

jiayuasu commented on July 24, 2024

@robertnagy1 This is likely caused by Spark + Sedona version mismatch.

Spark 3.0 - 3.3 will use sedona-spark-shaded-3.0_2.12
Spark 3.4 will use sedona-spark-shaded-3.4_2.12
Spark 3.5 will use sedona-spark-shaded-3.5_2.12

from sedona.

robertnagy1 commented on July 24, 2024

You are totally right. I apologize for not paying attention.

from sedona.

jiayuasu commented on July 24, 2024

No problem. Glad that you solved the problem! I will close the ticket.

from sedona.

Apache Sedona on Fabric, does not read parquet from Lakehouse about sedona HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent