qwshen / spark-flight-connector Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 7.0 168 KB

A Spark Connector that reads data from / writes data to Arrow-Flight end-points with Arrow-Flight and Flight-SQL

License: GNU General Public License v3.0

Java 92.05% Scala 7.95%

apache-arrow apache-flight apache-spark arrow arrow-flight data-source-api dremio flight-sql spark-connector sql

spark-flight-connector's People

Contributors

Stargazers

Watchers

Forkers

datafibers declark1 abmo-x wuwenchi melin m0h3en calloc

spark-flight-connector's Issues

Spark jobs size increasing double for same dataframe for same proration

Hi Wayne

I have parquet file in Dremio, I have used below query to read it in Spark setup.
val df=spark.read.option("host", "[SPARK_IP]").option("port", 32010).option("tls.enabled", false).option("tls.verifyServer", false).option("user", "user").option("password", "password").option("partition.size", 320).option("partition.byColumn", "COLUMN1").flight(""""dremio_space"."file"""").filter("COLUMN == 22")

After above query, I have executed df.count, 320 spark jobs are running to get count (count=15), when I ran df.count second time, 640 jobs run to get count (count=30), again I ran df.count, 960 jobs run to get count (count=45).
Do you have any idea, why jobs are increasing and count also increasing, after each execution.
Same issue happened, when I tried to run Group BY on dataframe.

Please let me know, what should I do to fic this issue.

Thanks
Nagaraja M M

Filter with base query contains where condition

Hi, it seems when I add the filter or where condition, the push down optimization would fail if the base query also contains a where condition.

Spark UNAUTHENTICATED issue with Dremio

I did a Dremmio v23.0.1 setup with master and executor, also Spark v3.2.2 setup in linux machines, all nodes are in different VM.
I have logged in to Dremio, it is working fine.
Tried to connect Apache arrow flight from python script, it is working. able to fetch data.

Build spark-flight-connector to jar, and used below command to run spark.
./spark-shell --master local[*] --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.2 --jars ./spark-flight-connector_3.2.1-1.0.1.jar --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.iceberg_catalog.type=hadoop --conf spark.sql.catalog.iceberg_catalog.warehouse=file:///home/name/tem/data

But when I tried to use below spark code, to get data, it is throwing UNAUTHENTICATED error.
spark.read.format("flight").option("host", "").option("port", "32010").option("user", "test").option("password", "test").option("table", """"name"."table"""").load

Please help me, how to read data from Apache arrow, using spark.

Thanks
Nagaraj M M

qwshen / spark-flight-connector Goto Github PK

spark-flight-connector's People

Contributors

Stargazers

Watchers

Forkers

spark-flight-connector's Issues

Spark jobs size increasing double for same dataframe for same proration

Filter with base query contains where condition

Spark UNAUTHENTICATED issue with Dremio

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent