qwshen / spark-flight-connector Goto Github PK
View Code? Open in Web Editor NEWA Spark Connector that reads data from / writes data to Arrow-Flight end-points with Arrow-Flight and Flight-SQL
License: GNU General Public License v3.0
A Spark Connector that reads data from / writes data to Arrow-Flight end-points with Arrow-Flight and Flight-SQL
License: GNU General Public License v3.0
Hi Wayne
I have parquet file in Dremio, I have used below query to read it in Spark setup.
val df=spark.read.option("host", "[SPARK_IP]").option("port", 32010).option("tls.enabled", false).option("tls.verifyServer", false).option("user", "user").option("password", "password").option("partition.size", 320).option("partition.byColumn", "COLUMN1").flight(""""dremio_space"."file"""").filter("COLUMN == 22")
After above query, I have executed df.count, 320 spark jobs are running to get count (count=15), when I ran df.count second time, 640 jobs run to get count (count=30), again I ran df.count, 960 jobs run to get count (count=45).
Do you have any idea, why jobs are increasing and count also increasing, after each execution.
Same issue happened, when I tried to run Group BY on dataframe.
Please let me know, what should I do to fic this issue.
Thanks
Nagaraja M M
Hi, it seems when I add the filter or where condition, the push down optimization would fail if the base query also contains a where condition.
Hi
I did a Dremmio v23.0.1 setup with master and executor, also Spark v3.2.2 setup in linux machines, all nodes are in different VM.
I have logged in to Dremio, it is working fine.
Tried to connect Apache arrow flight from python script, it is working. able to fetch data.
Build spark-flight-connector to jar, and used below command to run spark.
./spark-shell --master local[*] --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.2 --jars ./spark-flight-connector_3.2.1-1.0.1.jar --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.iceberg_catalog.type=hadoop --conf spark.sql.catalog.iceberg_catalog.warehouse=file:///home/name/tem/data
But when I tried to use below spark code, to get data, it is throwing UNAUTHENTICATED error.
spark.read.format("flight").option("host", "").option("port", "32010").option("user", "test").option("password", "test").option("table", """"name"."table"""").load
Please help me, how to read data from Apache arrow, using spark.
Thanks
Nagaraj M M
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.