sjrusso8 / spark-connect-rs Goto Github PK

View Code? Open in Web Editor NEW

62.0 4.0 11.0 3.79 MB

Apache Spark Connect Client for Rust

Home Page: https://docs.rs/spark-connect-rs

License: Apache License 2.0

Rust 99.22% Shell 0.78%

grpc-client spark spark-connect spark-sql

spark-connect-rs's Introduction

Some Stats

spark-connect-rs's People

Contributors

Stargazers

Watchers

Forkers

paul-sanga edmondop hntd187 shindora mrpowers holdenk abrassel spiceai lexara-prime-ai

spark-connect-rs's Issues

Implement approxQuantile on DataFrame

Description

Implement the missing method for approxQuantile on the DataFrame

approxQuantile

Feature: bindings for server side JS/TS using napi-rs

Description

Create similar bindings as with Rust but available in server side js (node, deno, bun, ...). The sdk should closely resemble the rust one, and only deviate when either necessary due to napi limitations, or when it is unidiomatic in JS.

napi.rs seems to be good crate to leverage and is relatively easy to use.

Early Experiment

The branch feat/napi contains a super quick pass at creating the bindings. The experiment only covers these areas

Create a remote SparkSession
Create a dataframe with .sql
Modify the dataframe with select, and filter
Perform "action" with count()
Perform “action” with show()

There is a lot of use of clone() and some not great implementations to create a new empty dataframe to satisfy the napi requirements. The polars js interop is a good example of how the bindings might function.

Collect fails on large results

The collect() panics when returning a large result. The arrow-ipc streamreader is not parsing the data correctly.

Example:

use spark_connect_rs;

use spark_connect_rs::{SparkSession, SparkSessionBuilder};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut spark: SparkSession =
        SparkSessionBuilder::remote("sc://127.0.0.1:15002/;user_id=example_rs")
            .build()
            .await?;

    spark
        .clone()
        .range(None, 100000, 1, Some(1))
        .collect()
        .await
        .unwrap();

     Ok(())
}

This results in a panic

thread 'main' panicked at /home/sjrusso/Documents/code/projects/rust-projects/spark-connect-rs/src/session.rs:191:30:
called `Result::unwrap()` on an `Err` value: IpcError("Not expecting a schema when messages are read")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1073:23
   4: spark_connect_rs::session::SparkSession::consume_plan::{{closure}}
             at ./src/session.rs:191:12
   5: spark_connect_rs::dataframe::DataFrame::collect::{{closure}}
             at ./src/dataframe.rs:99:14
   6: sql::main::{{closure}}
             at ./examples/sql.rs:21:10
   7: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/park.rs:282:63
   8: tokio::runtime::coop::with_budget
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
  10: tokio::runtime::park::CachedParkThread::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/park.rs:282:31
  11: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/blocking.rs:66:9
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  13: tokio::runtime::context::runtime::enter_runtime
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
  14: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  15: tokio::runtime::runtime::Runtime::block_on
             at /home/sjrusso/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/runtime.rs:349:45
  16: sql::main
             at ./examples/sql.rs:46:5
  17: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Feature: Position/Keyword Args with SQL

Description

Implement the ability to use positional/keyword args with sql. Because of the differences between python and rust, the function arguments need to be clearly implemented.

The pyspark process for sql allows for literals and dataframes to be in one argument. However, rust probably won't take to kindly to that input arg. If a user passes in a DataFrame it will need to be handled with a SubqueryAlias and if it's a literal it will be passed in as either a positional or a keyword argument.

We might want to only allow for to inputs parameters are options. Something like this?

sql<T: ToLiteral>(self, sql_query: &str, col_args: Option<HashMap<String, T>>, df_args: Option<HashMap<String, DataFrame>>) -> DataFrame

This could allow a user to do these variations.

spark.sql("SELECT * FROM table", None, None).await?;

let df = spark.range(...);

// create the hashmap

spark.sql("SELECT * FROM {df}" None, Some(df_hashmap)).await?;

let col = "name";

// create the hashmap

spark.sql("SELECT {col} FROM {df}", Some(col_hashmap), Some(df_hashmap)).await?;

Or should positional SQL be a completely different method all together? like sql_params? So that a user doesn't need to fuss with adding None x2 to all their sql statements.

Implement DataFrameStatFunctions on the DataFrame

Description

Create the DataFrameStatFunctions object and implement the remaining methods for approxQuantile, corr, cov, crossTab,freqItems, sampleBy

DataFrameStatFunctions

Write unit test(s) for Spark functions

Description

There are many functions for Spark, and most of them are created via a macro. However, not all of them have unit test coverage. Create additional unit tests based on similar test conditions from the exiting Spark API test cases.

I have been mirror the docstring tests from the PySpark API for reference.

Implement checkpoint & localCheckpoint on the DataFrame

Description

Implement the missing method for checkpoint and localCheckpoint on the DataFrame

localCheckpoint

checkpoint

Check example datasets into source control so they're easier to run

The examples currently use paths that are for the Docker workflow.

It would be cool if the examples could also work with non-Docker setups (e.g. when I manually spin up Spark Connect on localhost).

Perhaps we can check in all those data files into this repo, so these examples can work out of the box with Docker and Spark Connect localhost.

Epic: Implement Missing Spark Functions

Description

This issue will be the organizing issue for all the remaining spark functions to method.

Based on the readme here is the list

aggregate
array_contains
array_insert
array_join
array_sort
arrays_overlap
assert_true
broadcast
bround
bucket
concat_ws
conv
corr
cos
cosh
cot
count
count_distinct
date_format
date_trunc
decode
element_at
encode
exists
filter
first
forall
format_number
format_string
from_csv
from_json
from_unixtime
from_utc_timestamp
get
get_json_object
grouping_id
hypot
inspect
instr
json_tuple
lag
last_day
lead
levenshtein
locate
lpad
make_date
map_contains_keys
map_filter
map_from_arrays
map_from_entries
map_zip_with
max_by
min_by
mode
months_between
next_day
np
nth_value
octet_length
overlay
overload
percentile_approx
pmod
raise_error
regexp_extract
regexp_replace
repeat
rpad
schema_of_csv
schema_of_json
sentences
sequences
session_window
sha2
shiftLeft
shiftRight
shiftRightUnsigned
slice
sort_array
split
substring
sum_distinct
sys
to_degrees
to_radians
to_csv
to_date
to_json
to_str
to_timestamp
to_utc_timestamp
transform
transform_keys
transform_values
translate
trunc
unix_timestamp
when
window
window_time
zip_with

Implement NA functions on DataFrame & DataFrameNaFunctions

Description

Create the DataFrameNaFunctions object and implement the remaining methods for drop, fill, and replace

DataFrameNaFunctions

Epic: Spark 4.0 Connect Spec

Description

Spark 4.0 implements changes to the connect proto. We will to analyze the spec and identify what has changed.

Additionally, we will need to support both a client for Spark 3.5 and Spark 4.0. There should be a feature flag on the client for 3_5 or 4_0

Feature: Investigate WASM/WASI targets

Description

Being able to compile the rust bindings into wasm32-unknown-unknown and/or wasm32-wasi would be interesting. This could allow some interesting interactions between a browser and spark. A wasm32-wasi target would allow spark programs to run on any runtime.

Early Experiments

A feature flag under the core bindings for wasm already exists and does compile successfully to those targets mentioned above. The issue arises when trying to send a HTTP1.1 request with grpc-web to the Spark Connect server, which only accepts normal HTTP2 grpc requests. There are methods of standing up a proxy server with envoy to forward the gRPC browser request to the backend server. But this feels like a lot of effort for the client to do.

The branch feat/wasm contains the early experiment and trying to run wasm with wasmtime. Issue arises with using async code in wasm. There is probably a way to code it correctly, but I don't have time to finish the experiment

Implement createTable on Catalog

Description

Implement the method for createTable and createExternalTable on the Catalog.

Help getting spark-connect-rs running locally

Commands I ran (I am a total n00b):

install rust via rustup
Created a new crate: cargo new spark-connect-gist
edited Cargo.toml to include spark connect rs and tokio dependencies:

[dependencies]
spark-connect-rs = "0.0.1-beta.3"
tokio = { version = "1", features = ["full"] }

Install cmake: arch -arm64 brew install cmake
Install protobuf: brew install protobuf
Copy the example gist in main.rs: https://gist.github.com/sjrusso8/2b4e43af462367a15f91db5a33627449
Run cargo run

Here is the error message:

   Compiling spark-connect-gist v0.1.0 (/Users/matthew.powers/Documents/code/my_apps/spark-connect-gist)
    Finished dev [unoptimized + debuginfo] target(s) in 1.00s
     Running `target/debug/spark-connect-gist`
Error: tonic::transport::Error(Transport, hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 61, kind: ConnectionRefused, message: "Connection refused" })))

Cleanup Documentation - Spark Core Classes

Description

The overall documentation needs to be reviewed and matched against the Spark Core Classes and Functions. For instance, the README should be accurate to what functions and methods are currently implemented compared to the existing Spark API.

However, there are probably a few misses that are either currently implemented but marked as open, or were accidentally excluded. Might consider adding a few sections for other classes like StreamingQueryManager, DataFrameNaFunctions, DataFrameStatFunctions, etc.

Implement File Format Reader/Writer

Description

Implement the initial methods to read and write .csv, .json, .orc, .parquet, and .text.

Consider creating ConfigOpts trait for each of those file options and have a custom struct represent the options for each of those file types.

Example with CSV

Create the Options and modify the opts object. The object is passed into the method.

let mut opts = CsvOptions::new()

opts.header = true;
opts.delimiter = b'|';

let df = spark.read().csv(path, opts)

Example of what the function signature might look like

impl DataFrameReader {
 ....
     pub fn csv<C: ConfigOpts>(path: &str, opts: Some(C))
}

Implement `DataFrameWriterV2`

Description

The initial DataFrameWriter is created but there is also another way to write data via DataFrameWriterV2. This method has a slightly different implementation and leverages a different proto command message.

I think the methods should mirror the ones found on the Spark API guide, and a new method should be added onto the DataFrame for writeTo.

Using 'select' on a dataframe created from 'sql' throws an error

When creating a dataframe from spark.sql and then using select throws an error. Expected to be able to select the column from the data.

Remove Git Submodule and Copy the Proto into the Repo

Description

Git submodules are annoying, and the current submodule is only ever checked out at a specific release tag. It would be easier to just have the folder containing the copied protobufs.

I think it might look something like this in the repo

├── core       <- core implementation in Rust
│   │ - spark4_0 <- protobuf for spark 4.0.0
│   └─ spark3_5  <- protobuf for spark 3.5.1

Related to #61

CICD: Release pipeline is broken

Description

There is an error when the release to cargo pipeline is ran. So I have been running it manually.

Error run 9025621050

Bug: Endpoint uses https scheme when use_ssl is false

I already have a fix in mind for this issue, and I can make a PR if this issue is reviewed and approved, thanks!

Description

The current spark-connect-rs library configures all endpoints to https scheme. When tls feature is enabled, connection cannot be successfully made to server without TLS configured, for example, when connecting to a Spark cluster set up at localhost

Expected Behavior

When tls feature is enabled in spark-connect-rs crate, connection to server with / without TLS configured should both be successful

Proposed Fix

Default endpoint scheme to http, set it to https only when use_ssl=true is specified in connection string

Related Issue

spiceai/spiceai#1251

Deadlock: concurrent cloned spark sessions

Hi, thanks a lot for this nice crate!

I'd like to report a deadlock issue when a spark session is cloned and used concurrently.

#46 demonstrates a possible workflow leading to a deadlock.

The gist is, everywhere #[allow(clippy::await_holding_lock)] is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.

-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor.
-> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.

Therefore, I would suggest that we remove the 'clone' method from SparkSession, or replace rwlock with an asynchronous lock. What do you think of this?

#[allow(non_snake_case)]

Hi @sjrusso8 can you tell me the rationale behind exposing a camel cased API?

Seems like we should be fine just using idiomatic rust snake casing.

On a separate note, do you have an IM channel of some kind to chat about this project? I would love to iterate more rapidly with you.

Epic: Core Client Kernel

Description

Lots of other programming languages can access gRPC and Arrow, which means many different languages will be recreating the core client logic to handle the client requests and response handling.

The idea is that there could one kernel client that all other programming languages use, and then the specific language is left to implement the specific of the core spark objects.

What does this involve?

Isolate the existing client.rs into core and move all other rust specific implementations into rust
Remove any dependencies in client.rs that are only for the rust library
Update error handling in client.rs to create a new ConnectClientError error type, (currently leverages SparkError)
lots of more research and analysis into the feasibility of this kernel :)

sjrusso8 / spark-connect-rs Goto Github PK

spark-connect-rs's Introduction

Some Stats

Linkedin

spark-connect-rs's People

Contributors

Stargazers

Watchers

Forkers

spark-connect-rs's Issues

Description

Description

Early Experiment

Description

Description

Description

Description

Description

Description

Description

Description

Early Experiments

Description

Description

Description

Example with CSV

Description

Description

Description

Description

Expected Behavior

Proposed Fix

Related Issue

Description

What does this involve?

Recommend Projects

Recommend Topics

Recommend Org