<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

PR <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id=

Spark 3.0 Support about sql-spark-connector HOT 22 CLOSED

microsoft commented on May 16, 2024 18

Spark 3.0 Support

from sql-spark-connector.

Comments (22)

dbeavon commented on May 16, 2024 4

For those of you who are azure-databricks customers, and are loading data into azure-sql, would you please contact tech support at Microsoft?

There is no doubt that this is a breaking change for anyone who must upgrade to the azure-databricks runtime 7.x. In the very least they could provide a warning for us in the release notes.

For some reason the azure-databricks team needs a bit of encouragement from us before they'll prioritize a fix in this connector. They don't seem to consider it a priority to support the fast, bulk-insert connector for SQL. Currently they consider this a "third-party" interface. That same opinion seems to be expressed by both the "azure-databricks" team and the "databricks" team. It's odd that they don't really understand the requirement to be able to bulk insert from spark dataframes.... All you need is to google "spark sql bulk insert"....

Bulk insert technology in SQL Server has been around for decades, and Spark has a significant need of it. Otherwise we run into some some silly and unnecessary bottlenecks on individual record insertions.

from sql-spark-connector.

rajmera3 commented on May 16, 2024 3

@ravikd744 No, not yet. There is a PR in the works for Spark 3.0. Once it has been validated, we will update the repository, build, and readMe with the new support statement.

from sql-spark-connector.

tkasu commented on May 16, 2024 3

Any update regarding this? This is major blocker for us.

from sql-spark-connector.

dbeavon commented on May 16, 2024 1

@traberc
To see the necessary code changes you can go look at the PR (#30). There are a few lines of changes.

There is no real issue other than regression testing (aka "necessary validation").

The only substantial programming change is to target a newer version of scala.

In order for you to get this connector working you need to download the code, open in intellij, remove tests, and edit the sbt to target the correct version of scala, and rebuild. Once this is done, you will have your own private copy of the module that should work fine. But you will have nobody else to support it. This is where I landed after many conversations with folks at databricks, azure-databricks, and conversations here in the connector project.

I think what Rahul is saying is that databricks is not in his wheelhouse. I think it is fair to say that this community will start to care more about the topic (spark 3.0 support) once SSBDC is ready to adopt spark 3.0, and not before. You can read more at https://github.com/microsoft/sql-spark-connector

It is frustrating how hard it is for Microsoft to acknowledge that their "azure databricks" needs to properly interoperate with "azure SQL". IMHO this should not be a months-long debate. Another thing that Microsoft won't acknowledge is that this is a regression (as you pointed out). By definition, this is a regression in azure-databricks since we had a bulk-load spark connector in 2.4 and after upgrading to 3.0 we do not.

Things seem especially dysfunctional because there are three separate parties involved and everybody is dodging responsibility. The formal reasoning why databricks is dodging is because this is considered a "third-party" library. In addition to databricks itself, there is also another large team at Microsoft called "azure-databricks" and they do a bit of the software development to ensure databricks can be called a "first-party" service in azure. They build the "glue" that holds databricks in place within the azure cloud. They are also responsible for taking support calls. If these two teams ("databricks" and "azure-databricks") weren't enough, there is yet another team here in the community that is responsible for this connector. And this community project seems to be much more interested in SSBDC than in databricks. I've spent several months being bounced back and forth between these three different sets of folks. I strongly suggest you just be patient and wait for SSBDC to mature a bit more. Otherwise you are likely to waste as much time on the topic as I have.

In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.

from sql-spark-connector.

wboleksii commented on May 16, 2024 1

@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point

from sql-spark-connector.

MrWhiteABEX commented on May 16, 2024 1

Again I can only recommend to just compile it yourself from the PR and test it. It is not difficult using sbt. The CI build failes due to the broken pipeline but the connector works just fine for me. I have a streaming application running in production for about a month on DBR 7.3 that continuously ingests data without issues. At least for the sink with default options I am quite confident to say that if there was a major issue I would have hit it. But you have to test it in your dev/qa environment anyways.
The automated tests of this connector are lackluster. They would not detect any incompatibilities in Scala or Spark versions. They hardly scratch the surface.
Spark 3 is a major improvement. For my workload using DBR 7.3 (coming from 6.5) allowed me reduce my job cluster size and save about 1500$ (abount 30-40%) per month. At least for me it was worth the additional effort.

from sql-spark-connector.

rajmera3 commented on May 16, 2024

On initial inspection, the issue with Spark 3.0 support seems to be a logging class in the connector. If replaced, the connector should function

from sql-spark-connector.

ravikd744 commented on May 16, 2024

Hi Rahul, does the latest release support spark 3.0.0?

from sql-spark-connector.

dbeavon commented on May 16, 2024

Thanks for working on this. We are eager to start using Spark 3 (in Databricks 7). There are lots of factors pushing us in that direction, and the lack of a SQL connector seems to the be only holdup at this time.

from sql-spark-connector.

shivsood commented on May 16, 2024

PR #30 is addressing this.

from sql-spark-connector.

briancuster commented on May 16, 2024

It would be really great if this connector supported 3.0. We are currently locked in to using 3.0 but would like to use this connector.

from sql-spark-connector.

sl2bigdata commented on May 16, 2024

Would be really nice to have the upgrade! Blocker for us too.. Thx guys

from sql-spark-connector.

dbeavon commented on May 16, 2024

Sorry to state the obvious, but my understanding is that this issue is being delayed. It won't get much attention until "SQL Server Big Data Clusters" (SSBDC) is ready to adopt spark 3.0.

I don't know much about it... can someone please point me to a roadmap for SQL Server Big Data Clusters? Am I right that it does not support spark 3.0 yet? How long until its customers will be ready to use spark 3.0?

As far as azure-databricks goes, those guys don't seem to care much about this connector... or at least they are not in a position to ask for a connector which is compatible with spark 3.0. So azure-databricks customers are forced to wait for SSBDC to catch up.... hopefully that won't be very much longer!

from sql-spark-connector.

rajmera3 commented on May 16, 2024

Hi all,

Thanks for the comments and your feedback is received.

Currently we do not have the necessary validation to confirm Spark 3.0 support. Before adding the functionality and creating a new version of the connector (a dedicated 3.0 version), we look to do performance testing, runtime compatibility, etc.

At this time we have no strict timeline for Spark 3.0 support. There is an open PR and fork that allows the connector to work with 3.0 as reported by a few customers, but we will refrain from officially moving it into the main branch until we have tested it thoroughly.

We hear your feedback and hope to address it sooner than later.

from sql-spark-connector.

traberc commented on May 16, 2024

What is the issue with Spark 3.0 support? I see comments complaining about Databricks, but is the issue with Databricks itself or Spark 3.0? This being a Microsoft connector, it seems that the onus lies with Microsoft to update the connector rather than with Databricks. Maybe someone can help me understand the technical issues with Spark 3.0 support.

Now that the old "azure-sqldb-spark" connector is out of support, this "sql-spark-connector" is basically the only option going forward, but without Spark 3.0 support, it's basically dead in the water too.

We really want to leverage the new performance features of Spark 3.0, like AQP, but are being held back by either of the available SQL server connector options provided by Microsoft.

from sql-spark-connector.

gmdiana-hershey commented on May 16, 2024

I'm not an expert, so hopefully you'll all forgive me for asking a basic question. What's unclear to me is what the "necessary validation" means? It sounded like a number of customers have been building and using the existing PR and using it successfully. Are there specific test cases that the PR doesn't pass? If so, what is causing the delay in resolving those failures and completing the testing work?

As an Azure Databricks customer, it's been very frustrating that Microsoft has built a connector is incompatible with the current major release of Spark. On one hand, they're offering two products - SQL Server and Databricks (with runtime 7.0+). Both of these are allegedly "Azure" cloud services that Microsoft endorses, and one would think that would include the runtime releases of both those products. On the other hand, they've failed to provide a connector that lets you use the two products together. The lack of movement here has prompted me to begin exploring alternative databases.

from sql-spark-connector.

B4PJS commented on May 16, 2024

@dbeavon

In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.

Synapse workspaces currently only use scala to connect with synapse sql and only allows loading into a new table. It uses polybase under the hood as opposed to bulkcopy so that will not help out here.

Engineering team have been given feedback about this and they hope to have both points fixed at some point....

from sql-spark-connector.

pmooij commented on May 16, 2024

@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point

So it's high prio now! Looking forward running this on the latest DBR, as 7.4 has sooo many improvements over 6.6

from sql-spark-connector.

dazfuller commented on May 16, 2024

Spark 3 is critical, but it's worth nothing that Databricks runtime 6.4 which uses spark 2.4.5 goes EOL April 1st 2021 (poor choice of date).

Azure Databricks Runtimes

from sql-spark-connector.

pmooij commented on May 16, 2024

I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;

choco install intellijidea-community
choco install sbt
sbt assembly

this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days

since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.

thanks for the tip @MrWhiteABEX

from sql-spark-connector.

ZMon3y commented on May 16, 2024

I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;

choco install intellijidea-community

choco install sbt

sbt assembly

this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days

since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.

Thanks @pmooij
This worked for me as well with one minor change to src/test/scala/com/microsoft/sqlserver/jdbc/spark/bulkwrite/DataSourceTest.scala

which can be seen here: master...dovijoel:spark-3.0

Basically just changing SharedSQLContext to SharedSparkSession

I'm having success in Databricks Runtime 7.3 LTS | Spark 3.0.1 | Scala 2.12

from sql-spark-connector.

rajmera3 commented on May 16, 2024

Hi all,

Thanks for the patience as we worked on supporting Spark 3.0.
We have released a preview version of an Apache Spark 3.0 compatible connector on Maven!
The readme has more information but the connector is available at the coordinates com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha.

If you notice any bugs or have any feedback, please file an issue!

from sql-spark-connector.

Spark 3.0 Support about sql-spark-connector HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent