Comments (22)
For those of you who are azure-databricks customers, and are loading data into azure-sql, would you please contact tech support at Microsoft?
There is no doubt that this is a breaking change for anyone who must upgrade to the azure-databricks runtime 7.x. In the very least they could provide a warning for us in the release notes.
For some reason the azure-databricks team needs a bit of encouragement from us before they'll prioritize a fix in this connector. They don't seem to consider it a priority to support the fast, bulk-insert connector for SQL. Currently they consider this a "third-party" interface. That same opinion seems to be expressed by both the "azure-databricks" team and the "databricks" team. It's odd that they don't really understand the requirement to be able to bulk insert from spark dataframes.... All you need is to google "spark sql bulk insert"....
Bulk insert technology in SQL Server has been around for decades, and Spark has a significant need of it. Otherwise we run into some some silly and unnecessary bottlenecks on individual record insertions.
from sql-spark-connector.
@ravikd744 No, not yet. There is a PR in the works for Spark 3.0. Once it has been validated, we will update the repository, build, and readMe with the new support statement.
from sql-spark-connector.
Any update regarding this? This is major blocker for us.
from sql-spark-connector.
@traberc
To see the necessary code changes you can go look at the PR (#30). There are a few lines of changes.
There is no real issue other than regression testing (aka "necessary validation").
The only substantial programming change is to target a newer version of scala.
In order for you to get this connector working you need to download the code, open in intellij, remove tests, and edit the sbt to target the correct version of scala, and rebuild. Once this is done, you will have your own private copy of the module that should work fine. But you will have nobody else to support it. This is where I landed after many conversations with folks at databricks, azure-databricks, and conversations here in the connector project.
I think what Rahul is saying is that databricks is not in his wheelhouse. I think it is fair to say that this community will start to care more about the topic (spark 3.0 support) once SSBDC is ready to adopt spark 3.0, and not before. You can read more at https://github.com/microsoft/sql-spark-connector
It is frustrating how hard it is for Microsoft to acknowledge that their "azure databricks" needs to properly interoperate with "azure SQL". IMHO this should not be a months-long debate. Another thing that Microsoft won't acknowledge is that this is a regression (as you pointed out). By definition, this is a regression in azure-databricks since we had a bulk-load spark connector in 2.4 and after upgrading to 3.0 we do not.
Things seem especially dysfunctional because there are three separate parties involved and everybody is dodging responsibility. The formal reasoning why databricks is dodging is because this is considered a "third-party" library. In addition to databricks itself, there is also another large team at Microsoft called "azure-databricks" and they do a bit of the software development to ensure databricks can be called a "first-party" service in azure. They build the "glue" that holds databricks in place within the azure cloud. They are also responsible for taking support calls. If these two teams ("databricks" and "azure-databricks") weren't enough, there is yet another team here in the community that is responsible for this connector. And this community project seems to be much more interested in SSBDC than in databricks. I've spent several months being bounced back and forth between these three different sets of folks. I strongly suggest you just be patient and wait for SSBDC to mature a bit more. Otherwise you are likely to waste as much time on the topic as I have.
In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.
from sql-spark-connector.
@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point
from sql-spark-connector.
Again I can only recommend to just compile it yourself from the PR and test it. It is not difficult using sbt. The CI build failes due to the broken pipeline but the connector works just fine for me. I have a streaming application running in production for about a month on DBR 7.3 that continuously ingests data without issues. At least for the sink with default options I am quite confident to say that if there was a major issue I would have hit it. But you have to test it in your dev/qa environment anyways.
The automated tests of this connector are lackluster. They would not detect any incompatibilities in Scala or Spark versions. They hardly scratch the surface.
Spark 3 is a major improvement. For my workload using DBR 7.3 (coming from 6.5) allowed me reduce my job cluster size and save about 1500$ (abount 30-40%) per month. At least for me it was worth the additional effort.
from sql-spark-connector.
On initial inspection, the issue with Spark 3.0 support seems to be a logging class in the connector. If replaced, the connector should function
from sql-spark-connector.
Hi Rahul, does the latest release support spark 3.0.0?
from sql-spark-connector.
Thanks for working on this. We are eager to start using Spark 3 (in Databricks 7). There are lots of factors pushing us in that direction, and the lack of a SQL connector seems to the be only holdup at this time.
from sql-spark-connector.
PR #30 is addressing this.
from sql-spark-connector.
It would be really great if this connector supported 3.0. We are currently locked in to using 3.0 but would like to use this connector.
from sql-spark-connector.
Would be really nice to have the upgrade! Blocker for us too.. Thx guys
from sql-spark-connector.
Sorry to state the obvious, but my understanding is that this issue is being delayed. It won't get much attention until "SQL Server Big Data Clusters" (SSBDC) is ready to adopt spark 3.0.
I don't know much about it... can someone please point me to a roadmap for SQL Server Big Data Clusters? Am I right that it does not support spark 3.0 yet? How long until its customers will be ready to use spark 3.0?
As far as azure-databricks goes, those guys don't seem to care much about this connector... or at least they are not in a position to ask for a connector which is compatible with spark 3.0. So azure-databricks customers are forced to wait for SSBDC to catch up.... hopefully that won't be very much longer!
from sql-spark-connector.
Hi all,
Thanks for the comments and your feedback is received.
Currently we do not have the necessary validation to confirm Spark 3.0 support. Before adding the functionality and creating a new version of the connector (a dedicated 3.0 version), we look to do performance testing, runtime compatibility, etc.
At this time we have no strict timeline for Spark 3.0 support. There is an open PR and fork that allows the connector to work with 3.0 as reported by a few customers, but we will refrain from officially moving it into the main branch until we have tested it thoroughly.
We hear your feedback and hope to address it sooner than later.
from sql-spark-connector.
What is the issue with Spark 3.0 support? I see comments complaining about Databricks, but is the issue with Databricks itself or Spark 3.0? This being a Microsoft connector, it seems that the onus lies with Microsoft to update the connector rather than with Databricks. Maybe someone can help me understand the technical issues with Spark 3.0 support.
Now that the old "azure-sqldb-spark" connector is out of support, this "sql-spark-connector" is basically the only option going forward, but without Spark 3.0 support, it's basically dead in the water too.
We really want to leverage the new performance features of Spark 3.0, like AQP, but are being held back by either of the available SQL server connector options provided by Microsoft.
from sql-spark-connector.
I'm not an expert, so hopefully you'll all forgive me for asking a basic question. What's unclear to me is what the "necessary validation" means? It sounded like a number of customers have been building and using the existing PR and using it successfully. Are there specific test cases that the PR doesn't pass? If so, what is causing the delay in resolving those failures and completing the testing work?
As an Azure Databricks customer, it's been very frustrating that Microsoft has built a connector is incompatible with the current major release of Spark. On one hand, they're offering two products - SQL Server and Databricks (with runtime 7.0+). Both of these are allegedly "Azure" cloud services that Microsoft endorses, and one would think that would include the runtime releases of both those products. On the other hand, they've failed to provide a connector that lets you use the two products together. The lack of movement here has prompted me to begin exploring alternative databases.
from sql-spark-connector.
In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.
Synapse workspaces currently only use scala to connect with synapse sql and only allows loading into a new table. It uses polybase under the hood as opposed to bulkcopy so that will not help out here.
Engineering team have been given feedback about this and they hope to have both points fixed at some point....
from sql-spark-connector.
@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point
So it's high prio now! Looking forward running this on the latest DBR, as 7.4 has sooo many improvements over 6.6
from sql-spark-connector.
Spark 3 is critical, but it's worth nothing that Databricks runtime 6.4 which uses spark 2.4.5 goes EOL April 1st 2021 (poor choice of date).
from sql-spark-connector.
I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;
- choco install intellijidea-community
- choco install sbt
- sbt assembly
this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days
since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.
thanks for the tip @MrWhiteABEX
from sql-spark-connector.
I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;
- choco install intellijidea-community
- choco install sbt
- sbt assembly
this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days
since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.
Thanks @pmooij
This worked for me as well with one minor change to src/test/scala/com/microsoft/sqlserver/jdbc/spark/bulkwrite/DataSourceTest.scala
which can be seen here: master...dovijoel:spark-3.0
Basically just changing SharedSQLContext to SharedSparkSession
I'm having success in Databricks Runtime 7.3 LTS | Spark 3.0.1 | Scala 2.12
from sql-spark-connector.
Hi all,
Thanks for the patience as we worked on supporting Spark 3.0.
We have released a preview version of an Apache Spark 3.0 compatible connector on Maven!
The readme has more information but the connector is available at the coordinates com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha
.
If you notice any bugs or have any feedback, please file an issue!
from sql-spark-connector.
Related Issues (20)
- java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver HOT 1
- writing with mode "append" to an existing table only rolls back faulty rows w/o "NO_DUPLICATES"
- [Question] how to set MAXDOP
- Error while writing - com.microsoft.sqlserver.jdbc.SQLServerException: The connection Is closed. HOT 1
- Missing Maven distros HOT 6
- TCP/IP connection to the host whatever.database.windows.net, port 1433 has failed. Error: "null. Verify the connection properties. HOT 3
- Need the ability to use a linked service in a notebook to do a connection to a database
- Don't run the AFTER insert trigger
- Possible Wheel (.whl) distribution? HOT 1
- GA versions HOT 2
- TIMESTAMP to datetime2 fails on Azure Synapse HOT 1
- Connection closed when try to use om.microsoft.sqlserver.jdbc.spark connector HOT 4
- Version 8.4.1 of Microsoft JDBC Driver For SQL Server
- [Request] Support Spark 3.5 HOT 1
- Does bulk upsert data import support?
- Support for Spark 3.4.x HOT 2
- Assessing the risk of duplicated entries for BEST_EFFORT reliabilityLevel HOT 1
- sql-spark-connector issue where the access token expires post 60 mins
- Intermittent Authentication Failure using ActiveDirectoryPassword
- The project is dead or archived? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sql-spark-connector.