azure-samples / synapse Goto Github PK

View Code? Open in Web Editor NEW

369.0 46.0 349.0 113.09 MB

Samples for Azure Synapse Analytics

License: MIT License

TSQL 1.54% Jupyter Notebook 97.60% Batchfile 0.06% HTML 0.05% Scala 0.08% PowerShell 0.44% Python 0.22%

azure-synapse-analytics

synapse's Introduction

page_type

languages

products

description

urlFragment

sample

csharp

dotnet

Samples for Azure Synapse AnalyticS

update-this-to-unique-url-stub

Samples for Azure Synapse Analytics

Resources

Documentation https://aka.ms/SynapseDocs
Samples - https://aka.ms/SynapseSamples - this location

CLI - Azure CLI
Data - Small sample data sets
PowerShell - Azure PowerShell scripts
Notebooks - Notebook files
Spark - Code for using Apache Spark
SQL - T-SQL scripts

Scenario-based Samples

Tweet Analysis

Shows .NET for Spark and shared metadata experience between Spark created tables and SQL.

ADF to Synapse Migration Tool

The ADF to Synapse Migration Tool (currently PowerShell scripts) enables you to migrate Azure Data Factory pipelines, datasets, linked service, integration runtime and triggers to a Synapse Analytics Workspace.

Contributing

This project welcomes contributions and suggestions. See the Contributor's guide

synapse's People

Contributors

Stargazers

Watchers

Forkers

jovanpop-msft euangms anildwarepo tomtal iyanumanuel maheshadba ruixinxu hcmarque bamurtaugh mmodarre azurementor affogarty mudassarmdv guillermojordangob hksingh83 mikerys mkanchwala ltree1262 v-lanjli sharan779 ajerry jpaims kenakamu cheekatlapradeep-msft-zz elucht dayobam mlee3gsd casperlehmann chitratsr rrios042 akshatb1 revinjchalil machanth sidtandon2014 alekseys isaac-gritz harithamaddi-msft prlangad edresbadaani2 usistan anumjs kanhaiyasingh ymasaoka at-byte ranjansahoo annt-nguyen qpeiran kreidoss niadak ebillionniere305 rajatrakesh srinivasan1807 svchandramohan nelgson deepakahirwar robinrajsb samratbhatnagar praveenksingh7 pwehland rbudathoki julienmichel vnextcoder calmac-tns cloudmelon cloudcr testjiho niharikadutta pirz thingamagi vgnanasekaran1 azaricstefan mabalija ntb-azure1 malikamalik abedygathaba rodrigossz ralphke sidneyocirqueira jcbendernh laconicblue joaosalvadomicrosoft abrahams1 rmusti gitfocux isabella232 renilsoncorrente-microsoft mdrakiburrahman kapilrajyaguru jing29 mariuspc bricrsa efontana10 jasonhorner t104 saurabhsharma-msft yang-jiayi fonsecasergio johnphoang garyericson victorchin

synapse's Issues

use managed identity in order to connect to sharepoint through Azure Synapse?

Some users want Feature for Synapse to use managed identity in order to connect to SharePoint linked service.

Currently only supported through service principal authentication in SharePoint Linked service.

%pip not a line magic command?

Why is this? I keep getting the below error when I try to %pip install.

MagicUsageError: Line magic not exist: %pip.

ModuleNotFoundError: No module named 'org.apache.spark.sql.SqlAnalyticsConnector'

Hi There,
I was following the Notebook
Code


%%pyspark 

import org.apache.spark.sql.SqlAnalyticsConnector._
import com.microsoft.spark.sqlanalytics.utils.Constants
spark_read = spark.read.sqlanalytics("Built-in.dbo.LogisticsPostalAddress")
spark_read.show(5, truncate = false)

Output -

Note - I am using Azure Synapse Notebooks. Isn't this module should be already installed in the notebook or if not then how i can install it?

Spark Job in Synapse cannot be viewed in monitoring portal - Error Message is Fetching Failed

It is rare and intermittent but there are times when the monitoring portal in Azure Synapse will misbehave and will not show me the details about a completed spark job. Instead, it displays an error message that says "Fetching Failed". Screenshot.

I have not yet found a pattern or explanation. I reported the problem to CSS support but they are not yet familiar with the error. I suspect it is a timeout on an internal resource, like a spark history server or something like that.

I realize that some parts of the Synapse platform are proprietary but it borrows significantly from OSS spark. Does anyone have an idea what might take so long, when retrieving the U/I for a completed livy batch? Is it Azure storage accounts that are performing badly? Or is it a "spark history server"? Is there any reason why they wouldn't wait indefinitely for a response (eg. ten mins)? Whenever this happens the U/I seems to fail after a short period of time (only ~60 seconds or so). I haven't found any other patterns. As you can see above, the error message is nothing more than a small tooltip shown in the upper right of the screen; when I shared with CSS they weren't able to provide any additional guidance or explanation. So I'm hoping there are synapse users on stack overflow who have encountered this.

Side: When things are working properly, the spark job is 
presented with the related jobs/stages/tasks/logs like so:

Write to existing Synapse table

From Synapse's Apache Spark pool, Is it possible to write to an existing internal/external table? All examples are related to creating and loading the data to a new table. Even though the Synapse pipeline runs on spark, how does it manage to select/update? To have the same list of features through Pyspark/Scala, do we need to switch to Databricks instead of Apache spark pool comes with Synapse?

File Format "NativeParquet" is never used in SampleDB.sql

NativeParquet is implemented as a file format, but when the view parquet.YellowTaxi is created, it refers to FORMAT='PARQUET'.

There does not seem to be a reason for it to exist, and it confuses the reader, when they are trying to get up to speed with using the platform.
Is there any benefit to creating a custom file format rather than refering to 'PARQUET'?

Secure string parameter is taking the default value `**********` when running the Fabric Notebook from pipeline

The secure string parameter is taking a default value during the pipeline run like ********** even though some default value was given in the pipeline parameters section.

See the parameter length in the below image.

Thus, the notebook also taking the same value.

When it is edited at the pipeline start, it is taking the original value.

Whereas, same scenario is working fine in normal synapse notebook and pipeline.

Bad DB name in the 6th cell

How to find the azure spark pool availability in synapse analytics dynamically

Hi ,

I want to schedule azure synapse notebooks from pipeline, So pipeline should dynamically pick the available azure spark pool from the list. How to get that list.

Thanks,
Satya

Creating Workspace in DevOps with SPN leaves workspace inaccessible

I have created an ARM Template/Parameter files that deploys the workspace. After deploying using an SPN in DevOps, I get the error: "You need permission to access workspace" when trying to access the workspace. I am configured as the Active Directory Admin on the resource, and owner of the resource in Access Control. Firewall rules allows all IP addresses.

Using the same exact ARM Template/Parameter files, but deploying using the Azure Portal leaves the workspace in the correct state. I am able to access the studio as expected.

The SPN that deploys the ARM template has been granted the Subscription Owner role.

I can not detect any difference in what has been deployed.

Is there a known issue with deploying Synapse with DevOps / SPN?

Pipeline migration not working

When running the importADFtoSynapseTool.ps1 script I have a problem where all of the resources except the pipelines gets migrated into the Synapse workspace. The script outputs that all four of the pipelines is successfully migrated to the synapse workspace, but are nowhere to be found. I also get the warning bellow, but I'm unsure if this actually has any effect on the migration.

Self-Hosted (Linked) Integration Runtime with the following name will be filtered and will NOT be migrated: corporatesynergiIR

ADF to Synapse Pipeline migration powershell bug

PowerShell for ADF to Synapse Pipeline migration, but there is a problem with garbled characters in case of non-English multibyte character codes.

https://github.com/Azure-Samples/Synapse/blob/main/Pipelines/ImportADFtoSynapse/importADFtoSynapseTool.ps1

Please make the following modifications to avoid the JSON character encoding issue.

$uri = "$destUri/$resourceType/$($_.name)?api-version=$($config.SynapseWorkspace.apiVersion)";
$jsonBody = ConvertTo-Json $_ -Depth 30
$jsonBody =[Text.Encoding]::UTF8.GetBytes($jsonBody)
$name = $_.name

Please add this code.
$jsonBody =[Text.Encoding]::UTF8.GetBytes($jsonBody)

Thanks.

.Net Spark sample notebooks not found

All notebooks seem to be missing.

ERROR DotnetRunner: null

Getting an error when submitting a job description after following these steps. FYI when using the python example it works with no issues.
sparkpoolpoc-19-application_1638810855048_0001-Driver (stderr).log

Documentation should have details on Synapse-compatible Microsoft.Spark, etc. packages and build process

I agree with some others that the sample file should be source code, rather than a built executable, as well.

This request is not authorized to perform this operation using this permission

Hi,

I'm trying to run the loading of the data into Spark DataFrames but have the attached issue.
"Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, "

I've added my AAD account as Contributor on the ADLS Gen2 : NOK
I've added the Synapse workspace as Contributor on the ADLS Gen2 : NOK
I've added my account as Contributor on the subscription : NOK
I've added the Synapse workspace as Contributor on the subscription : NOK

Many thanks for your help !

ONNX conversion in notebook not working

The onnx conversion in the notebook tutorial-predict-nyc-taxi-tips-onnx.ipynb is not working. I'll get the error "ValueError: You passed in an iterable attribute but I cannot figure out its applicable type."

[Scala][Azure Synapse]: Using MSI access token unable to retrieve data from Azure sql server db table data.

Linked services configured for sql db through System assigned identity as shown in below image

Below find note book (in scala) in Azure synapse

import com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
import java.util.Properties

val jdbcHostname = ".sql.azuresynapse.net"
val jdbcPort = 1433
val jdbcDatabase = ""

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()

// Driver that can also be observed in the log when using the 'native' Synapse SQL way.
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)

// Create a linked server to your dedicated pool with a Manged Identity
connectionProperties.setProperty("accessToken", mssparkutils.credentials.getConnectionStringOrCreds("samplesqllink"))

// Define your query
val pushdown_query = "(select top 10 ID from dbo.tblEmployees) data_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

Here getting this error

com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user ''. ClientConnectionId:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/

But with User name and password it is working fine.

Please help here did I miss anything

Datetime on pyspark.pandas not working correctly

ps.date_range(start='1/1/2018', periods=5, freq='M') is skipping a month
DatetimeIndex(['2018-01-31', '2018-03-31', '2018-05-31', '2018-02-28',
'2018-04-30'],
dtype='datetime64[ns]', freq=None)

[Synapse - Apache Spark definition] - Python sample to copy local files to wabs

Hi, It would be nice to have a sample code to copy files from local directories to wabs.

I am developing a Python script in a Spark definition and I would like to copy a local file already created in the temp directory to wasb.

I am trying to use mssparkutils (Maybe this is my error):

local_file = '/tmp/a.txt'
azure_file = 'wasbs://<container>@<storage_account>.blob.core.windows.net/outputs/a.txt'
from notebookutils import mssparkutils
mssparkutils.fs.cp(local_file, azure_file, True)

But, I get error:

Current content of Temp folder:
   /tmp/eea_discodata_task-9c3df798c1ff11ebaedc000d3ab6dc78.log.json
   /tmp/ca-certificates.tmp.3cqtU8
Traceback (most recent call last):
  File "datapipelineapp.py", line 179, in <module>
    main()
  File "datapipelineapp.py", line 158, in main
    mssparkutils.fs.mv(file_rs, dataset_fn, True)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/notebookutils/mssparkutils/fs.py", line 12, in mv
    return fs.mv(src, dest, create_path, overwrite)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/notebookutils/mssparkutils/handlers/fsHandler.py", line 56, in mv
    return self.fsutils.mv(src, dest, create_path, overwrite)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:mssparkutils.fs.mv.
: java.io.FileNotFoundException: /tmp/eea_discodata_task-9c3df798c1ff11ebaedc000d3ab6dc78.log.json
	at com.microsoft.spark.notebook.msutils.impl.MSFsUtilsImpl.mv(MSFsUtilsImpl.scala:228)
	at mssparkutils.fs$.mv(fs.scala:20)
	at mssparkutils.fs.mv(fs.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

I see that dbutils package is not available in this context, it is used in Databricks environments.
Thanks in advance

Malformed URL Links - https://github.com/Azure-Samples/Synapse/tree/master/Notebooks/PySpark/Synapse%20Link%20for%20Cosmos%20DB%20samples/IoT

01-CosmosDBSynapseStreamIngestion: Ingest streaming data into Azure Cosmos DB collection using Structured Streaming - 404
02-CosmosDBSynapseBatchIngestion: Ingest Batch data into Azure Cosmos DB collection using Azure Synapse Spark - 404
03-CosmosDBSynapseJoins: Perform Joins and aggregations across Azure Cosmos DB collections using Azure Synapse Link - 404
04-CosmosDBSynapseML: Perform Anomaly Detection using Azure Synapse Link and Azure Cognitive Services on Synapse Spark (MMLSpark) - 404

Correct URL Links on this page - https://github.com/Azure-Samples/Synapse/tree/master/Notebooks/PySpark/Synapse Link for Cosmos DB samples

Synapse Intelligent Cache is not working

Env:
--Spark 3.2
--Synapse-premium
--Intelligent caching enabled.

Intelligent cache enabled

Inital content of CSV file

read successfully

added more content later

the same result was shown even after running df1.take(100) several times instead of displaying the newly records added to it.

The issue was resolved after rerunning the spark.read again

it would be nice to get up to date step by step instructions for .net example

Just downloaded and unpacked "sample files for dotnet.zip" from under Synapse/Spark/DotNET/.

I had a sight hope on the off chance it might contain something like instructions and a working solution file for Visual Studio ( community edition would work perfectly well ).

Hope this is good enough description of an issue. Please let me know, I would be happy to better describe the issue at hand.

The information on environment construction is incorrect

Hello,

It is presumed that the content of "Let's get the environment ready" described in the following document is incorrect.

https://github.com/Azure-Samples/Synapse/blob/master/Notebooks/PySpark/Synapse%20Link%20for%20Cosmos%20DB%20samples/MongoDB/spark-notebooks/pyspark/01-CosmosDBSynapseMongoDB.ipynb

Creating requirements.txt as instructed and applying it to the Spark pool does not install pymongo.
Also, there was no mention of pymongo in the output result of the libraries described in the document.

I'm asking a similar question to Microsoft Q & A and I'm currently investigating the cause.

https://docs.microsoft.com/en-us/answers/questions/105473/spark-pool-does-not-reflect-the-contents-of-requir.html

Please confirm and investigate.

Thanks,

azure-samples / synapse Goto Github PK

synapse's Introduction

Samples for Azure Synapse Analytics

Resources

Contents

Scenario-based Samples

Tweet Analysis

ADF to Synapse Migration Tool

Contributing

synapse's People

Contributors

Stargazers

Watchers

Forkers

synapse's Issues

Recommend Projects

Recommend Topics

Recommend Org