solliancenet / azure-synapse-analytics-day Goto Github PK

Jupyter Notebook 74.15% TSQL 25.85%

azure-synapse-analytics-day's Introduction

Azure Synapse Analytics in a Day Lab

Wide World Importers

Wide World Importers (WWI) is a wholesale novelty goods importer and distributor operating from the San Francisco bay area.

As a wholesaler, WWI's customers are mostly companies who resell to individuals. WWI sells to retail customers across the United States, including specialty stores, supermarkets, computing stores, tourist attraction shops, and some individuals. WWI sells to other wholesalers via a network of agents who promote the products on WWI's behalf. While all of WWI's customers are currently based in the United States, the company intends to expand into other countries.

WWI buys goods from suppliers, including novelty and toy manufacturers, and other novelty wholesalers. They stock the goods in their WWI warehouse and reorder from suppliers as needed to fulfill customer orders. They also purchase large volumes of packaging materials and sell these in smaller quantities as a convenience for the customers.

Recently WWI started to sell a variety of edible novelties such as chili chocolates. The company previously did not have to handle chilled items. To meet food handling requirements, they must monitor the temperature in their chiller room and any of their trucks that have chiller sections.

Lab context

Wide World Importers is designing and implementing a Proof of Concept (PoC) for a unified data analytics platform. Their soft goal is to bring siloed teams to work together on a single platform.

In this lab, you will play the role of various persona: a data engineer, a business analyst, and a data scientist. The workspace is already set up to focus on some of the core development capabilities of Azure Synapse Analytics.

By the end of this lab, you will have performed a non-exhaustive list of operations that combine the strength of Big Data and SQL analytics into a single platform.

How to get started with a provided lab environment

If you are using a hosted lab environment, please follow the steps below to get started:

Select the Lab Environment tab above the lab guide to copy the Azure credentials used for the lab. Make note of the UniqueId value. This value may be referenced at different points during the lab.
Select Lab Resources under Lab Environment to start the Virtual Machine (VM) provided for this lab. However, you do not need to use the VM to complete the lab. It is there for your convenience to make it easier to sign into Azure if you have an existing account and do not want to log out of it.

Solution architecture

The diagram below provides a unified view of the exercises in the lab and their estimated times for completion.

Exercise 1 - Explore the data lake with Azure Synapse Serverless SQL Pool and Azure Synapse Spark

In this exercise, you will explore data using the engine of your choice (SQL or Spark).

Understanding data through data exploration is one of the core challenges faced today by data engineers and data scientists as well. Depending on the data's underlying structure and the specific requirements of the exploration process, different data processing engines will offer varying degrees of performance, complexity, and flexibility.

In Azure Synapse Analytics, you have the possibility of using either the SQL Serverless engine, the big-data Spark engine, or both.

Exercise 2 - Working with Azure Synapse Pipelines

In this exercise, you will use a pipeline with parallel activities to bring data into the Data Lake, transform it, and load it into the Azure Synapse SQL Pool. You will also monitor the progress of the associated tasks.

Once data is properly understood and interpreted, moving it to the various destinations where processing steps occur is the next big task. Any modern data platform must provide a seamless experience for all the typical data wrangling actions like extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering.

Azure Synapse Analytics provides two significant categories of features - data flows and data orchestrations (implemented as pipelines). They cover the whole range of needs, from design and development to triggering, execution, and monitoring.

Exercise 3 - High Performance Analysis with Azure Synapse Dedicated SQL Pools

In this exercise, you will try to understand customer details using a query and chart visualizations. You will also explore the performance of various queries.

SQL data warehouses have been for a long time the center of gravity in data platforms. Current data warehouses are capable of providing high performance, distributed, and governed workloads, regardless of the data volumes at hand.

The Azure Synapse SQL Pools in Azure Synapse Analytics is the new incarnation of the former Azure SQL Data Warehouse. It provides all the state-of-the-art SQL data warehousing features while benefiting from the advanced integration with all the other Synapse services.

Exercise 4 - Lake Databases and Database templates

In this exercise, you will explore the concept of a lake database and you will learn how to use readily available database templates for lake databases.

The lake database in Azure Synapse Analytics enables you to bring together database design, meta information about the data that is stored and a possibility to describe how and where the data should be stored. Lake database addresses the challenge of today's data lakes where it is hard to understand how data is structured.

Exercise 5 - Log and telemetry analytics

In this exercise, you will explore the capabilities of the newly integrared Data Explorer runtime in Synapse Analytics.

Azure Synapse data explorer provides you with a dedicated query engine optimized and built for log and time series data workloads. With this new capability now part of Azure Synapse's unified analytics platform, you can easily access your machine and user data to surface insights that can directly improve business decisions. To complement the existing SQL and Apache Spark analytical runtimes, Azure Synapse data explorer is optimized for efficient log analytics, using powerful indexing technology to automatically index structured, semi-structured, and free-text data commonly found in telemetry data.

Exercise 6 - Data governance with Azure Purview

In this exercise, you will use several of the capabilities provided by the integration between Azure Synapse Analytics and Azure Purview workspaces.

Azure Purview is a unified data governance solution that helps you manage and govern your on-premises, multicloud, and software-as-a-service (SaaS) data. Purview enables you to easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. It also enables data consumers to find valuable, trustworthy data.

Azure Synapse Analytics and Azure Purview workspaces are tightly integrated, enabling seamless data discovery and lineage.

Exercise 7 - Power BI integration

In this exercise, you will build a Power BI report in Azure Synapse Analytics.

The visual approach in data exploration, analysis, and interpretation is one of the essential tools for both technical users (data engineers, data scientists) and business users. Having a highly flexible and performant data presentation layer is a must for any modern data platform.

Azure Synapse Analytics integrates natively with Power BI, a proven and highly successful data presentation and exploration platform. The Power BI experience is available inside Synapse Studio.

Extension module

The exercise is accompanied by a Power BI extension module with four additional (optional) exercises.

Exercise 8 - Data Science with Spark (optional)

In this exercise, you will leverage a model trained with Azure Machine Learning AutoML using Spark compute to make predictions using the T-SQL PREDICT statement in an Azure Synapse Analytics dedicated SQL pool.

Azure Synapse Analytics provides support for using trained models (in ONNX format) directly from dedicated SQL pools. What this means in practice, is that your data engineers can write T-SQL queries that use those models to make predictions against tabular data stored in a SQL Pool database table.

The model is trained and registered by Azure Machine Learning automated ML (AutoML) using the compute resources provided by a Synapse Analytics Spark pool (the main requirement is that the model format must be supported by ONNX). Using the integration of the Azure Machine Learning experience into Synapse Analytics Studio, the trained model is deployed to the dedicated SQL pool where it is used for inference via the T-SQL PREDICT statement.

azure-synapse-analytics-day's People

Contributors

Stargazers

Watchers

Forkers

givenscj euangms mallik-g samratbhatnagar masayukiozawa haroon921 mycousin liqsword a605052 mrtuncay praveenanil jackyc092020 rrios042 kowsi-lixar prsayeen ctesta-oneillmsft spektrasystems charleyhanania prlangad misterreally amy-chue ievsantillan loyaltnow abhaypsingh msukkarca msshada madhuri777289 annt-nguyen irishun malikamalik roseams mridul96 francop1967 chandsuma debaratisj vikrambala shadowshades-one bassampourco ravit73 dineshaggarwal amjadmkhan alexanderdeleon alan-at-microsoft srushti1421 agh10 srushti-714 yap-alan pemcaeiro aelai carlosariash direkshan-digital lavanyareddyalavala eml2varun nohanaga calingoconsulting b2sdev pabloalegria romainbruneau ahmedalouini1 ron2467 mbalija sytang112 jazer313 juanfperezr akmeghdad radhakrishnang komak-it vinnywiz msdz gvwoods ozgurgul peasevad vak365 trphan aravindragunathan tashbettridge chandramoulipathanjali paladique kanhaiyaorg kanhaiyasingh arevaloisabel korkridakedemoorg cmftall iakiyama4d gripaca yiyangabc mathiasekz h-mywk asener1 michaeltschreiber aronunlabeled azuresy oneclicktest kmlmlk jjrobertson14 ashrafib khanhtd2 mykovalenko cozenc luoanabecher

azure-synapse-analytics-day's Issues

Missing ipynb files from artifacts folder

Hello,
The following files, referenced in the 00-setup-workspace.md files, are missing from the artifacts folder :
Exercise 1 - Read with Spark.ipynb
Exercise 2 - Ingest Sales Data.ipynb
Thanks for your work,
Regards,
Romain

Exercise 3 : Update

Task 2 -> important section step 4
Under Authentication Method select OAuth2 and select Sign In.

There is an update and option instead of OAuth2 is Microsoft Account

Power BI exercise Step flow (Currently #4)

Issue submitted as referred by Spektra Support and direction that we submit it here

Based on past experience running these workshops - Tasks and Steps within The Power Bi Exercise would flow better if we used the published link to open Workspace and edit Authentication Creds (to Fix OAUTH) Task 2 using the settings option using the Dataset row item on the screen. (instead of the menu since its quite floaty and keeps running away) and then getting them to open/create the Power BI report within Synapse Studio

This would avoid much of the back and forth switching between environments and browser tabs. It also minimises the impression of the 'problem' i.e. OAUTH cannot be selected during Power BI desktop publish.

It will be also useful to ensure that the Browser defaults to the new Edge (currently IE) to ensure that clicking the publish button opens the correct browser.

Explicitly call out how to authenticate to https://portal.azure.com

It would benefit delivery to explicitly call out how to authenticate to https://portal.azure.com in Task 1 and Point 1 of 01-explore-data-lake.md
The Lab Environment tab and desktop file provide the credentials. However, many attendees asked the question and became frustrated immediately on starting.

Wrong folder name in instruction

https://github.com/solliancenet/azure-synapse-analytics-day/blob/master/00-setup-workspace.md ,on this page task 2 , the folder name needs to be bronze and not dev

Update changes in the lab guide

Exercise 1 :task 1 step 10
https://asadatalake02.dfs.core.windows.net/wwi/factsale-parquet/2012/Q1/*/*
to
Update to your datalake storage account name instead of asadatalake02

Exercise 2
task 1 19th step - Update SELECT TOP 1000 rows to SELECT TOP 100 rows
task 2 step 5 -UI change

task 2 step 10- UI change

task 4- To look closer at any individual stage, you can use the Job IDs drop-down to select the stage number. In this case, select the drop-down and choose 5 from the list.
Can we remove the instruction to select 5 particularly as few jobs may not be having job 5.

Exercise 5 - Can we add a note as
Note: An explicit whitelisting of the Synapse workspace is required to run PREDICT function queries. As part of the lab, you will not be able to run through this exercise. Please check with instructor to see if this is enabled in your lab.

Exercise 2 - Ingest Sales Data.ipynb - File is missing

Hi,

There is no file for Exercise 2 - Ingest Sales Data.ipynb

Could you please load it into the repository?

Thanks!

Jan

Exercise 1 Task 1 Step 6

Task 1 and Point 6 of 01-explore-data-lake.md. Mention you have to navigate to the "Linked" tab before seeing your storage accounts.

Exercise 1 Task 1 Step 8

Double-click the InvoiceDateKey=2012-01-01 folder, then right-click the Parquet file before choosing "SELECT TOP 100".

Attendees are right-clicking on the InvoiceDateKey=2012-01-01 folder and not seeing the "SELECT TOP 100" option.

Setting up the environment problem

The file azure-synapse-analytics-day/artifacts/05/Exercise 5 - Create Sample Data for Predict.sql
has two CREATE TABLE [wwi].[SampleData]. Should there only be one or should the other have a different table name?

Setup: Import datasets, data flows, and pipelines, connection error

Hello all,
I am following the steps for setting up.
However, I met connection error when I run the pipeline to copy data.
I have tested the linked services and found that the linked service to SQLPool got connection failed by testing.
I have set up the firewall option and allowed access of Azure components. However, the error still exists.

Would appreciate getting hints on these steps to set up the environment. Thanks a lot in advance!

Lab files / steps - missing

Hi team,

Seems some of the lab files are missing. I'm not able execute code from Task 6. for example,
'https://<primary_storage>.dfs.core.windows.net/wwi/factsale-csv/2012/Q1//' i don't see any steps to upload sample file to above path or codes to covers this part. Totally struck with workshop. Kindly help.

Potential Instructions Update

Extracted from #9

The notebook has many values to be replaced which will already be done as a part of setup ,can you remove it from the instruction?

00-Setupworkspace

Exercise 2 Task 2 Step 26

Issue extracted from #13

Screenshot update

Exercise 1 Task 2 Step 10

https://github.com/solliancenet/azure-synapse-analytics-day/blob/master/01-explore-data-lake.md#task-2---explore-the-data-lake-with-azure-synapse-spark

The screenshot in the repo;

The one in the actual environment

Exercise sequence improvement - swap #3 & #4

the Power BI exercise #3 is better suited to be #4 considering -

Ex 4 (to do with Performance) is easier to follow through if its was Ex 3 using the same tooling in Ex 1 + 2
Power BI has some complexities (addressed on separate issue) which may delay the lab. Note: this Lab hosted by Spektra currently has only 4 hours.
Users who are new to Azure and Synapse realistically can take up to 90-120mins for the Ex 1 + 2 to finish satisfactorily
Content slide packs cover them towards the end hence better alignment
Improvement in flow would impress a much better experience for newbies coming in to experience Synapse

Exercise 4 - High Performance Analysis with Azure Synapse SQL Pools

The SQL Query takes execution time greater than 15 secs for wwi_perf.FactSale_Fast table and wwi_perf.FactSale_Slow .

And sometimes wwi_perf.FactSale_Fast table takes more execution time compared to wwi_perf.FactSale_Slow table

Is it something related to SQLPool DW size?

Issue in Exercise 2

In Exercise 2, Step 17, we have the following:

Observe the results in the output pane, and see how easy it was to use Spark notebooks to write data from Blob Storage into Azure Synapse Analytics.

It should be Observe the results in the output pane, and see how easy it was to use Spark notebooks to read data from Blob Storage into Azure Synapse Analytics.

@daronyondem can you please check on this and let me know if I'm mistaken

Power BI exercise Task 2 - Open instead of Create

Issue submitted as referred by Spektra Support and direction that we submit it here

It may be better to allow users to open the existing report that which is published by Power BI Desktop steps (task 1). Creating a new one seems odds to users who are currently quite used to or familiar with, although illustrating that a new one can be created by simply pointing out the option is useful.

If the Auth creds are sorted out beforehand (as submitted on separate issue) then the opening the existing report would save time too

This would greatly improve the Power BI integration experience, esp for currently proficient Power BI report developers

Exercise 2 Task 3 Step 9

The screenshot in the repo;

The screenshot from test:

Typos in Exercise 4 Task 3 Step 4

Written as:
"to stard the Map Data tool. If this is the first time you are doing this, you might pe"

should be

to start the Map Data tool. If this is the first time you are doing this, you might be

Exercise 2 Bonus Challenge Step 2 Wording Update

Extracted from #15

The instruction of 1000 rows is in Bonus challenge step 2 under "To observe the data being written into the table:

Exercise 1 Task 2 Step 9

EX1 task 2 step 9 : There are no options like x axis and y axis and we have new options like key values,aggregations etc

Exercise 2 task 1 step 21-Screenshot update
Select 1000 rows->100 rows

Exercise 2 task 2 step 26
Screenshot update

Instruction and Screenshot update.

Exercise2 Task3 step3

Orchestrate needs to be updated to Integration.

Under Integration, select Pipeline runs.

Exercise2 Task4 step3

Submit time needs to be updated to Local time

On the Apache Spark applications page, select the Local time value and observe the available options for limiting the time range for Spark applications that are displayed in the list. In this case, you are looking at the current run, so ensure Last 24 hours is selected and then select OK.

dp-203-setup-Part01.ps1 is looking for #UserId, but does n to seem to capture it anywhere.

dp-203-setup-Part01.ps1 throws the error below:

Performing post-deployment configuration...
Granting [email protected] admin permissions...
Assign-SynapseRole : Cannot bind argument to parameter 'PrincipalId' because it is an empty string.
At C:\dp-203\data-engineering-ilt-deployment\Allfiles\00\artifacts\environment-setup\automation\dp-20
3-setup-Part01.ps1:198 char:110

... d "6e4bf58a-b8e1-4cc3-bbf9-d73143322b78" -PrincipalId $userId # Work ...
```
                                                      ~~~~~~~
```
- CategoryInfo : InvalidData: (:) [Assign-SynapseRole], ParameterBindingValidationExcep
  tion
- FullyQualifiedErrorId : ParameterArgumentValidationErrorEmptyStringNotAllowed,Assign-SynapseRo
  le

Assign-SynapseRole : Cannot bind argument to parameter 'PrincipalId' because it is an empty string.
At C:\dp-203\data-engineering-ilt-deployment\Allfiles\00\artifacts\environment-setup\automation\dp-20
3-setup-Part01.ps1:199 char:110

... d "7af0c69a-a548-47d6-aea3-d00e69bd83aa" -PrincipalId $userId # SQL ...
```
                                                      ~~~~~~~
```
- CategoryInfo : InvalidData: (:) [Assign-SynapseRole], ParameterBindingValidationExcep
  tion
- FullyQualifiedErrorId : ParameterArgumentValidationErrorEmptyStringNotAllowed,Assign-SynapseRo
  le

... d "c3a6d2f1-a26f-4810-9b0f-591308d5cbf1" -PrincipalId $userId # Apac ...
```
                                                      ~~~~~~~
```
- CategoryInfo : InvalidData: (:) [Assign-SynapseRole], ParameterBindingValidationExcep
  tion
- FullyQualifiedErrorId : ParameterArgumentValidationErrorEmptyStringNotAllowed,Assign-SynapseRo
  Le

The scripts is looking for #UserId, but does n to seem to capture it anywhere.

Add the current userto Admin roles

Write-Host "Granting $userName admin permissions..."
Assign-SynapseRole -WorkspaceName $workspaceName -RoleId "6e4bf58a-b8e1-4cc3-bbf9-d73143322b78" -PrincipalId $userId # Workspace Admin
Assign-SynapseRole -WorkspaceName $workspaceName -RoleId "7af0c69a-a548-47d6-aea3-d00e69bd83aa" -PrincipalId $userId # SQL Admin
Assign-SynapseRole -WorkspaceName $workspaceName -RoleId "c3a6d2f1-a26f-4810-9b0f-591308d5cbf1" -PrincipalId $userId # Apache Spark Admin

What am I missing?

Thanks,

Michael

Getting Error - Synapse Predict

Hi Team,

Getting below error while running predict query on Azure Synapse,

Failed to execute query. Error: Parse error at line: 6, column: 14: Incorrect syntax near 'MODEL'.
Total execution time: 00:00:01.050
https://www.screencast.com/t/sB5nZAtDWav

Kindly help.

Exercise 4 correction

The mention wwi_perf.FactSale_Perf on the 3rd-to-last paragraph should be wwi_perf.FactSale_Fast.

Also, I think the statement "When our query needs to consolidate each customer's data, a lot of data movement will occur between the distributions." should be expanded on. Why is there data movement? Maybe mention because the fact table is joining with the dimension table and go into details. But I'm not sure I understand why a fact table that is set to round robin would be slower when it's joining a replicated dimension table?

Setup steps: broken link to "Exercise 3 - Task 1"

In setup steps, Task 10 - Configure Power BI. The link to "Exercise 3 - Task 1" is broken.
which actions should be done to configure PBI?

Naming of PowerBI dataset-Exercise 3 task 1

If prompted to save your changes, select Save and provide a unique name having the format user_dataset_UNIQUEID (where UNIQUEID is the unique identifier which you retrieved from your username, as described above)

Users might be confused with different datasetname in task1 and 2 so can you please
Update the name to wwifactsales in Exercise 3 task 1 step 11 as we are using the same name in ex3 task 2

Exercise 5: Data science with spark

Table is created with the name wwi.SampleData according to the setup but in Exercise 5 of the guide the query is pointing to the table [wwi_ml].[SampleData]

And can you please check on the error related to model:

Thanks,
Srushti

Exercise 2 Task 1 Step 21

Issue extracted from #13

Exercise 2 task 1 step 21-Screenshot update
Select 1000 rows->100 rows

missing notebooks: Exercise 2 - Ingest Sales Data and Exercise 1 - Read with Spark

Hello,

At Task 9, Exercise 2 - Ingest Sales Data and Exercise 1 - Read with Spark are missing and can`t be imported, please add them. I get a page not found when trying to access them. Thanks!

Issue in Exercise 5 Task 3

In Exercise 5 Task 3 step 7 we are not able to select the existing cognitive service, even though cognitive service and linked service are present. And also the UI got updated.

@daronyondem can you please have a look at this?

PowerBI report is not showing any data

After creating the PowerBI dataset in Exercise 3 task 1
In exercise 3 task 2 the Powerbi report is not showing up any data
Can you please check on this as soon as possible?

Exercise 2 and exercise 3 updates

Exercise 1 task 1 step 2
Screenshot update

Exercise 2:
task 1 19th step - Update SELECT TOP 1000 rows to SELECT TOP 100 rows as 1000 rows is no longer supported.

Exercise 3 -update screenshots
Task 1 step 2 -

Task 1 step 6

Task 1 step 12

Task 1 step 14

Spellcheck

There is a good amount of spelling mistakes on the instructions. Worth parsing it through a spellcheck engine and correcting them.