exercise-ev-databricks's People
Forkers
fergus-orbach tw-seanluckett aus-der-technik paulrinaldi ryandawsonuk umairshaik marius-tw againstthegoneexercise-ev-databricks's Issues
[Content] Create Batch Processing Exercise - Gold 2
[Content] Create Production Code example
read_from_stream function should use input_df from the parameters
This function in cmd10
def read_from_stream(input_df: DataFrame) -> DataFrame:
### YOUR CODE HERE
raw_stream_data = (
spark.readStream.format("rate")
.option("rowsPerSecond", 10)
.load()
)
###
# This is just data setup, not part of the exercise
return raw_stream_data.\
join(mock_data_df, raw_stream_data.value == mock_data_df.index, 'left').\
drop("timestamp").\
drop("index")
df = read_from_stream(mock_data_df)
should use input_df, instead of mock_data_df
.
If input parameter is used, then the tests fail.
[Content] Create Batch Processing Exercise - Silver
[Content] Create Streaming Exercise (Stateless)
Flattening, etc. We might skip this one in favour of the Stateful exercise
My First Dataset - Final Charge - use inner join instead of left join (Find Matching Start Transaction Requests (left join))
[Content] Create new Streaming exercise
[Content] Update Batch Silver and Gold exercises with standard dataset
The batch silver and gold exercises depended on completing the previous exercises. This does not favour people who are a bit slower. We want to encourage people to learn instead of needing to pass gates. At the beginning of each of the exercises, import the output from the previous exercise (the output dirs in the github repo exercise-ev-databricks).
Catch joining errors early from CMD 17, fails in CMD 45
Tests are working all the way down to e2e tests and fail then if there is a little mistake in a command all the way up.
That is really hart to catch, because the unit-tests are working fine.
When someone messed up CMD 17 (join_with_start_transaction_responses) and join the input_df into join_df, than there is en error in CMD 45.
AssertionError: Expected [2.17, 1.58, 1.0], but got [2.17, 1.58, 1.25]
Correct solution for 17 is:
return input_df. \
join(join_df, input_df.transaction_id == join_df.transaction_id, "left"). \
select(
join_df.charge_point_id,
join_df.transaction_id,
join_df.meter_start,
input_df.meter_stop.alias("meter_stop"),
join_df.start_timestamp,
input_df.timestamp.alias("stop_timestamp")
)
Wrong solution example is
return join_df. \
join(input_df, join_df.transaction_id == input_df.transaction_id, "left"). \
select(
join_df.charge_point_id,
join_df.transaction_id,
join_df.meter_start,
input_df.meter_stop,
join_df.start_timestamp,
input_df.timestamp.alias("stop_timestamp")
)
when change "left"-join to "right"-join all is working fine.
If the wrong way changes the data (besides the order), then this should be tested early, because there is no way to find the error later on.
[Investigate] Why does writing MeterValues Request take 32 seconds and others take 2 seconds?
In the Batch Processing Silver layer exercise, writing MeterValues request to parquet (single partition) takes 32 seconds... whereas writing all of the other DFs for other actions takes sub 2 seconds. Why is that?
NOTE: to run this exercise, please run the Bronze exercise - it outputs some parquet which the SIlver Exercise requires.
Add sorting to test to increase repeatability
E2E Test beneath "Join with Target DataFrame" (Final Charge) needs to sort the input_df so that the test doesn't flap due to re-ordering of the DF
Add answers to Batch Processing Bronze Reflect Questions
At the end of the exercise, there is a "Reflect" section
- Add Spark UI screenshots of where people can see how many in-memory partitions have been created (and explanation) as a result of the repartition
- Add Spark UI screenshots of how much time the write took (and explanation) - this is due to parquet taking a while to write
Get ZOptimise working in DeltaLake Exercise
Create baseline Streaming exercise for Syed
stop_transaction_body_schema implementation method in Final Charge notebook does not confirm to the explanation in the suggested link for the exercise
from_json function expects an argument with col
type but the reference link in the same exercise suggests data frame method. Here is the link to the suggestion in the exercise.
https://sparkbyexamples.com/pyspark/pyspark-json-functions-with-examples/
Participants end up implementing the solutions as per suggestion link and test case fails too. So either the suggested document should change or the tests should become more inclusive. Brought to notice by Sultan Ahmad - Dec 2023 Tour
[Content] Add Databricks Visualisations to Visualisation Exercise
[Content] Update static input data for Visualisation
Exercise Silver missing schema field for MeterValues Request
Exercise Silver
Cmd 92, Target Schema missing connector_id
Final Charge Test - sorting causes unexpected failures
From @petershaw:
The final test still demands a certain order of the data. If you put in some kind of sorting it fails. I do not see any reason for that. Sorting is not needed, but also not wrong. I can understand why a lot of people put it in. So improving the flexibility would be a win.
Shorten Medallion Arch Exercises
There has been a lot of feedback that the Batch Processing exercises have been too repetitive. We don't want to fill in the answers completely because there is a possibility that people will simply run the code without thinking through the problem. This issue describes a compromise solution.
- Update the Bronze Exercise with filled in solutions but leave one line per exercise as a to-do.
- Update the Silver Exercise with filled in solutions but leave one line per exercise as a to-do.
[Content] Create Batch Processing Exercise - Gold 1
Make Batch Processing Gold exercise less repetitive
Bronze Processing - Reflect Question 3 - How many minutes were spent rendering in the notebook as opposed to the actual write ?
Change the question to How many minutes were spent showing the response in the notebook as opposed to the actual write ?
[Content] Create Batch Processing Exercise - Bronze
Increase Verbosity in the Final Charge Exercise
From @petershaw:
A stumbling block was CMD56. A lot of the groups tries to get
charge_point_id
from input_df and not from join_df. There is no way of debugging it inside the command itself, because you are getting the already transformed data of your previous steps and not that one (the original) that the tests produces. I do like this the intension and the reason for this but I have to have give a lot of debugging help: what is the printSchema of your data.. and what is the test data schema. This also happens further down. I am thinking of a way how to give a hint but let them solve this on their own... maybe a little "hey, try you print out the schema in two cmd above..." Because the python error is misleading if you think that every command is atomic. Hope that my description gets clear.
Sometimes it would help to see the mock data in the description of the tests. Most spoken sentence today: no, it is in your code, because the tests are correct ;) It helps a lot to learn debugging, but to actually see the mocking data schema in the test description could be useful to avoid a frustration cycle.
- Add some helper text for command 56: "Does your test fail? Sometimes mocked data in a unit test does not have all of the properties that you are expecting. Try to printSchema() and compare the output of the command with the output of the test . See any differences?"
- Do this both for the Solutions and Exercise
[Content] Create content for My First Dataset which represents Data Wrangling as a sole contributor
[Content] Create Streaming Exercise (Stateful)
In this Stateful Streaming exercise, we'll answer the question of what was the latest Status of the charger in the last 5 minutes.
This base notebook contains the baseline content. It streams data, by leveraging the "rate" format. From there, what's left is to unpack the "status" and window it
- Upload completed exercise to the exercise-ev-databricks repo
- Copy the README from another exercise in the exercise-ev-databricks repo (update the instructions for links to the exercise)
- Link from the appropriate page in data-derp.github.io to the repo's README
Final Charge Exercise - errors at top not caught and causing problems later
From @petershaw:
The other problem I personally have with this excellent notebook is that it is waaaaaaayyyyy to long to provide quick help. If there is a error at the top (somewhere around cmd 50) all goes fine, but not the Test at the end. To find the problem you have either do run all, or scrolling up and down all the time, because DB has a bug in the Toast-Message! It says all good, even if the error fails when running "all commands below". While one run takes 1 Minute and 45 seconds it is very stressful if you have multiple participants with different problems at the same time. 1:45m again and again to find out where the problems are in multiple spots in the book has caused me some hair loss today.
My suggestion:
- Test more of the problems that could happen later in the earlier commands. -> provide hints
- Do not test the order of the data (if not needed)
- Split it up into two notebooks that it is better to handle and runs faster
I saved one of the problematic books and try to find a solution for testing when i find time. I have to dig in the problems and mainly understand them first.
Add "Stop Stream" to the Delta Lake Domain Exercise
https://data-derp.github.io/docs/2.0/making-big-data-work/exercise-delta-lake-domain
- Does the content make sense?
- Silver Notebook: Add a new Markdown section at the end to say "don't forget to stop your stream. Go back to the end of the Bronze notebook and run the last code block") - add that to both Solutions and Workbook
- Bronze Notebook: Add stop stream to both Solutions and Workbook (with a note to say: "this stream should remain running while you're working on the Silver notebook. Come back here when you're done with the SIlver exercise to turn of the stream (of we'll incur lots of costs)"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.