The exercise-ev-databricks from data-derp

exercise-ev-databricks's Issues

This function in cmd10

def read_from_stream(input_df: DataFrame) -> DataFrame:
    ### YOUR CODE HERE
    raw_stream_data = (
        spark.readStream.format("rate")
        .option("rowsPerSecond", 10)
        .load()
    )
    ###


    # This is just data setup, not part of the exercise
    return raw_stream_data.\
        join(mock_data_df, raw_stream_data.value == mock_data_df.index, 'left').\
        drop("timestamp").\
        drop("index")


df = read_from_stream(mock_data_df)

should use input_df, instead of mock_data_df.
If input parameter is used, then the tests fail.

[Content] Create Batch Processing Exercise - Silver

[Content] Create Streaming Exercise (Stateless)

Flattening, etc. We might skip this one in favour of the Stateful exercise

My First Dataset - Final Charge - use inner join instead of left join (Find Matching Start Transaction Requests (left join))

[Content] Create new Streaming exercise

[Content] Update Batch Silver and Gold exercises with standard dataset

The batch silver and gold exercises depended on completing the previous exercises. This does not favour people who are a bit slower. We want to encourage people to learn instead of needing to pass gates. At the beginning of each of the exercises, import the output from the previous exercise (the output dirs in the github repo exercise-ev-databricks).

Catch joining errors early from CMD 17, fails in CMD 45

Tests are working all the way down to e2e tests and fail then if there is a little mistake in a command all the way up.

That is really hart to catch, because the unit-tests are working fine.

When someone messed up CMD 17 (join_with_start_transaction_responses) and join the input_df into join_df, than there is en error in CMD 45.

AssertionError: Expected [2.17, 1.58, 1.0], but got [2.17, 1.58, 1.25]

Correct solution for 17 is:

        return input_df. \
        join(join_df, input_df.transaction_id == join_df.transaction_id, "left"). \
        select(
            join_df.charge_point_id, 
            join_df.transaction_id, 
            join_df.meter_start, 
            input_df.meter_stop.alias("meter_stop"), 
            join_df.start_timestamp, 
            input_df.timestamp.alias("stop_timestamp")
        )

Wrong solution example is

        return join_df. \
            join(input_df, join_df.transaction_id == input_df.transaction_id, "left"). \
            select(
                join_df.charge_point_id, 
                join_df.transaction_id, 
                join_df.meter_start, 
                input_df.meter_stop, 
                join_df.start_timestamp, 
                input_df.timestamp.alias("stop_timestamp")
            )

when change "left"-join to "right"-join all is working fine.

If the wrong way changes the data (besides the order), then this should be tested early, because there is no way to find the error later on.

[Investigate] Why does writing MeterValues Request take 32 seconds and others take 2 seconds?

In the Batch Processing Silver layer exercise, writing MeterValues request to parquet (single partition) takes 32 seconds... whereas writing all of the other DFs for other actions takes sub 2 seconds. Why is that?

NOTE: to run this exercise, please run the Bronze exercise - it outputs some parquet which the SIlver Exercise requires.

Add sorting to test to increase repeatability

E2E Test beneath "Join with Target DataFrame" (Final Charge) needs to sort the input_df so that the test doesn't flap due to re-ordering of the DF

Add answers to Batch Processing Bronze Reflect Questions

At the end of the exercise, there is a "Reflect" section

Add Spark UI screenshots of where people can see how many in-memory partitions have been created (and explanation) as a result of the repartition
Add Spark UI screenshots of how much time the write took (and explanation) - this is due to parquet taking a while to write

Get ZOptimise working in DeltaLake Exercise

z optimise

Create baseline Streaming exercise for Syed

stop_transaction_body_schema implementation method in Final Charge notebook does not confirm to the explanation in the suggested link for the exercise

from_json function expects an argument with col type but the reference link in the same exercise suggests data frame method. Here is the link to the suggestion in the exercise.

https://sparkbyexamples.com/pyspark/pyspark-json-functions-with-examples/

Participants end up implementing the solutions as per suggestion link and test case fails too. So either the suggested document should change or the tests should become more inclusive. Brought to notice by Sultan Ahmad - Dec 2023 Tour

[Content] Add Databricks Visualisations to Visualisation Exercise

[Content] Update static input data for Visualisation

Exercise Silver missing schema field for MeterValues Request

Exercise Silver
Cmd 92, Target Schema missing connector_id

Final Charge Test - sorting causes unexpected failures

From @petershaw:

The final test still demands a certain order of the data. If you put in some kind of sorting it fails. I do not see any reason for that. Sorting is not needed, but also not wrong. I can understand why a lot of people put it in. So improving the flexibility would be a win.

Shorten Medallion Arch Exercises

There has been a lot of feedback that the Batch Processing exercises have been too repetitive. We don't want to fill in the answers completely because there is a possibility that people will simply run the code without thinking through the problem. This issue describes a compromise solution.

Update the Bronze Exercise with filled in solutions but leave one line per exercise as a to-do.
Update the Silver Exercise with filled in solutions but leave one line per exercise as a to-do.

[Content] Create Batch Processing Exercise - Gold 1

Make Batch Processing Gold exercise less repetitive

Bronze Processing - Reflect Question 3 - How many minutes were spent rendering in the notebook as opposed to the actual write ?

Change the question to How many minutes were spent showing the response in the notebook as opposed to the actual write ?

[Content] Create Batch Processing Exercise - Bronze

Increase Verbosity in the Final Charge Exercise

From @petershaw:

A stumbling block was CMD56. A lot of the groups tries to get charge_point_id from input_df and not from join_df. There is no way of debugging it inside the command itself, because you are getting the already transformed data of your previous steps and not that one (the original) that the tests produces. I do like this the intension and the reason for this but I have to have give a lot of debugging help: what is the printSchema of your data.. and what is the test data schema. This also happens further down. I am thinking of a way how to give a hint but let them solve this on their own... maybe a little "hey, try you print out the schema in two cmd above..." Because the python error is misleading if you think that every command is atomic. Hope that my description gets clear.
Sometimes it would help to see the mock data in the description of the tests. Most spoken sentence today: no, it is in your code, because the tests are correct ;) It helps a lot to learn debugging, but to actually see the mocking data schema in the test description could be useful to avoid a frustration cycle.

Add some helper text for command 56: "Does your test fail? Sometimes mocked data in a unit test does not have all of the properties that you are expecting. Try to printSchema() and compare the output of the command with the output of the test . See any differences?"
Do this both for the Solutions and Exercise

[Content] Create content for My First Dataset which represents Data Wrangling as a sole contributor

[Content] Create Streaming Exercise (Stateful)

In this Stateful Streaming exercise, we'll answer the question of what was the latest Status of the charger in the last 5 minutes.

This base notebook contains the baseline content. It streams data, by leveraging the "rate" format. From there, what's left is to unpack the "status" and window it

Upload completed exercise to the exercise-ev-databricks repo
Copy the README from another exercise in the exercise-ev-databricks repo (update the instructions for links to the exercise)
Link from the appropriate page in data-derp.github.io to the repo's README

Final Charge Exercise - errors at top not caught and causing problems later

From @petershaw:

The other problem I personally have with this excellent notebook is that it is waaaaaaayyyyy to long to provide quick help. If there is a error at the top (somewhere around cmd 50) all goes fine, but not the Test at the end. To find the problem you have either do run all, or scrolling up and down all the time, because DB has a bug in the Toast-Message! It says all good, even if the error fails when running "all commands below". While one run takes 1 Minute and 45 seconds it is very stressful if you have multiple participants with different problems at the same time. 1:45m again and again to find out where the problems are in multiple spots in the book has caused me some hair loss today.

My suggestion:

Test more of the problems that could happen later in the earlier commands. -> provide hints

Do not test the order of the data (if not needed)

Split it up into two notebooks that it is better to handle and runs faster
I saved one of the problematic books and try to find a solution for testing when i find time. I have to dig in the problems and mainly understand them first.

Add "Stop Stream" to the Delta Lake Domain Exercise

https://data-derp.github.io/docs/2.0/making-big-data-work/exercise-delta-lake-domain

Does the content make sense?
Silver Notebook: Add a new Markdown section at the end to say "don't forget to stop your stream. Go back to the end of the Bronze notebook and run the last code block") - add that to both Solutions and Workbook
Bronze Notebook: Add stop stream to both Solutions and Workbook (with a note to say: "this stream should remain running while you're working on the Silver notebook. Come back here when you're done with the SIlver exercise to turn of the stream (of we'll incur lots of costs)"

data-derp / exercise-ev-databricks Goto Github PK

exercise-ev-databricks's People

Contributors

Stargazers

Watchers

Forkers

exercise-ev-databricks's Issues

Recommend Projects

Recommend Topics

Recommend Org