Giter VIP home page Giter VIP logo

exercise-ev-databricks's People

Contributors

kbeyer-tw avatar kelseymok avatar krissimon avatar masroorrizvi avatar paulrinaldi avatar syed-tw avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

exercise-ev-databricks's Issues

read_from_stream function should use input_df from the parameters

This function in cmd10

def read_from_stream(input_df: DataFrame) -> DataFrame:
    ### YOUR CODE HERE
    raw_stream_data = (
        spark.readStream.format("rate")
        .option("rowsPerSecond", 10)
        .load()
    )
    ###


    # This is just data setup, not part of the exercise
    return raw_stream_data.\
        join(mock_data_df, raw_stream_data.value == mock_data_df.index, 'left').\
        drop("timestamp").\
        drop("index")


df = read_from_stream(mock_data_df)

should use input_df, instead of mock_data_df.
If input parameter is used, then the tests fail.

[Content] Update Batch Silver and Gold exercises with standard dataset

The batch silver and gold exercises depended on completing the previous exercises. This does not favour people who are a bit slower. We want to encourage people to learn instead of needing to pass gates. At the beginning of each of the exercises, import the output from the previous exercise (the output dirs in the github repo exercise-ev-databricks).

Catch joining errors early from CMD 17, fails in CMD 45

Tests are working all the way down to e2e tests and fail then if there is a little mistake in a command all the way up.

That is really hart to catch, because the unit-tests are working fine.

When someone messed up CMD 17 (join_with_start_transaction_responses) and join the input_df into join_df, than there is en error in CMD 45.

AssertionError: Expected [2.17, 1.58, 1.0], but got [2.17, 1.58, 1.25]

Correct solution for 17 is:

        return input_df. \
        join(join_df, input_df.transaction_id == join_df.transaction_id, "left"). \
        select(
            join_df.charge_point_id, 
            join_df.transaction_id, 
            join_df.meter_start, 
            input_df.meter_stop.alias("meter_stop"), 
            join_df.start_timestamp, 
            input_df.timestamp.alias("stop_timestamp")
        )

Wrong solution example is

        return join_df. \
            join(input_df, join_df.transaction_id == input_df.transaction_id, "left"). \
            select(
                join_df.charge_point_id, 
                join_df.transaction_id, 
                join_df.meter_start, 
                input_df.meter_stop, 
                join_df.start_timestamp, 
                input_df.timestamp.alias("stop_timestamp")
            )

when change "left"-join to "right"-join all is working fine.

If the wrong way changes the data (besides the order), then this should be tested early, because there is no way to find the error later on.

Add answers to Batch Processing Bronze Reflect Questions

At the end of the exercise, there is a "Reflect" section

Image

  • Add Spark UI screenshots of where people can see how many in-memory partitions have been created (and explanation) as a result of the repartition
  • Add Spark UI screenshots of how much time the write took (and explanation) - this is due to parquet taking a while to write

stop_transaction_body_schema implementation method in Final Charge notebook does not confirm to the explanation in the suggested link for the exercise

Image

from_json function expects an argument with col type but the reference link in the same exercise suggests data frame method. Here is the link to the suggestion in the exercise.

https://sparkbyexamples.com/pyspark/pyspark-json-functions-with-examples/

Participants end up implementing the solutions as per suggestion link and test case fails too. So either the suggested document should change or the tests should become more inclusive. Brought to notice by Sultan Ahmad - Dec 2023 Tour

Final Charge Test - sorting causes unexpected failures

From @petershaw:

The final test still demands a certain order of the data. If you put in some kind of sorting it fails. I do not see any reason for that. Sorting is not needed, but also not wrong. I can understand why a lot of people put it in. So improving the flexibility would be a win.

Shorten Medallion Arch Exercises

There has been a lot of feedback that the Batch Processing exercises have been too repetitive. We don't want to fill in the answers completely because there is a possibility that people will simply run the code without thinking through the problem. This issue describes a compromise solution.

  • Update the Bronze Exercise with filled in solutions but leave one line per exercise as a to-do.
  • Update the Silver Exercise with filled in solutions but leave one line per exercise as a to-do.

Increase Verbosity in the Final Charge Exercise

From @petershaw:

A stumbling block was CMD56. A lot of the groups tries to get charge_point_id from input_df and not from join_df. There is no way of debugging it inside the command itself, because you are getting the already transformed data of your previous steps and not that one (the original) that the tests produces. I do like this the intension and the reason for this but I have to have give a lot of debugging help: what is the printSchema of your data.. and what is the test data schema. This also happens further down. I am thinking of a way how to give a hint but let them solve this on their own... maybe a little "hey, try you print out the schema in two cmd above..." Because the python error is misleading if you think that every command is atomic. Hope that my description gets clear.
Sometimes it would help to see the mock data in the description of the tests. Most spoken sentence today: no, it is in your code, because the tests are correct ;) It helps a lot to learn debugging, but to actually see the mocking data schema in the test description could be useful to avoid a frustration cycle.

  • Add some helper text for command 56: "Does your test fail? Sometimes mocked data in a unit test does not have all of the properties that you are expecting. Try to printSchema() and compare the output of the command with the output of the test . See any differences?"
  • Do this both for the Solutions and Exercise

[Content] Create Streaming Exercise (Stateful)

In this Stateful Streaming exercise, we'll answer the question of what was the latest Status of the charger in the last 5 minutes.

This base notebook contains the baseline content. It streams data, by leveraging the "rate" format. From there, what's left is to unpack the "status" and window it

  • Upload completed exercise to the exercise-ev-databricks repo
  • Copy the README from another exercise in the exercise-ev-databricks repo (update the instructions for links to the exercise)
  • Link from the appropriate page in data-derp.github.io to the repo's README

Final Charge Exercise - errors at top not caught and causing problems later

From @petershaw:

The other problem I personally have with this excellent notebook is that it is waaaaaaayyyyy to long to provide quick help. If there is a error at the top (somewhere around cmd 50) all goes fine, but not the Test at the end. To find the problem you have either do run all, or scrolling up and down all the time, because DB has a bug in the Toast-Message! It says all good, even if the error fails when running "all commands below". While one run takes 1 Minute and 45 seconds it is very stressful if you have multiple participants with different problems at the same time. 1:45m again and again to find out where the problems are in multiple spots in the book has caused me some hair loss today.

My suggestion:

  • Test more of the problems that could happen later in the earlier commands. -> provide hints
  • Do not test the order of the data (if not needed)
  • Split it up into two notebooks that it is better to handle and runs faster
    I saved one of the problematic books and try to find a solution for testing when i find time. I have to dig in the problems and mainly understand them first.

Add "Stop Stream" to the Delta Lake Domain Exercise

https://data-derp.github.io/docs/2.0/making-big-data-work/exercise-delta-lake-domain

  • Does the content make sense?
  • Silver Notebook: Add a new Markdown section at the end to say "don't forget to stop your stream. Go back to the end of the Bronze notebook and run the last code block") - add that to both Solutions and Workbook
  • Bronze Notebook: Add stop stream to both Solutions and Workbook (with a note to say: "this stream should remain running while you're working on the Silver notebook. Come back here when you're done with the SIlver exercise to turn of the stream (of we'll incur lots of costs)"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.