danielbeach / data-engineering-practice Goto Github PK
View Code? Open in Web Editor NEWData Engineering Practice Problems
Data Engineering Practice Problems
Hey Daniel, I've been loving this so far, thanks for putting it together! I finally made it to exercise 6 but when I run "docker-compose up run" I get this error message (see below) and the docker container won't start. I've never used pyspark before, so I have no idea how I could troubleshoot this.
s\GitHub\data-engineering-practice\Exercises\Exercise-6> docker-compose up run
[+] Running 1/0
- Container exercise-6-run-1 Created 0.0s
Attaching to exercise-6-run-1
exercise-6-run-1 | WARNING: An illegal reflective access operation has occurred
exercise-6-run-1 | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
exercise-6-run-1 | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
exercise-6-run-1 | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
exercise-6-run-1 | WARNING: All illegal access operations will be denied in a future release
exercise-6-run-1 | 22/10/13 01:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
exercise-6-run-1 | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkContext: Running Spark version 3.0.1
exercise-6-run-1 | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1 | 22/10/13 01:03:55 INFO ResourceUtils: Resources for spark.driver:
exercise-6-run-1 |
exercise-6-run-1 | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkContext: Submitted application: Exercise6
exercise-6-run-1 | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls to: root
exercise-6-run-1 | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls to: root
exercise-6-run-1 | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls groups to:
exercise-6-run-1 | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls groups to:
exercise-6-run-1 | 22/10/13 01:03:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
exercise-6-run-1 | 22/10/13 01:03:55 INFO Utils: Successfully started service 'sparkDriver' on port 36347.
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkEnv: Registering MapOutputTracker
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMaster
exercise-6-run-1 | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
exercise-6-run-1 | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
exercise-6-run-1 | 22/10/13 01:03:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-72b58301-91c5-4d58-b06a-c81c8deea2bc
exercise-6-run-1 | 22/10/13 01:03:55 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
exercise-6-run-1 | 22/10/13 01:03:55 INFO SparkEnv: Registering OutputCommitCoordinator
exercise-6-run-1 | 22/10/13 01:03:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
exercise-6-run-1 | 22/10/13 01:03:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://06fa963d87c7:4040
exercise-6-run-1 | 22/10/13 01:03:56 INFO Executor: Starting executor ID driver on host 06fa963d87c7
exercise-6-run-1 | 22/10/13 01:03:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37915.
exercise-6-run-1 | 22/10/13 01:03:56 INFO NettyBlockTransferService: Server created on 06fa963d87c7:37915
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManagerMasterEndpoint: Registering block manager 06fa963d87c7:37915 with 434.4 MiB RAM, BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1 | 22/10/13 01:03:56 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
exercise-6-run-1 | 22/10/13 01:03:56 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
exercise-6-run-1 | ['Divvy_Trips_2019_Q4.zip', 'Divvy_Trips_2020_Q1.zip']
exercise-6-run-1 | 22/10/13 01:03:56 INFO SparkContext: Invoking stop() from shutdown hook
exercise-6-run-1 | 22/10/13 01:03:56 INFO SparkUI: Stopped Spark web UI at http://06fa963d87c7:4040
exercise-6-run-1 | 22/10/13 01:03:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
exercise-6-run-1 | 22/10/13 01:03:56 INFO MemoryStore: MemoryStore cleared
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManager: BlockManager stopped
exercise-6-run-1 | 22/10/13 01:03:56 INFO BlockManagerMaster: BlockManagerMaster stopped
exercise-6-run-1 | 22/10/13 01:03:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
exercise-6-run-1 | 22/10/13 01:03:56 INFO SparkContext: Successfully stopped SparkContext
exercise-6-run-1 | 22/10/13 01:03:56 INFO ShutdownHookManager: Shutdown hook called
exercise-6-run-1 | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0/pyspark-12e6df8c-60f3-4687-8d09-651ece254651
exercise-6-run-1 | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0
exercise-6-run-1 | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-7da8b58f-f65f-4ac9-beb2-c4128e23bb07
exercise-6-run-1 exited with code 0
PS E:\Documents\GitHub\data-engineering-practice\Exercises\Exercise-6>
Hi
Maybe it's a good idea to add also a section regarding Data Orchestrator (Dagster,Prefect,Mage,Airflow ect.ect).
It's a crucial part of Data Engineer to understand and debug a data pipeline.
What you think?
Riccardo
Instructions: You are looking for the file that was Last Modified on 2022-02-07 14:03, you can't cheat and lookup the file number yourself.
I am planning on having the code identify the proper file, but even when checking manually, there are 102 files with this same last updated timestamp:
Is this intended as part of the exercise?
Also would be nice to have some sort of solutions/answers available to check if we completed the exercise properly
Hi,
there is an issue in this URI, please fix it
https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2220_Q1.zip
When I try to print a list of files that have bucket s3, console says to me "botocore.exceptions.NoCredentialsError: Unable to locate credentials"
i write my code:
import boto3
def main():
s3= boto3.client('s3')
paquete=s3.download_file('commoncrawl','crawl-data/CC-MAIN-2022-05/wet.paths.gz','wet.paths.gz')
paquete.content
if name == "main":
main()
Hey everyone ,I think the page might have changed can't find a file modified in 2022 in the link provided, could you confirm? I assume the web scraping is only supposed to be done in that link I guess I could search all the links, but would that be the objective?
Thanks !!!
when i try to donwload the links on the first exercise the links are 400 status
On exercise 2 when I try to run docker build --tag=exercise-2 .
it gets to Building wheel for pandas (pyproject.toml)...
then hangs for 20+ minutes. Is this expected?
I tried updating pip install --upgrade pip
before building the image as suggested here, but no luck.
I cancelled and then attempted again so you can see my terminal, but I left it to run for 20+ minutes previously
I get error 403 and that I don't have permission when sending a request to the url https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/.
How can I get permission? Has the URL changed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.