danielbeach / data-engineering-practice Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 428.0 46.29 MB

Data Engineering Practice Problems

Dockerfile 63.33% Python 36.67%

data-engineering-practice's People

Contributors

Stargazers

Watchers

Forkers

celestialized krishnabyggari94 nadimessa joshgreenslade kishorepradeep savadev broges stewartwc shrishailde iyedsaadli tochiebere rifatrkf risarora binbenban jacke callmekofi rbackupx frank2533 fullrobot xiaolongguo artusc cclauss jianbozheng varuserey shamoo100 sivamsinghsh mmdanas blopezpi eatzebaby dcccastro parikannappan tonyle9 hartyquinn raghavkhandelwal31 tthustla awesome-project-ic awesome-tutors vic-hatem b3rnhch jinysong mastervel elonvampire fvillena4 gruizebury rammakireddi nexxyb shubham23471 cyrusleungst moritzkoerber javapagar vladimirdepistado2003 beannguyengw sumanengg nithish761 mide-clp dvainrub samdewriter sandeep298 marclamberti trungnghiahoang96 hungnphan nan0t brianalytics sbeep neeleshapatil anshuman1117 cherifsouare priya-gittest thesekyi cakkhoiron the-mirak susmit07 kunjalmaheshwari pavan3401 mouradap somasekharreddy1119 analytics-sam chinmay4382 atse0612 deanlj-dev dhamodarb mateus-ae conradobio deepak7093 mithranvm dataengdev syedmushtaq17 realattaurrehman ochibobo telltoarvind muhammadyasir1 rpshgupta haguy77 cpkabra jkcai dennokush fabiomarquez sureshb208 no0b1t0 aadiiitiii

data-engineering-practice's Issues

Exercise-6 docker image won't start

Hey Daniel, I've been loving this so far, thanks for putting it together! I finally made it to exercise 6 but when I run "docker-compose up run" I get this error message (see below) and the docker container won't start. I've never used pyspark before, so I have no idea how I could troubleshoot this.

s\GitHub\data-engineering-practice\Exercises\Exercise-6> docker-compose up run
[+] Running 1/0
 - Container exercise-6-run-1  Created                                                                             0.0s
Attaching to exercise-6-run-1
exercise-6-run-1  | WARNING: An illegal reflective access operation has occurred
exercise-6-run-1  | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
exercise-6-run-1  | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
exercise-6-run-1  | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
exercise-6-run-1  | WARNING: All illegal access operations will be denied in a future release
exercise-6-run-1  | 22/10/13 01:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
exercise-6-run-1  | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Running Spark version 3.0.1
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: Resources for spark.driver:
exercise-6-run-1  |
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Submitted application: Exercise6
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
exercise-6-run-1  | 22/10/13 01:03:55 INFO Utils: Successfully started service 'sparkDriver' on port 36347.
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering MapOutputTracker
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMaster
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
exercise-6-run-1  | 22/10/13 01:03:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-72b58301-91c5-4d58-b06a-c81c8deea2bc
exercise-6-run-1  | 22/10/13 01:03:55 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering OutputCommitCoordinator
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO Executor: Starting executor ID driver on host 06fa963d87c7
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37915.
exercise-6-run-1  | 22/10/13 01:03:56 INFO NettyBlockTransferService: Server created on 06fa963d87c7:37915
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMasterEndpoint: Registering block manager 06fa963d87c7:37915 with 434.4 MiB RAM, BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
exercise-6-run-1  | ['Divvy_Trips_2019_Q4.zip', 'Divvy_Trips_2020_Q1.zip']
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Invoking stop() from shutdown hook
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Stopped Spark web UI at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO MemoryStore: MemoryStore cleared
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: BlockManager stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: BlockManagerMaster stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Successfully stopped SparkContext
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Shutdown hook called
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0/pyspark-12e6df8c-60f3-4687-8d09-651ece254651
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-7da8b58f-f65f-4ac9-beb2-c4128e23bb07
exercise-6-run-1 exited with code 0
PS E:\Documents\GitHub\data-engineering-practice\Exercises\Exercise-6>

Practice with Orchestrator

Maybe it's a good idea to add also a section regarding Data Orchestrator (Dagster,Prefect,Mage,Airflow ect.ect).
It's a crucial part of Data Engineer to understand and debug a data pipeline.

What you think?

Riccardo

Exercise 2 - 102 files matching last updated ts of `2022-02-07 14:03`

Instructions: You are looking for the file that was Last Modified on 2022-02-07 14:03, you can't cheat and lookup the file number yourself.

I am planning on having the code identify the proper file, but even when checking manually, there are 102 files with this same last updated timestamp:

Is this intended as part of the exercise?

Also would be nice to have some sort of solutions/answers available to check if we completed the exercise properly

Unable to connect to the Postgres DB

I ran the commands docker build --tag=exercise-5 . and docker-compose up run and I get this error:

I noticed when I check the logs for the exercise-5-postgres-1 container it says this:

Uri does not exist

Hi,
there is an issue in this URI, please fix it

https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2220_Q1.zip

Exercise-3 "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

When I try to print a list of files that have bucket s3, console says to me "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

i write my code:

import boto3

def main():
s3= boto3.client('s3')
paquete=s3.download_file('commoncrawl','crawl-data/CC-MAIN-2022-05/wet.paths.gz','wet.paths.gz')
paquete.content

if name == "main":
main()

Exercise 2-- No files last modified in 2022

Hey everyone ,I think the page might have changed can't find a file modified in 2022 in the link provided, could you confirm? I assume the web scraping is only supposed to be done in that link I guess I could search all the links, but would that be the objective?
Thanks !!!

is not available to download the links in the first exercise

when i try to donwload the links on the first exercise the links are 400 status

Exercise-2 docker build hanging on building wheel for pandas (pyproject.toml)

On exercise 2 when I try to run docker build --tag=exercise-2 . it gets to Building wheel for pandas (pyproject.toml)... then hangs for 20+ minutes. Is this expected?

I tried updating pip install --upgrade pip before building the image as suggested here, but no luck.

I cancelled and then attempted again so you can see my terminal, but I left it to run for 20+ minutes previously

Exercise 2: permission denied

I get error 403 and that I don't have permission when sending a request to the url https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/.
How can I get permission? Has the URL changed?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.