Giter VIP home page Giter VIP logo

data-engineering-practice's People

Contributors

cclauss avatar danielbeach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-engineering-practice's Issues

Exercise-6 docker image won't start

Hey Daniel, I've been loving this so far, thanks for putting it together! I finally made it to exercise 6 but when I run "docker-compose up run" I get this error message (see below) and the docker container won't start. I've never used pyspark before, so I have no idea how I could troubleshoot this.

s\GitHub\data-engineering-practice\Exercises\Exercise-6> docker-compose up run
[+] Running 1/0
 - Container exercise-6-run-1  Created                                                                             0.0s
Attaching to exercise-6-run-1
exercise-6-run-1  | WARNING: An illegal reflective access operation has occurred
exercise-6-run-1  | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
exercise-6-run-1  | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
exercise-6-run-1  | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
exercise-6-run-1  | WARNING: All illegal access operations will be denied in a future release
exercise-6-run-1  | 22/10/13 01:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
exercise-6-run-1  | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Running Spark version 3.0.1
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: Resources for spark.driver:
exercise-6-run-1  |
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Submitted application: Exercise6
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
exercise-6-run-1  | 22/10/13 01:03:55 INFO Utils: Successfully started service 'sparkDriver' on port 36347.
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering MapOutputTracker
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMaster
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
exercise-6-run-1  | 22/10/13 01:03:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-72b58301-91c5-4d58-b06a-c81c8deea2bc
exercise-6-run-1  | 22/10/13 01:03:55 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering OutputCommitCoordinator
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO Executor: Starting executor ID driver on host 06fa963d87c7
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37915.
exercise-6-run-1  | 22/10/13 01:03:56 INFO NettyBlockTransferService: Server created on 06fa963d87c7:37915
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMasterEndpoint: Registering block manager 06fa963d87c7:37915 with 434.4 MiB RAM, BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
exercise-6-run-1  | ['Divvy_Trips_2019_Q4.zip', 'Divvy_Trips_2020_Q1.zip']
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Invoking stop() from shutdown hook
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Stopped Spark web UI at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO MemoryStore: MemoryStore cleared
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: BlockManager stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: BlockManagerMaster stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Successfully stopped SparkContext
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Shutdown hook called
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0/pyspark-12e6df8c-60f3-4687-8d09-651ece254651
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-7da8b58f-f65f-4ac9-beb2-c4128e23bb07
exercise-6-run-1 exited with code 0
PS E:\Documents\GitHub\data-engineering-practice\Exercises\Exercise-6>

Practice with Orchestrator

Hi

Maybe it's a good idea to add also a section regarding Data Orchestrator (Dagster,Prefect,Mage,Airflow ect.ect).
It's a crucial part of Data Engineer to understand and debug a data pipeline.

What you think?

Riccardo

Exercise 2 - 102 files matching last updated ts of `2022-02-07 14:03`

Instructions: You are looking for the file that was Last Modified on 2022-02-07 14:03, you can't cheat and lookup the file number yourself.

I am planning on having the code identify the proper file, but even when checking manually, there are 102 files with this same last updated timestamp:
image

Is this intended as part of the exercise?

Also would be nice to have some sort of solutions/answers available to check if we completed the exercise properly

Unable to connect to the Postgres DB

I ran the commands docker build --tag=exercise-5 . and docker-compose up run and I get this error:
Screen Shot 2022-03-04 at 2 23 10 pm

I noticed when I check the logs for the exercise-5-postgres-1 container it says this:

Screen Shot 2022-03-04 at 2 24 37 pm

Exercise-3 "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

When I try to print a list of files that have bucket s3, console says to me "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

i write my code:

import boto3

def main():
s3= boto3.client('s3')
paquete=s3.download_file('commoncrawl','crawl-data/CC-MAIN-2022-05/wet.paths.gz','wet.paths.gz')
paquete.content

if name == "main":
main()

Exercise 2-- No files last modified in 2022

Hey everyone ,I think the page might have changed can't find a file modified in 2022 in the link provided, could you confirm? I assume the web scraping is only supposed to be done in that link I guess I could search all the links, but would that be the objective?
Thanks !!!

Exercise-2 docker build hanging on building wheel for pandas (pyproject.toml)

On exercise 2 when I try to run docker build --tag=exercise-2 . it gets to Building wheel for pandas (pyproject.toml)... then hangs for 20+ minutes. Is this expected?

I tried updating pip install --upgrade pip before building the image as suggested here, but no luck.

I cancelled and then attempted again so you can see my terminal, but I left it to run for 20+ minutes previously
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.