Giter VIP home page Giter VIP logo

learningsparkscala's Introduction

LearningSparkScala

This repository is for learning purposes.

to make spark jobs work and not to go out of memory:
  • find a config file /usr/local/Cellar/apache-spark/2.2.1/libexec/conf/spark-defaults.conf.template
  • change its name to spark-defaults.conf
  • uncomment spark.driver.memory and change its value

πŸ“ src/hello_world

Contains training files with simple things like word count or find maximum value.

πŸ“„ WordCountBible.scala downloads the King James Version bible from Gutenberg project and counts all the words except stop words (which can be found in english_stop_words.txt file). Then prints out top 20 words with a table markdown:

word count
lord 7830
god 4442
said 3999
upon 2748
man 2613
israel 2565
son 2370
king 2257
people 2139
came 2093
house 2024
come 1971
one 1967
children 1802
also 1769
day 1734
land 1718
men 1653
let 1511
go 1492

πŸ“„ Precipitation.scala takes data from 1800.csv file (not in repository) which looks like this:

station_id ? data type amount ?
ITE00100554 18000101 TMAX -75 E
ITE00100554 18000101 TMIN -148 E
GM000010962 18000101 PRCP 0 E
EZE00100082 18000101 TMAX -86 E

Then looking for the maximum amount of precipitations and prints out the weather station id and its precipitation value.

GM000010962 max precipitation: 305.00

πŸ“„ CustomerOrders.scala takes data from customer-orders.csv file

customer_id product_id amount spent
44 8602 37.19

Then counts total amount spent by each customer and prints out customers with spent amount.

Customer 45 spent 3309.3804
Customer 79 spent 3790.5698
Customer 96 spent 3924.23
...

πŸ“ src/MovieRecommendation

Contains files working with MovieLens data sets

πŸ“„ PopularMovies.scala counts how many times each movie was rated == how popular the movie is. Doesn't contain movie names, just IDs. Uses ratings.csv file. Prints out top 20 movies.

πŸ“„ PopularMoviesNameMapping.scala does the same as PopularMovies.scala but maps movie titles with their IDs in results. Prints out top 20 movies.

Result (using ml-latest dataset):

Movie Number of Ratings
Forrest Gump (1994) 91921
Shawshank Redemption, The (1994) 91082
Pulp Fiction (1994) 87901
Silence of the Lambs, The (1991) 84078
Matrix, The (1999) 77960
Star Wars: Episode IV - A New Hope (1977) 77045
Jurassic Park (1993) 74355
Schindler's List (1993) 67662
Braveheart (1995) 66512
Toy Story (1995) 66008
Star Wars: Episode VI - Return of the Jedi (1983) 62714
Terminator 2: Judgment Day (1991) 61836
Star Wars: Episode V - The Empire Strikes Back (1980) 61672
Fight Club (1999) 60024
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) 59693
Usual Suspects, The (1995) 59271
American Beauty (1999) 57879
Apollo 13 (1995) 57416
Independence Day (a.k.a. ID4) (1996) 57232
Godfather, The (1972) 57070

πŸ“„ MovieSimilarities.scala counts similar movies using MovieLens data set. It uses Cosine_similarity to count similar movies based on similar ratings from users. Result:

Top 10 similar movies for Life Is Beautiful (La Vita Γ¨ bella) (1997)

Movie score strength
Good Will Hunting (1997) 0.976235418147676 53
Shawshank Redemption, The (1994) 0.9749373116159256 69
Amelie (Fabuleux destin d'AmΓ©lie Poulain, Le) (2001) 0.9738613674727673 51
Shrek (2001) 0.9716221218860066 52

πŸ“ src/MarvelSuperheroSocial

Working with marvel datasets from the course "Apache Spark with Scala By Frank Kane"

learningsparkscala's People

Watchers

Kristina I. avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.