Giter VIP home page Giter VIP logo

coding-challenge-pyspark's Introduction

Coding Challenge Solution - Pyspark

If you already have Spark installed, those challenges use findspark module, available in the requirements.txt file

dictionary.py

This job will read through the whole dataset folder and create an index of unique words with an associated ID. Resulting dictionary will be saved as parquet file.

index.py

This job will create a dataframe of the list of words and the document ID where they appear. The list won't be of unique words. Repetitions need to appear to understand the full list of documents where this word is located.

After the full read is done, we will join the previous dictionary with this new dataframe, to obtain a list of wordId vs docId.

When we have that list, we will collect all the docId into a list, so we can finally have unique entries for words, together with the full list of docs where they appear.

coding-challenge-pyspark's People

Contributors

rubenwap avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.