Giter VIP home page Giter VIP logo

beware-the-for-loop's Introduction

beware-the-for-loop

a comparison of sequential vs parallel group-by data processing

Writing “loops” is a common coding/programming practice. A loop is a process for iterating and repeating a block of code. Such a programming tool has many applications. I find myself frequently using for-loops, and in my job I see many customers using for-loops. There are a lot of for-loops out there in the world.

Beware, there is an “insidious” type of for-loop: one that iterates through subsets of rows in a dataframe, and independently processes each subset. For example, suppose one column in a dataframe is “geography” which indicates various locations for a retail company. A common use of a for-loop would be to iterate through each geography, and process the data for each geography separately. There are many applications for such an approach. For example, we may want to train machine learning models that are specific to each geography.

But here’s the problem: for-loops are a serial (not parallel) process. Why does that matter? In our brave new world of Big Data, it’s safe to say that parallel processing is paramount.

There is a common objection that I’ve heard, to the idea of converting existing non-parallelized processing into something that is more “Sparkified”: when the customer or colleague is using Pandas, and knowing that Pandas is not a distributed-computing package, the objection is a lack of appetite for rewriting their existing code from Python + Pandas into Python + Spark (sans Pandas). Rest assured, using Pandas does not stand in your way of parallelizing your process. This is demonstrated in the accompanying code.

The accompanying code demonstrates the insidious type of for-loop (one that iterates through subsets of the data and independently processes each subset), while also demonstrating much faster alternative approaches, thanks to the parallel processing power of Spark.

beware-the-for-loop's People

Contributors

ogdendc avatar

Stargazers

Jun Wei avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.