- Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer)
- Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data
- As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster
- It is a fact that parallel computation can make certain types of programming tasks much faster
- However, with greater computing power comes greater complexity
- Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?
isaacmwendwa / big-data-with-pyspark Goto Github PK
View Code? Open in Web Editor NEWThis repository contains the materials (code & theory) I compiled while undertaking DataCamp's Big Data with PySpark Learning Track