Giter VIP home page Giter VIP logo

read-big-file-with-python's Introduction

read-big-file-with-python

The first in what was originally going to be a three part case study ( since extended to 5 parts) in reading a big (21GB) text file using C, Python, PYSPARK, Spark-Scala and Athena/Glue. You can see the other parts here:

part2
part3
part4
part5

This part deals with a comparison between using a C and python program.

A lot of the work I do is dealing with large'ish data files from external data providers and trying to shoehorn their data into our system. One such file comprises company ownership data that we download from Standard & Poors CapitalIQ data service. We use the file to calculate free floats for stocks and this in turn feeds in to whether or not the stock can become a member of various equity indexes that we calculate.

The data file is about 21 Gigabtyes big and holds approximately 366 Million pipe separated records. The first 10 records are shown below:

18511|1|2587198|2004-03-31|0|100000|0|1.97|0.49988|100000||||
18511|2|2587198|2004-06-30|0|160000|0|3.2|0.79669|60000|60|||
18511|3|2587198|2004-09-30|0|160000|0|2.17|0.79279|0|0|||
18511|4|2587198|2004-09-30|0|160000|0|1.72|0.79118|0|0|||
18511|5|2587198|2005-03-31|0|0|0|0|0|-160000|-100|||19
18511|6|2587940|2004-03-31|0|240000|0|0.78|0.27327|240000||||
18511|7|2587940|2004-06-30|0|560000|0|1.59|0.63576|320000|133.33||24|
18511|8|2587940|2004-09-30|0|560000|0|1.13|0.50704|0|0|||
18511|9|2587940|2004-09-30|0|560000|0|0.96|0.50704|0|0|||
18511|10|2587940|2005-03-31|0|0|0|0|0|-560000|-100|||14

The second field in the above file can range between 1 and 56 and the goal was to split up the original file so that all the records with the same value for the second field would be grouped together in the same file. i.e we would end up with 56 separate files period1.txt, period2.txt ... period56.txt each containing approximately 6 million records. Performance is quite critical so, I wrote a C program to run on our HP OpenVMS Alpha server (consisting of dual 1.33 GHz itanium processors and 32 GB Ram). I'm not a C expert but was a bit shocked to discover the program was taking about 54 minutes to run on a quiet system. Anyhow some time later I started to get into Python and as we all know Python is slow isn't it, so there wasn't much point in trying to rewrite my C code in Python and run it on my desktop PC was there? Well, at a loose end one day I decided to try and give it a go, more as a learning excercise for me rather that any expectation of producing something that would be fasetr than my C code - allbeit on a different platform. Needless to say I was amazed when my python run came in at 1033 seconds elapsed time -that's just over 17 minutes or fully two thirds quicker than my C program. I know we're not quite comparing like with like but make of it what you want.

PS. Out of interest I downloaded a copy of visual studio and coded a simplified version of my C program - just using fgets to read the file line by line and sscanf to get at the second field. I ran the program on the same data set and on the same PC as my python program ran on and it took just shy of 1 hour to process the file. Replacing the fgets and sscanf calls with a single call to fscanf made no difference to the run-time either

read-big-file-with-python's People

Contributors

taupirho avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.