Giter VIP home page Giter VIP logo

2018_metz's Introduction

Context

You have a big file and you want to extract information from it, and correlate them with 3rd party services. You get a new file every 5 min.

Processing all that in one single process will take too much time,

This file is text, so you can read it easily but the content is made of multiline blocks.

Use the validate.sh script to make sure the files you generate are the same as the source files.

Step 0

Clone this repo:

git clone https://github.com/Rafiot/2018_Metz.git

Get the dataset:

./get_dataset.sh

Step 1

Figuring out a separator write a file split it in 10 independent files of the same-ish size

Tools required:

  • vim (look at the file -> find a separator)
  • grep (figure out how many entries we have
  • wc (count the amout of blocks)
  • bc (compute things -> amout of blocks /file)

Write some code to do that.

Step 2

Rewrite it, but better:

  • function with parameters (source_file_name, separator, output_name)
  • make it a script (see __main__, __name__)

Step 3

What about the file gets lot bigger? Or the size fluctuates? (i.e we need to dynamically figure out how many blocks we want in each file)

Or we want to split it in more/less files? (i.e. we have more CPUs at hand and can process more files at once)

Python modules

  • re (regex, replaces grep)

Method:

  • len (replaces wc)
  1. count the total amount of blocks (in another method)
  2. Divide it by the number of files
  3. Update the file_split method accordingly

Step 4

Do we care about the number of entries? Or the number of files?

===> Update your code to be able to pass a number of file as parameter

Step 5

We're getting there. Let's do some refactoring now to make the code more pythonesque.

  • use the with open ... as ...: syntax when possible
  • Use format instead of concatenating text
  • Use round on entries_per_file
  • Add some logging (see the logging module)
  • Use argparse to make the script more flexible

Step 6

Let's think a bit how we can make this code more efficient.

Why do we compute the mount of entries? Do we need that? What about using the size of the file instead?

Methods:

  • file.seek
  • file.tell

Step 7

Let's make it better:

  • Only open the source file once
  • Open as binary file

Step 8

Step 9 ++

If you're fast and bored:

  • Make it a class (with comments)
  • Yield pseudo files (BytesIO) instead of writing the files on the disk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.