You have a big file and you want to extract information from it, and correlate them with 3rd party services. You get a new file every 5 min.
Processing all that in one single process will take too much time,
This file is text, so you can read it easily but the content is made of multiline blocks.
Use the validate.sh
script to make sure the files you generate are the same as the source files.
Clone this repo:
git clone https://github.com/Rafiot/2018_Metz.git
Get the dataset:
./get_dataset.sh
Figuring out a separator write a file split it in 10 independent files of the same-ish size
Tools required:
vim
(look at the file -> find a separator)grep
(figure out how many entries we havewc
(count the amout of blocks)bc
(compute things -> amout of blocks /file)
Write some code to do that.
Rewrite it, but better:
- function with parameters (
source_file_name
,separator
,output_name
) - make it a script (see
__main__
,__name__
)
What about the file gets lot bigger? Or the size fluctuates? (i.e we need to dynamically figure out how many blocks we want in each file)
Or we want to split it in more/less files? (i.e. we have more CPUs at hand and can process more files at once)
Python modules
re
(regex, replacesgrep
)
Method:
len
(replaceswc
)
- count the total amount of blocks (in another method)
- Divide it by the number of files
- Update the
file_split
method accordingly
Do we care about the number of entries? Or the number of files?
===> Update your code to be able to pass a number of file as parameter
We're getting there. Let's do some refactoring now to make the code more pythonesque.
- use the
with open ... as ...:
syntax when possible - Use format instead of concatenating text
- Use
round
on entries_per_file - Add some logging (see the
logging
module) - Use
argparse
to make the script more flexible
Let's think a bit how we can make this code more efficient.
Why do we compute the mount of entries? Do we need that? What about using the size of the file instead?
Methods:
file.seek
file.tell
Let's make it better:
- Only open the source file once
- Open as binary file
-
Fetch new files when there is something available
-
Use the library to generate text files:
- https://bitbucket.org/ripencc/bgpdump/downloads/ (Installation details: https://bitbucket.org/ripencc/bgpdump/wiki/Home.wiki#!building)
sh ./bootstrap.sh make ./bgpdump -T
./bgpdump -O ../data/latest-bview.txt ../data/original/latest-bview.gz
If you're fast and bored:
- Make it a class (with comments)
- Yield pseudo files (
BytesIO
) instead of writing the files on the disk