rainwoodman / bigfile Goto Github PK

View Code? Open in Web Editor NEW

18.0 4.0 5.0 2.19 MB

A reproducible massively parallel IO library for hierarchical data

License: BSD 2-Clause "Simplified" License

Python 18.47% Makefile 0.99% C 68.51% Shell 1.35% Fortran 0.34% CMake 1.31% Cython 9.03%

bigfile's Introduction

bigfile

A reproducible massively parallel IO library for large, hierarchical datasets.

Python 2 and 3 binding is available via pip.

bigfile was originally developed for the BlueTides simulation on BlueWaters at NCSA. The library is currently under investigation under the BW-PAID program with NCSA.

The current implementation works on a true POSIX compliant file system, e.g. Lustre. BigFile makes two assumptions

mkdir() is durable -- it shall propagate the directory to all clients.
Allowing non-overlapping write from different clients. This less strict than POSIX.

Be aware NFS is not a true POSIX compliant file system.

To cite bigfile, use DOI at

Build status

Install

Usually one only needs the Python binding in order to read a BigFile.

To install the Python binding

pip install [--user] bigfile

The C-API of bigfile can be embedded into a project, by dropping in four files : bigfile.c bigfile-mpi.c, bigfile.h bigfile-mpi.h.

However, if installation is preferred, the library and executables can be compiled and installed using CMake:

mkdir build
cd build
cmake ..
make install

This will install the project in the default path (probalby /usr/local), to select an alternative installation destination, replace the cmake call by:

cmake -DCMAKE_INSTALL_PREFIX:PATH=<PREFIX> ..

where <PREFIX> is the desired destination.

Compilation is also possible using the legacy build system:

make install

However, you need to manually override CC MPICC, PREFIX as needed. Take a look at the Makefile is always recommended.

Description

bigfile provides a hierarchical structure of data columns via File, Dataset and Column.

A Column stores a two dimesional table of nmemb columns and size rows. Numerical typed columns are supported.

Attributes can be attached to a Column. Numerical attributes and string attributes are supported.

Type casting is performed on-the-fly if read/write operation requests a different data type than the file has stored.

bigfile.Dataset works with dask.from_array.

The Anatomy of a BigFile

A BigFile maps to a directory hierarchy on the file system.

This is the directroy structure of an example file:

/scratch1/scratchdirs/sd/yfeng1/example-bigfile
  block0
    header
    attrs-v2
    000000
  group1
    block1.1
      header
      attrs-v2
      000000
      000001
    block1.2
      header
      attrs-v2
      000000
      000001
  group2
    block2.1
      header
      attrs-v2
      000000
      000001

A BigFile consists of blocks (BigBlock) and groups of blocks. Files, groups and blocks are mapped to directories of the hosting file system.

A BigBlock consists of two special plain text files and a sequence of binary data files.

Text file header, which stores the data type and size of the block,
Text file attrs-v2, which stores the attributes attached to the block.
Binary files 000000, 000001, .... which store the binary data of for the blocks. The format of the data (endianess, data type, vector length per row) is described in header. The number of files used by a block, as well as the size (number of rows) of a block is fixed at the creation of a block.

The performance of bigfile is insulated from the configurations of the Lustre file system due to the explicit striping.

Comparision with HDF5

Good

bigfile is simpler. The core library of bigfile consists of 2 source files, 2 header files, and 1 Makefile, a total of less than 3000 lines of code, easily maintained by one person or dropped into a project. HDF5 is much more complicated.
bigfile is closer to the data. The raw data on disk is stored as binary files that can be directly accessed by any application. The meta data (block descriptions and attributes) is stored in plain text, easily understood by human. In a sense, the bigfile library is no more than a helper for reading and writing these files under the bigfile protocal. In contrast, once your data goes into HDF5 it is trapped, the HDF5 library is required to make sense of the data from that point on.

Bad

bigfile is limited -- for example, bigfile has no API for output streaming, and only 2-dimensional tables are supported. HDF5 is much richer in functionality and more powerful in data description. The designated use-case of bigfile is to store a large amount of static / near-immutable column-wise table data.
bigfile is incomplete. Bugs have yet to be identified and fixed. In contrast HDF5 has been a funded research program developed for more than 20 years.

API Reference

The documentation needs to be written.

The core library is C. Refer to bigfile.h and bigfile-mpi.h for the API interface.

There are Python bindings for Python 2 and 3.

The Python binding under MPI invoked more meta-data queries to the file system than we would like to be, though for small scale applications (thousands of cores) it is usually adequate.

Examples

# This example consumes the BlueTides Simulation data.

import bigfile

f = bigfile.File('PART_018')

print (f.blocks)
# Position and Velocity of GAS particles
data = bigfile.Dataset(f["0/"], ['Position', 'Velocity'])

print (data.size)
print (data.dtype)
# just read a few particles, because there are 700 billion of them.
print data[10:30]

Shell

We provide the following shell commands for inspecting a bigfile:

bigfile-cat
bigfile-create
bigfile-repartition
bigfile-ls
bigfile-get-attr
bigfile-set-attr

Rejected Poster for SC17

We submitted a poster to describe bigfile for SC17. Although the poster was rejected, we post them here as they contain a description of the design and some benchmarks of bigfile.

https://github.com/rainwoodman/bigfile/tree/documents

Yu Feng

bigfile's People

Contributors

Stargazers

Watchers

Forkers

sbird eiffl hixio-mh cbyrohl orion01500

bigfile's Issues

Benchmark matrix

Two data set sizes:

Big (1T rows)
Small (1G rows)

Fix BytesPerFile: 1GB (per BlobFile)

Varying number of nodes as a fraction of BlueWaters:
Possibly just two is enough 10% and 90%

Always use as many ranks as cores. (Threading is not very relevant -- number of writer subcomms is relevant)

Varying number of writers per BlobFile (e.g 1, 2, 4, 8...)

@dadamsncsa Does this capture what we discussed?

deprecate bigfile.BigFile in favor of bigfile.File

just so we don't forget...

Writing into single bigfile from multiple ranks

Hi!

I am using nbodykit to manipulate some data and then writing it with the following command:

density = density_fourier.paint(mode='real')
ArrayMesh(density,BoxSize=Lbox).save("density.bigfile", mode='real', dataset='Field')

into a bigfile.

The version of my bigfile module was 0.1.49 and nbodykit is 0.3.15.

I saw that there was a mention of the feature that allows writing into a single file from multiple ranks in the README.md, so I thought that perhaps I just need to update my bigfile, so I installed from scratch, specifying a local directory:
cmake -DCMAKE_INSTALL_PREFIX:PATH=/home/boryanah/installs/ ..
which installed correctly.

However, I get the following error when calling:

>>> bigfile.File
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'bigfile' has no attribute 'File'

Cheers,
Boryana

Converting to hdf5

@rainwoodman and @sbird
I see you guys have some code to convert from hdf5 to bigfile, but do you have any code to do the reverse? I have some analysis code that takes hdf5 as input that I want to run on outputs from MP-Gadget, where that outputs are only a few GB. Thanks :)

Scale up file truncation

Attributes on a BigFile.

Need to refactor big_block__attr__ functions. They shall have versions that deal with an Attrset.

Add Per writer sharding support in MPI mode.

The MPI write function shall be refactor such that it can create one blob file per writer task and dedicate the blob file to that writer task.

We shall also switch the client code to do this by default -- most file system would support this sort of concurrent IO.

This also opens the door for adding on-the-fly data compression -- since we now can assert the data in the same blob file is written sequentially.

RecordType and DataSet support in C.

These changes will allow us to use bigfile to do journals (e.g. record blackhole details per step).

RecordType = [ ( column name, dtype ) ]

big_record_set(rt, void* record_buf, icol, const void * value)
big_record_get(rt, const void* record_buf, icol, void * value)
BigDataset(RecordType, File).
big_dataset_read(bd, offset, size, buf), big_dataset_write(bd, offset, size, buf)
Use BigArray to view the buf from the correct strides, and big_block_read / big_block_write to do the writing on each block in the data set.
big_dataset_read_mpi, big_dataset_write_mpi,
same as above. may need to plumb some flags.
big_dataset_grow(bd, size)
big_dataset_grow_mpi(bd, size)

Concern 1. may need to open and close 2xNcolumn physical files per read/write.
Is it fast enough for a PM step? Probably fine.

Concern 2. if some blocks in the record type exists, some doesn't exist? Perhaps need to create and grow on the fly.

Add API for auto throttling

Offline RLE compression

Offline RLE compression on a single Column

RLEBLOB files are compressed with the run-length-encoding. It shall be type aware.
There shall be some kind of dynamic range compensation for efficient encoding of clustered data.

The compression algorithm shall automatically identify from the linear sequence, clusters of contiguous samples that can be represented by a centroid, and a sequence of 'fluctuations' of a much lower dynamic range.
Integral types, this shall be loss-less; for floating points, this can be lossy.

For spatial positions this can be very effective. Notice that the centroid shall be the same shape as the member. And the dynamic range of the fluctuations can be different per member.

[ ] design the file format

[ ] tool to compress and decompress a single BLOB file.

[ ] add interface in bigfile.c to randomly seek and read from RLE BLOB files.

Be more professional in API naming.

Shall we do the following renaming in the API:

BigFile -> Group
BigBlock -> Column
BigData -> DataSet

Timestamp of overwritten datasets.

The timestamp of the overwritten datasets shall be updated. Currently they are not and by ls -l it is difficult to tell if the dataset is updated, creating confusions.

A useful example is the touch command.

https://gist.github.com/JoshCheek/1224782

I checked on my linux it worked on a directory. We need to see if it works on lustre.

reparameterize bigfile-iosim.

Reparameterize the iosim to take the set of parameters similar to MP-Gadget.

Installation Error with pip

I am trying to install bigfile in Python. I tried pip install bigfile (in both Python3.8 and 3.10) like suggested in the Readme.md. But i am consistently getting ModuleNotFoundError: No module named 'Cython' error. I have Cython installed which i further verified with as follows:

>>> from Cython.Build import cythonize
>>> print(dir(cythonize))
['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

The complete error log is as follows:

Collecting bigfile
  Using cached bigfile-0.1.51.tar.gz (221 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [17 lines of output]
      Traceback (most recent call last):
        File "c:\users\ekser\documents\research-works\astrophysics\bigfile\build\venv\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "c:\users\ekser\documents\research-works\astrophysics\bigfile\build\venv\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "c:\users\ekser\documents\research-works\astrophysics\bigfile\build\venv\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "C:\Users\ekser\AppData\Local\Temp\pip-build-env-a5wu1t5e\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "C:\Users\ekser\AppData\Local\Temp\pip-build-env-a5wu1t5e\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "C:\Users\ekser\AppData\Local\Temp\pip-build-env-a5wu1t5e\overlay\Lib\site-packages\setuptools\build_meta.py", line 480, in run_setup
          exec(code, locals())
        File "<string>", line 2, in <module>
      ModuleNotFoundError: No module named 'Cython'

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

I thought the error was in Windows only, so i tried installing in Linux (22.04) but the error is exactly the same.

Add API for aggregated IO

Saving a numpy array to a bigfile.

We need a helper method for this.
Two versions, one for BigFile one for BigFileMPI.

Too many file access using the Python interfacte

@rainwoodman

Hi Yu,
Admins on TACC have raised an issue with our code using thebigfile python interface. Apparently, it invokes too many queries to the filesystem, similar to your comment on README file:

The Python binding under MPI invoked more meta-data queries to the file system than we would like to be, though for small scale applications (thousands of cores) it is usually adequate

Not quite sure if this #43 fixes this though, since my issue is with reading not writing files.

Any starter clues, like which source file I need to dive into to fix this?

bigfile-mpi shall die on NFS

bigfile shall provide a function to detect if NFS backend is used. The client is then free to die if multiple nodes are issuing writes. Later this can be enhanced to add better support to workaround NFS multi-writings.