[Hdf-forum] Fwd: HDF5 : how to get good performance ?
robl at mcs.anl.gov
Fri Aug 8 10:26:54 CDT 2014
On 08/08/2014 03:27 AM, houssen wrote:
> In short : are there things to know / make sure of / be aware of to get
> good performance with P-HDF5 ?
- turn on collective I/O. it's not enabled by default
- HDF5 metadata might be a factor if you have very many small datasets,
but for most applications it's not important
- consult your MPI library for any file-system specific tuning you might
be able to do. For example, Intel-MPI needs you to set an environment
variable before it will use any of the GPFS or Panasas optimizations it
- be mindful of type conversions: if your data in memory is a 4-byte
float, but they are 8-byte doubles on disk, HDF5 will "break collective"
and do that I/O independently.
> To test this I wrote a MPI code. ... I expected to get better
> performance with MPI-IO and P-HDF5 than with the sequential approach.
> The spirit of this test code is very simple / basic (each MPI process
> writes his own block of data in the same file, or, in separate files in
> the sequential approach).
> Note : in each case (sequential, MPI-IO, P-HDF5), when I say "write data
> in file", I mean writing big blocks / bunch of data at once (I do not
> write data one by one - I write the biggest block of data, but smaller
> than 2Gb, that is possible to write).
> Note : I tried with N = 1, 2, 4, 8, 16.
in 2014, 16 is not very parallel. serial I/O has many benefits at
modest levels of parallelism: caching, mostly.
> Note : I generated files (MPI-IO, P-HDF5) whose size scaled from 1Gb to
> 16 Gb (which looks like a "very big" file to me).
that's adequate, yes
> Note : I followed the P-HDF5 documentation (use H5P_FILE_ACCESS and
> H5P_DATASET_XFER property list + use hyperslab "by chunks")
> Note : the file system is "GPFS" (it has been installed by the cluster
> vendor : this is supposed to be ready to get performance out of P-HDF5 -
> I am an "application" guy that try to use HDF5, I am not an "admin sys"
> that would be familiar with complex related stuffs related to the file
Now we are getting somewhere.
> Note : I compiled the HDF5 package like this "./configure
> Note : I use CentOS + GNU compilers (for both HDF5 package and my test
> code) + hdf5-1.8.13
> Note : I use mpic++ (not h5pxx compilers - actually I didn't get why
> HDF5 provides compilers) to compile my test code, is this a problem ?
just makes it easier to pick up any libraries needed. I don't use the
wrappers, either, which means sometimes I need to figure out what new
library (like -ldl) HDF5 needs.
> Any relevant clue / information would be appreciated. If what I observe
> is logical I would just understand why, and, how / when it is possible
> to get performance out of P-HDF5. I just would like to get some logic
> out of this.
If you are using GPFS, there is one optimization that goes a long way
towards improving performance: aligning writes to file system block
boundaries. See this email from a few weeks ago:
> Thanks for help,
> PS : I can give more information and the code, if needed (?)
> Hdf-forum is for HDF software users discussion.
> Hdf-forum at lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the Hdf-forum