apollo3zehn / purehdf Goto Github PK

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.

License: MIT License

C# 99.44% Python 0.45% Shell 0.12%

purehdf's Introduction

PureHDF

A pure C# library without native dependencies that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.

The minimum supported target framework is .NET Standard 2.0 which includes

.NET Framework 4.6.1+
.NET Core (all versions)
.NET 5+

This library runs on all platforms (ARM, x86, x64) and operating systems (Linux, Windows, MacOS, Raspbian, etc) that are supported by the .NET ecosystem without special configuration.

The implemention follows the HDF5 File Format Specification (HDF5 1.10).

Please read the docs for samples and API documentation.

Version 2 changes

To keep the code base clean, version 2 of PureHDF supports active .NET versions only, which are .NET 6 and .NET 8 as of now (June 2024).

Version 1 of PureHDF supports all .NET versions starting with .NET 4.7.2 and continues to receive bug fixes. Features will be backported upon request if feasible.

Installation

dotnet add package PureHDF

Quick Start

Reading

// root group
var file = H5File.OpenRead("path/to/file.h5");

// sub group
var group = file.Group("path/to/group");

// attribute
var attribute = group.Attribute("my-attribute");
var attributeData = attribute.Read<int>();

// dataset
var dataset = group.Dataset("my-dataset");
var datasetData = dataset.Read<double>();

See the docs to learn more about data types, multidimensional arrays, chunks, compression, slicing and more.

Writing

The first step is to create a new H5File instance:

var file = new H5File();

A H5File derives from the H5Group type because it represents the root group. H5Group implements the IDictionary interface, where the keys represent the links in an HDF5 file and the value determines the type of the link: either it is another H5Group or a H5Dataset.

You can create an empty group like this:

var group = new H5Group();

If the group should have some datasets, just add them using the dictionary collection initializer - just like with a normal dictionary:

var group = new H5Group()
{
    ["numerical-dataset"] = new double[] { 2.0, 3.1, 4.2 },
    ["string-dataset"] = new string[] { "One", "Two", "Three" }
}

Datasets and attributes can both be created either by instantiating their specific class (H5Dataset, H5Attribute) or by just providing some kind of data. This data can be nearly anything: arrays, scalars, numerical values, strings, anonymous types, enums, complex objects, structs, bool values, etc. However, whenever you want to provide more details like the dimensionality of the attribute or dataset, the chunk layout or the filters to be applied to a dataset, you need to instantiate the appropriate class.

But first, let's see how to add attributes. Attributes cannot be added directly using the dictionary collection initializer because that is only for datasets. However, every H5Group has an Attribute property which accepts our attributes:

var group = new H5Group()
{
    Attributes = new()
    {
        ["numerical-attribute"] = new double[] { 2.0, 3.1, 4.2 },
        ["string-attribute"] = new string[] { "One", "Two", "Three" }
    }
}

The full example with the root group, a subgroup, two datasets and two attributes looks like this:

var file = new H5File()
{
    ["my-group"] = new H5Group()
    {
        ["numerical-dataset"] = new double[] { 2.0, 3.1, 4.2 },
        ["string-dataset"] = new string[] { "One", "Two", "Three" },
        Attributes = new()
        {
            ["numerical-attribute"] = new double[] { 2.0, 3.1, 4.2 },
            ["string-attribute"] = new string[] { "One", "Two", "Three" }
        }
    }
};

The last step is to write the defined file to the drive:

file.Write("path/to/file.h5");

See the docs to learn more about data types, multidimensional arrays, chunks, compression, slicing and more.

Development

The tests of PureHDF are executed against .NET 6 and .NET 7 so these two runtimes are required. Please note that due to an currently unknown reason the writing tests cannot be run in parallel to other tests because some unrelated temp files are in use although they should not be and thus cannot be accessed by the unit tests.

If you are using Visual Studio Code as your IDE, you can simply execute one of the predefined test tasks by selecting Run Tasks from the global menu (Ctrl+Shift+P). The following test tasks are predefined:

tests: common
tests: writing
tests: filters
tests: HSDS

The HSDS tests require a python installation to be present on the system with the venv package available.

Comparison Table

Overwhelmed by the number of different HDF 5 libraries? Here is a comparison table:

Note: The following table considers only projects listed on Nuget.org

Name	Arch	Platform	Kind	Mode	Version	License	Maintainer	Comment
v1.10
PureHDF	all	all	managed	rw	1.10.*	MIT	Apollo3zehn
HDF5-CSharp	x86,x64	Win,Lin,Mac	HL	rw	1.10.6	MIT	LiorBanai
SciSharp.Keras.HDF5	x86,x64	Win,Lin,Mac	HL	rw	1.10.5	MIT	SciSharp	fork of HDF-CSharp
ILNumerics.IO.HDF5	x64	Win,Lin	HL	rw	?	proprietary	IL_Numerics_GmbH	probably 1.10
LiteHDF	x86,x64	Win,Lin,Mac	HL	ro	1.10.5	MIT	silkfire
hdflib	x86,x64	Windows	HL	wo	1.10.6	MIT	bdebree
Mbc.Hdf5Utils	x86,x64	Win,Lin,Mac	HL	rw	1.10.6	Apache-2.0	bqstony
HDF.PInvoke	x86,x64	Windows	bindings	rw	1.8,1.10.6	HDF5	hdf,gheber
HDF.PInvoke.1.10	x86,x64	Win,Lin,Mac	bindings	rw	1.10.6	HDF5	hdf,Apollo3zehn
HDF.PInvoke.NETStandard	x86,x64	Win,Lin,Mac	bindings	rw	1.10.5	HDF5	surban
v1.8
HDF5DotNet.x64	x64	Windows	HL	rw	1.8	HDF5	thieum
HDF5DotNet.x86	x86	Windows	HL	rw	1.8	HDF5	thieum
sharpHDF	x64	Windows	HL	rw	1.8	MIT	bengecko
HDF.PInvoke	x86,x64	Windows	bindings	rw	1.8,1.10.6	HDF5	hdf,gheber
hdf5-v120-complete	x86,x64	Windows	native	rw	1.8	HDF5	daniel.gracia
hdf5-v120	x86,x64	Windows	native	rw	1.8	HDF5	keen

Abbreviations:

Term	.NET API	Native dependencies
`managed`	high-level	none
`HL`	high-level	C-library
`bindings`	low-level	C-library
`native`	none	C-library

purehdf's People

Contributors

Stargazers

Watchers

Forkers

alpara erisonliang reikanysora hdfeos gheber zouzuofa chuongmep eegkit marklam blackclaws notesjor clpu morehavoc dymaptic

purehdf's Issues

Add IQuerable interface to build hyperslabs?

Edit (2023-01-03)

An experimental IQueryable support has been implemented as dataset.AsQueryable(). Stream support could be implemented similar in the form of dataset.AsStream();.

However, this is still an option:

dataset.Read().Execute(),
dataset.Read().Skip(1).Take(2).Execute()
dataset.Read().AsStream()

The advanage is that dataset.Read.Execute() is also a query, so there is only query and stream. But it is not quite clean to create a IQueryable first to finally get a stream.

Stream and query make mostly sense for 1-dimensional data. Stream cannot be implemented on multidimensional data since the data are not written linerarly into the memory.

Original Issue

LINQ to HDF?

// this could build a netCDF hyperslab (start, stop, stride)
dataset
   .Skip([]) // = start
   .Take(ulong[]) // = stop - start
   .Where((x, n) => n % nth == 0) // stride
   .Read<int>();

// this could build an HDF5 hyperslab (start, stride, count, block)
dataset
   .Skip([]) // = start
   .Where((x, n) => n % nth == 0) // stride
   .Repeat(y) // count (https://fuqua.io/Rx.NET/ix-docs/html/M_System_Linq_QueryableEx_Repeat__1_3.htm)
   .Take(ulong[]) // block
   .Read<int>();

https://jacopretorius.net/2010/01/implementing-a-custom-linq-provider.html

skip expression = new hyperslab
take expression = += block
repeat expression = += count
stride expression = += stride
where = compile and execute (only once?)
select expression (only once?)
first, any, last, single, sum ...?
Enumerable.Range(0, 9).Chunk(3)? (https://devblogs.microsoft.com/dotnet/new-dotnet-6-apis-driven-by-the-developer-community)

LINQ Part 3: An Introduction to IQueryable - CodeProject
https://www.codeproject.com/Articles/1240553/LINQ-Part-An-Introduction-to-IQueryable

Returning IEnumerable vs. IQueryable - Stack Overflow
https://stackoverflow.com/questions/2876616/returning-ienumerablet-vs-iqueryablet

Unify reading and writing API

caching
remove endianness support?
restore commented out code parts
restore fill value
repair all tests
At least throw exception that async is not supported in native except datasets
repair multi threading (see benchmark)

Does this library support query hdf5 file to get a tree view of the file?

Does this library support query hdf5 file to get a tree view of the file?
If so, how to do it?
Thanks!

Hyperslab visualizer

Like this, but extended (actual_rs = "actual_resized"):

var aa = actual.ToArray();
var bb = expected.ToArray();

var sb1 = new StringBuilder();
var sb2 = new StringBuilder();

for (int i = 0; i < expected.Length; i++)
{
    sb1.Append($"{aa[i]},");
    sb2.Append($"{bb[i]},");
}

var sb1f = sb1.ToString();
var sb2f = sb2.ToString();

Matlab

close all

% reshape into C-Order
intermediate_rs = permute(reshape(intermediate, 4, 25, 25), [3 2 1]);
actual_rs       = reshape(actual, 25, 75).';
expected_rs     = reshape(expected, 25, 75).';

sourceDim1      = size(intermediate_rs, 1);
sourceDim2      = size(intermediate_rs, 2);
sourceDim3      = size(intermediate_rs, 3);
targetDim1      = size(actual_rs, 1);
targetDim2      = size(actual_rs, 2);

% source selection
figure
title('source selection (rank = 3)')

for i = 1 : sourceDim1
    for j = 1 : sourceDim2
        for k = 1 : sourceDim3          
            text(...
                (j - 1) / sourceDim2, ...
                1 - ((i - 1) / sourceDim1), ...
                    -(k - 1) / sourceDim3, ...
                num2str(intermediate_rs(i, j, k)), ...
                'FontSize', 8 ...
             )
        end
    end
end

% target selection (actual)
figure
title('target selection (actual, rank = 2)')

for i = 1 : targetDim1
    for j = 1 : targetDim2
        text(...
                 (j - 1) / targetDim2, ...
            1 - ((i - 1) / targetDim1),...
            num2str(actual_rs(i, j)), ...
            'FontSize', 8 ...
        )
    end
end

% target selection (expected)
figure
title('target selection (expected, rank = 2)')

for i = 1 : targetDim1
    for j = 1 : targetDim2
        if (actual_rs(i, j) ~= expected_rs(i, j))
            color = 'r';
        else
            color = 'k';
        end
        
        text(...
                 (j - 1) / targetDim2, ...
            1 - ((i - 1) / targetDim1), ...
            num2str(expected_rs(i, j)), ...
            'FontSize', 8, 'Color', color)
    end
end

Chunk cache problem

When chunk > chunk cache max value, then the chunk does not become part of the cache ... and so it is not written to file

Error in reading HDF5 file

I am trying to read a file with but I am getting the following error:

1. Solution exception:H5F.open
File "recorder.h5"  failed to open with status -1

I have been able to open the file with HDFView so I am pretty sure the file is not corrupted.
The file is created from an FEA software that it is using "HDF5 library version: 1.10.1".

I have attached the file in case someone wants to try to help

https://drive.google.com/file/d/1SAKkZf0VGHRfbdPKabyiEPzpEXie4VzC/view?usp=sharing

Code that I have used

import HDF5DotNet

from HDF5DotNet import *

import System
from System import Array, Double, Int64

print('\nInitializing HDF5 library\n')
status = H5.Open()
print('HDF5 ', H5.Version.Major, '.', H5.Version.Minor, '.', H5.Version.Release)

h5file = H5F.open('recorder.hdf5', H5F.OpenMode.ACC_RDONLY)
H5F.close(h5file)
print '\nShutting down HDF5 library\n'
status = H5.Close()

Can you help?

hyperslab: merge subset definitions or pass a "list" of offsets?

Hi all,

does anyone knows how to merge different hyperslab definitions? I am working with 2D data (1028 channels x 3 10e6 points sampled along time) which are chunked (n=200). I can read 1 channel at a time and thanks to effort from Apollo3zehn and the use of threads, we can read this within a reasonable time (4 s instead of 30-40 s). However, I'd like to read a user-defined subset of these channels. Is it possible with the current implementation of hyperslab within HDF5.NET?

Right now, the only way I see how to use hyperslab is to define blocks of contiguous channels (changing "count" and/or "block"), or of channels which are regularly separated (changing "stride"). As a results, it means to read 1 channel at a time. Is it possible to define a "list" of channels?

Sorry if I missed something... And many thanks for any help (and your patience).
Fred

Here is a figure of what I would like to achieve. Right now, I see how to read one row (yellow). I'd like to be able to read at once all green rows as well.

Replace SpanExtensions with this approach?

// https://docs.microsoft.com/en-us/dotnet/api/system.array?view=netcore-3.1
// max array length is 0X7FEFFFFF = int.MaxValue - 1024^2 bytes
// max multi dim array length seems to be 0X7FEFFFFF x 2, but no confirmation found
private unsafe T ReadCompactMultiDim<T>()
{
    // vllt. einfach eine weitere Read<T> Methode (z.B. ReadMultiDim), 
    // die keine generic constraint hat (leider), aber T zuerst auf IsArray
    // geprüft wird
    // beide Methoden definieren dann ein Lambda, um den Buffer entsprechender
    // Größe zu erzeugen. Dieser Buffer wird dann gefüllt und kann von der 
    // jeweiligen Methode mit dem korrekten Typ zurückgegeben werden
    //
    // oder `T[,] = Read2D<T>()`, `T[,,] = Read3D<T>()`, etc ..., dann wäre generic constraint wieder möglich
    // oder: use implicit cast operator for multi dim arrays? http://dontcodetired.com/blog/post/Writing-Implicit-and-Explicit-C-Conversion-Operators

    //var a = ReadCompactMultiDim<T[,,]>();
    var type = typeof(T);

    var lengths = new int[] { 100, 200, 10 };
    var size = lengths.Aggregate(1L, (x, y) => x * y);
    object[] args = lengths.Cast<object>().ToArray();

    var buffer = (T)Activator.CreateInstance(type, args);

    var handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
    try
    {
        var span = new Span<byte>(handle.AddrOfPinnedObject().ToPointer(), (int)size);
        span.Fill(0x25);
        return buffer;
    }
    finally
    {
        handle.Free();
    }
}

is it possible to modify hdf5 file, change dataset and metadata

is it possible to modify hdf5 file, change dataset and metadata. thanks

Add Pandas compatible read/write mode

#53

Another try for scale offset filter?

https://github.com/jamesmudd/jhdf/issues/ 511

soft link write support

hyperslab and threads: what could go wrong?

Hi all,

I am trying to read electrophysiology data from an H5file, whereby data are stored as 200 (data along time) x 1028 (channels) ushort and compressed. There is an enormous number of these chunks (about 3 000 000) (chunk size is certainly not optimal, but this what I get) and reading one channel of such data takes ages (aka 28 - 30s from a regular disk).

While I can read data which are stored as chunks and compressed using a direct approach or by reading each chunk in turn (it takes then 50 s), I thought that I could do multithreading, considering that by doing so, I could gain on the time necessary to decompress the data, knowing that the computer I work with has 12 cores.

However, when doing so, I get errors of groups not found or such, after a variable number of loops (typically 5 to 20). Any guess why this might occur?

Thank you of any help or cue,
Fred

Here is the code that fails (it works however if I replace the "Parellel.For" loop with a regular "for" loop):

public ushort[] ReadAll_OneElectrodeAsIntParallel(ElectrodeProperties electrodeProperties)
{
H5Group group = Root.Group("/");
H5Dataset dataset = group.Dataset("sig");
var nbdatapoints = dataset.Space.Dimensions[1]; // any size*
const ulong chunkSizePerChannel = 200;
var result = new ushort[nbdatapoints];
var nchunks = (long) (nbdatapoints / chunkSizePerChannel) ;

        int ndimensions = dataset.Space.Rank;
        if (ndimensions != 2)
            return null;

        Parallel.For (0, nchunks, i =>
        {
            var istart = (ulong) i * chunkSizePerChannel;
            var iend = istart + chunkSizePerChannel - 1;
            if (iend > nbdatapoints)
                iend = nbdatapoints - 1;
            var chunkresult = Read_OneElectrodeDataAsInt(group, dataset, electrodeProperties.Channel, istart, iend);
            Array.Copy(chunkresult, 0, result, (int) istart, (int) (iend - istart + 1));
        }) ;

        return result;
    }

Here is the code that works:
public ushort[] ReadAll_OneElectrodeAsInt(ElectrodeProperties electrodeProperties)
{
H5Group group = Root.Group("/");
H5Dataset dataset = group.Dataset("sig");
int ndimensions = dataset.Space.Rank;
if (ndimensions != 2)
return null;
var nbdatapoints = dataset.Space.Dimensions[1]; // any size*
return Read_OneElectrodeDataAsInt(group, dataset, electrodeProperties.Channel, 0, nbdatapoints -1);
}

Here is the function called by both routines:
public ushort[] Read_OneElectrodeDataAsInt(H5Group group, H5Dataset dataset, int channel, ulong startsAt, ulong endsAt)
{
var nbPointsRequested = endsAt - startsAt + 1;

        //Trace.WriteLine($"startsAt: {startsAt} endsAt: {endsAt} nbPointsRequested={nbPointsRequested}");
        
        var datasetSelection = new HyperslabSelection(
            rank: 2,
            starts: new[] { (ulong)channel, startsAt },         // start at row ElectrodeNumber, column 0
            strides: new ulong[] { 1, 1 },                      // don't skip anything
            counts: new ulong[] { 1, nbPointsRequested },       // read 1 row, ndatapoints columns
            blocks: new ulong[] { 1, 1 }                        // blocks are single elements
        );

        var memorySelection = new HyperslabSelection(
            rank: 1,
            starts: new ulong[] { 0 },
            strides: new ulong[] { 1 },
            counts: new[] { nbPointsRequested },
            blocks: new ulong[] { 1 }
        );

        var memoryDims = new[] { nbPointsRequested };
        var result = dataset
            .Read<ushort>(
                fileSelection: datasetSelection,
                memorySelection: memorySelection,
                memoryDims: memoryDims
            );

        return result;
    }

HDF5 WebAssembly Explorer

File Path only can be Const - Bindigns

Discussed in #40

^{Originally posted by FranciscoG001 August 29, 2023}
Hello again, just want to know if its possible to change the type of filePath on: [H5SourceGenerator(filePath: HDF5Read.FILE_PATH)] internal partial class MyGeneratedH5Bindings { }; , because only can be const, or if its possible to pass another path to the .h5 file, because I also repair that just accept the hardcode path like "C:\user\..." and I like to use the path from a local folder on my project without the hardcode path.

The length of the limits parameter must match this hyperslab's rank.”

System.RankException
HResult=0x80131517
Message=The length of the limits parameter must match this hyperslab's rank.
Source=HDF5.NET
StackTrace:
在 HDF5.NET.HyperslabSelection.d__18.MoveNext()
在 HDF5.NET.SelectionUtils.d__0.MoveNext()
在 HDF5.NET.SelectionUtils.d__31.MoveNext() 在 HDF5.NET.H5Dataset.<ReadAsync>d__502.MoveNext()
在 HDF5.NET.H5Dataset.d__23.MoveNext()
在 HDF5.ConsoleTest.Program.d__14.MoveNext() 在 C:\Users\jiede\source\repos\SQLiteLib\src\SQLiteLib\Tests\HDF5.ConsoleTest\Program.cs 中: 第 412 行

此异常最初是在此调用堆栈中引发的:
[外部代码]
HDF5.ConsoleTest.Program.QueryData() (位于 Program.cs 中)

` using var h5file = H5File.OpenRead(file);
var h5group = h5file.Group(tableName);
var h5dataset = h5group.Dataset($"PARA_{j}_OBJECT");
var datasetSelection = new HyperslabSelection(
rank: 3,
starts: new ulong[] { 20, 33, 0 },
strides: new ulong[] { 1, 1, 1 },
counts: new ulong[] { 1, 1, 1 },
blocks: new ulong[] { 1, 1, 1 }
);

                var dataValues = await h5dataset.ReadStringAsync(datasetSelection);`

I want to read the data of multiple Start positions at one time through HyperslabSelection, one at a time, how can I do it?

Hdf5_file_7Z.zip

Super issue

before alpha release

Repair Little/Big Endian conversion
Check if generic reading requires changes to hyperslab selections
Ensure that buffer size is checked before copy
Improve File.Open signature
Add missing API (many properties etc)
Parallel testing support or run tests sequentially
Package Icon

before beta release

before release

performance Optimizations

More efficient casting: https://learn.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.unsafe.as?view=net-7.0
Do not decode fixed (and extensible?) array data block header for every single chunk request!!!! Use global cache or local cache. Make it thread-safe and then the multi-threaded benchmark should finally work. Seach for "NOTE: The multi-threaded benchmark is not yet working because ...".

allocation alternative? ReadOnlySequence to reduce allocations: https://docs.microsoft.com/en-us/dotnet/standard/io/buffers

microsoft/Microsoft.IO.RecyclableMemoryStream: A library to provide pooling for .NET MemoryStream objects to improve application performance.: https://github.com/Microsoft/Microsoft.IO.RecyclableMemoryStream

missing tests

read single chunk (compressed / filtered)
skip filter
Test DelegateSelection + Documentation
do not filter edge chunks
add tests with max dims != dims
Automatically test against publicly available H5 files
test thread-safety of Intel filter Helper

backlog

System.OverflowException

Hi,

I encounter this error :
System.OverflowException at (wrapper managed-to-native) System.Object.__icall_wrapper_ves_icall_array_new_specific(intptr,int)
at PureHDF.VFD.H5StreamDriver.ReadBytes (System.Int32 count) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VFD/H5StreamDriver.cs:87
at PureHDF.VOL.Native.HeaderMessage..ctor (PureHDF.NativeContext context, System.Byte version, PureHDF.VOL.Native.ObjectHeader objectHeader, System.Boolean withCreationOrder) [0x003a2] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/HeaderMessage.cs:74 \r\n at PureHDF.VOL.Native.ObjectHeader.ReadHeaderMessages (PureHDF.NativeContext context, System.UInt64 objectHeaderSize, System.Byte version, System.Boolean withCreationOrder) [0x0003d] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader.cs:116
at PureHDF.VOL.Native.ObjectHeader1..ctor (PureHDF.NativeContext context, System.Byte version) [0x00068] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader1.cs:36
at PureHDF.VOL.Native.ObjectHeader.Construct (PureHDF.NativeContext context) [0x00055] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader.cs:81
at PureHDF.NativeNamedReference.Dereference () [0x00058] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeNamedReference.cs:57
at PureHDF.VOL.Native.NativeGroup.Get (System.String path, PureHDF.VOL.Native.H5LinkAccess linkAccess) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeGroup.cs:90
at PureHDF.VOL.Native.NativeGroup.Get (System.String path) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeGroup.cs:80
at PureHDF.IH5GroupExtensions.Group (PureHDF.IH5Group group, System.String path) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/API/IH5GroupExtensions.cs:82
at HDF5Reader.Group (System.String groupPath) [0x00001] in C:\Users\CRJE160\Documents\Git\DragonflyPlayer-ADS\Assets\Scripts\HDF5.NET\HDF5Reader.cs:82

The H5 group I'm reading has a size that exceeds the maximum value of an Int32 (10936 * 5174 * 64 = 3 623 621 248). I think the problem is on StreamReader's ReadBytes methode, this method takes an Int32, not a long / Int64.

Adding writing support?

Would this be feasible in the near future? I love this package, but now my project requires creation too instead of just reading of nwb and hdf5 files.

Reduce number of H5BinaryReader variables, use BinaryReader instead, especially for local byte arrays. Did not work in the first attempt because some APIs expect main file stream but also local data streams.

To Slow reading multiple datasets/group

Hello, I'm trying to read a sequential folder/group in my HDF5 Ex: Group1/Group2/Group3/Datasets and they have the size for example: Size Group1 = 8, Size Group2 = 6, Size Group3 = 400, Size Dataset = 2 doubles values. But basically, since I don't have, for example, a function that can read everything directly from a group, I have to do this code to go through all the groups and this usually takes about 20/30 seconds before I can load everything, because I'm read 2 struct of that type that, which makes a total of 76,800 values to be read. Is there any way/function I can use from the library to reduce the time this reading takes? My code:

Make filter pipeline more flexible, i.e. hook into pipeline at runtime?

H5Group.Read<T>() fails with "Filter pipeline failed" in 1.0.0-alpha.21

My aplogies if the following report is a little vague. I'm not sure exactly what info is needed to replicate the issue, but from my perpective all existing files that have been read without issue for some time, are now consistently failing and it isn't immediately clear to me why. It feels like perhaps the new async capability is leading to corrupt reading of byte streams, but I could be way off. I'm logging this info early before I roll up my sleeves and attempt to work out myself what is going on, in case it is obvious to anybody else and/or a fix can be more rapidly fothcoming.

Let me know what other info I can provide to help diagnose the problem. Thanks.

In 1.0.0-alpha.20 the following works fine. In 1.0.0-alpha.21 it fails.

var hdf = H5File.OpenRead(path);
var result = hdf.Dataset("key").Read<double>();

Exception:

System.Exception: 'Filter pipeline failed.'
Inner Exception: InvalidDataException: The archive entry was compressed using an unsupported compression method.

Stacktrace:

   at HDF5.NET.H5Filter.ExecutePipeline(List`1 pipeline, UInt32 filterMask, H5FilterFlags flags, Memory`1 filterBuffer, Memory`1 resultBuffer)
   at HDF5.NET.H5D_Chunk.<ReadChunkAsync>d__60`1.MoveNext()
   at HDF5.NET.H5D_Chunk.<ReadChunkAsync>d__59`1.MoveNext()
   at HDF5.NET.SimpleChunkCache.<GetChunkAsync>d__15.MoveNext()
   at HDF5.NET.SelectionUtils.<CopyMemoryAsync>d__2`1.MoveNext()
   at HDF5.NET.H5Dataset.<ReadAsync>d__50`2.MoveNext()
   at HDF5.NET.H5Dataset.Read[T](Selection fileSelection, Selection memorySelection, UInt64[] memoryDims, H5DatasetAccess datasetAccess)
   at ...

Unable to read variable-length attribute

Hello,
reading the variable-length attribute is causing the following exception: Exception: Variable-length sequence data can only be decoded as array (incompatible type: System.String).
The exception says that type System.String is incompatible even though a type string[] has been used.
The code below successfully reads the variable-length type dataset, but reading the variable-length type attribute fails.

            var dataset_NUTS_keys = nativeFile.Dataset("/NUTS_keys");

            foreach (var attribute in dataset_NUTS_keys.Attributes())
            {
                var typeClass = attribute.Type.Class; // attribute.Type.Class is VariableLength

                //This attribute has the name DIMENSION_LIST and should contain the text 'NUTS'
                string[] attributeArray = attribute.Read<string[]>(); // Exception
            }

            var typeClass_NUTS_keys = dataset_NUTS_keys.Type.Class; // Type.Class is VariableLength

            string[] NUTS_keys = dataset_NUTS_keys.Read<string[]>(); // Successful

A copy of the HDF5 file used is here

If attribute.Read<string[][]>() is used the exception is: Bitfield data can only be decoded as NativeObjectReference1 (incompatible type: System.String)
If NativeObjectReference1 is used the exception is: Unable to decode a reference type as value type.

Any help would be greatly appreciated.

How to write a new HDF5 file

Does this have the ability to write out a new HDF5 file?

Can't read file from pandas library

When I'm try read from pandas python, it return nothing. Whether it relate to schema version of HDF5 ?

Thank you

import numpy as np
import pandas as pd
#%pip install tables -U
import warnings
import os
import time
from tables import NaturalNameWarning
warnings.filterwarnings('ignore', category=NaturalNameWarning)
filePath =r"file.h5"
store = pd.HDFStore(filePath)
store.open()
group  = store.groups()
group

This is testing in cs:

[Test]
    public void TestSaveHdf()
    {
        var file = new H5File()
        {
            ["my-group"] = new H5Group()
            {
                ["numerical-dataset"] = new double[] { 2.0, 3.1, 4.2 },
                ["string-dataset"] = new string[] { "One", "Two", "Three" },
                Attributes = new()
                {
                    ["numerical-attribute"] = new double[] { 2.0, 3.1, 4.2 },
                    ["string-attribute"] = new string[] { "One", "Two", "Three" }
                }
            }
        };
        file.Write("file.h5");
    }

    [Test]
    public void TestReadHdf()
    {
        // root group
        var file = H5File.OpenRead("file.h5");

// sub group
        var group = file.Group("my-group");


// dataset
        var dataset = group.Dataset("numerical-dataset");
        var datasetData = dataset.Read<double[]>();
        foreach (var item in datasetData)
        {
            Console.WriteLine(item);
        }
    }

Field "_dataType" present in the type "HDF5.NET.H5DataType" can be exposed?

Hello,

I have been trying to use HDF5.NET library for reading HDF5 files. I would say that HDF5.NET library is very .NET developer friendly.

We use compound types in our applications, and one application would generate HDF5 files, and other application would need to read the HDF5 files, As far as I could read the documentation, HDF5.NET library requires explicit types that would describe the schema for compound types. But in our case, it is not possible to have a common API for all our applications just to make schema types in sync.

In HDF.PInvoke library, there is a possibility to get member names, datatypes and read the binary data for a compound type data, and let the application handle the reading of compound type, which works for our scenario.

If there is similar option in "HDF5.NET" library, it would be helpful for us. The only thing that is stopping from doing it is that the field "_dataType" present in the type "HDF5.NET.H5DataType" is not exposed. Is it possible to expose the field "_dataType"?

Missing features

IH5DataProvider ((File)-Driver)
Multithreading (Cache, H5FileReader)
Automatically test against publicly available H5 files
ObjectHeader Cache
ExpandoObject

Unsafe code may only appear if compiling with /unsafe (HDF5.NET.Benchmarks)

I get the following error when try to build the solution:

Filter pipeline improvements v2

Linked to: #33

Pipeline has been improved but memory cannot be rented yet. Problem is that it is impossible to say if the returned memory is a sliced version of the previously rented memory or an independent one. So we cannot simply free the rented memory and all other memories when the method returns. We have to wait until the pipeline finished and that might cause large and useless memory consumption.

unable to read attributes in generic way.

Hi,
I've got a request to read all attribute of a h5 files during iteration of it in my own library (LiorBanai/HDF5-CSharp#163).

I was thinking about leveraging your library since it is better implementation but when I try to read the attributes I get the following exception:
""The fill value message is missing."

and others such 'Non-negative number required. (Parameter 'count')'

the file is
hdf5_test.zip

is there a better way to read all attributes of a file?

[Question] reading dynamic compound dataset

Hi,
I was wondering how to read back a compound dataset from existing H5 file without knowing its strucutre (even read it as dictionary of <string,object> is ok) or reading the columns separately.

Do you have example by any chance in your library?

Filter pipeline improvements

1.) Filters like deflate do not know uncompressed size. But if they are the last filter in the pipeline, the size is known by the chunk size.

Also if is the second last filter, the following filter might be shuffle which does not change buffer size. So for deflate it is always better to use the chunk size as guess than to use the input buffer size.

2.) All filters should use MemoryPool instead of new byte[].

3.) The last filter can directly write to the resulting array, save one copy operation.

To enable this there should be a FilterInfo structure with currently present fields + MemoryProvider.

This class would have method GetResultMemory() which normally provides a MemoryPool memory. But if the filter is the last one in pipeline, it returns the sliced result buffer.

Add parallel reading.

ToArray1D: useful?

Hi,
I am new to C# (but practicing Java, C++ for image analysis and electrophysiology) and I am trying to read H5 files storing recordings from 1024 micro-electrodes. These data are stored within an array of "int" with 1028 channels x many points (for ex 300,000) corresponding to the duration of the observation (data are sampled at 20 kHz)).
I would like to extract 1 channel at a time from this array using the hyperslabs approach.

It looks to me that it would be useful to add a ToArray1D as follow:

public static unsafe T[] ToArray1D(this T[] data, long dim0)
where T : unmanaged
{
var dims = new long[] { dim0 };
ArrayExtensions.ValidateInputData(data, dims);
var output = new T[dims[0]];
fixed (void* ptr = output)
{
ArrayExtensions.CopyData(data, ptr);
}
return output;
}

Is there any other way to do this? Right now, I resolved to use ToArray2D but it seems awkward.

Thank you for any help,

Frederic

PS By the way, for a beginner like me, HDF5.NET is the most understandable library and it makes it easy to read all other fields in the H5 file I am working on. Thumbs up for your work!

Unable to resolve dependency

Hi there, I made a LSTM with keras.net.

But since I couldn't set the seed, I decided to use tensroflow.net, the problem is that when I tried to install tensorlfow.keras show me the next error.

Unable to resolve dependency 'PureHDF'. Source(s) used: 'nuget.org', 'Microsoft Visual Studio Offline Packages'.

Anybody have Idea about how to solve this issue?

how to traverse the attributes of a group or a dataset?

traverse the attributes and show the corresponding contents.
Thanks

make use of Fillvalue doc (attached)

FillValue.docx

NBit filter?

Hi,

I am trying to read data that was compressed with Gzip, but I am getting this error:

Exception: The filter 'Nbit' is not yet supported by HDF5.NET.
HDF5.NET.H5Filter.NbitFilterFunc (HDF5.NET.H5FilterFlags flags, System.UInt32[] parameters, System.Memory`1[T] buffer) (at <fb37093370234856bd3792f3c203654a>:0)

Could you help me/guide me towards getting this solved?

Dataset.Read<T>() performance issue

Hello,

I'm creating this "issue" to try and find out how to improve read performance for a complete dataset.

At the moment, I sometimes have to load fairly large files (around 800 MB), and it can take up to twenty minutes to read a complete dataset, even when trying to tweek buffer and chunk sizes in PureHDF.

Do you have any other ideas on how I can improve performance?
I can't use Multi-Threading or asynchronism (my project uses Unity3D and therefore .Net Standard 2.1).

Thanks in advance !

❓ File Extension

Is it possible to do an Extension of a file such that the writer.Dispose() can be called and the data written to file and the continue to write to file using the chunking?

Reading hdf5 HELP

Hi @Apollo3zehn

I would like to use your library to read my .hdf5 file but I am having issue in understanding the "group" or "dataset" or "attribute".

Can you provide a quick example that reference my file?

Thanks,
Marco

recorder.zip

Index out of range when trying to read dataset of VariableLength

Hi, I am trying to read a HDF5 file and came across an issue when trying to read a dataset that is VariableLength type.
To read the dataset, i just use the code:
var data = dataset.ReadString();

but it fails with the following exception:

 System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. 
 (Parameter 'index')
    at PureHDF.H5ReadUtils.ReadString(H5Context context, DatatypeMessage datatype, Span`1 data, String[] result) in 
 /home/runner/work/PureHDF/PureHDF/src/PureHDF/Utils/H5ReadUtils.cs:line 293
    at PureHDF.H5Dataset.ReadString(Selection fileSelection, Selection memorySelection, UInt64[] memoryDims, 
 H5DatasetAccess datasetAccess) in /home/runner/work/PureHDF/PureHDF/src/PureHDF/API/H5Dataset.cs:line 270

It works fine if I read dataset with String type.
May I know what else I need to specify to be able to read the VariableLength?
The attached screenshot is an example of the dataset properties

Thank you.

internal struct QUAD_CN
{
    public int ID;
    public string TERM; // length: 8
    public int[] GRID; // length: 5
    public float[] FD1; // length: 5
}

backlog

Support for H5DataType
Fixed Array Data Block Paging: This has advantages for datasets with huge amount of chunks
.NET 8: Fixed size array: https://devblogs.microsoft.com/dotnet/new-csharp-12-preview-features/
DateTime and Timespan (as strings or unix time)? or a simple converter interface?

apollo3zehn / purehdf Goto Github PK

purehdf's Introduction

PureHDF

Version 2 changes

Installation

Quick Start

Reading

Writing

Development

Comparison Table

purehdf's People

Contributors

Stargazers

Watchers

Forkers

purehdf's Issues

Edit (2023-01-03)

Original Issue

Discussed in #40

before alpha release

before beta release

before release

performance Optimizations

missing tests

backlog

backlog

Recommend Projects

Recommend Topics

Recommend Org