Giter VIP home page Giter VIP logo

Comments (20)

rimmartin avatar rimmartin commented on July 22, 2024

I'll experiment. Have to see if it works with the slash on 'dataset'

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Should work without the slash; just the name

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

Slash or no slash, I keep getting the same error when I increase the array's length from 10000 to 100000. I'll try to bisect until I find the exact length that triggers this behaviour.

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

It breaks when going from a length of 73901 to 73902.

Also, when I examine the file with h5dump -d /datasetName, I'm getting the JSON representation of the whole array as the first point in the dataset.

EDIT: I was wrong, apologies. It looks like JSON but it's not JSON.

This is the header for a Uint16 dataset within the same file:

DATASET "/station_id" {
   DATATYPE  H5T_STD_U16LE
   DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
   ATTRIBUTE "type" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
}

This is the header for the string dataset:

DATASET "/station_name" {
   DATATYPE  H5T_ARRAY { [100] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
}

My code is based on the tutorial for variable length strings here: http://hdf-ni.github.io/hdf5.node/tut/dataset-tutorial.html .

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

What are you filling in the Array entries with?
I suppose for a test a random string generator could b used. Or find a text document wth over 80,000 lines...
Testing

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

The following code

var fs = require('fs');
var hdf5 = require('../common/hdf5').hdf5;
var h5lt = require('../common/hdf5').h5lt;
var h5gl = require('../common/hdf5').h5gl;
var path = require('path');
var shortid = require('shortid');
var filePath = path.join(__dirname, 'test-hdf5.h5');
var file = new hdf5.File(filePath, h5gl.Access.ACC_TRUNC);
var length = 10;
var dataset = new Array(length);
for (var i = 0; i < length; i++) {
  dataset[i] = shortid.generate();
}
h5lt.makeDataset(file.id, 'test', dataset);
file.close();

produces a file that when examined through h5dump -d /test --stride 1 --start 0 --count 1 products/test-hdf5.h5 shows the following:

HDF5 "products/test-hdf5.h5" {
DATASET "/test" {
   DATATYPE  H5T_ARRAY { [10] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   SUBSET {
      START ( 0 );
      STRIDE ( 1 );
      COUNT ( 1 );
      BLOCK ( 1 );
      DATA {
      (0): [ "r1_oscv0", "SkeOjocDR", "Sy-doo5DC", "SJfujjcwR", "Bkmuio9PA", "BJNuoo9D0", "rkBdsjqPA", "ry8OssqvR", "ryv_ii5wR", "ryd_ojqw0" ]
      }
   }
}
}

This is what I was referring to before - it looks like the entire array of strings is being stored as the first point the dataset rather than each string being treated as a separate point.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114
It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

It might have to do with some handle limit on linux

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

For example on my ubuntu

cat /proc/sys/fs/file-max
808097

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

I guess there are two sides to this - the cut out and the array of strings vs strings dataset.
Happy to contribute in any way I can. Feel free to send tests my way. I'll check the fs limit as soon as I get back home.
On 9 Oct 2016 6:28 p.m., rimmartin [email protected] wrote:I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114
It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024
filename = '/home/jacopo/data-backend/products/gistemp/gistemp.h5', file descriptor = 12, errno = 14, error message = 'Bad address', buf = 0x55c61fcac378, total write size = 422496, bytes this sub-write = 422496, bytes actually written = 18446744073709551615, offset = 1179648
filename = './roothaan.h5', file descriptor = 9, errno = 14, error message = 'Bad address', buf = 0x487f858, total write size = 98400, bytes this sub-write = 98400, bytes actually written = 18446744073709551615, offset = 1183744

The actually written is crazy in both your test and mine; but there is a bad address as the error message

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

With my test as-is, i.e. using shortid.generate(), I can go up to a length of 73862. A length of 73863 breaks one every two runs (more or less) and 73864 always breaks.

However, switching to the following filler loop only got me up to 73820, breaking on all runs from 73821 going upward.

for (var i = 0; i < length; i++) {
  dataset[i] = 'hello ' + i;
}

Lengthening the string to 'helloworldhelloworld ' + i still got me up to 73820. Curiously enough, inverting the order to i + ' hello' got me to a different number, 73746.

There must be a pattern but I can't see it ATM. Perhaps we're hitting some kind of limit on how big an array of strings can be within an array of strings-typed dataset (even though we shouldn't be getting an array of strings-typed dataset in the first place).

PS: My file-max is 200676.
PS: Can I store fixed-length strings using node.hdf5?

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Yea I was testing with

   dataset[i] = 'hello ' + '\0';

It feels like some limit is being hit; a heap or stack. Something. I may put the question to the hdfgroup after I search their email.

Yes, fixed was done for table columns. Let me test some; to make it clean I may add an option:

h5lt.makeDataset(file.id, '/dataset', dataset, {fixed-width=7});

for example

Will continue to look at large sizes of everything to look for breaks in the system

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

That'd be lovely. Happy to test any solution you come up with.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

fixed width is coming. Need to test and work on the reading back to javascript.

For writing there is no need to fixed the length of the strings; just know the maximum length of them all. If this is too short for one string entry in the Array an exception will be thrown from the native side to insure data doesn't get messed up

h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

should commit this evening

from hdf5.node.

jacoscaz avatar jacoscaz commented on July 22, 2024

Wonderful, wonderful, wonderful.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Hi, sorry for delay.

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

now saves nearly 1 million lines from a text file for the rat liver pdb chemistry model. The fixed width is 80 in this case.

Need to test reading back to javascript yet

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

I'm building their c examples and extending them to work with large data. Otherwise I've mirrored these examples in this project. Their docs don't say chunking is necessary but may need to

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

fixed width is now working. Tested on about a million entries and ~74 Mb h5 file

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
    var readArray=h5lt.readDataset(group.id, "Rat Liver");

where the array is filled from a text file read and split on"\n"

    const lineArr          =  ratLiver.trim().split("\n");
    var lines=new Array(lineArr.length);
    var index=0;
    var maxLength=0;
    /* Loop over every line. */
    lineArr.forEach(function (line) {
        if(index<lines.length){
        lines[index]=line;
        if(maxLength<line.length)maxLength=line.length;
        }
        index++;
    });

Relooking at variable length

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

variable length io is now working

from hdf5.node.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.