Giter VIP home page Giter VIP logo

Comments (25)

rimmartin avatar rimmartin commented on July 22, 2024 1

Ok, good. I'll publish again by Monday

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Is it a gig of char data?

I'll make a test that creates an h5 pushing that much in and reading it back and get back to you. There shouldn't be a speed issue since it uses the hdf5 c library. Looking for the bottle neck
Or is it an array of strings of inchikeys?

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

I made an 25,000,000 Array of variable length strings from https://github.com/HDF-NI/hdf5.node/blob/master/test/test_h5lt.js#L709 test and wrote and read it back within 30 seconds. This made an h5 of about 1.2 Gb size.

Definitely could be faster. And it isn't efficient with RAM. I'll be looking into the RAM efficiency; I have only 6Gb RAM on my laptop and if I bump it to much larger it can max out and start swapping disk and take a long time. Also I have to check if this data type works with getting regions or chunks of the array.

        let file = new hdf5.File('./roothaan.h5', Access.ACC_TRUNC);
            let group=file.createGroup('pmcservices/Huge Quotes');
            const quotes=new Array(7);
            quotes[0]="Never put off till tomorrow what may be done day after tomorrow just as well.";
            quotes[1]="I have never let my schooling interfere with my education";
            quotes[2]="Reader, suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself.";
            quotes[3]="Substitute 'damn' every time you're inclined to write 'very;' your editor will delete it and the writing will be just as it should be.";
            quotes[4]="Don’t go around saying the world owes you a living. The world owes you nothing. It was here first.";
            quotes[5]="Loyalty to country ALWAYS. Loyalty to government, when it deserves it.";
            quotes[6]="What would men be without women? Scarce, sir...mighty scarce.";

            const hugeQuotes=new Array(25000000);
            for(var i in hugeQuotes){
                hugeQuotes[i]=quotes[i % 7];
            }
            h5lt.makeDataset(group.id, "Mark Twain", hugeQuotes);
            group.close();
            file.close();
            file = new hdf5.File('./roothaan.h5', Access.ACC_RDWR);
            group=file.openGroup('pmcservices/Huge Quotes');
            console.dir("read back..");
            const array=h5lt.readDataset(group.id, 'Mark Twain');
            console.dir(array.length);
            array.length.should.equal(25000000);
            group.close();
          file.close();

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Hi Rim,

Thank you for checking this out. The content of the file is an array of strings(inchikeys). I could send you the file but it is pretty big. Let me know if I can do something to help. It would be nice if we could have similar performance than with python.
Regards,

oriol

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Hi I believe the difference is copying the data; put into new strings. Do you need the data all at once? I'll be testing bringing it in chunks

Also be looking into the scope'ing to see if there can be assigning to the same data instead of copying. This particular data structure is less conducive to assigning but will study it.

What final data structure do you want the data in?

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Hi,

We just want to read a particular position of the array of strings. We don't need to read all data.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Ok, I'll test tonight after work; there is a way; chunks can be fed into a stream or web socket

And still will look to speed it up

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Starting to be able to read back a region of an Array of variable length strings. Will clean up and test more before committing. Test that I haven't broken any other functionality.

const array=h5lt.readDataset(group.id, 'Mark Twain', {start: [1000], stride: [1], count: [21]});

this will typically work this way; stride will usually be 1 for Arrays. If the start and count are outside the actual data it should return the intersection of values.

Should be available by Tuesday

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

I'm away the whole week with limited access to internet. So I don't know if I'll be able to test it before I come back.
I installed hdf5 through npm install. How should I get your modified version? By the way, could you explain me what "start" and "count" options mean for the readDataset?
Thanks

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Oh what I meant was I'll publish to npm by Tuesday.

"start" and "count" are glue directly through to the H5Sselect_hyperslab https://support.hdfgroup.org/HDF5/doc1.8/RM/RM_H5S.html#Dataspace-SelectHyperslab. Each are arrays up to the rank of the data.

For example 3d data start is the i,j,k corner of the block to read/write and count of the run in i,j,k.

Your case I believe the data is stored as rank=1 so it is a linear chunk starting at start: [i] going count[i] array entries from there.

And that sounds great having limited access to the internet:-)

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

2d example though this is as Buffer.
http://hdf-ni.github.io/hdf5.node/tut/subset_tutorial.html

This is going to blend back into a single interface now that Buffer is no longer a separate thing under nodejs; see #20

Also I'm experimenting with direct streaming; we shall see if it works

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Thanks for all this information.
Looking forward to testing the new version.

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

I have just tested the new version and it works fine if I want to access an specific element of the dataset because now with start parameter, it is very easy and fast. However, if I wanted to load all dataset (which is an array of strings) to a javascript array, then it is still very slow. For instance, with my dataset( an array of 1M strings):

1)This works well:
var dataset2 = h5lt.readDataset(file.id, "inchikeys",{start: [11221], stride: [1], count: [1]});
I can access the element 11221 very fast.

2)But this is still very slow:
var dataset2 = h5lt.readDataset(file.id, "inchikeys",{start: [0], stride: [1], count: [712231]});
If I want to read all dataset in a variable to have access to this data very often, it is still very slow.

We are going to test it more in our application, probably current implementation could work for us. But I wanted to let you know because python library also allows option 2 very fast.
Thanks a lot.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

var dataset2 = h5lt.readDataset(file.id, "inchikeys",{start: [0], stride: [1], count: [10000]});
var dataset3 = h5lt.readDataset(file.id, "inchikeys",{start: [10000], stride: [1], count: [10000]});
or in a loop chunking it this way.
When you get the whole thing just
var dataset4 = h5lt.readDataset(file.id, "inchikeys");
However the whole thing in memory is hitting your ram and going into memory swap to disk

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

I don't think it is a memory problem because I'm checking the memory and it is never close to be full. The same thing with python is working fine in the same computer and at the same time.

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Doing it in a loop it is still too slow. A dataset of 1M elements would require 100 readDatasets which are too much and the process will be too slow.

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Ok, it seems to be a problem of the way h5 files are created. I just read my dataset in nodejs and save it in a variable. Then I created a new file .h5 with hdf5 nodejs. Finally, I read the new file in nodejs and it was very fast. So there is an issue with reading files created in h5py. I'll test more and let you know.
Thanks

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

oh very interesting; I'll be curious if it is some data typing or custom settings I need to learn about. Appreciate the sleuthing

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

Well, after several tests we have noticed the following issue. We are creating a .h5 file through python. The process to create this file is the following. We query our database and get 1M of inchikeys (ex: QITFVXKKWLWTBN-UHFFFAOYSA-N). Then we create a .h5 file with a dataset containing this array of 1M inchikeys by using h5py. We have created two files by the same process.
Then we read the dataset in nodejs by the following code:

var hdf5 = require('hdf5').hdf5;
var h5lt = require('hdf5').h5lt;
var Access = require('hdf5/lib/globals').Access;
var file = new hdf5.File('/home/oguitart/temp/inchies_1.h5', Access.ACC_RDONLY);
console.time('readhdf5');
var dataset2 = h5lt.readDataset(file.id, "inchies",{start: [0], stride: [1], count: [1000000000]});
console.timeEnd('readhdf5');

We have done that with two different files created the same way with the same size (attached). The time to read those two datasets is surprisingly very different. One takes 6 seconds, the other takes 89 seconds. The two files are attached so that you can test it too.

inchies_files.tar.gz

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Thank you, I'll do some testing tonight

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

and compare with Java reading it in HDFView

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Issue is repeatable; I'll look at internal steps and code to narrow down. I repeated the issue with nodejs v8.4.0 and hdf5 1.10.1

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Ok, a bug that I had there a long time!

This is a fixed length set of strings contiguous as an array; reads in a blip from h5; then I'm checking for length or the min with 27; using strlen is dangerous because there might not be a null byte for a while.

Changed to a simple check; this is so the javascript string doesn't have a length of 27 but should be 4 for example on the None's.

Both these inchie h5's are now reading into the javascrpt array in 0 s, 451.7131 ms on my machine.

from hdf5.node.

rimmartin avatar rimmartin commented on July 22, 2024

Before publishing again I'm surveying for any other potential error of this kind.

Can you test using npm to pull from this repository? in your package dependencies section use

    "hdf5": "HDF-NI/hdf5.node"

instead of

    "hdf5": "0.3.2"

? If you need it published to work I will publish tomorrow; just want to minimize hitting users with multiple publishings to the npm repository.

Thank you for helping find this bug.

from hdf5.node.

oguitart avatar oguitart commented on July 22, 2024

I tested it and it works great. Much faster. I would prefer to use it through the normal repository but I'm not in a hurry.
Thanks!

from hdf5.node.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.