Questions on HugeCollection about hugecollections-old HOT 13 CLOSED

peter-lawrey commented on July 19, 2024

Questions on HugeCollection

from hugecollections-old.

Comments (13)

Cotton-Ben commented on July 19, 2024

Assume either way as long as the address is known, it is possible to use external process to also read the same data? (But not writing to it)

True IPC-style write (in addition to read) to an OpenHFT HHM will also be supported, correct?

Saw synchronized word in the implementation, assume that supports concurrent read and write. Let me know if my assumption is correct

To do this w/ true IPC capability, will necessitate your HHM locks' operations/operands also be maintained 100% off-heap, correct?

from hugecollections-old.

peter-lawrey commented on July 19, 2024

It can be much larger than the heap however, since hash maps assume
random access and arrange data randomly you may get a significant drop in
performance if you start swapping.
Huge collections is designed to

compact the data so it uses less memory in the first place.
make full use of the memory available with notional GC impact.
2) It would but it wouldn't be any faster than a swapping hash map. Using
memory mapped files might save you running out of swap space or having to
rebuild such a large collection on restart.
3) Using memory mapped files, sharing between processes is possible. The
current implementation couldn't do this, but such an implementation is
planned.
4) The collection supports concurrent reads and writes. There is 128
segments by default and each one has a lock. This means up to 128 CPUs can
be accessing the collection at once (though you might way to increase the
number of segments) You cannot have concurrent access to the same
key/value however all the operations are around a micro-second if not less
so they shouldn't have to wait very long.

BTW I have created a group here now
https://groups.google.com/d/forum/openhft-huge-collections but feel free to
ask questions as issues given the lack of documentation.

On 14 January 2014 13:47, flxthomaslo [email protected] wrote:

Peter, thanks again for such an amazing work on the high performance
development. Since there is no google group on HugeCollection, want to ask
you a few questions on HugeCollection via github issue:

By looking at the code, HugeCollection is based on off heap memory
storage, do you know if it will be suitable for collection that could be
potentially bigger than the available memory?

If not, would memory mapped file backed HugeCollection possible to
support that?

Assume either way as long as the address is known, it is possible to
use external process to also read the same data? (But not writing to it)

Saw synchronized word in the implementation, assume that supports
concurrent read and write. Let me know if my assumption is correct

Super thanks in advanced

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2
.

from hugecollections-old.

peter-lawrey commented on July 19, 2024

Correct in both counts. Some thought will need to go into what assumptions
you can make. I am also considering a shared memory allocator which can be
used across processes. Such a thing could be used to build a shared HHM.

On 14 January 2014 14:01, Ben Cotton [email protected] wrote:

Assume either way as long as the address is known, it is possible to
use external process to also read the same data? (But not writing to it)

True IPC-style write (in addition to read) to an OpenHFT HHM will also be
supported, correct?

Saw synchronized word in the implementation, assume that supports
concurrent read and write. Let me know if my assumption is correct

To do this w/ true IPC capability, will necessitate your HHM locks'
operations/operands also be maintained 100% off-heap, correct?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32266340
.

from hugecollections-old.

flxthomaslo commented on July 19, 2024

Peter I will stick with using the github issue for this one since I want to continue the conversation at the same place.

The reason why I was thinking about using memory mapped file backed is more for putting lookup based type of data reference on it. If calculated carefully I should be able to avoid page fault but at the same time I can use an external process to load that up fast and have the main application process accessing it fast and seamlessly and the system will still work if somehow the size of these lookup data is larger than the main memory.

Having said that, if it is memory mapped file backed I wonder what would be the best way to deal with concurrent read and write even though in this case this type of data does not change frequently.

Apologize for putting more of a particular use case for the HugeCollection.

from hugecollections-old.

peter-lawrey commented on July 19, 2024

Java-Lang which huge collections is built on supports locking across shared
memory. It has a sample program where two processes toggle values in an
array of records. One only flips to false when locked and the other dlips
to true. It gets about 5 million toggles per second. Using compare and swap
would be faster but I wanted to test the locking worked.
On 15 Jan 2014 14:30, "flxthomaslo" [email protected] wrote:

Peter I will stick with using the github issue for this one since I want
to continue the conversation at the same place.

The reason why I was thinking about using memory mapped file backed is
more for putting lookup based type of data reference on it. If calculated
carefully I should be able to avoid page fault but at the same time I can
use an external process to load that up fast and have the main application
process accessing it fast and seamlessly and the system will still work if
somehow the size of these lookup data is larger than the main memory.

Having said that, if it is memory mapped file backed I wonder what would
be the best way to deal with concurrent read and write even though in this
case this type of data does not change frequently.

Apologize for putting more of a particular use case for the HugeCollection.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32365755
.

from hugecollections-old.

flxthomaslo commented on July 19, 2024

Interesting will take a look at the Java-Lang.

from hugecollections-old.

peter-lawrey commented on July 19, 2024

In particular, have a look at
OpenHFT/Java-Lang/lang/src/test/java/net/openhft/lang/io/LockingViaMMapMain.java
This toggled once every 28 nano-seconds (with two processes so the average
latency was 56 ns) over 10 million toggles.

On 15 January 2014 15:22, flxthomaslo [email protected] wrote:

Interesting will take a look at the Java-Lang.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32370578
.

from hugecollections-old.

flxthomaslo commented on July 19, 2024

excellent will take a look

from hugecollections-old.

peter-lawrey commented on July 19, 2024

When you consider off-heap, messaging or persistence, you have to watch out
for serialization. Java-Lang makes low latency (sub micro-second) and
GC-free deserialization a key requirement. This matters because you can
lose more latency in deserialization than you get in network latency. With
a low latency network card you can send data from one Java process to
another machine in under 10 micro-seconds so you want serialization to be
much less than this, not much more.

On 15 January 2014 19:19, flxthomaslo [email protected] wrote:

excellent will take a look

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32399089
.

from hugecollections-old.

flxthomaslo commented on July 19, 2024

What we are planning to do is not to have any serialization at all. We are currently prototype a fixed length binary encoding to represent the data object and use a flyweight Java data accessor class to position based on the offset and read the data directly from the memory. With that we theoretically have only one Java data accessor object (so we can keep heap foot print minimum) and we do not create any new Java object at all when accessing all the data from the memory. We are evaluating how fast we can read the data if we can carefully control the page fault (if we are using memory mapped file based accessing). And when we actually need to send the data out to the network we basically just take the same binary data and pop that into the network and the receiving end will take the binary data off from the network and copy back into the proper memory slot so there is no serialization/deserialization overhead. I think the key here is how fast can we read the data from the fixed length binary encoding versus from the native Java object. We suspect it will be a bit slower but then we should be able to gain deterministic performance due to the little GC activity. What is your thought on this? Obviously in this case we might not be able to use the HugeCollection because sounds like HugeCollection might modify the content to shrink the size of data?

from hugecollections-old.

peter-lawrey commented on July 19, 2024

Huge Collection supports both variable length encoding and fixed length
encoding (with offsets)

BTW I use plain Java objects (recycled) in my GC free demo programs. Just
because you use a Java objects doesn't mean you have to create garbage.
The library also supports dynamically generated off heap data types using
an interface of getters and setters. i.e. give it an interface and it will
implement an on heap and/or off heap implementation for you.

On of the problems with avoiding copying the data is you have to hold the
lock on the underlying segment, other wise another thread could modify the
data while you are using it.

On 15 January 2014 20:20, flxthomaslo [email protected] wrote:

What we are planning to do is not to have any serialization at all. We are
currently prototype a fixed length binary encoding to represent the data
object and use a flyweight Java data accessor class to position based on
the offset and read the data directly from the memory. With that we
theoretically have only one Java data accessor object (so we can keep heap
foot print minimum) and we do not create any new Java object at all when
accessing all the data from the memory. We are evaluating how fast we can
read the data if we can carefully control the page fault (if we are using
memory mapped file based accessing). And when we actually need to send the
data out to the network we basically just take the same binary data and pop
that into the network and the receiving end will take the binary data off
from the network and copy back into the proper memory slot so there is no
serialization/deserialization overhead. I think the key here is how fast
can we read the data from the fixed le ngth binary encoding versus from the
native Java object. We suspect it will be a bit slower but then we should
be able to gain deterministic performance due to the little GC activity.
What is your thought on this? Obviously in this case we might not be able
to use the HugeCollection because sounds like HugeCollection might modify
the content to shrink the size of data?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32410793
.

from hugecollections-old.

flxthomaslo commented on July 19, 2024

For using encoding to replace the POJO, it is not only for GC but more for the resolving the serialization and deserialization problem in general. On top of it I assume if you use POJO every time you read the object you will need to serialize and deserialize from the encoding. And totally agree with you on using encoding we will need to hold the lock and thus the original question on concurrent access. We are evaluating the possibly to use FileLock from Java 7 since it supports locking a region of memory with share or exclusive lock. Since writing is typical pretty fast so the locking effect wouldn't be much of a problem I don't think. And for reading if the cost for acquiring the share lock is negligible in terms of read the data then the solution might work. We have never used FileLock so we don't know how well it will turn out. If FileLock does not work then what we can potentially do is to allocate the byte array and when we need to read the data we just copy the memory onto the byte array and using the same encoding (or decoding in this case) to read the field then we don't need to worry about changing data.

from hugecollections-old.

peter-lawrey commented on July 19, 2024

In Java-Lang I compare using FileLock and using shared memory locks. The
file lock too 4,800 ns on average, and the shared memory lock took 56 ns on
average. As system call, file lock should make the cost of
serialization/deserialization less important.

I would use an off heap lock for sharing between processes or an on heap
lock for simplicity. I don't think you will get faster than that.

On 15 January 2014 22:02, flxthomaslo [email protected] wrote:

For using encoding to replace the POJO, it is not only for GC but more for
the resolving the serialization and deserialization problem in general. On
top of it I assume if you use POJO every time you read the object you will
need to serialize and deserialize from the encoding. And totally agree with
you on using encoding we will need to hold the lock and thus the original
question on concurrent access. We are evaluating the possibly to use
FileLock from Java 7 since it supports locking a region of memory with
share or exclusive lock. Since writing is typical pretty fast so the
locking effect wouldn't be much of a problem I don't think. And for reading
if the cost for acquiring the share lock is negligible in terms of read the
data then the solution might work. We have never used FileLock so we don't
know how well it will turn out. If FileLock does not work then what we can
potentially do is to allocate the byte array and when we need to read the
data we just copy the memory onto t he byte array and using the same
encoding (or decoding in this case) to read the field then we don't need to
worry about changing data.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32420212
.

from hugecollections-old.

Questions on HugeCollection about hugecollections-old HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent