Giter VIP home page Giter VIP logo

datasketches-memory's Introduction

Build Status Maven Central Language grade: Java Total alerts Coverage Status

=================

DataSketches Java Memory Component

This Memory component is general purpose, has no external runtime dependencies and can be used in any application that needs to manage data structures inside or outside the Java heap.

The goal of this component of the DataSketches library is to provide a high performance access API for accessing four different types of memory resources. Each of the four resource types is accessed using different API methods in the Memory component.

  • Heap: Contiguous bytes on the Java Heap constructed by, e.g., WritableMemory.writableWrap(byte[]) or using the WritableMemory.allocate(int) method. For purposes of this document this includes on-heap ByteBuffers constructed using ByteBuffer.allocate(int).

  • DirectByteBuffer: Contiguous bytes of a ByteBuffer constructed by, e.g., WritableMemory.writableWrap(ByteBuffer) where the ByteBuffer was previously constructed using ByteBuffer.allocateDirect(int); or, is a slice() thereof.

  • Direct: Contiguous bytes off the Java Heap constructed by, e.g., WritableMemory.allocateDirect(long) method.

  • Memory-Mapped Files Contiguous bytes of a file represented in off-heap memory and created using, e.g., the WritableMemory.writableMap(File) method.

Please visit the main DataSketches website for more information.

If you are interested in making contributions to this Memory component please see our Community page.

Release 2.0.0+

Starting with release datasketches-memory-2.0.0, this Memory component supports Java 8 through Java 13. Providing access to the four contiguous byte resources (mentioned above) in Java 8 only requires reflection. However, Java 9 introduced the Java Platform Module System (JPMS) where access to these internal classes requires starting up the JVM with special JPMS arguments. The actual JVM arguments required will depend on how the user intends to use the Memory API, the Java version used to run the user's application and whether the user's application is a JPMS application or not.

Also see the usage examples for more information.

USE AS A LIBRARY (using jars from Maven Central)

In this environment, the user is using the Jars from Maven Central as a library dependency and not attempting to build the Memory component from the source code or run the Memory component tests.

  • If you are running Java 8, no extra JVM arguments are required.
  • If you are running Java 11-13 and only using the Heap related API, no extra JVM arguments are required.

Otherwise, if you are running Java 11-13 and ...

  • If your application is not a JPMS module use the following table. Choose the columns that describe your use of the Memory API. If any of the columns contain a Yes, then the JVM argument in the first column of the row containing a Yes will be required. If you are not sure the extent of the Memory API being used, there is no harm in specifying all 4 JVM arguments. Note: do not embed any spaces in the full argument.
JVM Arguments for non-JPMS Applications Direct ByteBuffer Direct MemoryMapped Files
--add-exports java.base/jdk.internal.misc= ALL-UNNAMED Yes
--add-exports java.base/jdk.internal.ref= ALL-UNNAMED Yes Yes
--add-opens java.base/java.nio= ALL-UNNAMED Yes Yes
--add-opens java.base/sun.nio.ch= ALL-UNNAMED Yes
  • If your application is a JPMS module use the following table. Choose the columns that describe your use of the Memory API. If any of the columns contain a Yes, then the JVM argument in the first column of the row containing a Yes will be required. If you are not sure the extent of the Memory API being used, there is no harm in specifying all 4 JVM arguments. Note: do not embed any spaces in the full argument.
JVM Arguments for JPMS Applications Direct ByteBuffer Direct MemoryMapped Files
--add-exports java.base/jdk.internal.misc= org.apache.datasketches.memory Yes
--add-exports java.base/jdk.internal.ref= org.apache.datasketches.memory Yes Yes
--add-opens java.base/java.nio= org.apache.datasketches.memory Yes Yes
--add-opens java.base/sun.nio.ch= org.apache.datasketches.memory Yes Yes

DEVELOPER USAGE

In this environment the developer needs to build the Memory component from source and run the Memory Component tests. There are two use cases. The first is for a System Developer that needs to build and test their own Jar from source for a specific Java version. The second use case is for a Memory Component Developer and Contributor.

  • System Developer

    • Compile, test and create a Jar for a specific Java version
      • use the provided script for this purpose
  • Memory Component Developer / Contributor

    • Compile & test the library from source code using:
      • Eclipse (version)
      • IntelliJ (version)
      • Maven (version)
      • Command-line or scripts
    • The developer must have installed in their development system at least JDK versions 8 and 11.
    • Unless building with the provided script, the developer must have a valid Maven toolchain configuration.

Build Instructions

NOTES:

  1. This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: See Oracle Location-Independent Access to Resources

Dependencies

There are no run-time dependencies. See the pom.xml file for test dependencies.

Maven build instructions

The Maven build requires the following JDKs to compile:

  • JDK8/Hotspot
  • JDK11/Hotspot

Before building, first ensure that your local environment has been configured according to the Maven Toolchains Configuration.

There are three types of tests: normal unit tests, tests run by the strict profile and continuous integration(CI) tests. The CI tests target the Multi-Release (MR) JAR and run the entire test suite using a specific version of Java. Running the CI test command also runs the default unit tests.

To run normal unit tests:

mvn clean test

To run the strict profile tests (only supported in Java 8):

mvn clean test -P strict

To run javadoc on this multi-module project, use:

mvn clean javadoc:javadoc -DskipTests=true

To build the multi-release JAR, use:

mvn clean package

To run the eclipse plugin on this multi-module project, use:

mvn clean eclipse:eclipse -DskipTests=true

To install jars built from the downloaded source:

mvn clean install -DskipTests=true

This will create the following Jars:

  • datasketches-memory-X.Y.Z.jar The compiled main class files.
  • datasketches-memory-X.Y.Z-tests.jar The compiled test class files.
  • datasketches-memory-X.Y.Z-sources.jar The main source files.
  • datasketches-memory-X.Y.Z-test-sources.jar The test source files
  • datasketches-memory-X.Y.Z-javadoc.jar The compressed Javadocs.

Building for a specific java version

A build script named package-single-release-jar.sh has been provided to package a JAR for a specific java version. This is necessary in cases where a developer is unable to install all the required versions of the JDK that are required as part of the Maven build.

The build script performs the following steps:

  1. Sets up staging directories under target/ for the package files
  2. Uses git commands to gather information about the current Git commit and branch
  3. Compiles java source tree
  4. Packages a JAR containing compiled sources together with the Manifest, License and Notice files
  5. Checks and tests the assembled JAR by using the API to access four different resource types

The build script is located in the tools/scripts/ directory and requires the following arguments:

  • JDK Home Directory - The first argument is the absolute path of JDK home directory e.g. $JAVA_HOME
  • Git Version Tag - The second argument is the Git Version Tag for this deployment e.g. 1.0.0-SNAPSHOT, 1.0.0-RC1, 1.0.0 etc.
  • Project Directory - The third argument is the absolute path of project.basedir e.g. /src/apache-datasketches-memory

For example, if the project base directory is /src/datasketches-memory;

To run the script for a release version:

./tools/scripts/package-single-release-jar.sh $JAVA_HOME 2.1.0 /src/datasketches-memory

To run the script for a snapshot version:

./tools/scripts/package-single-release-jar.sh $JAVA_HOME 2.2.0-SNAPSHOT /src/datasketches-memory

To run the script for an RC version:

./tools/scripts/package-single-release-jar.sh $JAVA_HOME 2.1.0-RC1 /src/datasketches-memory

Note that the script does not use the Git Version Tag to adjust the working copy to a remote tag - it is expected that the user has a pristine copy of the desired branch/tag available before using the script.


Further documentation for contributors

For more information on the project configuration, the following topics are discussed in more detail:

In order to build and contribute to this project, please read the relevant IDE documentation:

For releasing to AppNexus, please use the sign-deploy-jar.sh script in the scripts directory. See the documentation within the script for usage instructions.

datasketches-memory's People

Contributors

alexandersaydakov avatar ccaominh avatar davecromberge avatar dependabot[bot] avatar gianm avatar jihoonson avatar leerho avatar leventov avatar niketh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasketches-memory's Issues

Thoughts on "originMemory"

Currently when asBuffer() is called it keeps a copy of the calling WritableMemoryImpl in the target WritableBufferImpl as a BaseWritableMemoryImpl originMemory. This seems a bit odd.

  • I think this field should be moved to BaseWritableBufferImpl.

  • Also, why don't we also do the reverse? If asMemory is called from a Buffer, keep a reference to the Buffer in the Memory impl. This would provide very fast switching back and forth between Buffer and Memory and it would allow us to focus Buffer strictly on the positional methods only, significantly reducing its API.

Thoughts?

Memory.map(file, off, cap) should not extend the file

When off + cap is more than the file size, the file should not be extended regardless of the map mode (read-only or read-write) and the read-only status of the file. Instead, an exception should be thrown. This behaviour should be specified in the javadocs of those methods.

WritableMemory.putBytes(Memory)

As a user of the Memory API, I was surprised not to find such a method. Spent several minutes before I found Memory.copyTo(WritableMemory).

I suggest to add a method WritableMemory.putBytes(long, Memory, long, long), that delegates to Memory.copyTo().

WritableMemoryDirectHandler allows only one allocate() call

Using the default direct memory handler, an application can make only a single call to allocate(), because the WritableMemory object is never associated with a handler:
https://github.com/DataSketches/memory/blob/master/src/main/java/com/yahoo/memory/WritableMemoryDirectHandler.java#L57-L59

As an example of trying to increase the size of some buffer, say I have this simple method:

private WritableMemory growBuffer(WritableMemory srcMem, int currentSize, int newSize) {
  MemoryRequest mr = mem.getMemoryRequest();
  if (mr == null) { throw new Exception(); }
  WritableMemory newMem = memoryRequest.request(newSize);
  srcMem.copyTo(0, newMem, 0, currentSize);
  return newMem;
}

If I replace the old memory reference with the new one (and it would be strange to need to hold on to the original Memory object for the life of the object), things work as expected on the first call to growBuffer() but on the next call, mem.getMemoryRequest() will return null.

isResourceReadOnly doesn't make much sense

isResourceReadOnly() - method doesn't make much sense, if not to say that it's misleading and dangerous -- it's very easy to think that this method designates the Memory's or Buffer's read-onlyness, but it doesn't. See #69.

I suggest to remove this method and replace it with isReadOnly(), that returns localReadOnly instead of the ResourceState's read-only.

Buffer / Memory consistency

Currently Buffer does not have a wrap(byte[] ... BO) parallel to Memory. Do you need this?

Another thought I had was to eliminate all the static factory methods in Buffer and WritableBuffer and require that the user always create a Buffer from Memory. This would make your suggestion of keeping a Memory reference in Buffer make even more sense and would simplify the code and tests quite a bit. Thoughts? (Sketches don't use Buffer so this is fine with me.)

Duplicate check and assert code

Why do we need duplicate code here and here when we could make these static methods that could also be leveraged by other static methods that do the same checks, e.g., here?

I would suggest consolidating all these check and assert methods together in one place, like Util.

Support newer Java version than 8

Java 8 is long in the tooth. Newer versions will pose challenges for this library in particular, but the project risks being left behind over time due to incompatibility with modern JDKs.

getCharsFromUtf8() asymmetry with putCharsToUtf8()

@leventov

This is more of a comment than a real issue, but here FYI.

getCharsFromUtf8() originally returned void. But understanding how many characters were actually decoded can be valuable to know, especially if the user needs to increment a pointer into the destination Appendable. But it is not possible for the method to know how many characters already existed in the destination before appending.

I have updated the method to return an integer of the number of characters decoded in the current PR.

The asymmetry is that putCharsToUtf8() returns the position in the WritableMemory after the last byte has been written from the encoding process instead of the number of bytes encoded.

A question for you is should we change the putCharsToUtf8() to return the number of bytes encoded? This would make the two methods more symmetric.

We had no tests for the case where the destination Appendable already had some content so I added a test for that.

Support for java 9

Currently when using this library I get an error "JDK must be either 7 or 8".

Direct Memory statistics

Since Memory is not ByteBuffers, it's not covered by JMX statistics of direct byte buffer allocation. It's viable to add at least basic atomic counters, like total direct memory allocated, total direct memory mapped. I suggest to expose them as static methods in Memory: Memory.totalDirectMemoryAllocated(), etc.

Slim Memory; remove hasByteBuffer() and getByteBuffer() methods

Because of apache/druid#5335 (comment) (see the last part of the message) Memory objects are going to need to carry minimal overhead over the memory bytes themselves, at least, in comparison with ByteBuffer (in the linked message, an example API uses ByteBuffer, because it's what currently used in Druid, but it's really going to be replaced with Memory).

In order to achieve this, I think separate ResourceState objects need to be dropped. Memory is either it's own "resource state" (for Memory objects, obtained from wrap(), allocate(), or map() calls), or references to another Memory as it's "resource state" (for Memory objects, obtained from region() calls). Buffer also references to some Memory as it's resource state.

Memory has the functionality of ResourceState as package-private methods.

StepBooleans should be "inlined" to be just volatile boolean fields. (And isReadOnly shouldn't be volatile, it should be just final, because it's not changing during Memory lifetime.

Heap and off-heap Memory should also probably be separated, because off-heap needs a lot of fields, not needed by on-heap Memory: handle, valid (we change the validity state only when we unmap of deallocate direct memory, so not needed for on-heap), reference to ByteBuffer, etc. Off-heap Memory could be much beefier than on-heap, because we are not going to map or allocate small off-heap Memories.

Because of this, I think hasByteBuffer() method should be removed, because we cannot consistently support it, when wrapping on-heap ByteBuffers. Also, I think this method is useless.

Class hierarchy could look like this:

Memory
|
WritableMemory
|
BaseWritableMemoryImpl------
|                           |
WritableMemoryImpl          NonNativeEndianWritableMemoryImpl
|                           |
OffHeapWritableMemoryImpl   OffHeapNonNativeEndianWritableMemoryImpl

OffHeapWritableMemoryImpl and OffHeapNonNativeEndianWritableMemoryImpl have pretty much the same bodies, but I don't see how to avoid this repetition.

ByteBuffer views of Memory

Previously I turned this idea down, but now I've given up the idea to make all libraries to be compatible with Memory from the bottom up (like RoaringBitmap). Those libraries, if they don't currently use unsafe or any other ugly hacks, just vanilla ByteBuffer API, won't agree to migrate to Memory unless it supports Java 9, now Java 10, and Unsafe-free mode of operation. Also non-OpenJDK operation (e. g. Eclipse OpenJ9, that is famous to break often when people try to access private fields of some impls of OpenJDK's classes.) Adding support of all those things would be very very hard in Memory, and a lot of work.

Previously I was concerned that wrapping direct memory as DirectByteBuffer is unsafe, because if the original memory is freed, the buffer could still be used. Well, yes, it's unsafe (we could even call those methods like unsafeByteBufferView()), but there is no other way. Those methods should be used only in specific situations carefully. Also we could make it a little less dangerous, if we install a reference to the Memory object into ByteBuffer's att field.

utf8 PR

@leventov

I made some style and cosmetic changes to your PR which I pulled into the utf8 branch here.
I had to add finals, make some methods static that could have been, etc. tools/SketchesCheckstyle should have caught those.

Question 1: You did not create a parallel get/put utf8 for buffer. Why?

Question 2: Have you done any testing that shows that adding this utf8 capability into Memory improves performance over what could have been done external to Memory?

Unaligned access

I'm not sure to what extent we want to support platforms that don't support unaligned memory access, such as old ARMs, SPARC, etc. Currently an attempt to do unaligned reads and writes with Memory will lead to SIGBUS crash, that is worse than throwing an exception.

jdk.interal.misc.Unsafe includes intrinsics for doing unaligned reads and writes, but sun.misc.Unsafe unfortunately doesn't.

Removing stateful information from abstract classes

@leventov

Last year you stated an objective of not having stateful information in Memory or Buffer (or BaseBuffer). This leaves open the possibility of creating implementations that are independent of Unsafe.

By adding ResourceState to BaseBuffer, this is now broken. But it is not really needed there. Shall I move it back to WritableBufferImpl ?

Support for efficient write to a file

Memory should add method(s) for efficient (i. e. without extra data copy, if possible) write of it's contents (or subrange) to a file. WritableByteChannel has method write(ByteBuffer). Hence when we create a large direct Memory (potentially bigger than 2 GB) we should create e. g. 1 GB direct ByteBuffer "blocks" under the hoods anyway, and use them only for such file write operations.

Heap byte[] or ByteBuffer-based memory is OK already. (byte[] will need to wrap into a ByteBuffer before writing, e. g. create a little garbage, but it's unavoidable and OK. ByteBuffer-based it no better, it will need to create duplicate() for write anyway, to avoid concurrency issues with ByteBuffer offsets.)

Non byte[] array based memory (short[], int[] etc.) couldn't be written to a file without an extra data copy one way or another (unless you want to create JNI implementation and deal with file descriptors yourself :) But it's probably OK because Memory that is created this way probably shouldn't often be written to a file directly. At least there is no such need in Druid. Druid specifically is going to need to write only large direct Memory into a file.

Issues with overlap in Memory.copyTo() / array get / array put methods

  1. Why does Memory.copyTo() prohibit overlap of copied regions? Why doesn't it handle it properly? Well, it might be a source of bugs, I agree. Can we add a method copyToMaybeOverlapping()?
  2. WritableMemoryImpl.copyTo() checks this == destination, but it should really compare srtParent and dstParent: https://github.com/DataSketches/memory/blob/3878b82529f4e6de531ffbff6b3b60206f27476f/src/main/java/com/yahoo/memory/WritableMemoryImpl.java#L305-L312
    Because the same array could be wrapped two times.
  3. They same applies to all bulk array get and put methods, because source or dest array could be the same array that was previously wrapped to produce the Memory object.
  4. And the same for Buffer.

Copyright

Copyright headers in the project say Copyright <year>, Yahoo! Inc. that is not correct, because I didn't sign any CLA with Yahoo.

It's suggested to change all headers to Copyright <year> Memory contributors and add a NOTICE file that lists copyright holders: Yahoo Inc. and Metamarkets Group Inc.

Different CheckStyle?

@leventov

Somewhere you mentioned using a different CheckStyle template. Where was the comment? Or what was the recommendation and why?

Reachability issues

The usage of direct Memory is currently fundamentally prone to use-after-free, if the direct memory object is garbage-collected at the moment when a some of it's method is called or it is compared with other Memory e. g. in equals(). See Reference.reachabilityFence() Javadoc for details.

Now DirectByteBuffer's code looks like this: http://hg.openjdk.java.net/jdk/jdk/file/a6ede2dabe20/src/java.base/share/classes/java/nio/Direct-X-Buffer.java.template#l269, those try { ... } finally { Reference.reachabilityFence(this); } are in every DirectByteBuffer's method.

Reference.reachabilityFence() method is Java 9+, but informally it's known that in Hotspot JVM any method call with the object as the argument has the same effect as Reference.reachabilityFence(), e. g. Objects.requireNonNull(): http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-February/051312.html

Incorrect starting offset while wrapping Memory onto byte array

Memory.java#L198, permalink seems incorrect.
It initializes the memory with the offset 0, instead of the supplied offset into the array. I might be unclear on the method's contract, but this seems counterintuitive to what I was expecting - an offset of x with length l as an argument means that the memory would be a window into the [x..x+l] bytes of the provided byte array. This is what ByteBuffer.wrap() does.

Is this the expected behavior of the method or is it a bug?

Thanks!

Subtleties of ZERO_SIZE_MEMORY/BUFFER

@leventov

You created two degenerate objects, ZERO_SIZE_MEMORY and ZERO_SIZE_BUFFER, presumably for performance. I normally like this idea, however, I want to point out some subtle unintended consequences that you might not be aware of.

The ResourceState associated with each object contains information about the resource that the object is mapped to and the user can query some of this information via methods like hasByteBuffer(). The degenerate objects you created are specifically zero byte array resources.

Suppose the user's code does something like this:

ByteBuffer bb = ByteBuffer.allocateDirect(0); //accidentally
//.... much later and somewhere else ...
WritableMemory wmem = WritableMemory.wrap(bb);
assert wmem.hasByteBuffer();  //Fails!
assert wmem.isDirect(); //Fails!

This might be very misleading for the user trying to debug why his ByteBuffer derived WritableMemory suddenly lost its mapping to a direct ByteBuffer, not realizing that under the covers we did a switcheroo on him because it had zero capacity.

It would be helpful if you could explain what you thought was the most likely scenario that would impact performance. Do you expect to see lots of zero-sized ByteBuffers or zero-sized arrays that you want to wrap? Does this occur frequently in Druid?

I do suggest we change the name of the current one to ZERO_SIZED_ARRAY_MEMORY to make it really clear how it was created.

Public checkValidAndBounds()

I changed the public checkBounds() to checkValidAndBounds().

Nonetheless, do we really need this? Since all of our public methods do this check why would a user need this?

Change WritableHandle Interface

@leventov

This has bothered me for a while, and since we are making other API changes, I am thinking we may as well make this change as well...

WritableHandle can not extend Handle because both interfaces define the method get(). But changing the method name in WritableHandle to getWritable(), then it would be possible for WritableHandle to extend Handle allowing simple casting between the two.

This will likely affect a bunch of our code, but I haven't tried it to see how much it affects. Other users would need a warning and documentation of this change.

Here are two images of the before and after Handle Hierarchy:

handlehierarchy

handlehierarchy2

Thoughts?

Java 9 compatibility

@leventov

Would you mind taking a look at this to see if this is the right way to do this?

I am somewhat concerned whether Java 9 will allow us to wrap ByteBuffer using reflection like we are doing now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.