Comments (18)
I don't know why TimSort is preferred to DPQ for objects.
Actually, I think it's because DPQ is not stable. Stability is useless for primitives on the default comparator. It only matters under the following conditions for primitives:
- Using a custom comparator for primitives
- The comparator must claim two different primitive values are equal.
The conditions are a bit unusual, and it's unclear how important stability would be under those conditions. An example for such a comparator (and a good test case): All evens are equals; all odd are equal; evens are less than odds. Implemented as a one-liner: Integer.compare(a & 1, b & 1)
from eclipse-collections.
Could you share the stack trace you saw?
from eclipse-collections.
I was able to cause the stack overflow issue with 100K element IntList that is in order.
@Test
public void bigIntArrayList()
{
MutableIntList list = IntInterval.oneTo(100_000).toList();
// list.shuffleThis();
list.sortThis(new IntComparator() {
@Override
public int compare(int value1, int value2)
{
return value2 - value1;
}
});
System.out.println(list.get(0));
System.out.println(list.get(10));
}
Shuffling the list or reversing the value2
and value1
in the Comparator
will cause the code to run fine.
Stack trace:
java.lang.StackOverflowError
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
at org.eclipse.collections.impl.utility.primitive.IntQuickSort.sort(IntQuickSort.java:116)
from eclipse-collections.
sortThis(...)
and sortThisBy(...)
methods ultimately delegate to IntQuickSort.sort(...)
for IntLists (and so on for all primitives). IntQuickSort
is basically a classic quicksort with some tweaks so it is recursive. It may be a good idea to move to TimSort as the safer (and stable!) sort algorithm.
from eclipse-collections.
from eclipse-collections.
@javafanboy - thank you for the detailed write up (and for opening the issue).
A couple of thoughts:
- We can make the sorting algorithm pluggable. I see three options for the scope: globally, per collection type (easiest to implement I think), per
sort
call. What do folks think - any of that makes sense? - Separately, replacing the current sort implementation with TimSort seems like a good idea - see the reasons in my perv comment. And if we do the pluggable approach, it can just be the default option. I don't think we should be taking ChatGPT authored code, but a clean room TimSort implementation shouldn't be a problem.
from eclipse-collections.
from eclipse-collections.
Just use Arrays.sort()
from eclipse-collections.
[Primitive]ArrayList.sortThis()
does use Arrays.sort()
for ascending sorting by list value order. You need something else if your requirement is to sort based on a comparator so [Primitive]ArrayList.sortThis(<comparator>)
variants use a bespoke sort implementation.
from eclipse-collections.
Just use
Arrays.sort()
Arrays.sort()
does not accept a Comparator
equivalent for primitive types.
from eclipse-collections.
If the code is taking too much much time to execute, then there can be two possible solutions -
1.Tune the MIN_RUN size : you might find that increasing or decreasing MIN_RUN can yield better performance. Larger values reduce the number of runs, leading to fewer merges but more work during the initial sorting phase.
2.Optimize the merging process.
Here's the optimized merging method
private static void merge(int[] array, int left, int mid, int right, IntComparator comparator) {
if (comparator.compare(array[mid], array[mid + 1]) <= 0) {
// The arrays are already sorted, no need to merge
return;
}
int len1 = mid - left + 1;
int len2 = right - mid;
int[] leftArray = new int[len1];
int[] rightArray = new int[len2];
System.arraycopy(array, left, leftArray, 0, len1);
System.arraycopy(array, mid + 1, rightArray, 0, len2);
int i = 0, j = 0, k = left;
while (i < len1 && j < len2) {
if (comparator.compare(leftArray[i], rightArray[j]) <= 0) {
array[k++] = leftArray[i++];
} else {
array[k++] = rightArray[j++];
}
}
while (i < len1) {
array[k++] = leftArray[i++];
}
while (j < len2) {
array[k++] = rightArray[j++];
}
}
from eclipse-collections.
If the code is taking too much much time to execute, then there can be two possible solutions...
Thanks for your reply and sorry if I was unclear - with TimSort the performance is really good for all the cases I have tried with even with the algorithm I posted above (with fixed MIN_RUN etc.). It was the built in (quicksort based) method that executed really slowly on my already partly sorted data.
I have not seen your proposed optimization in any TimSort implementation but it seems intuitively correct and my test cases still passed with it - the performance improvement was just a precent or two (i.e. quite rare that the sub-arrays are already FULLY sorted and this must be out weighted by an additional comparison) for my data but every improvement counts!
from eclipse-collections.
It also seems like one can replace the two last loops in the merge method with System.arraycopy - my tests still run and it actually shaved of a quite significant ~5% or so with my large arrays.
The code now looks like this:
import org.eclipse.collections.api.block.comparator.primitive.IntComparator;
public class IterativeTimSort {
private static final int MIN_RUN = 32;
public static void sort(int[] array, int length, IntComparator comparator) {
if (length < 0 || length > array.length) {
throw new IllegalArgumentException("Illegal length!");
}
for (int i = 0; i < length; i += MIN_RUN) {
insertionSort(array, i, Math.min(i + MIN_RUN - 1, length - 1), comparator);
}
for (int size = MIN_RUN; size < length; size = 2 * size) {
for (int left = 0; left < length; left += 2 * size) {
final int mid = Math.min(left + size - 1, length - 1);
final int right = Math.min(left + 2 * size - 1, length - 1);
if (mid < right) {
merge(array, left, mid, right, comparator);
}
}
}
}
private static void insertionSort(int[] array, int left, int right, IntComparator comparator) {
for (int i = left + 1; i <= right; i++) {
final int temp = array[i];
int j = i - 1;
while (j >= left && comparator.compare(array[j], temp) > 0) {
array[j + 1] = array[j];
j--;
}
array[j + 1] = temp;
}
}
private static void merge(int[] array, int left, int mid, int right, IntComparator comparator) {
// If the sorted array halves are in the right order and non-overlapping, there is no need to merge
if (comparator.compare(array[mid], array[mid + 1]) <= 0) return;
final int len1 = mid - left + 1;
final int len2 = right - mid;
final int[] leftArray = new int[len1];
final int[] rightArray = new int[len2];
System.arraycopy(array, left, leftArray, 0, len1);
System.arraycopy(array, mid + 1, rightArray, 0, len2);
int i = 0, j = 0, k = left;
while (i < len1 && j < len2) {
if (comparator.compare(leftArray[i], rightArray[j]) <= 0) {
array[k++] = leftArray[i++];
} else {
array[k++] = rightArray[j++];
}
}
if (i < len1) System.arraycopy(leftArray, i, array, k, len1 - i);
if (j < len2) System.arraycopy(rightArray, j, array, k, len2 - j);
}
}
For data that is not partly sorted (i.e. created some random int arrays) Arrays.sort is about 30% faster than the above TimSort so it is probably a better DEFAULT choice than TimSort. This also avoids altering sorting performance for existing code using the Eclipse collections today. The JDK did after all replace TimSort with the current QuickSort derivate already back in version 6 or something and that was probably for a good reason...
from eclipse-collections.
for further performance improvement , can't we use -
private static final int MIN_RUN = 64; ??
from eclipse-collections.
for further performance improvement , can't we use - private static final int MIN_RUN = 64; ??
With my test (sorting 10 million mostly sorted records) going to 64 actually reduces performance with a few percent.
from eclipse-collections.
I had a closer look at this.
TLDR
- There is a genuine bug in our quick sort implementation. It tries to find a better pivot so Don's sorted case doesn't take long, but something is wrong with that code.
- The state of the art general sorting algorithm is probably Java's dual pivot quicksort (dpq). There are various algo's claiming "fastest", but curiously, they don't run benchmarks with dpq.
- There are only two algorithms in the ones I found that are short enough to easily convert to Java and have the right licenses:
- Latest version of timsort: https://github.com/timsort/cpp-TimSort
- More of a long shot, piposort: https://github.com/scandum/piposort
Long Version
Top contenders other than dpq:
- pdq (pattern defeating quicksort): https://github.com/orlp/pdqsort used in Go, but apparently slower than dpq golang/go#61027 (read the issue for the drama about how Go devs don't want to replace it)
- blip: https://github.com/RedBedHed/blipsort no comparison with dpq. Might be portable to Java, but has some code that's very suspect/unportable (use of comparator returned value).
- flux: https://github.com/scandum/fluxsort no comparison with dqp. Would be hard to port to Java, just because of sheer length. Large code is not JIT friendly.
- quad: https://github.com/scandum/quadsort flux depends on this, so this might be more doable, but still very large
Various benchmarks:
https://www.reddit.com/r/cpp/comments/fgxfqa/c_benchmark_timsort_vs_pdqsort_vs_quadsort_vs/
https://github.com/scandum/fluxsort
https://github.com/scandum/quadsort
from eclipse-collections.
@mohrezaei - thank you for looking into this. A couple of questions:
- Any idea why JDK
Arrays.sort(..,[comparator])
methods for Objects use TimSort, whileArrays.sort(...)
for primitive types use DPQ? - Any reason not to try to just go with TimSort as a replacement for the existing
sort()
primitive with comparator implementation - should be an improvement (performance, stability, fixes this issue)?
from eclipse-collections.
The solution here should probably start with a bunch of test cases (random, sorted, reverse sorted, saw ascending, saw descending).
From there, quickest (pun!) might be to fix our quicksort. With a bit more effort, TimSort is probably the next best option.
As a project for someone learning the ropes, doing a comparison with various sort algorithms could be good.
I don't know why TimSort is preferred to DPQ for objects. (Sometimes that's because some internals depend on magnitude, like radix sort, but I don't think that's the case with DPQ)
from eclipse-collections.
Related Issues (20)
- eclipse-collections v12 JDK compatibility support HOT 4
- Interval types do not support reverse ranges for certain helper methods HOT 2
- Add groupByEachUniqueKey to RichIterable HOT 6
- [OSGi] Opting in to Service Loader Mediator HOT 3
- can not found the GMF HOT 1
- Migrate tests from Junit4 to Junit5
- Perf Issue: toImmutable[List/Set/Bag] methods should be overridden on LazyIterable HOT 4
- Upgrade Maven version for CodeQL GitHub Action build to 3.9.6
- Fix performance problem in MutableList.subList() / implement similar optimization as ArrayList.subList() HOT 5
- Implement ArrayStack distinct HOT 1
- Maven install failure: java.util.ConcurrentModificationException in Eclipse Collections OSGi Bundle HOT 1
- Optimize `clear` method in `SubList` class
- Optimize ImmutableArrayList.takeWhile() and dropWhile() for small lists. HOT 1
- Help me find the folder I need on Github HOT 1
- SortedSet#intersect does not always return elements from this HOT 2
- Compact() method for maps doesn't check if a rehash is needed HOT 5
- Custom hashing strategy for primitive maps with object as key would be a very nice new feature!
- Set that return the existing value? HOT 4
- Make the Collectors returned by Collectors2AdditionalTest.sumBy*() methods thread-safe. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eclipse-collections.