Running this code:
import fj.data.List;
import fj.data.TreeMap;
public class GroupByPerf
{
public static void main(String[] args)
{
System.out.println("warmup");
for (int i = 0; i < 10; i++)
{
testIt(5000);
}
System.out.println("real run");
for (int i = 0; i < 100000; i++)
{
testIt(50000);
}
}
public static void testIt(int size)
{
List<Shrubbery> sh = List.range(0, size).map(x -> new Shrubbery(x%20, String.valueOf(x)));
long now = System.currentTimeMillis();
TreeMap<Long, List<Shrubbery>> grouped = sh.groupBy(x -> x.id);
System.out.println("duration " + (System.currentTimeMillis()-now));
}
public static class Shrubbery
{
public final long id;
public final String name;
public Shrubbery(long id, String name)
{
this.id = id;
this.name = name;
}
}
, yields ~350 millisecond run time for each iteration post the warmup phase, on my machine.
I did a flight recording with jmc, a large portion of the time is spent rebalancing the underlying set, or otherwise in the set.
I took a shot at a naive hashmap implementation, of course the semantic of the method is changed, but they are still grouped:
private static <A, B> HashMap<A, List<B>> groupBy(List<B> list, F<B,A> keyF)
{
HashMap<A, List.Buffer<B>> res = HashMap.hashMap();
list.foreachDoEffect(x-> {
A key = keyF.f(x);
Option<List.Buffer<B>> entry = res.get(key);
if(entry.isSome())
{
entry.some().snoc(x);
}
else
{
res.set(key, List.Buffer.<B>empty().snoc(x));
}
});
return res.map(x->x, x->x.toList());
}
The performance of the above code is roughly on par with javautil Stream. (1-2 millis per run)
Do you think should be at least a variant of the groupBy method which does not use a red-black tree? I am sure there are good reasons the signature is as it is, but also I think 300ms for 50000 elements is not something one will use in production.
Running this code:
, yields ~350 millisecond run time for each iteration post the warmup phase, on my machine.
I did a flight recording with jmc, a large portion of the time is spent rebalancing the underlying set, or otherwise in the set.
I took a shot at a naive hashmap implementation, of course the semantic of the method is changed, but they are still grouped:
The performance of the above code is roughly on par with javautil Stream. (1-2 millis per run)
Do you think should be at least a variant of the groupBy method which does not use a red-black tree? I am sure there are good reasons the signature is as it is, but also I think 300ms for 50000 elements is not something one will use in production.