Skip to content

Poor groupBy performance #288

@dimitarg

Description

@dimitarg

Running this code:

import fj.data.List;
import fj.data.TreeMap;

public class GroupByPerf
{
    public static void main(String[] args)
    {
        System.out.println("warmup");
        for (int i = 0; i < 10; i++)
        {
            testIt(5000);

        }

        System.out.println("real run");
        for (int i = 0; i < 100000; i++)
        {
            testIt(50000);

        }
    }

    public static void testIt(int size)
    {
        List<Shrubbery> sh = List.range(0, size).map(x -> new Shrubbery(x%20, String.valueOf(x)));
        long now = System.currentTimeMillis();
        TreeMap<Long, List<Shrubbery>> grouped = sh.groupBy(x -> x.id);
        System.out.println("duration " + (System.currentTimeMillis()-now));
    }

    public static class Shrubbery
    {
        public final long id;
        public final String name;

        public Shrubbery(long id, String name)
        {
            this.id = id;
            this.name = name;
        }
    }

, yields ~350 millisecond run time for each iteration post the warmup phase, on my machine.

I did a flight recording with jmc, a large portion of the time is spent rebalancing the underlying set, or otherwise in the set.

I took a shot at a naive hashmap implementation, of course the semantic of the method is changed, but they are still grouped:

private static <A, B> HashMap<A, List<B>> groupBy(List<B> list, F<B,A> keyF)
    {
        HashMap<A, List.Buffer<B>> res = HashMap.hashMap();
        list.foreachDoEffect(x-> {
            A key = keyF.f(x);
            Option<List.Buffer<B>> entry = res.get(key);
            if(entry.isSome())
            {
                entry.some().snoc(x);
            }
            else
            {
                res.set(key, List.Buffer.<B>empty().snoc(x));
            }
        });

        return res.map(x->x, x->x.toList());
    }

The performance of the above code is roughly on par with javautil Stream. (1-2 millis per run)

Do you think should be at least a variant of the groupBy method which does not use a red-black tree? I am sure there are good reasons the signature is as it is, but also I think 300ms for 50000 elements is not something one will use in production.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions