A few weeks ago an interesting question was asked on the StackOverflow. Someone experimented with a recursion in Groovy and stepped into Closure.trampoline() [1]. It quickly turned out that using TrampolineClosure makes a recursive execution slower. Is this a valid behavior, or do we do something wrong?

Why Closure.trampoline()?

Recursive algorithms in Groovy (or Java in general) are quite tricky. Every recursive call adds a frame to the call stack. If we make too many recursive calls, we can hit the call stack limit and cause StackOverflowError. We used to use recursive method calls, but the same problem applies to recursive closure calls. Consider the following example.

Listing 1. Recursive factorial function defined as a closure
def factorial
factorial = { n, acc = 1G -> 1 >= n ? acc : factorial(n - 1, acc * n ) } (1)

assert factorial(13) == 6227020800

In this simple Groovy snippet we define a recursive closure call to compute the factorial of a given number. This recursive closure allows me to compute factorial of 706, but it hits the StackOverflowError for any number starting from 707.

This is where TrampolineClosure comes with a help.

Closures are wrapped in a TrampolineClosure. Upon calling, a trampolined Closure will call the original Closure waiting for its result. If the outcome of the call is another instance of a TrampolineClosure, created perhaps as a result to a call to the trampoline() method, the Closure will again be invoked. This repetitive invocation of returned trampolined Closures instances will continue until a value other than a trampolined Closure is returned. That value will become the final result of the trampoline. That way, calls are made serially, rather than filling the stack.


Source: http://groovy-lang.org/closures.html#_trampoline

The above example refactored to the trampoline version may look like this:

def factorial
factorial = { n, acc = 1G -> 1 >= n ? acc : factorial.trampoline(n - 1, acc * n ) }.trampoline()

assert factorial(13) == 6227020800

The TrampolineClosure makes all recursive calls serial, so in this case computing factorial of numbers larger than 707 is doable.

The performance

What about the performance? Running a quick JMH benchmark shows that:

  • computing factorial(500) without the trampoline takes 0.111 ms (in average) on my computer

  • computing factorial(500) with the trampoline takes 0.196 ms (in average) on my computer

For a single call the difference is not huge. However, in the question asked on the StackOverflow we had to deal with a slightly different scenario. The person who asked this question compared the execution time of computing factorial 100,000 times for a pseudo-random number between 0 and 39.

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.

The benchmark

I created a benchmark that simulates a very similar scenario. In this test we benchmark the execution of calculating factorial 100,000 time using pseudo-random numbers from range <30,34>. This way we can eliminate anomalies (e.g. calculating factorial(0) or factorial(1)) and prevent JIT compiler from inlining method call with fixed argument.

You can clone the JMH benchmark test source code from: https://github.com/wololock/groovy-closure-trampoline-example

The benchmark tests the following scenarios:

  • calculating factorial using recursive closure call,

  • calculating factorial using recursive trampoline closure call,

  • calculating factorial using tail recursive method.

Each scenario is tested in the dynamic and statically compiled variant.

Listing 2. JMH benchmark test of a factorial function defined with closures
package bench

import groovy.transform.CompileStatic
import groovy.transform.TailRecursive
import org.openjdk.jmh.annotations.Benchmark
import org.openjdk.jmh.annotations.Scope
import org.openjdk.jmh.annotations.State

@State(Scope.Benchmark)
class FactorialBench {

    static final Random random = new Random()
    static final int limit = 100000

    @Benchmark
    void a_standard() {
        def factorial
        factorial = { n, acc = 1G -> 1 >= n ? acc : factorial(n - 1, acc * n ) }

        for (int i = 0; i < limit; i++) {
            factorial(30 + random.nextInt(5))
        }
    }

    @Benchmark
    @CompileStatic
    void a_standard_sc() {
        Closure factorial
        factorial = { int n, acc = 1G -> 1 >= n ? acc : factorial(n - 1, ((BigInteger) acc) * BigInteger.valueOf((long) n)) }

        for (int i = 0; i < limit; i++) {
            factorial(30 + random.nextInt(5))
        }
    }

    @Benchmark
    void b_trampoline() {
        def factorial
        factorial = { n, acc = 1G -> 1 >= n ? acc : factorial.trampoline(n - 1, acc * n ) }.trampoline()

        for (int i = 0; i < limit; i++) {
            factorial(30 + random.nextInt(5))
        }
    }

    @Benchmark
    @CompileStatic
    void b_trampoline_sc() {
        Closure factorial
        factorial = { int n, acc = 1G -> 1 >= n ? acc : factorial.trampoline(n - 1, ((BigInteger) acc) * BigInteger.valueOf((long) n)) }.trampoline()

        for (int i = 0; i < limit; i++) {
            factorial(30 + random.nextInt(5))
        }
    }

    @Benchmark
    void c_tailRecursive() {
        for (int i = 0; i < limit; i++) {
            factorialTailRecursive(30 + random.nextInt(5))
        }
    }

    @Benchmark
    @CompileStatic
    void c_tailRecursive_sc() {
        for (int i = 0; i < limit; i++) {
            factorialTailRecursiveSC(30 + random.nextInt(5))
        }
    }


    @TailRecursive
    static factorialTailRecursive(n, acc = 1G) {
        1 >= n ? acc : factorialTailRecursive(n - 1, n * acc)
    }

    @TailRecursive
    @CompileStatic
    static BigInteger factorialTailRecursiveSC(int n, BigInteger acc = 1G) {
        1 >= n ? acc : factorialTailRecursiveSC(n - 1, acc * BigInteger.valueOf((long) n))
    }
}

I use JMH Gradle plugin, so executing the following benchmark looks like this:

$ ./gradlew clean jmh --no-daemon

The results

The execution of all benchmark scenarios takes about 8-9 minutes. I run it twice for Groovy 2.5.7 and for Groovy 3.0.0-beta-1 just to check if the upcoming version 3 introduces any performance improvements.

Laptop specs: JDK 1.8.0_201 (Java HotSpot™ 64-Bit Server VM, 25.201-b09), Groovy 2.5.7, Intel® Core™ i7-4900MQ CPU @ 2.80GHz (4 cores, cache size 8192 KB), 16 GB RAM, OS: Fedora 29 (64 bit)

1) Standard recursive closure

Benchmark                                 Mode  Cnt     Score   Error  Units
FactorialBench.a_standard                 avgt   42   472,114 ± 1,315  ms/op
FactorialBench.a_standard_sc              avgt   42   428,717 ± 1,063  ms/op

This is our starting point. We can see that running statically compiled code is approximately 9% faster.

And here is the stack profiler output.

....[Thread state: RUNNABLE]........................................................................
 43,6%  43,6% org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.pickClosureMethod
 21,6%  21,6% sun.reflect.DelegatingMethodAccessorImpl.invoke
 21,2%  21,2% org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrap.invoke
  5,5%   5,6% groovy.lang.MetaMethod.doMethodInvoke
  3,6%   3,6% groovy.lang.MetaClassImpl.invokeMethod
  3,1%   3,1% sun.reflect.GeneratedMethodAccessor2.invoke
  0,4%   0,4% java.math.BigInteger.multiply
  0,3%   0,3% org.codehaus.groovy.runtime.ArrayUtil.createArray
  0,2%   0,2% java.util.Arrays.copyOf
  0,2%   0,2% java.util.Arrays.copyOfRange
  0,4%   0,4% <other>

2) Trampoline closure

Benchmark                                 Mode  Cnt     Score   Error  Units
FactorialBench.b_trampoline               avgt   42  1054,703 ± 3,805  ms/op
FactorialBench.b_trampoline_sc            avgt   42   683,293 ± 2,191  ms/op

The benchmark test shows that the (dynamic) trampoline variant of the factorial function is about 2.23 times slower compared to the standard recursive closure approach.

Here is the stack profiler output for the trampoline closure variant:

....[Thread state: RUNNABLE]........................................................................
 22,6%  22,6% org.codehaus.groovy.reflection.ParameterTypes.isValidMethod
 21,8%  21,8% org.codehaus.groovy.reflection.ParameterTypes.coerceArgumentsToClasses
 15,5%  15,5% org.codehaus.groovy.reflection.ParameterTypes.correctArguments
 11,5%  11,5% groovy.lang.Closure.<init>
  9,4%   9,4% sun.reflect.DelegatingMethodAccessorImpl.invoke
  8,5%   8,5% groovy.lang.TrampolineClosure.<init>
  7,1%   7,1% org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.pickClosureMethod
  0,4%   0,4% groovy.lang.MetaClassImpl.invokeMethod
  0,3%   0,3% org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrap.invoke
  0,3%   0,3% java.math.BigInteger.multiply
  2,6%   2,6% <other>

There are two major differences between trampoline and the standard recursive closure:

  • Creating TrampolineClosure objects for each recursive call comes with a cost. In this specific case it took 210.8 ms (20% of the total time).

  • Calling Closure.trampoline(args) requires arguments coercion, which takes ~393 ms (37.3% of the total time).

Static compilation helps a bit - it executes in 683.293 ms average time. If we take a look at the call stack from the profiler, we see that static compilation removed the need of the arguments coercion. The main additional cost comes with the TrampolineClosure objects creation - it takes 185.172 ms (27,1% of the total time).

....[Thread state: RUNNABLE].....................................................................
 29,2%  29,2% org.codehaus.groovy.reflection.ParameterTypes.isValidMethod
 24,5%  24,5% java.math.BigInteger.multiply
 16,7%  16,7% groovy.lang.Closure.<init>
 10,4%  10,4% groovy.lang.TrampolineClosure.<init>
  9,8%   9,8% org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.pickClosureMethod
  4,0%   4,0% org.codehaus.groovy.runtime.metaclass.MetaMethodIndex.getMethods
  2,4%   2,4% org.codehaus.groovy.reflection.ParameterTypes.correctArguments
  0,7%   0,7% sun.reflect.GeneratedMethodAccessor1.invoke
  0,5%   0,5% groovy.lang.MetaClassImpl.invokeMethod
  0,3%   0,3% java.util.Arrays.copyOf
  1,3%   1,3% <other>

3) Tail recursive method

Benchmark                                 Mode  Cnt     Score   Error  Units
FactorialBench.c_tailRecursive            avgt   42   322,019 ± 1,409  ms/op
FactorialBench.c_tailRecursive_sc         avgt   42   104,385 ± 1,380  ms/op

The best performance comes with a tail recursive method. And it shouldn’t be a surprise - Groovy compiler compiles a method annotated with @TailRecursive to a bytecode that uses while-loop instead of the recursive calls. (Read more about in the Tail-recursive methods in Groovy post.) That is why the call stack is dominated by two operations:

  • java.math.BigInteger object initialization,

  • and java.math.BigInteger.multiply() method call.

....[Thread state: RUNNABLE]........................................................................
 41,5%  41,5% java.math.BigInteger.<init>
 26,8%  26,8% java.math.BigInteger.multiply
 15,9%  15,9% java.lang.Integer.getChars
 14,2%  14,2% java.lang.Integer.stringSize
  0,5%   0,5% java.math.BigInteger.multiplyByInt
  0,3%   0,3% bench.FactorialBench.factorialTailRecursive
  0,2%   0,2% java.lang.Integer.toString
  0,2%   0,2% org.codehaus.groovy.runtime.typehandling.NumberMath.toBigInteger
  0,2%   0,2% java.util.Arrays.copyOfRange
  0,0%   0,0% org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call

The statically compiled variant of the tail recursive method does even better. In this case it spends 97.1% of the time calling NumberMath.multiply.

....[Thread state: RUNNABLE]........................................................................
 97,1%  97,2% org.codehaus.groovy.runtime.typehandling.NumberMath.multiply
  1,1%   1,1% bench.FactorialBench.factorialTailRecursiveSC
  0,8%   0,8% java.math.BigInteger.multiplyByInt
  0,2%   0,2% java.math.BigInteger.valueOf
  0,2%   0,2% bench.FactorialBench.c_tailRecursive_sc
  0,2%   0,2% java.util.Arrays.copyOfRange
  0,1%   0,1% java.lang.Thread.isInterrupted
  0,1%   0,1% java.math.BigInteger.<init>
  0,1%   0,1% java.lang.System.nanoTime
  0,0%   0,0% sun.misc.Unsafe.unpark

Does Groovy 3.0.0-beta-1 do better?

Short answer - no. The results comparable. Even if Groovy 2.5.7 of 3.0.0-beta-1 does a little bit better in one variant or another, it doesn’t prove anything. The tendencies are the same in both cases.

groovy trampoline 01

Here you can find the full console output from the benchmark tests I use:

Conclusion

The final question is - should we avoid using Closure.trampoline() then? Absolutely not. If you use recursive calls in the closure, you should consider using the TrampolineClosure to avoid hitting the call stack size limit. The cost of using TrampolineClosure in the statically compiled Groovy code becomes a trouble only when you need to handle hundreds of thousands calls that invoke recursive closure. However, in this case you can also consider refactoring to the tail recursive method call for the best performance. But remember what Donald Knuth said: "Premature optimization is the root of all evil." [2] Use whatever programming construction that works for you (and your team) best, and solve performance problems when they start occurring.


2. Donald Knuth, "Computer Programming as an Art (1974)", p. 671

Szymon Stepniak

Groovista, Upwork's Top Rated freelancer, Toruń Java User Group founder, open source contributor, Stack Overflow addict, bedroom guitar player. I walk through e.printStackTrace() so you don't have to.