math/big: remove Direct Sqrt computation
The Float.Sqrt method switches (for performance reasons) between
direct (uses Quo) and inverse (doesn't) computation, depending on the
precision, with threshold 128.
Unfortunately the implementation of recursive division in CL 172018
made Quo slightly slower exactly in the range around and below the
threshold Sqrt is using, so this strategy is no longer profitable.
The new division algorithm allocates more, and this has increased the
amount of allocations performed by Sqrt when using the direct method;
on low precisions the computation is fast, so additional allocations
have an negative impact on performance.
Interestingly, only using the inverse method doesn't just reverse the
effects of the Quo algorithm change, but it seems to make performances
better overall for small precisions:
name old time/op new time/op delta
FloatSqrt/64-4 643ns ± 1% 635ns ± 1% -1.24% (p=0.000 n=10+10)
FloatSqrt/128-4 1.44µs ± 1% 1.02µs ± 1% -29.25% (p=0.000 n=10+10)
FloatSqrt/256-4 1.49µs ± 1% 1.49µs ± 1% ~ (p=0.752 n=10+10)
FloatSqrt/1000-4 3.71µs ± 1% 3.74µs ± 1% +0.87% (p=0.001 n=10+10)
FloatSqrt/10000-4 35.3µs ± 1% 35.6µs ± 1% +0.82% (p=0.002 n=10+9)
FloatSqrt/100000-4 844µs ± 1% 844µs ± 0% ~ (p=0.549 n=10+9)
FloatSqrt/
1000000-4 69.5ms ± 0% 69.6ms ± 0% ~ (p=0.222 n=9+9)
name old alloc/op new alloc/op delta
FloatSqrt/64-4 280B ± 0% 200B ± 0% -28.57% (p=0.000 n=10+10)
FloatSqrt/128-4 504B ± 0% 248B ± 0% -50.79% (p=0.000 n=10+10)
FloatSqrt/256-4 344B ± 0% 344B ± 0% ~ (all equal)
FloatSqrt/1000-4 1.30kB ± 0% 1.30kB ± 0% ~ (all equal)
FloatSqrt/10000-4 13.5kB ± 0% 13.5kB ± 0% ~ (p=0.237 n=10+10)
FloatSqrt/100000-4 123kB ± 0% 123kB ± 0% ~ (p=0.247 n=10+10)
FloatSqrt/
1000000-4 1.83MB ± 1% 1.83MB ± 3% ~ (p=0.779 n=8+10)
name old allocs/op new allocs/op delta
FloatSqrt/64-4 8.00 ± 0% 5.00 ± 0% -37.50% (p=0.000 n=10+10)
FloatSqrt/128-4 11.0 ± 0% 5.0 ± 0% -54.55% (p=0.000 n=10+10)
FloatSqrt/256-4 5.00 ± 0% 5.00 ± 0% ~ (all equal)
FloatSqrt/1000-4 6.00 ± 0% 6.00 ± 0% ~ (all equal)
FloatSqrt/10000-4 6.00 ± 0% 6.00 ± 0% ~ (all equal)
FloatSqrt/100000-4 6.00 ± 0% 6.00 ± 0% ~ (all equal)
FloatSqrt/
1000000-4 10.3 ±13% 10.3 ±13% ~ (p=1.000 n=10+10)
For example, 1.02µs for FloatSqrt/128 is actually better than what I
was getting on the same machine before the Quo changes.
The .8% slowdown on /1000 and /10000 appears to be real and it is
quite baffling (that codepath was not touched at all); it may be
caused by code alignment changes.
Change-Id: Ib03761cdc1055674bc7526d4f3a23d7a25094029
Reviewed-on: https://go-review.googlesource.com/c/go/+/228062
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>