]> Cypherpunks repositories - gostls13.git/commit
cmd/compile: schedule carry chain arithmetic disjointly
authorPaul E. Murphy <murp@ibm.com>
Tue, 15 Jun 2021 21:57:23 +0000 (16:57 -0500)
committerKeith Randall <khr@golang.org>
Sun, 8 May 2022 20:12:35 +0000 (20:12 +0000)
commitc386269ed8746304b219d5be7d673539ae1e2643
treeaa5a8293f0e5ec249d8e056942f08139be9ef5d1
parent1efe38750a1f29c946fde694738c4ba2ed48ff98
cmd/compile: schedule carry chain arithmetic disjointly

This results in a 1.7-2.4x improvement in native go crypto/elliptic
multiplication operations on PPC64, and similar improvements might
be possible on other architectures which use flags or similar to
represent the carry bit in SSA form.

If it is possible, schedule carry chains independently of each
other to avoid clobbering the carry flag. This is very expensive.

This is done by:

1. Identifying carry bit using, but not creating ops, and lowering
   their priority below all other ops which do not need to be
   placed at the top of a block. This effectively ensures only
   one carry chain will be placed at a time in most important
   cases (crypto/elliptic/internal/fiat contains most of them).

2. Raising the priority of carry bit generating ops to schedule
   later in a block to ensure they are placed as soon as they
   are ready.

Likewise, tuple ops which separate carrying ops are scored
similar to 2 above. This prevents unrelated ops from being
scheduled between carry-dependent operations. This occurs
when unrelated ops are ready to schedule alongside such
tuple ops. This reduces the chances a flag clobbering op
might be placed between two carry-dependent operations.

With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.3µs ± 0%    46.9µs ± 0%   +1.34%
ScalarBaseMult/P224                    356µs ± 0%     209µs ± 0%  -41.14%
ScalarBaseMult/P384                   1.20ms ± 0%    0.57ms ± 0%  -52.14%
ScalarBaseMult/P521                   3.38ms ± 0%    1.44ms ± 0%  -57.27%
ScalarMult/P256                        199µs ± 0%     199µs ± 0%   -0.17%
ScalarMult/P224                        357µs ± 0%     212µs ± 0%  -40.56%
ScalarMult/P384                       1.20ms ± 0%    0.58ms ± 0%  -51.86%
ScalarMult/P521                       3.37ms ± 0%    1.44ms ± 0%  -57.32%
MarshalUnmarshal/P256/Uncompressed    2.59µs ± 0%    2.52µs ± 0%   -2.63%
MarshalUnmarshal/P256/Compressed      2.58µs ± 0%    2.52µs ± 0%   -2.06%
MarshalUnmarshal/P224/Uncompressed    1.54µs ± 0%    1.40µs ± 0%   -9.42%
MarshalUnmarshal/P224/Compressed      1.54µs ± 0%    1.39µs ± 0%   -9.87%
MarshalUnmarshal/P384/Uncompressed    2.40µs ± 0%    1.80µs ± 0%  -24.93%
MarshalUnmarshal/P384/Compressed      2.35µs ± 0%    1.81µs ± 0%  -23.03%
MarshalUnmarshal/P521/Uncompressed    3.79µs ± 0%    2.58µs ± 0%  -31.81%
MarshalUnmarshal/P521/Compressed      3.80µs ± 0%    2.60µs ± 0%  -31.67%

Note, P256 uses an asm implementation, thus, little variation is expected.

Updates #40171

Change-Id: I810850e8ff429505424c92d6fe37f99aaa0c6e84
Reviewed-on: https://go-review.googlesource.com/c/go/+/393656
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Filippo Valsorda <valsorda@google.com>
src/cmd/compile/internal/ssa/schedule.go