This results in a 1.7-2.4x improvement in native go crypto/elliptic
multiplication operations on PPC64, and similar improvements might
be possible on other architectures which use flags or similar to
represent the carry bit in SSA form.
If it is possible, schedule carry chains independently of each
other to avoid clobbering the carry flag. This is very expensive.
This is done by:
1. Identifying carry bit using, but not creating ops, and lowering
their priority below all other ops which do not need to be
placed at the top of a block. This effectively ensures only
one carry chain will be placed at a time in most important
cases (crypto/elliptic/internal/fiat contains most of them).
2. Raising the priority of carry bit generating ops to schedule
later in a block to ensure they are placed as soon as they
are ready.
Likewise, tuple ops which separate carrying ops are scored
similar to 2 above. This prevents unrelated ops from being
scheduled between carry-dependent operations. This occurs
when unrelated ops are ready to schedule alongside such
tuple ops. This reduces the chances a flag clobbering op
might be placed between two carry-dependent operations.
With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are: