cmd/internal/obj/arm64: make sure prologue and epilogue are pattern matched for small frames
CL 379075 implemented function prologue/epilogue with STP/LDP.
To fix issue #53374, CL 412474 reverted the prologue STP change for
small frames, and the LDP in epilogue was kept. The current instructions
are:
1) The prolog and epilogue are not in the same pattern (either STR-LDR,
or STP-LDP).
2) Go Internal ABI defines that R30 is saved at 0(RSP) and R29 is saved
at -8(RSP), so we can not use a single STP.W/LDP.P to save/restore
LR&FP and adjust SP. Changing the ABI causes too much complexity,
and the benefit is not that big.
This patch reverts the small frames' epilogue change in CL 379075. It
converts LDP in the epilogue to LDR-LDR. Another solution is to re-apply
the STP change in prologue, which requires to fix #53609. This seems the
easier and safer solution in the mean time. The new instructions are:
The current pattern may cause performance issues in Store-Forwarding on
micro-architectures like AmpereOne. Assuming a function call in the
middle of such code is short enough that the stores are still around,
then the LDP executes and it may wait longer to get the results from
separated stores in Store Buffers other than single STP.
Store-Forwarding aims to improve the efficiency of the processor by
allowing data to be forwarded directly from a store operation to a
subsequent load operation when certain conditions are met. See the
paper: "Memory Barriers: a Hardware View for Software Hackers"
(chapter 3.2: Store Forwarding).
The performance of following ARM64 Linux servers were tested:
1) AmpereOne (ARM v8.6+) from Ampere Computing.
2) Ampere Altra (ARM Neoverse N1) from Ampere Computing.
3) Graviton2 (ARM Neoverse N1) from AWS.
The effect of this change depends the hardware implementation of
store-forwarding. It can obviously improve AmpereOne, especially for
small functions that are frequently called and returned quickly.
E.g., JSON Marshal/Unmarshal benchmarks on AmpereOne: