Cypherpunks repositories - gostls13.git/commit

cmd/internal/obj/arm64: make sure prologue and epilogue are pattern matched for small frames

CL 379075 implemented function prologue/epilogue with STP/LDP.
To fix issue #53374, CL 412474 reverted the prologue STP change for
small frames, and the LDP in epilogue was kept. The current instructions
are:

  prologue:
    MOVD.W R30, -offset(RSP)
    MOVD R29, -8(RSP)
  epilogue:
    LDP -8(RSP), (R29, R30)
    ADD $offset, RSP, RSP

It seems a bit strange, as:

1) The prolog and epilogue are not in the same pattern (either STR-LDR,
   or STP-LDP).
2) Go Internal ABI defines that R30 is saved at 0(RSP) and R29 is saved
   at -8(RSP), so we can not use a single STP.W/LDP.P to save/restore
   LR&FP and adjust SP. Changing the ABI causes too much complexity,
   and the benefit is not that big.

This patch reverts the small frames' epilogue change in CL 379075. It
converts LDP in the epilogue to LDR-LDR. Another solution is to re-apply
the STP change in prologue, which requires to fix #53609. This seems the
easier and safer solution in the mean time. The new instructions are:

  prologue:
    MOVD.W R30, -offset(RSP)
    MOVD R29, -8(RSP)
  epilogue:
    MOVD -8(RSP), R29
    MOVD.P offset(RSP), R30

The current pattern may cause performance issues in Store-Forwarding on
micro-architectures like AmpereOne. Assuming a function call in the
middle of such code is short enough that the stores are still around,
then the LDP executes and it may wait longer to get the results from
separated stores in Store Buffers other than single STP.

Store-Forwarding aims to improve the efficiency of the processor by
allowing data to be forwarded directly from a store operation to a
subsequent load operation when certain conditions are met. See the
paper: "Memory Barriers: a Hardware View for Software Hackers"
(chapter 3.2: Store Forwarding).

The performance of following ARM64 Linux servers were tested:

1) AmpereOne (ARM v8.6+) from Ampere Computing.
2) Ampere Altra (ARM Neoverse N1) from Ampere Computing.
3) Graviton2 (ARM Neoverse N1) from AWS.

The effect of this change depends the hardware implementation of
store-forwarding. It can obviously improve AmpereOne, especially for
small functions that are frequently called and returned quickly.
E.g., JSON Marshal/Unmarshal benchmarks on AmpereOne:

    goos: linux
    goarch: arm64
    pkg: encoding/json
                             │ ampere-one.base │            ampere-one.new            │
                             │     sec/op      │    sec/op     vs base                │
    CodeMarshal-8                 882.1µ ±  1%   779.6µ ±  1%  -11.62% (p=0.000 n=10)
    CodeMarshalError-8            961.5µ ±  0%   855.7µ ±  1%  -11.01% (p=0.000 n=10)
    MarshalBytes/32-8             207.6n ±  1%   187.8n ±  0%   -9.52% (p=0.000 n=10)
    MarshalBytes/256-8            501.0n ±  1%   482.6n ±  1%   -3.68% (p=0.000 n=10)
    MarshalBytes/4096-8           5.336µ ±  1%   5.074µ ±  1%   -4.92% (p=0.000 n=10)
    MarshalBytesError/32-8        242.3µ ±  2%   205.7µ ±  3%  -15.08% (p=0.000 n=10)
    MarshalBytesError/256-8       242.4µ ±  1%   205.2µ ±  2%  -15.35% (p=0.000 n=10)
    MarshalBytesError/4096-8      247.9µ ±  0%   210.1µ ±  1%  -15.24% (p=0.000 n=10)
    MarshalMap-8                  150.8n ±  1%   145.7n ±  0%   -3.35% (p=0.000 n=10)
    EncodeMarshaler-8             50.30n ± 26%   54.48n ±  6%        ~ (p=0.739 n=10)
    CodeUnmarshal-8               4.796m ±  2%   4.055m ±  1%  -15.45% (p=0.000 n=10)
    CodeUnmarshalReuse-8          4.260m ±  1%   3.496m ±  1%  -17.94% (p=0.000 n=10)
    UnmarshalString-8             73.89n ±  1%   65.83n ±  1%  -10.91% (p=0.000 n=10)
    UnmarshalFloat64-8            60.63n ±  1%   58.66n ± 25%        ~ (p=0.143 n=10)
    UnmarshalInt64-8              55.62n ±  1%   53.25n ± 22%        ~ (p=0.468 n=10)
    UnmarshalMap-8                255.3n ±  1%   230.3n ±  1%   -9.77% (p=0.000 n=10)
    UnmarshalNumber-8             467.2n ±  1%   367.0n ±  0%  -21.43% (p=0.000 n=10)
    geomean                       6.224µ         5.605µ         -9.94%

Other ARM64 micro-architectures may be not affected so much by such
issue. E.g., benchmarks on Ampere Altra and Graviton2 show slight
improvements:

                             │ altra.base  │             altra.new              │
                             │   sec/op    │   sec/op     vs base               │
    CodeMarshal-8              980.1µ ± 1%   977.3µ ± 1%       ~ (p=0.912 n=10)
    CodeMarshalError-8         1.109m ± 3%   1.096m ± 5%       ~ (p=0.971 n=10)
    MarshalBytes/32-8          246.8n ± 1%   245.4n ± 0%  -0.55% (p=0.002 n=10)
    MarshalBytes/256-8         590.9n ± 1%   606.6n ± 1%  +2.67% (p=0.000 n=10)
    MarshalBytes/4096-8        6.351µ ± 1%   6.376µ ± 1%       ~ (p=0.183 n=10)
    MarshalBytesError/32-8     245.3µ ± 2%   246.1µ ± 2%       ~ (p=0.684 n=10)
    MarshalBytesError/256-8    245.5µ ± 1%   248.7µ ± 2%       ~ (p=0.218 n=10)
    MarshalBytesError/4096-8   254.2µ ± 1%   254.9µ ± 1%       ~ (p=0.481 n=10)
    MarshalMap-8               152.7n ± 2%   151.5n ± 3%       ~ (p=0.782 n=10)
    EncodeMarshaler-8          45.95n ± 7%   42.88n ± 5%  -6.70% (p=0.014 n=10)
    CodeUnmarshal-8            5.121m ± 4%   5.125m ± 3%       ~ (p=0.579 n=10)
    CodeUnmarshalReuse-8       4.616m ± 3%   4.634m ± 2%       ~ (p=0.529 n=10)
    UnmarshalString-8          72.12n ± 2%   72.20n ± 2%       ~ (p=0.912 n=10)
    UnmarshalFloat64-8         64.44n ± 5%   63.20n ± 4%       ~ (p=0.393 n=10)
    UnmarshalInt64-8           61.49n ± 2%   58.14n ± 4%  -5.45% (p=0.002 n=10)
    UnmarshalMap-8             263.6n ± 2%   266.2n ± 1%       ~ (p=0.196 n=10)
    UnmarshalNumber-8          464.7n ± 1%   464.0n ± 0%       ~ (p=0.566 n=10)
    geomean                    6.617µ        6.575µ       -0.64%

                             │ graviton2.base │            graviton2.new            │
                             │     sec/op     │    sec/op     vs base               │
    CodeMarshal-8                1.122m ±  0%   1.118m ±  1%       ~ (p=0.052 n=10)
    CodeMarshalError-8           1.216m ±  1%   1.214m ±  0%       ~ (p=0.631 n=10)
    MarshalBytes/32-8            289.9n ±  0%   280.8n ±  0%  -3.17% (p=0.000 n=10)
    MarshalBytes/256-8           675.9n ±  0%   664.7n ±  0%  -1.66% (p=0.000 n=10)
    MarshalBytes/4096-8          6.884µ ±  0%   6.885µ ±  0%       ~ (p=0.565 n=10)
    MarshalBytesError/32-8       293.1µ ±  2%   288.9µ ±  2%       ~ (p=0.123 n=10)
    MarshalBytesError/256-8      296.0µ ±  3%   289.0µ ±  1%  -2.36% (p=0.019 n=10)
    MarshalBytesError/4096-8     300.4µ ±  1%   295.6µ ±  0%  -1.60% (p=0.000 n=10)
    MarshalMap-8                 168.8n ±  1%   168.8n ±  1%       ~ (p=1.000 n=10)
    EncodeMarshaler-8            53.77n ±  8%   50.05n ± 12%       ~ (p=0.579 n=10)
    CodeUnmarshal-8              5.875m ±  2%   5.882m ±  1%       ~ (p=0.796 n=10)
    CodeUnmarshalReuse-8         5.383m ±  1%   5.366m ±  0%       ~ (p=0.631 n=10)
    UnmarshalString-8            74.59n ±  1%   73.99n ±  0%  -0.80% (p=0.001 n=10)
    UnmarshalFloat64-8           68.52n ±  7%   64.19n ± 18%       ~ (p=0.868 n=10)
    UnmarshalInt64-8             65.32n ± 13%   62.24n ±  8%       ~ (p=0.138 n=10)
    UnmarshalMap-8               290.1n ±  0%   291.3n ±  0%  +0.43% (p=0.010 n=10)
    UnmarshalNumber-8            514.4n ±  0%   499.4n ±  0%  -2.93% (p=0.000 n=10)
    geomean                      7.459µ         7.317µ        -1.91%

Change-Id: If27386fc5f514b76bdaf2012c2ce86cc65f7ca5b
Reviewed-on: https://go-review.googlesource.com/c/go/+/621775
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

author	Hao Liu <hliu@amperecomputing.com>
	Tue, 22 Oct 2024 08:14:48 +0000 (01:14 -0700)
committer	Cherry Mui <cherryyz@google.com>
	Thu, 31 Oct 2024 01:28:37 +0000 (01:28 +0000)
commit	6d39245514c675cdea5c7fd7f778e97bf0728dd5
tree	4eccdbd12652aee3f792acd09bedb4aedcea36a3	tree
parent	4efd5191653293d98ba2a89ea9a5667bbf9abd15	commit \| diff