The LA464 micro-architecture is very sensitive to alignment of loops,
so the final performance of linked binaries can vary wildly due to
uncontrolled alignment of certain performance-critical loops. Now that
PCALIGN is available on loong64, let's make use of it and manually align
some assembly loops. The functions are identified based on perf records
of some easily regressed go1 benchmark cases (e.g. FmtFprintfPrefixedInt,
RegexpMatchEasy0_1K and Revcomp are particularly sensitive; even those
optimizations purely reducing dynamic instruction counts can regress
those cases by 6~12%, making the numbers almost useless).
Benchmark results on Loongson 3A5000 (which is an LA464 implementation):
goos: linux
goarch: loong64
pkg: test/bench/go1
│ CL 416154 │ this CL │
│ sec/op │ sec/op vs base │
BinaryTree17 14.10 ± 1% 14.10 ± 1% ~ (p=1.000 n=10)
Fannkuch11 3.672 ± 0% 3.579 ± 0% -2.53% (p=0.000 n=10)
FmtFprintfEmpty 94.72n ± 0% 94.73n ± 0% +0.01% (p=0.000 n=10)
FmtFprintfString 149.9n ± 0% 151.9n ± 0% +1.33% (p=0.000 n=10)
FmtFprintfInt 154.1n ± 0% 158.3n ± 0% +2.73% (p=0.000 n=10)
FmtFprintfIntInt 236.2n ± 0% 241.4n ± 0% +2.20% (p=0.000 n=10)
FmtFprintfPrefixedInt 314.2n ± 0% 320.2n ± 0% +1.91% (p=0.000 n=10)
FmtFprintfFloat 405.0n ± 0% 414.3n ± 0% +2.30% (p=0.000 n=10)
FmtManyArgs 933.6n ± 0% 949.9n ± 0% +1.75% (p=0.000 n=10)
GobDecode 15.51m ± 1% 15.24m ± 0% -1.77% (p=0.000 n=10)
GobEncode 18.42m ± 4% 18.10m ± 2% ~ (p=0.631 n=10)
Gzip 423.6m ± 0% 429.9m ± 0% +1.49% (p=0.000 n=10)
Gunzip 88.75m ± 0% 88.31m ± 0% -0.50% (p=0.000 n=10)
HTTPClientServer 85.44µ ± 0% 85.71µ ± 0% +0.31% (p=0.035 n=10)
JSONEncode 18.65m ± 0% 19.74m ± 0% +5.81% (p=0.000 n=10)
JSONDecode 77.75m ± 0% 78.60m ± 1% +1.09% (p=0.000 n=10)
Mandelbrot200 7.214m ± 0% 7.208m ± 0% ~ (p=0.481 n=10)
GoParse 7.616m ± 2% 7.616m ± 1% ~ (p=0.739 n=10)
RegexpMatchEasy0_32 142.9n ± 0% 133.0n ± 0% -6.93% (p=0.000 n=10)
RegexpMatchEasy0_1K 1.535µ ± 0% 1.362µ ± 0% -11.27% (p=0.000 n=10)
RegexpMatchEasy1_32 161.8n ± 0% 161.8n ± 0% ~ (p=0.628 n=10)
RegexpMatchEasy1_1K 1.635µ ± 0% 1.497µ ± 0% -8.41% (p=0.000 n=10)
RegexpMatchMedium_32 1.429µ ± 0% 1.420µ ± 0% -0.63% (p=0.000 n=10)
RegexpMatchMedium_1K 41.86µ ± 0% 42.25µ ± 0% +0.93% (p=0.000 n=10)
RegexpMatchHard_32 2.144µ ± 0% 2.108µ ± 0% -1.68% (p=0.000 n=10)
RegexpMatchHard_1K 63.83µ ± 0% 62.65µ ± 0% -1.86% (p=0.000 n=10)
Revcomp 1.337 ± 0% 1.192 ± 0% -10.89% (p=0.000 n=10)
Template 116.4m ± 1% 115.6m ± 2% ~ (p=0.579 n=10)
TimeParse 421.4n ± 2% 418.1n ± 1% -0.78% (p=0.001 n=10)
TimeFormat 515.1n ± 0% 517.9n ± 0% +0.54% (p=0.001 n=10)
geomean 104.5µ 103.5µ -0.99%
│ CL 416154 │ this CL │
│ B/s │ B/s vs base │
GobDecode 47.19Mi ± 1% 48.04Mi ± 0% +1.80% (p=0.000 n=10)
GobEncode 39.73Mi ± 4% 40.44Mi ± 2% ~ (p=0.631 n=10)
Gzip 43.68Mi ± 0% 43.04Mi ± 0% -1.47% (p=0.000 n=10)
Gunzip 208.5Mi ± 0% 209.6Mi ± 0% +0.50% (p=0.000 n=10)
JSONEncode 99.21Mi ± 0% 93.76Mi ± 0% -5.49% (p=0.000 n=10)
JSONDecode 23.80Mi ± 0% 23.55Mi ± 1% -1.08% (p=0.000 n=10)
GoParse 7.253Mi ± 2% 7.253Mi ± 1% ~ (p=0.810 n=10)
RegexpMatchEasy0_32 213.6Mi ± 0% 229.4Mi ± 0% +7.41% (p=0.000 n=10)
RegexpMatchEasy0_1K 636.3Mi ± 0% 717.3Mi ± 0% +12.73% (p=0.000 n=10)
RegexpMatchEasy1_32 188.6Mi ± 0% 188.6Mi ± 0% ~ (p=0.810 n=10)
RegexpMatchEasy1_1K 597.4Mi ± 0% 652.2Mi ± 0% +9.17% (p=0.000 n=10)
RegexpMatchMedium_32 21.35Mi ± 0% 21.49Mi ± 0% +0.63% (p=0.000 n=10)
RegexpMatchMedium_1K 23.33Mi ± 0% 23.11Mi ± 0% -0.94% (p=0.000 n=10)
RegexpMatchHard_32 14.24Mi ± 0% 14.48Mi ± 0% +1.67% (p=0.000 n=10)
RegexpMatchHard_1K 15.30Mi ± 0% 15.59Mi ± 0% +1.93% (p=0.000 n=10)
Revcomp 181.3Mi ± 0% 203.4Mi ± 0% +12.21% (p=0.000 n=10)
Template 15.89Mi ± 1% 16.00Mi ± 2% ~ (p=0.542 n=10)
geomean 59.33Mi 60.72Mi +2.33%
Change-Id: I9ac28d936e03d21c46bb19fa100018f61ace6b42
Reviewed-on: https://go-review.googlesource.com/c/go/+/479816
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: WANG Xuerui <git@xen0n.name>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
AND $7, R15
BNE R0, R15, byte_loop
+ PCALIGN $16
chunk16_loop:
BEQ R0, R14, byte_loop
MOVV (R6), R8
BEQ R4, R5, eq
MOVV size+16(FP), R6
ADDV R4, R6, R7
+ PCALIGN $16
loop:
BNE R4, R7, test
MOVV $1, R4
ADDV R4, R5 // end
ADDV $-1, R4
+ PCALIGN $16
loop:
ADDV $1, R4
BEQ R4, R5, notfound
ADDV R4, R5 // end
ADDV $-1, R4
+ PCALIGN $16
loop:
ADDV $1, R4
BEQ R4, R5, notfound
// do 8 bytes at a time if there is room
ADDV $-7, R4, R7
+ PCALIGN $16
SGTU R7, R6, R8
BEQ R8, out
MOVV R0, (R6)
// do 8 bytes at a time if there is room
ADDV $-7, R9, R6 // R6 is end pointer-7
+ PCALIGN $16
SGTU R6, R4, R8
BEQ R8, out
MOVV (R5), R7
// do 8 bytes at a time if there is room
ADDV $7, R4, R6 // R6 is start pointer+7
+ PCALIGN $16
SGTU R9, R6, R8
BEQ R8, out1
ADDV $-8, R5