runtime, cmd/internal/obj/arm: improve arm function prologue
When stack growth is not needed, as it usually is not,
execute only a single conditional branch
rather than three conditional instructions.
This adds 4 bytes to every function,
but might speed up execution in the common case.
Sample disassembly for
func f() {
_ = [128]byte{}
}
Before:
TEXT main.f(SB) x.go
x.go:3 0x2000
e59a1008 MOVW 0x8(R10), R1
x.go:3 0x2004
e59fb028 MOVW 0x28(R15), R11
x.go:3 0x2008
e08d200b ADD R11, R13, R2
x.go:3 0x200c
e1520001 CMP R1, R2
x.go:3 0x2010
91a0300e MOVW.LS R14, R3
x.go:3 0x2014
9b0118a9 BL.LS runtime.morestack_noctxt(SB)
x.go:3 0x2018
9afffff8 B.LS main.f(SB)
x.go:3 0x201c
e52de084 MOVW.W R14, -0x84(R13)
x.go:4 0x2020
e28d1004 ADD $4, R13, R1
x.go:4 0x2024
e3a00000 MOVW $0, R0
x.go:4 0x2028
eb012255 BL 0x4a984
x.go:5 0x202c
e49df084 RET #132
x.go:5 0x2030
eafffffe B 0x2030
x.go:5 0x2034
ffffff7c ?
After:
TEXT main.f(SB) x.go
x.go:3 0x2000
e59a1008 MOVW 0x8(R10), R1
x.go:3 0x2004
e59fb02c MOVW 0x2c(R15), R11
x.go:3 0x2008
e08d200b ADD R11, R13, R2
x.go:3 0x200c
e1520001 CMP R1, R2
x.go:3 0x2010
9a000004 B.LS 0x2028
x.go:3 0x2014
e52de084 MOVW.W R14, -0x84(R13)
x.go:4 0x2018
e28d1004 ADD $4, R13, R1
x.go:4 0x201c
e3a00000 MOVW $0, R0
x.go:4 0x2020
eb0124dc BL 0x4b398
x.go:5 0x2024
e49df084 RET #132
x.go:5 0x2028
e1a0300e MOVW R14, R3
x.go:5 0x202c
eb011b0d BL runtime.morestack_noctxt(SB)
x.go:5 0x2030
eafffff2 B main.f(SB)
x.go:5 0x2034
eafffffe B 0x2034
x.go:5 0x2038
ffffff7c ?
Updates #10587.
package sort benchmarks on an iPhone 6:
name old time/op new time/op delta
SortString1K 569µs ± 0% 565µs ± 1% -0.75% (p=0.000 n=23+24)
StableString1K 872µs ± 1% 870µs ± 1% -0.16% (p=0.009 n=23+24)
SortInt1K 317µs ± 2% 316µs ± 2% ~ (p=0.410 n=26+26)
StableInt1K 343µs ± 1% 339µs ± 1% -1.07% (p=0.000 n=22+23)
SortInt64K 30.0ms ± 1% 30.0ms ± 1% ~ (p=0.091 n=25+24)
StableInt64K 30.2ms ± 0% 30.0ms ± 0% -0.69% (p=0.000 n=22+22)
Sort1e2 147µs ± 1% 146µs ± 0% -0.48% (p=0.000 n=25+24)
Stable1e2 290µs ± 1% 286µs ± 1% -1.30% (p=0.000 n=23+24)
Sort1e4 29.5ms ± 2% 29.7ms ± 1% +0.71% (p=0.000 n=23+23)
Stable1e4 88.7ms ± 4% 88.6ms ± 8% -0.07% (p=0.022 n=26+26)
Sort1e6 4.81s ± 7% 4.83s ± 7% ~ (p=0.192 n=26+26)
Stable1e6 18.3s ± 1% 18.1s ± 1% -0.76% (p=0.000 n=25+23)
SearchWrappers 318ns ± 1% 344ns ± 1% +8.14% (p=0.000 n=23+26)
package sort benchmarks on a first generation rpi:
name old time/op new time/op delta
SearchWrappers 4.13µs ± 0% 3.95µs ± 0% -4.42% (p=0.000 n=15+13)
SortString1K 5.81ms ± 1% 5.82ms ± 2% ~ (p=0.400 n=14+15)
StableString1K 9.69ms ± 1% 9.73ms ± 0% ~ (p=0.121 n=15+11)
SortInt1K 3.30ms ± 2% 3.66ms ±19% +10.82% (p=0.000 n=15+14)
StableInt1K 5.97ms ±15% 4.17ms ± 8% -30.05% (p=0.000 n=15+15)
SortInt64K 319ms ± 1% 295ms ± 1% -7.65% (p=0.000 n=15+15)
StableInt64K 343ms ± 0% 332ms ± 0% -3.26% (p=0.000 n=12+13)
Sort1e2 3.36ms ± 2% 3.22ms ± 4% -4.10% (p=0.000 n=15+15)
Stable1e2 6.74ms ± 1% 6.43ms ± 2% -4.67% (p=0.000 n=15+15)
Sort1e4 247ms ± 1% 247ms ± 1% ~ (p=0.331 n=15+14)
Stable1e4 864ms ± 0% 820ms ± 0% -5.15% (p=0.000 n=14+15)
Sort1e6 41.2s ± 0% 41.2s ± 0% +0.15% (p=0.000 n=13+14)
Stable1e6 192s ± 0% 182s ± 0% -5.07% (p=0.000 n=14+14)
Change-Id: I8a9db77e1d4ea1956575895893bc9d04bd81204b
Reviewed-on: https://go-review.googlesource.com/10497
Reviewed-by: Russ Cox <rsc@golang.org>