]>
Cypherpunks repositories - gostls13.git/commit
internal/bytealg: rewrite indexbytebody on PPC64
Use P8 instructions throughout to be backwards compatible, but
otherwise not impede performance. Use overlapping loads where
possible, and prioritize larger checks over smaller check.
However, some newer instructions can be used surgically when
targeting a newer GOPPC64. These can lead to noticeable
performance improvements with minimal impact to readability.
All tests run below on a Power10/ppc64le, and use a small
modification to BenchmarkIndexByte to ensure the IndexByte
wrapper call is inlined (as it likely is under realistic usage).
This wrapper adds substantial overhead if not inlined.
Previous (power9 path, GOPPC64=power8) vs. GOPPC64=power8:
IndexByte/1 3.81ns ± 8% 3.11ns ± 5% -18.39%
IndexByte/2 3.82ns ± 3% 3.20ns ± 6% -16.23%
IndexByte/3 3.61ns ± 4% 3.25ns ± 6% -10.13%
IndexByte/4 3.66ns ± 5% 3.08ns ± 1% -15.91%
IndexByte/5 3.82ns ± 0% 3.75ns ± 2% -1.94%
IndexByte/6 3.83ns ± 0% 3.87ns ± 4% +1.04%
IndexByte/7 3.83ns ± 0% 3.82ns ± 0% -0.27%
IndexByte/8 3.82ns ± 0% 2.92ns ±11% -23.70%
IndexByte/9 3.70ns ± 2% 3.08ns ± 2% -16.87%
IndexByte/10 3.74ns ± 2% 3.04ns ± 0% -18.75%
IndexByte/11 3.75ns ± 0% 3.31ns ± 8% -11.79%
IndexByte/12 3.74ns ± 0% 3.04ns ± 0% -18.86%
IndexByte/13 3.83ns ± 4% 3.04ns ± 0% -20.64%
IndexByte/14 3.80ns ± 1% 3.30ns ± 8% -13.18%
IndexByte/15 3.77ns ± 1% 3.04ns ± 0% -19.33%
IndexByte/16 3.81ns ± 0% 2.78ns ± 7% -26.88%
IndexByte/17 4.12ns ± 0% 3.04ns ± 1% -26.11%
IndexByte/18 4.27ns ± 6% 3.05ns ± 0% -28.64%
IndexByte/19 4.30ns ± 4% 3.02ns ± 2% -29.65%
IndexByte/20 4.43ns ± 7% 3.45ns ± 7% -22.15%
IndexByte/21 4.12ns ± 0% 3.03ns ± 1% -26.35%
IndexByte/22 4.40ns ± 6% 3.05ns ± 0% -30.82%
IndexByte/23 4.40ns ± 6% 3.01ns ± 2% -31.48%
IndexByte/24 4.32ns ± 5% 3.07ns ± 0% -28.98%
IndexByte/25 4.76ns ± 2% 3.04ns ± 1% -36.11%
IndexByte/26 4.82ns ± 0% 3.05ns ± 0% -36.66%
IndexByte/27 4.82ns ± 0% 2.97ns ± 3% -38.39%
IndexByte/28 4.82ns ± 0% 2.96ns ± 3% -38.57%
IndexByte/29 4.82ns ± 0% 3.34ns ± 9% -30.71%
IndexByte/30 4.82ns ± 0% 3.05ns ± 0% -36.77%
IndexByte/31 4.81ns ± 0% 3.05ns ± 0% -36.70%
IndexByte/32 3.52ns ± 0% 3.44ns ± 1% -2.15%
IndexByte/33 4.77ns ± 1% 3.35ns ± 0% -29.81%
IndexByte/34 5.01ns ± 5% 3.35ns ± 0% -33.15%
IndexByte/35 4.92ns ± 9% 3.35ns ± 0% -31.89%
IndexByte/36 4.81ns ± 5% 3.35ns ± 0% -30.37%
IndexByte/37 4.99ns ± 6% 3.35ns ± 0% -32.86%
IndexByte/38 5.06ns ± 5% 3.35ns ± 0% -33.84%
IndexByte/39 5.02ns ± 5% 3.48ns ± 9% -30.58%
IndexByte/40 5.21ns ± 9% 3.55ns ± 4% -31.82%
IndexByte/41 5.18ns ± 0% 3.42ns ± 2% -33.98%
IndexByte/42 5.19ns ± 0% 3.55ns ±11% -31.56%
IndexByte/43 5.18ns ± 0% 3.45ns ± 5% -33.46%
IndexByte/44 5.18ns ± 0% 3.39ns ± 0% -34.56%
IndexByte/45 5.18ns ± 0% 3.43ns ± 4% -33.74%
IndexByte/46 5.18ns ± 0% 3.47ns ± 1% -33.03%
IndexByte/47 5.18ns ± 0% 3.44ns ± 2% -33.54%
IndexByte/48 5.18ns ± 0% 3.39ns ± 0% -34.52%
IndexByte/49 5.69ns ± 0% 3.79ns ± 0% -33.45%
IndexByte/50 5.70ns ± 0% 3.70ns ± 3% -34.98%
IndexByte/51 5.70ns ± 0% 3.70ns ± 2% -35.05%
IndexByte/52 5.69ns ± 0% 3.80ns ± 1% -33.35%
IndexByte/53 5.69ns ± 0% 3.78ns ± 0% -33.54%
IndexByte/54 5.69ns ± 0% 3.78ns ± 1% -33.51%
IndexByte/55 5.69ns ± 0% 3.78ns ± 0% -33.61%
IndexByte/56 5.69ns ± 0% 3.81ns ± 3% -33.12%
IndexByte/57 6.20ns ± 0% 3.79ns ± 4% -38.89%
IndexByte/58 6.20ns ± 0% 3.74ns ± 2% -39.58%
IndexByte/59 6.20ns ± 0% 3.69ns ± 2% -40.47%
IndexByte/60 6.20ns ± 0% 3.79ns ± 1% -38.81%
IndexByte/61 6.20ns ± 0% 3.77ns ± 1% -39.23%
IndexByte/62 6.20ns ± 0% 3.79ns ± 0% -38.89%
IndexByte/63 6.20ns ± 0% 3.79ns ± 0% -38.90%
IndexByte/64 4.17ns ± 0% 3.47ns ± 3% -16.70%
IndexByte/65 5.38ns ± 0% 4.21ns ± 0% -21.59%
IndexByte/66 5.38ns ± 0% 4.21ns ± 0% -21.58%
IndexByte/67 5.38ns ± 0% 4.22ns ± 0% -21.58%
IndexByte/68 5.38ns ± 0% 4.22ns ± 0% -21.59%
IndexByte/69 5.38ns ± 0% 4.22ns ± 0% -21.56%
IndexByte/70 5.38ns ± 0% 4.21ns ± 0% -21.59%
IndexByte/71 5.37ns ± 0% 4.21ns ± 0% -21.51%
IndexByte/72 5.37ns ± 0% 4.22ns ± 0% -21.46%
IndexByte/73 5.71ns ± 0% 4.22ns ± 0% -26.20%
IndexByte/74 5.71ns ± 0% 4.21ns ± 0% -26.21%
IndexByte/75 5.71ns ± 0% 4.21ns ± 0% -26.17%
IndexByte/76 5.71ns ± 0% 4.22ns ± 0% -26.22%
IndexByte/77 5.71ns ± 0% 4.22ns ± 0% -26.22%
IndexByte/78 5.71ns ± 0% 4.21ns ± 0% -26.22%
IndexByte/79 5.71ns ± 0% 4.22ns ± 0% -26.21%
IndexByte/80 5.71ns ± 0% 4.21ns ± 0% -26.19%
IndexByte/81 6.20ns ± 0% 4.39ns ± 0% -29.13%
IndexByte/82 6.20ns ± 0% 4.36ns ± 0% -29.67%
IndexByte/83 6.20ns ± 0% 4.36ns ± 0% -29.63%
IndexByte/84 6.20ns ± 0% 4.39ns ± 0% -29.21%
IndexByte/85 6.20ns ± 0% 4.36ns ± 0% -29.64%
IndexByte/86 6.20ns ± 0% 4.36ns ± 0% -29.63%
IndexByte/87 6.20ns ± 0% 4.39ns ± 0% -29.21%
IndexByte/88 6.20ns ± 0% 4.36ns ± 0% -29.65%
IndexByte/89 6.74ns ± 0% 4.36ns ± 0% -35.33%
IndexByte/90 6.75ns ± 0% 4.37ns ± 0% -35.22%
IndexByte/91 6.74ns ± 0% 4.36ns ± 0% -35.30%
IndexByte/92 6.74ns ± 0% 4.36ns ± 0% -35.34%
IndexByte/93 6.74ns ± 0% 4.37ns ± 0% -35.20%
IndexByte/94 6.74ns ± 0% 4.36ns ± 0% -35.33%
IndexByte/95 6.75ns ± 0% 4.36ns ± 0% -35.32%
IndexByte/96 4.83ns ± 0% 4.34ns ± 2% -10.24%
IndexByte/97 5.91ns ± 0% 4.65ns ± 0% -21.24%
IndexByte/98 5.91ns ± 0% 4.65ns ± 0% -21.24%
IndexByte/99 5.91ns ± 0% 4.65ns ± 0% -21.23%
IndexByte/100 5.90ns ± 0% 4.65ns ± 0% -21.21%
IndexByte/101 5.90ns ± 0% 4.65ns ± 0% -21.22%
IndexByte/102 5.90ns ± 0% 4.65ns ± 0% -21.23%
IndexByte/103 5.91ns ± 0% 4.65ns ± 0% -21.23%
IndexByte/104 5.91ns ± 0% 4.65ns ± 0% -21.24%
IndexByte/105 6.25ns ± 0% 4.65ns ± 0% -25.59%
IndexByte/106 6.25ns ± 0% 4.65ns ± 0% -25.59%
IndexByte/107 6.25ns ± 0% 4.65ns ± 0% -25.60%
IndexByte/108 6.25ns ± 0% 4.65ns ± 0% -25.58%
IndexByte/109 6.24ns ± 0% 4.65ns ± 0% -25.50%
IndexByte/110 6.25ns ± 0% 4.65ns ± 0% -25.56%
IndexByte/111 6.25ns ± 0% 4.65ns ± 0% -25.60%
IndexByte/112 6.25ns ± 0% 4.65ns ± 0% -25.59%
IndexByte/113 6.76ns ± 0% 5.05ns ± 0% -25.37%
IndexByte/114 6.76ns ± 0% 5.05ns ± 0% -25.31%
IndexByte/115 6.76ns ± 0% 5.05ns ± 0% -25.38%
IndexByte/116 6.76ns ± 0% 5.05ns ± 0% -25.31%
IndexByte/117 6.76ns ± 0% 5.05ns ± 0% -25.38%
IndexByte/118 6.76ns ± 0% 5.05ns ± 0% -25.31%
IndexByte/119 6.76ns ± 0% 5.05ns ± 0% -25.38%
IndexByte/120 6.76ns ± 0% 5.05ns ± 0% -25.36%
IndexByte/121 7.35ns ± 0% 5.05ns ± 0% -31.33%
IndexByte/122 7.36ns ± 0% 5.05ns ± 0% -31.42%
IndexByte/123 7.38ns ± 0% 5.05ns ± 0% -31.60%
IndexByte/124 7.38ns ± 0% 5.05ns ± 0% -31.59%
IndexByte/125 7.38ns ± 0% 5.05ns ± 0% -31.60%
IndexByte/126 7.38ns ± 0% 5.05ns ± 0% -31.58%
IndexByte/128 5.28ns ± 0% 5.10ns ± 0% -3.41%
IndexByte/256 7.27ns ± 0% 7.28ns ± 2% +0.13%
IndexByte/512 12.1ns ± 0% 11.8ns ± 0% -2.51%
IndexByte/1K 23.1ns ± 3% 22.0ns ± 0% -4.66%
IndexByte/2K 42.6ns ± 0% 42.4ns ± 0% -0.41%
IndexByte/4K 90.3ns ± 0% 89.4ns ± 0% -0.98%
IndexByte/8K 170ns ± 0% 170ns ± 0% -0.59%
IndexByte/16K 331ns ± 0% 330ns ± 0% -0.27%
IndexByte/32K 660ns ± 0% 660ns ± 0% -0.08%
IndexByte/64K 1.30µs ± 0% 1.30µs ± 0% -0.08%
IndexByte/128K 2.58µs ± 0% 2.58µs ± 0% -0.04%
IndexByte/256K 5.15µs ± 0% 5.15µs ± 0% -0.04%
IndexByte/512K 10.3µs ± 0% 10.3µs ± 0% -0.03%
IndexByte/1M 20.6µs ± 0% 20.5µs ± 0% -0.03%
IndexByte/2M 41.1µs ± 0% 41.1µs ± 0% -0.03%
IndexByte/4M 82.2µs ± 0% 82.1µs ± 0% -0.02%
IndexByte/8M 164µs ± 0% 164µs ± 0% -0.01%
IndexByte/16M 328µs ± 0% 328µs ± 0% -0.01%
IndexByte/32M 657µs ± 0% 657µs ± 0% -0.00%
GOPPC64=power8 vs GOPPC64=power9. The Improvement is
most noticed between 16 and 64B, and goes away around
128B.
IndexByte/16 2.78ns ± 7% 2.65ns ±15% -4.74%
IndexByte/17 3.04ns ± 1% 2.80ns ± 3% -7.85%
IndexByte/18 3.05ns ± 0% 2.71ns ± 4% -11.00%
IndexByte/19 3.02ns ± 2% 2.76ns ±10% -8.74%
IndexByte/20 3.45ns ± 7% 2.91ns ± 0% -15.46%
IndexByte/21 3.03ns ± 1% 2.84ns ± 9% -6.33%
IndexByte/22 3.05ns ± 0% 2.67ns ± 1% -12.38%
IndexByte/23 3.01ns ± 2% 2.67ns ± 1% -11.24%
IndexByte/24 3.07ns ± 0% 2.92ns ±12% -4.79%
IndexByte/25 3.04ns ± 1% 3.15ns ±15% +3.63%
IndexByte/26 3.05ns ± 0% 2.83ns ±13% -7.33%
IndexByte/27 2.97ns ± 3% 2.98ns ±10% +0.56%
IndexByte/28 2.96ns ± 3% 2.96ns ± 9% -0.05%
IndexByte/29 3.34ns ± 9% 3.03ns ±12% -9.33%
IndexByte/30 3.05ns ± 0% 2.68ns ± 1% -12.05%
IndexByte/31 3.05ns ± 0% 2.83ns ±12% -7.27%
IndexByte/32 3.44ns ± 1% 3.21ns ±10% -6.78%
IndexByte/33 3.35ns ± 0% 3.41ns ± 2% +1.95%
IndexByte/34 3.35ns ± 0% 3.13ns ± 0% -6.53%
IndexByte/35 3.35ns ± 0% 3.13ns ± 0% -6.54%
IndexByte/36 3.35ns ± 0% 3.13ns ± 0% -6.52%
IndexByte/37 3.35ns ± 0% 3.13ns ± 0% -6.52%
IndexByte/38 3.35ns ± 0% 3.24ns ± 4% -3.30%
IndexByte/39 3.48ns ± 9% 3.44ns ± 2% -1.19%
IndexByte/40 3.55ns ± 4% 3.46ns ± 2% -2.44%
IndexByte/41 3.42ns ± 2% 3.39ns ± 4% -0.86%
IndexByte/42 3.55ns ±11% 3.46ns ± 1% -2.65%
IndexByte/43 3.45ns ± 5% 3.44ns ± 2% -0.31%
IndexByte/44 3.39ns ± 0% 3.43ns ± 3% +1.23%
IndexByte/45 3.43ns ± 4% 3.50ns ± 1% +2.07%
IndexByte/46 3.47ns ± 1% 3.46ns ± 2% -0.31%
IndexByte/47 3.44ns ± 2% 3.47ns ± 1% +0.78%
IndexByte/48 3.39ns ± 0% 3.46ns ± 2% +1.96%
IndexByte/49 3.79ns ± 0% 3.47ns ± 0% -8.41%
IndexByte/50 3.70ns ± 3% 3.64ns ± 5% -1.66%
IndexByte/51 3.70ns ± 2% 3.75ns ± 0% +1.40%
IndexByte/52 3.80ns ± 1% 3.77ns ± 0% -0.70%
IndexByte/53 3.78ns ± 0% 3.77ns ± 0% -0.46%
IndexByte/54 3.78ns ± 1% 3.53ns ± 7% -6.74%
IndexByte/55 3.78ns ± 0% 3.47ns ± 0% -8.17%
IndexByte/56 3.81ns ± 3% 3.45ns ± 0% -9.43%
IndexByte/57 3.79ns ± 4% 3.47ns ± 0% -8.45%
IndexByte/58 3.74ns ± 2% 3.55ns ± 4% -5.16%
IndexByte/59 3.69ns ± 2% 3.61ns ± 4% -2.01%
IndexByte/60 3.79ns ± 1% 3.45ns ± 0% -9.09%
IndexByte/61 3.77ns ± 1% 3.47ns ± 0% -7.93%
IndexByte/62 3.79ns ± 0% 3.45ns ± 0% -8.97%
IndexByte/63 3.79ns ± 0% 3.47ns ± 0% -8.44%
IndexByte/64 3.47ns ± 3% 3.18ns ± 0% -8.41%
GOPPC64=power9 vs GOPPC64=power10. Only sizes <16 will
show meaningful changes.
IndexByte/1 3.27ns ± 8% 2.36ns ± 2% -27.58%
IndexByte/2 3.06ns ± 4% 2.34ns ± 1% -23.42%
IndexByte/3 3.77ns ±11% 2.48ns ± 7% -34.03%
IndexByte/4 3.18ns ± 8% 2.33ns ± 1% -26.69%
IndexByte/5 3.18ns ± 5% 2.34ns ± 4% -26.26%
IndexByte/6 3.13ns ± 3% 2.35ns ± 1% -24.97%
IndexByte/7 3.25ns ± 1% 2.33ns ± 1% -28.22%
IndexByte/8 2.79ns ± 2% 2.36ns ± 1% -15.32%
IndexByte/9 2.90ns ± 0% 2.34ns ± 2% -19.36%
IndexByte/10 2.99ns ± 3% 2.31ns ± 1% -22.70%
IndexByte/11 3.13ns ± 7% 2.31ns ± 0% -26.08%
IndexByte/12 3.01ns ± 4% 2.32ns ± 1% -22.91%
IndexByte/13 2.98ns ± 3% 2.31ns ± 1% -22.72%
IndexByte/14 2.92ns ± 2% 2.61ns ±16% -10.58%
IndexByte/15 3.02ns ± 5% 2.69ns ± 7% -10.90%
IndexByte/16 2.65ns ±15% 2.29ns ± 1% -13.61%
Change-Id: I4482f762d25eabf60def4981a0b2bc0c10ccf50c
Reviewed-on: https://go-review.googlesource.com/c/go/+/478656
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>