]> Cypherpunks repositories - gostls13.git/commit
internal/bytealg: rewrite indexbytebody on PPC64
authorPaul E. Murphy <murp@ibm.com>
Mon, 6 Mar 2023 22:51:31 +0000 (16:51 -0600)
committerPaul Murphy <murp@ibm.com>
Fri, 21 Apr 2023 16:10:29 +0000 (16:10 +0000)
commitdb32aba508e86a1c016319d12f5b573bc2b13c48
treef6ea921a40ec5f1b8a38468920a237d6c5089add
parentccb8db88c5c11be65343732ef61d9d1052e6838a
internal/bytealg: rewrite indexbytebody on PPC64

Use P8 instructions throughout to be backwards compatible, but
otherwise not impede performance. Use overlapping loads where
possible, and prioritize larger checks over smaller check.

However, some newer instructions can be used surgically when
targeting a newer GOPPC64. These can lead to noticeable
performance improvements with minimal impact to readability.

All tests run below on a Power10/ppc64le, and use a small
modification to BenchmarkIndexByte to ensure the IndexByte
wrapper call is inlined (as it likely is under realistic usage).
This wrapper adds substantial overhead if not inlined.

Previous (power9 path, GOPPC64=power8) vs. GOPPC64=power8:

IndexByte/1       3.81ns ± 8%     3.11ns ± 5%  -18.39%
IndexByte/2       3.82ns ± 3%     3.20ns ± 6%  -16.23%
IndexByte/3       3.61ns ± 4%     3.25ns ± 6%  -10.13%
IndexByte/4       3.66ns ± 5%     3.08ns ± 1%  -15.91%
IndexByte/5       3.82ns ± 0%     3.75ns ± 2%   -1.94%
IndexByte/6       3.83ns ± 0%     3.87ns ± 4%   +1.04%
IndexByte/7       3.83ns ± 0%     3.82ns ± 0%   -0.27%
IndexByte/8       3.82ns ± 0%     2.92ns ±11%  -23.70%
IndexByte/9       3.70ns ± 2%     3.08ns ± 2%  -16.87%
IndexByte/10      3.74ns ± 2%     3.04ns ± 0%  -18.75%
IndexByte/11      3.75ns ± 0%     3.31ns ± 8%  -11.79%
IndexByte/12      3.74ns ± 0%     3.04ns ± 0%  -18.86%
IndexByte/13      3.83ns ± 4%     3.04ns ± 0%  -20.64%
IndexByte/14      3.80ns ± 1%     3.30ns ± 8%  -13.18%
IndexByte/15      3.77ns ± 1%     3.04ns ± 0%  -19.33%
IndexByte/16      3.81ns ± 0%     2.78ns ± 7%  -26.88%
IndexByte/17      4.12ns ± 0%     3.04ns ± 1%  -26.11%
IndexByte/18      4.27ns ± 6%     3.05ns ± 0%  -28.64%
IndexByte/19      4.30ns ± 4%     3.02ns ± 2%  -29.65%
IndexByte/20      4.43ns ± 7%     3.45ns ± 7%  -22.15%
IndexByte/21      4.12ns ± 0%     3.03ns ± 1%  -26.35%
IndexByte/22      4.40ns ± 6%     3.05ns ± 0%  -30.82%
IndexByte/23      4.40ns ± 6%     3.01ns ± 2%  -31.48%
IndexByte/24      4.32ns ± 5%     3.07ns ± 0%  -28.98%
IndexByte/25      4.76ns ± 2%     3.04ns ± 1%  -36.11%
IndexByte/26      4.82ns ± 0%     3.05ns ± 0%  -36.66%
IndexByte/27      4.82ns ± 0%     2.97ns ± 3%  -38.39%
IndexByte/28      4.82ns ± 0%     2.96ns ± 3%  -38.57%
IndexByte/29      4.82ns ± 0%     3.34ns ± 9%  -30.71%
IndexByte/30      4.82ns ± 0%     3.05ns ± 0%  -36.77%
IndexByte/31      4.81ns ± 0%     3.05ns ± 0%  -36.70%
IndexByte/32      3.52ns ± 0%     3.44ns ± 1%   -2.15%
IndexByte/33      4.77ns ± 1%     3.35ns ± 0%  -29.81%
IndexByte/34      5.01ns ± 5%     3.35ns ± 0%  -33.15%
IndexByte/35      4.92ns ± 9%     3.35ns ± 0%  -31.89%
IndexByte/36      4.81ns ± 5%     3.35ns ± 0%  -30.37%
IndexByte/37      4.99ns ± 6%     3.35ns ± 0%  -32.86%
IndexByte/38      5.06ns ± 5%     3.35ns ± 0%  -33.84%
IndexByte/39      5.02ns ± 5%     3.48ns ± 9%  -30.58%
IndexByte/40      5.21ns ± 9%     3.55ns ± 4%  -31.82%
IndexByte/41      5.18ns ± 0%     3.42ns ± 2%  -33.98%
IndexByte/42      5.19ns ± 0%     3.55ns ±11%  -31.56%
IndexByte/43      5.18ns ± 0%     3.45ns ± 5%  -33.46%
IndexByte/44      5.18ns ± 0%     3.39ns ± 0%  -34.56%
IndexByte/45      5.18ns ± 0%     3.43ns ± 4%  -33.74%
IndexByte/46      5.18ns ± 0%     3.47ns ± 1%  -33.03%
IndexByte/47      5.18ns ± 0%     3.44ns ± 2%  -33.54%
IndexByte/48      5.18ns ± 0%     3.39ns ± 0%  -34.52%
IndexByte/49      5.69ns ± 0%     3.79ns ± 0%  -33.45%
IndexByte/50      5.70ns ± 0%     3.70ns ± 3%  -34.98%
IndexByte/51      5.70ns ± 0%     3.70ns ± 2%  -35.05%
IndexByte/52      5.69ns ± 0%     3.80ns ± 1%  -33.35%
IndexByte/53      5.69ns ± 0%     3.78ns ± 0%  -33.54%
IndexByte/54      5.69ns ± 0%     3.78ns ± 1%  -33.51%
IndexByte/55      5.69ns ± 0%     3.78ns ± 0%  -33.61%
IndexByte/56      5.69ns ± 0%     3.81ns ± 3%  -33.12%
IndexByte/57      6.20ns ± 0%     3.79ns ± 4%  -38.89%
IndexByte/58      6.20ns ± 0%     3.74ns ± 2%  -39.58%
IndexByte/59      6.20ns ± 0%     3.69ns ± 2%  -40.47%
IndexByte/60      6.20ns ± 0%     3.79ns ± 1%  -38.81%
IndexByte/61      6.20ns ± 0%     3.77ns ± 1%  -39.23%
IndexByte/62      6.20ns ± 0%     3.79ns ± 0%  -38.89%
IndexByte/63      6.20ns ± 0%     3.79ns ± 0%  -38.90%
IndexByte/64      4.17ns ± 0%     3.47ns ± 3%  -16.70%
IndexByte/65      5.38ns ± 0%     4.21ns ± 0%  -21.59%
IndexByte/66      5.38ns ± 0%     4.21ns ± 0%  -21.58%
IndexByte/67      5.38ns ± 0%     4.22ns ± 0%  -21.58%
IndexByte/68      5.38ns ± 0%     4.22ns ± 0%  -21.59%
IndexByte/69      5.38ns ± 0%     4.22ns ± 0%  -21.56%
IndexByte/70      5.38ns ± 0%     4.21ns ± 0%  -21.59%
IndexByte/71      5.37ns ± 0%     4.21ns ± 0%  -21.51%
IndexByte/72      5.37ns ± 0%     4.22ns ± 0%  -21.46%
IndexByte/73      5.71ns ± 0%     4.22ns ± 0%  -26.20%
IndexByte/74      5.71ns ± 0%     4.21ns ± 0%  -26.21%
IndexByte/75      5.71ns ± 0%     4.21ns ± 0%  -26.17%
IndexByte/76      5.71ns ± 0%     4.22ns ± 0%  -26.22%
IndexByte/77      5.71ns ± 0%     4.22ns ± 0%  -26.22%
IndexByte/78      5.71ns ± 0%     4.21ns ± 0%  -26.22%
IndexByte/79      5.71ns ± 0%     4.22ns ± 0%  -26.21%
IndexByte/80      5.71ns ± 0%     4.21ns ± 0%  -26.19%
IndexByte/81      6.20ns ± 0%     4.39ns ± 0%  -29.13%
IndexByte/82      6.20ns ± 0%     4.36ns ± 0%  -29.67%
IndexByte/83      6.20ns ± 0%     4.36ns ± 0%  -29.63%
IndexByte/84      6.20ns ± 0%     4.39ns ± 0%  -29.21%
IndexByte/85      6.20ns ± 0%     4.36ns ± 0%  -29.64%
IndexByte/86      6.20ns ± 0%     4.36ns ± 0%  -29.63%
IndexByte/87      6.20ns ± 0%     4.39ns ± 0%  -29.21%
IndexByte/88      6.20ns ± 0%     4.36ns ± 0%  -29.65%
IndexByte/89      6.74ns ± 0%     4.36ns ± 0%  -35.33%
IndexByte/90      6.75ns ± 0%     4.37ns ± 0%  -35.22%
IndexByte/91      6.74ns ± 0%     4.36ns ± 0%  -35.30%
IndexByte/92      6.74ns ± 0%     4.36ns ± 0%  -35.34%
IndexByte/93      6.74ns ± 0%     4.37ns ± 0%  -35.20%
IndexByte/94      6.74ns ± 0%     4.36ns ± 0%  -35.33%
IndexByte/95      6.75ns ± 0%     4.36ns ± 0%  -35.32%
IndexByte/96      4.83ns ± 0%     4.34ns ± 2%  -10.24%
IndexByte/97      5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/98      5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/99      5.91ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/100     5.90ns ± 0%     4.65ns ± 0%  -21.21%
IndexByte/101     5.90ns ± 0%     4.65ns ± 0%  -21.22%
IndexByte/102     5.90ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/103     5.91ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/104     5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/105     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/106     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/107     6.25ns ± 0%     4.65ns ± 0%  -25.60%
IndexByte/108     6.25ns ± 0%     4.65ns ± 0%  -25.58%
IndexByte/109     6.24ns ± 0%     4.65ns ± 0%  -25.50%
IndexByte/110     6.25ns ± 0%     4.65ns ± 0%  -25.56%
IndexByte/111     6.25ns ± 0%     4.65ns ± 0%  -25.60%
IndexByte/112     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/113     6.76ns ± 0%     5.05ns ± 0%  -25.37%
IndexByte/114     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/115     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/116     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/117     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/118     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/119     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/120     6.76ns ± 0%     5.05ns ± 0%  -25.36%
IndexByte/121     7.35ns ± 0%     5.05ns ± 0%  -31.33%
IndexByte/122     7.36ns ± 0%     5.05ns ± 0%  -31.42%
IndexByte/123     7.38ns ± 0%     5.05ns ± 0%  -31.60%
IndexByte/124     7.38ns ± 0%     5.05ns ± 0%  -31.59%
IndexByte/125     7.38ns ± 0%     5.05ns ± 0%  -31.60%
IndexByte/126     7.38ns ± 0%     5.05ns ± 0%  -31.58%
IndexByte/128     5.28ns ± 0%     5.10ns ± 0%   -3.41%
IndexByte/256     7.27ns ± 0%     7.28ns ± 2%   +0.13%
IndexByte/512     12.1ns ± 0%     11.8ns ± 0%   -2.51%
IndexByte/1K      23.1ns ± 3%     22.0ns ± 0%   -4.66%
IndexByte/2K      42.6ns ± 0%     42.4ns ± 0%   -0.41%
IndexByte/4K      90.3ns ± 0%     89.4ns ± 0%   -0.98%
IndexByte/8K       170ns ± 0%      170ns ± 0%   -0.59%
IndexByte/16K      331ns ± 0%      330ns ± 0%   -0.27%
IndexByte/32K      660ns ± 0%      660ns ± 0%   -0.08%
IndexByte/64K     1.30µs ± 0%     1.30µs ± 0%   -0.08%
IndexByte/128K    2.58µs ± 0%     2.58µs ± 0%   -0.04%
IndexByte/256K    5.15µs ± 0%     5.15µs ± 0%   -0.04%
IndexByte/512K    10.3µs ± 0%     10.3µs ± 0%   -0.03%
IndexByte/1M      20.6µs ± 0%     20.5µs ± 0%   -0.03%
IndexByte/2M      41.1µs ± 0%     41.1µs ± 0%   -0.03%
IndexByte/4M      82.2µs ± 0%     82.1µs ± 0%   -0.02%
IndexByte/8M       164µs ± 0%      164µs ± 0%   -0.01%
IndexByte/16M      328µs ± 0%      328µs ± 0%   -0.01%
IndexByte/32M      657µs ± 0%      657µs ± 0%   -0.00%

GOPPC64=power8 vs GOPPC64=power9. The Improvement is
most noticed between 16 and 64B, and goes away around
128B.

IndexByte/16      2.78ns ± 7%     2.65ns ±15%   -4.74%
IndexByte/17      3.04ns ± 1%     2.80ns ± 3%   -7.85%
IndexByte/18      3.05ns ± 0%     2.71ns ± 4%  -11.00%
IndexByte/19      3.02ns ± 2%     2.76ns ±10%   -8.74%
IndexByte/20      3.45ns ± 7%     2.91ns ± 0%  -15.46%
IndexByte/21      3.03ns ± 1%     2.84ns ± 9%   -6.33%
IndexByte/22      3.05ns ± 0%     2.67ns ± 1%  -12.38%
IndexByte/23      3.01ns ± 2%     2.67ns ± 1%  -11.24%
IndexByte/24      3.07ns ± 0%     2.92ns ±12%   -4.79%
IndexByte/25      3.04ns ± 1%     3.15ns ±15%   +3.63%
IndexByte/26      3.05ns ± 0%     2.83ns ±13%   -7.33%
IndexByte/27      2.97ns ± 3%     2.98ns ±10%   +0.56%
IndexByte/28      2.96ns ± 3%     2.96ns ± 9%   -0.05%
IndexByte/29      3.34ns ± 9%     3.03ns ±12%   -9.33%
IndexByte/30      3.05ns ± 0%     2.68ns ± 1%  -12.05%
IndexByte/31      3.05ns ± 0%     2.83ns ±12%   -7.27%
IndexByte/32      3.44ns ± 1%     3.21ns ±10%   -6.78%
IndexByte/33      3.35ns ± 0%     3.41ns ± 2%   +1.95%
IndexByte/34      3.35ns ± 0%     3.13ns ± 0%   -6.53%
IndexByte/35      3.35ns ± 0%     3.13ns ± 0%   -6.54%
IndexByte/36      3.35ns ± 0%     3.13ns ± 0%   -6.52%
IndexByte/37      3.35ns ± 0%     3.13ns ± 0%   -6.52%
IndexByte/38      3.35ns ± 0%     3.24ns ± 4%   -3.30%
IndexByte/39      3.48ns ± 9%     3.44ns ± 2%   -1.19%
IndexByte/40      3.55ns ± 4%     3.46ns ± 2%   -2.44%
IndexByte/41      3.42ns ± 2%     3.39ns ± 4%   -0.86%
IndexByte/42      3.55ns ±11%     3.46ns ± 1%   -2.65%
IndexByte/43      3.45ns ± 5%     3.44ns ± 2%   -0.31%
IndexByte/44      3.39ns ± 0%     3.43ns ± 3%   +1.23%
IndexByte/45      3.43ns ± 4%     3.50ns ± 1%   +2.07%
IndexByte/46      3.47ns ± 1%     3.46ns ± 2%   -0.31%
IndexByte/47      3.44ns ± 2%     3.47ns ± 1%   +0.78%
IndexByte/48      3.39ns ± 0%     3.46ns ± 2%   +1.96%
IndexByte/49      3.79ns ± 0%     3.47ns ± 0%   -8.41%
IndexByte/50      3.70ns ± 3%     3.64ns ± 5%   -1.66%
IndexByte/51      3.70ns ± 2%     3.75ns ± 0%   +1.40%
IndexByte/52      3.80ns ± 1%     3.77ns ± 0%   -0.70%
IndexByte/53      3.78ns ± 0%     3.77ns ± 0%   -0.46%
IndexByte/54      3.78ns ± 1%     3.53ns ± 7%   -6.74%
IndexByte/55      3.78ns ± 0%     3.47ns ± 0%   -8.17%
IndexByte/56      3.81ns ± 3%     3.45ns ± 0%   -9.43%
IndexByte/57      3.79ns ± 4%     3.47ns ± 0%   -8.45%
IndexByte/58      3.74ns ± 2%     3.55ns ± 4%   -5.16%
IndexByte/59      3.69ns ± 2%     3.61ns ± 4%   -2.01%
IndexByte/60      3.79ns ± 1%     3.45ns ± 0%   -9.09%
IndexByte/61      3.77ns ± 1%     3.47ns ± 0%   -7.93%
IndexByte/62      3.79ns ± 0%     3.45ns ± 0%   -8.97%
IndexByte/63      3.79ns ± 0%     3.47ns ± 0%   -8.44%
IndexByte/64      3.47ns ± 3%     3.18ns ± 0%   -8.41%

GOPPC64=power9 vs GOPPC64=power10. Only sizes <16 will
show meaningful changes.

IndexByte/1       3.27ns ± 8%     2.36ns ± 2%  -27.58%
IndexByte/2       3.06ns ± 4%     2.34ns ± 1%  -23.42%
IndexByte/3       3.77ns ±11%     2.48ns ± 7%  -34.03%
IndexByte/4       3.18ns ± 8%     2.33ns ± 1%  -26.69%
IndexByte/5       3.18ns ± 5%     2.34ns ± 4%  -26.26%
IndexByte/6       3.13ns ± 3%     2.35ns ± 1%  -24.97%
IndexByte/7       3.25ns ± 1%     2.33ns ± 1%  -28.22%
IndexByte/8       2.79ns ± 2%     2.36ns ± 1%  -15.32%
IndexByte/9       2.90ns ± 0%     2.34ns ± 2%  -19.36%
IndexByte/10      2.99ns ± 3%     2.31ns ± 1%  -22.70%
IndexByte/11      3.13ns ± 7%     2.31ns ± 0%  -26.08%
IndexByte/12      3.01ns ± 4%     2.32ns ± 1%  -22.91%
IndexByte/13      2.98ns ± 3%     2.31ns ± 1%  -22.72%
IndexByte/14      2.92ns ± 2%     2.61ns ±16%  -10.58%
IndexByte/15      3.02ns ± 5%     2.69ns ± 7%  -10.90%
IndexByte/16      2.65ns ±15%     2.29ns ± 1%  -13.61%

Change-Id: I4482f762d25eabf60def4981a0b2bc0c10ccf50c
Reviewed-on: https://go-review.googlesource.com/c/go/+/478656
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
src/internal/bytealg/indexbyte_ppc64x.s