internal/bytealg: improve performance of IndexByte for ppc64x
Use addi+lvx instruction fusion and remove register dependencies in
the main loop to improve performance.
benchmark old ns/op new ns/op delta
BenchmarkIndexByte/10-192 9.86 9.75 -1.12%
BenchmarkIndexByte/32-192 15.6 11.2 -28.21%
BenchmarkIndexByte/4K-192 155 97.6 -37.03%
BenchmarkIndexByte/4M-192 171790 129650 -24.53%
BenchmarkIndexByte/64M-192
6530982 5018424 -23.16%
benchmark old MB/s new MB/s speedup
BenchmarkIndexByte/10-192 1013.72 1025.76 1.01x
BenchmarkIndexByte/32-192 2049.47 2868.01 1.40x
BenchmarkIndexByte/4K-192 26422.69 41975.67 1.59x
BenchmarkIndexByte/4M-192 24415.17 32350.74 1.33x
BenchmarkIndexByte/64M-192 10275.46 13372.50 1.30x
Change-Id: Iedf17f01f374d58e85dcd6a972209bfcb7eb6063
Reviewed-on: https://go-review.googlesource.com/137415
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>