internal/bytealg: optimize Index/IndexString in amd64
Now with PCALIGN available on amd64 we can start optimizing some routines that benefit from instruction alignment.
```
│ sec/op │ sec/op vs base │
IndexByte/4K-16 69.89n ± ∞ ¹ 45.88n ± ∞ ¹ -34.35% (p=0.008 n=5)
IndexByte/4M-16 65.36µ ± ∞ ¹ 47.32µ ± ∞ ¹ -27.60% (p=0.008 n=5)
IndexByte/64M-16 1.435m ± ∞ ¹ 1.140m ± ∞ ¹ -20.57% (p=0.008 n=5)
│ B/s │ B/s vs base │
IndexByte/4K-16 54.58Gi ± ∞ ¹ 83.14Gi ± ∞ ¹ +52.32% (p=0.008 n=5)
IndexByte/4M-16 59.76Gi ± ∞ ¹ 82.54Gi ± ∞ ¹ +38.12% (p=0.008 n=5)
IndexByte/64M-16 43.56Gi ± ∞ ¹ 54.84Gi ± ∞ ¹ +25.89% (p=0.008 n=5)
```
Change-Id: Iff3dfd542c55e7569242be81f38b2887b9e04e87
GitHub-Last-Rev:
f309f898b13ad8fdf88a21f2f105382db9ada2f5
GitHub-Pull-Request: golang/go#61792
Reviewed-on: https://go-review.googlesource.com/c/go/+/516435
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>