internal/bytealg: process two AVX2 lanes per Count loop
The branch taken by the bytealg.Count algorithm used to process a single
32 bytes block per loop iteration. Throughput of the algorithm can be
improved by unrolling two iterations per loop: the lack of data
dependencies between each iteration allows for better utilization of the
CPU pipeline. The improvement is most significant on medium size payloads
that fit in the L1 cache; beyond the L1 cache size, memory bandwidth is
likely the bottleneck and the change does not show any measurable
improvements.
Change-Id: I1098228c726a2ee814806dcb438b7e92febf4370
Reviewed-on: https://go-review.googlesource.com/c/go/+/532457 Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>