[dev.simd] simd: use new data movement instructions to do "fast" transposes
This is a test/example/performance-comparison. Looking at the
generated code shows that there is still a lot of checking that
perhaps we can figure out how to optimize away.
$b/go test -bench=B -benchtime=5x .
goos: linux
goarch: amd64
pkg: simd/internal/simd_test
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
BenchmarkPlainTranspose-88 5
3143116414 ns/op
BenchmarkTiled4Transpose-88 5
1127457328 ns/op
BenchmarkTiled8Transpose-88 5
671788993 ns/op
Benchmark2BlockedTranspose-88 5
1665429657 ns/op
Benchmark3BlockedTranspose-88 5
1208767441 ns/op
Benchmark4BlockedTranspose-88 5
910212696 ns/op
Benchmark5aBlockedTranspose-88 5
939205670 ns/op
Benchmark5bBlockedTranspose-88 5
1018286871 ns/op
Change-Id: I78bae0fd2ff4f511dac4291b898bbb79b0114741
Reviewed-on: https://go-review.googlesource.com/c/go/+/700695
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>