This change mirrors the current amd64 implementation, and provides optimal performance
on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is
implicitly tested by the robustness of the already existing amd64 implementation.
The implementation interleaves GHASH with CTR mode to achieve the highest possible
throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the
reduction step.
Even thought there is a significant amount of assembly, the code reuses the go
code for the amd64 implementation, so there is little additional go code.
Since AES-GCM is critical for performance of all web servers, this change is
required to level the playfield for arm64 CPUs, where amd64 currently enjoys an
unfair advantage.
Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and
CLMUL intrinsics, with a few additional vector instructions.