This is mostly cleanup and simplification. This removes
many unneeded register moves, loads, and bit twiddlings
which were holdovers from porting this from the amd64
version.
The updated code loads each block once per iteration
instead of once per round. Similarly, the logical
operations now match the original md5 specification.
Likewise, add extra sizes to the benchtest to give more
data points on how the implementation scales with input
size.
All in all, this is roughly a 20% improvement on ppc64le
code running on POWER9 (POWER8 is similar, but around
16%):