This change uses library functions such as bits.RotateLeft32 to
reduce the amount of code needed in the generic implementation.
Since the code is now shorter I've also removed the option to
generate a non-unrolled version of the code.
I've also tried to remove bounds checks where possible to make
the new version performant, however that is not the primary goal
of this change since most architectures have assembly
implementations already.