This loads the keys once per call, not once per block. This
has the effect of unrolling the inner loop too. This allows
decryption to scale better with available hardware.
Noteably, encryption serializes crypto ops, thus no
performance improvement is seen, but neither is it reduced.
Care is also taken to explicitly clear keys from registers
as was done implicitly in the prior version.
Also, fix a couple of typos from copying the asm used to
load ESPERM.
Performance delta on POWER9:
name old time/op new time/op delta
AESCBCEncrypt1K 1.10µs ± 0% 1.10µs ± 0% +0.55%
AESCBCDecrypt1K 793ns ± 0% 415ns ± 0% -47.70%