bytes: improve performance for bytes.Compare on ppc64x
This improves the performance for byte.Compare by rewriting
the cmpbody function in runtime/asm_ppc64x.s. The previous code
had a simple loop which loaded a pair of bytes and compared them,
which is inefficient for long buffers. The updated function checks
for 8 or 32 byte chunks and then loads and compares double words where
possible.
Because the byte.Compare result indicates greater or less than,
the doubleword loads must take endianness into account, using a
byte reversed load in the little endian case.