crypto/elliptic: fix incomplete addition used in CombinedMult on s390x
This applies the amd64-specific changes from CL 42611 to the s390x P256
implementation. The s390x implementation was disabled in CL 62292 and
this CL re-enables it.
Adam Langley's commit message from CL 42611:
The optimised P-256 includes a CombinedMult function, which doesn't do
dual-scalar multiplication, but does avoid an affine conversion for
ECDSA verification.
However, it currently uses an assembly point addition function that
doesn't handle exceptional cases.