crypto/md5: faster inner loop, 3x faster overall
The speedup is a combination of unrolling/specializing
the actual code and also making the compiler generate better code.
Go 1.0.1 (size: 1239 code + 320 data = 1559 total)
md5.BenchmarkHash1K
1000000 7178 ns/op 142.64 MB/s
md5.BenchmarkHash8K 200000 56834 ns/op 144.14 MB/s
Partial unroll (size: 1115 code + 256 data = 1371 total)
md5.BenchmarkHash1K
5000000 2513 ns/op 407.37 MB/s
md5.BenchmarkHash8K 500000 19406 ns/op 422.13 MB/s
Complete unroll (size: 1900 code + 0 data = 1900 code)
md5.BenchmarkHash1K
5000000 2442 ns/op 419.18 MB/s
md5.BenchmarkHash8K 500000 18957 ns/op 432.13 MB/s
Comparing Go 1.0.1 and the complete unroll (this CL):
benchmark old MB/s new MB/s speedup
md5.BenchmarkHash1K 142.64 419.18 2.94x
md5.BenchmarkHash8K 144.14 432.13 3.00x
On the same machine, 'openssl speed md5' reports 441 MB/s
and 531 MB/s for our two cases, so this CL is at 90% and 80% of
those speeds, which is at least in the right ballpark.
OpenSSL is using carefully engineered assembly, so we are
unlikely to catch up completely.
Measurements on a Mid-2010 MacPro5,1.
R=golang-dev, bradfitz, agl
CC=golang-dev
https://golang.org/cl/
6220046