runtime: speed up amd64 memmove
MOV with SSE registers seems faster than REP MOVSQ if the
size being copied is less than about 2K. Previously we
didn't use MOV if the memory region is larger than 256
byte. This patch improves the performance of 257 ~ 2048
byte non-overlapping copy by using MOV.
Here is the benchmark result on Intel Xeon 3.5GHz (Nehalem).
benchmark old ns/op new ns/op delta
BenchmarkMemmove16 4 4 +0.42%
BenchmarkMemmove32 5 5 -0.20%
BenchmarkMemmove64 6 6 -0.81%
BenchmarkMemmove128 7 7 -0.82%
BenchmarkMemmove256 10 10 +1.92%
BenchmarkMemmove512 29 16 -44.90%
BenchmarkMemmove1024 37 25 -31.55%
BenchmarkMemmove2048 55 44 -19.46%
BenchmarkMemmove4096 92 91 -0.76%
benchmark old MB/s new MB/s speedup
BenchmarkMemmove16 3370.61 3356.88 1.00x
BenchmarkMemmove32 6368.68 6386.99 1.00x
BenchmarkMemmove64 10367.37 10462.62 1.01x
BenchmarkMemmove128 17551.16 17713.48 1.01x
BenchmarkMemmove256 24692.81 24142.99 0.98x
BenchmarkMemmove512 17428.70 31687.72 1.82x
BenchmarkMemmove1024 27401.82 40009.45 1.46x
BenchmarkMemmove2048 36884.86 45766.98 1.24x
BenchmarkMemmove4096 44295.91 44627.86 1.01x
LGTM=khr
R=golang-codereviews, gobot, khr
CC=golang-codereviews
https://golang.org/cl/
90500043