GC requires the whole zeroed word to be visible for a memory subsystem.
While the implementation of Enhanced REP STOSB tries to use as efficient
stores as possible, e.g writing the whole cache line and not byte-after-byte,
we should use REP STOSQ to guarantee the requirements of the GC.
The performance is not affected.
Change-Id: I1b0fd1444a40bfbb661541291ab96eba11bcc762
Reviewed-on: https://go-review.googlesource.com/c/go/+/405274
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
MOVQ AX, DI // DI = ptr
XORQ AX, AX
- // MOVOU seems always faster than REP STOSQ.
+ // MOVOU seems always faster than REP STOSQ when Enhanced REP STOSQ is not available.
tail:
// BSR+branch table make almost all memmove/memclr benchmarks worse. Not worth doing.
TESTQ BX, BX
JAE loop_preheader_avx2_huge
loop_erms:
+ // STOSQ is used to guarantee that the whole zeroed pointer-sized word is visible
+ // for a memory subsystem as the GC requires this.
MOVQ BX, CX
- REP; STOSB
- RET
+ SHRQ $3, CX
+ ANDQ $7, BX
+ REP; STOSQ
+ JMP tail
loop_preheader_avx2_huge:
// Align to 32 byte boundary