Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.
We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.
Running the benchmark in issue #37174:
name old time/op new time/op delta
Fast-8 90.0ns ± 1% 46.8ns ± 3% -47.97% (p=0.000 n=10+10)
Slow-8 188ns ± 5% 49ns ± 1% -73.82% (p=0.000 n=10+9)
There is no more slowdown due to preemption signals.
For #37174.
Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605
Reviewed-on: https://go-review.googlesource.com/c/go/+/219131
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
// TODO: MXCSR register?
+ // Apparently, the signal handling code path in darwin kernel leaves
+ // the upper bits of Y registers in a dirty state, which causes
+ // many SSE operations (128-bit and narrower) become much slower.
+ // Clear the upper bits to get to a clean state. See issue #37174.
+ // It is safe here as Go code don't use the upper bits of Y registers.
+ p("#ifdef GOOS_darwin")
+ p("VZEROUPPER")
+ p("#endif")
+
p("PUSHQ BP")
p("MOVQ SP, BP")
p("// Save flags before clobbering them")
#include "textflag.h"
TEXT ·asyncPreempt(SB),NOSPLIT|NOFRAME,$0-0
+ #ifdef GOOS_darwin
+ VZEROUPPER
+ #endif
PUSHQ BP
MOVQ SP, BP
// Save flags before clobbering them