runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64
Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.
We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.
Running the benchmark in issue #37174:
name old time/op new time/op delta
Fast-8 90.0ns ± 1% 46.8ns ± 3% -47.97% (p=0.000 n=10+10)
Slow-8 188ns ± 5% 49ns ± 1% -73.82% (p=0.000 n=10+9)
There is no more slowdown due to preemption signals.