runtime: don't track scheduling latency for _Grunning <-> _Gsyscall
The current logic causes much more tracking than necessary, when really
_Grunning and _Gsyscall are both sort of "running" from the perspective
of tracking scheduling latency.
This makes cgo calls and syscalls a little faster in the single-threaded
case, and shows much larger improvement in the multi-threaded case
by removing updates of shared variables (though this parallel
microbenchmark is a little unrealistic, so don't ascribe too much weight
to it).