Two GC-related functions, scang and casgstatus, wait in an active spin loop.
Active spinning is never a good idea in user-space. Once we wait several
times more than the expected wait time, something unexpected is happenning
(e.g. the thread we are waiting for is descheduled or handling a page fault)
and we need to yield to OS scheduler. Moreover, the expected wait time is
very high for these functions: scang wait time can be tens of milliseconds,
casgstatus can be hundreds of microseconds. It does not make sense to spin
even for that time.
go install -a std profile on a 4-core machine shows that 11% of time is spent
in the active spin in scang:
6.12% compile compile [.] runtime.scang
3.27% compile compile [.] runtime.readgstatus
1.72% compile compile [.] runtime/internal/atomic.Load
The active spin also increases tail latency in the case of the slightest
oversubscription: GC goroutines spend whole quantum in the loop instead of
executing user code.
Here is scang wait time histogram during go install -a std:
13707.0000 -
1815442.7667 [ 118]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎...
1815442.7667 -
3617178.5333 [ 9]: ∎∎∎∎∎∎∎∎∎
3617178.5333 -
5418914.3000 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎
5418914.3000 -
7220650.0667 [ 5]: ∎∎∎∎∎
7220650.0667 -
9022385.8333 [ 12]: ∎∎∎∎∎∎∎∎∎∎∎∎
9022385.8333 -
10824121.6000 [ 13]: ∎∎∎∎∎∎∎∎∎∎∎∎∎
10824121.6000 -
12625857.3667 [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
12625857.3667 -
14427593.1333 [ 18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
14427593.1333 -
16229328.9000 [ 18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
16229328.9000 -
18031064.6667 [ 32]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
18031064.6667 -
19832800.4333 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
19832800.4333 -
21634536.2000 [ 6]: ∎∎∎∎∎∎
21634536.2000 -
23436271.9667 [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
23436271.9667 -
25238007.7333 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎
25238007.7333 -
27039743.5000 [ 27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
27039743.5000 -
28841479.2667 [ 20]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
28841479.2667 -
30643215.0333 [ 10]: ∎∎∎∎∎∎∎∎∎∎
30643215.0333 -
32444950.8000 [ 7]: ∎∎∎∎∎∎∎
32444950.8000 -
34246686.5667 [ 4]: ∎∎∎∎
34246686.5667 -
36048422.3333 [ 4]: ∎∎∎∎
36048422.3333 -
37850158.1000 [ 1]: ∎
37850158.1000 -
39651893.8667 [ 5]: ∎∎∎∎∎
39651893.8667 -
41453629.6333 [ 2]: ∎∎
41453629.6333 -
43255365.4000 [ 2]: ∎∎
43255365.4000 -
45057101.1667 [ 2]: ∎∎
45057101.1667 -
46858836.9333 [ 1]: ∎
46858836.9333 -
48660572.7000 [ 2]: ∎∎
48660572.7000 -
50462308.4667 [ 3]: ∎∎∎
50462308.4667 -
52264044.2333 [ 2]: ∎∎
52264044.2333 -
54065780.0000 [ 2]: ∎∎
and the zoomed-in first part:
13707.0000 - 19916.7667 [ 2]: ∎∎
19916.7667 - 26126.5333 [ 2]: ∎∎
26126.5333 - 32336.3000 [ 9]: ∎∎∎∎∎∎∎∎∎
32336.3000 - 38546.0667 [ 8]: ∎∎∎∎∎∎∎∎
38546.0667 - 44755.8333 [ 12]: ∎∎∎∎∎∎∎∎∎∎∎∎
44755.8333 - 50965.6000 [ 10]: ∎∎∎∎∎∎∎∎∎∎
50965.6000 - 57175.3667 [ 5]: ∎∎∎∎∎
57175.3667 - 63385.1333 [ 6]: ∎∎∎∎∎∎
63385.1333 - 69594.9000 [ 5]: ∎∎∎∎∎
69594.9000 - 75804.6667 [ 6]: ∎∎∎∎∎∎
75804.6667 - 82014.4333 [ 6]: ∎∎∎∎∎∎
82014.4333 - 88224.2000 [ 4]: ∎∎∎∎
88224.2000 - 94433.9667 [ 1]: ∎
94433.9667 - 100643.7333 [ 1]: ∎
100643.7333 - 106853.5000 [ 2]: ∎∎
106853.5000 - 113063.2667 [ 0]:
113063.2667 - 119273.0333 [ 2]: ∎∎
119273.0333 - 125482.8000 [ 2]: ∎∎
125482.8000 - 131692.5667 [ 1]: ∎
131692.5667 - 137902.3333 [ 1]: ∎
137902.3333 - 144112.1000 [ 0]:
144112.1000 - 150321.8667 [ 2]: ∎∎
150321.8667 - 156531.6333 [ 1]: ∎
156531.6333 - 162741.4000 [ 1]: ∎
162741.4000 - 168951.1667 [ 0]:
168951.1667 - 175160.9333 [ 0]:
175160.9333 - 181370.7000 [ 1]: ∎
181370.7000 - 187580.4667 [ 1]: ∎
187580.4667 - 193790.2333 [ 2]: ∎∎
193790.2333 - 200000.0000 [ 0]:
Here is casgstatus wait time histogram:
631.0000 - 5276.6333 [ 3]: ∎∎∎
5276.6333 - 9922.2667 [ 5]: ∎∎∎∎∎
9922.2667 - 14567.9000 [ 2]: ∎∎
14567.9000 - 19213.5333 [ 6]: ∎∎∎∎∎∎
19213.5333 - 23859.1667 [ 5]: ∎∎∎∎∎
23859.1667 - 28504.8000 [ 6]: ∎∎∎∎∎∎
28504.8000 - 33150.4333 [ 6]: ∎∎∎∎∎∎
33150.4333 - 37796.0667 [ 2]: ∎∎
37796.0667 - 42441.7000 [ 1]: ∎
42441.7000 - 47087.3333 [ 3]: ∎∎∎
47087.3333 - 51732.9667 [ 0]:
51732.9667 - 56378.6000 [ 1]: ∎
56378.6000 - 61024.2333 [ 0]:
61024.2333 - 65669.8667 [ 0]:
65669.8667 - 70315.5000 [ 0]:
70315.5000 - 74961.1333 [ 1]: ∎
74961.1333 - 79606.7667 [ 0]:
79606.7667 - 84252.4000 [ 0]:
84252.4000 - 88898.0333 [ 0]:
88898.0333 - 93543.6667 [ 0]:
93543.6667 - 98189.3000 [ 0]:
98189.3000 - 102834.9333 [ 0]:
102834.9333 - 107480.5667 [ 1]: ∎
107480.5667 - 112126.2000 [ 0]:
112126.2000 - 116771.8333 [ 0]:
116771.8333 - 121417.4667 [ 0]:
121417.4667 - 126063.1000 [ 0]:
126063.1000 - 130708.7333 [ 0]:
130708.7333 - 135354.3667 [ 0]:
135354.3667 - 140000.0000 [ 1]: ∎
Ideally we eliminate the waiting by switching to async
state machine for GC, but for now just yield to OS scheduler
after a reasonable wait time.
To choose yielding parameters I've measured
golang.org/x/benchmarks/http tail latencies with different yield
delays and oversubscription levels.
With no oversubscription (to the degree possible):
scang yield delay = 1, casgstatus yield delay = 1
Latency-50 1.41ms ±15% 1.41ms ± 5% ~ (p=0.611 n=13+12)
Latency-95 5.21ms ± 2% 5.15ms ± 2% -1.15% (p=0.012 n=13+13)
Latency-99 7.16ms ± 2% 7.05ms ± 2% -1.54% (p=0.002 n=13+13)
Latency-999 10.7ms ± 9% 10.2ms ±10% -5.46% (p=0.004 n=12+13)
scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50 1.41ms ±15% 1.41ms ± 8% ~ (p=0.511 n=13+13)
Latency-95 5.21ms ± 2% 5.14ms ± 2% -1.23% (p=0.006 n=13+13)
Latency-99 7.16ms ± 2% 7.02ms ± 2% -1.94% (p=0.000 n=13+13)
Latency-999 10.7ms ± 9% 10.1ms ± 8% -6.14% (p=0.000 n=12+13)
scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50 1.41ms ±15% 1.45ms ± 6% ~ (p=0.724 n=13+13)
Latency-95 5.21ms ± 2% 5.18ms ± 1% ~ (p=0.287 n=13+13)
Latency-99 7.16ms ± 2% 7.05ms ± 2% -1.64% (p=0.002 n=13+13)
Latency-999 10.7ms ± 9% 10.0ms ± 5% -6.72% (p=0.000 n=12+13)
scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50 1.41ms ±15% 1.51ms ± 7% +6.57% (p=0.002 n=13+13)
Latency-95 5.21ms ± 2% 5.21ms ± 2% ~ (p=0.960 n=13+13)
Latency-99 7.16ms ± 2% 7.06ms ± 2% -1.50% (p=0.012 n=13+13)
Latency-999 10.7ms ± 9% 10.0ms ± 6% -6.49% (p=0.000 n=12+13)
scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50 1.41ms ±15% 1.53ms ± 6% +8.48% (p=0.000 n=13+12)
Latency-95 5.21ms ± 2% 5.23ms ± 2% ~ (p=0.287 n=13+13)
Latency-99 7.16ms ± 2% 7.08ms ± 2% -1.21% (p=0.004 n=13+13)
Latency-999 10.7ms ± 9% 9.9ms ± 3% -7.99% (p=0.000 n=12+12)
scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50 1.41ms ±15% 1.47ms ± 5% ~ (p=0.072 n=13+13)
Latency-95 5.21ms ± 2% 5.17ms ± 2% ~ (p=0.091 n=13+13)
Latency-99 7.16ms ± 2% 7.02ms ± 2% -1.99% (p=0.000 n=13+13)
Latency-999 10.7ms ± 9% 9.9ms ± 5% -7.86% (p=0.000 n=12+13)
With slight oversubscription (another instance of http benchmark
was running in background with reduced GOMAXPROCS):
scang yield delay = 1, casgstatus yield delay = 1
Latency-50 840µs ± 3% 804µs ± 3% -4.37% (p=0.000 n=15+18)
Latency-95 6.52ms ± 4% 6.03ms ± 4% -7.51% (p=0.000 n=18+18)
Latency-99 10.8ms ± 7% 10.0ms ± 4% -7.33% (p=0.000 n=18+14)
Latency-999 18.0ms ± 9% 16.8ms ± 7% -6.84% (p=0.000 n=18+18)
scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50 840µs ± 3% 809µs ± 3% -3.71% (p=0.000 n=15+17)
Latency-95 6.52ms ± 4% 6.11ms ± 4% -6.29% (p=0.000 n=18+18)
Latency-99 10.8ms ± 7% 9.9ms ± 6% -7.55% (p=0.000 n=18+18)
Latency-999 18.0ms ± 9% 16.5ms ±11% -8.49% (p=0.000 n=18+18)
scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50 840µs ± 3% 823µs ± 5% -2.06% (p=0.002 n=15+18)
Latency-95 6.52ms ± 4% 6.32ms ± 3% -3.05% (p=0.000 n=18+18)
Latency-99 10.8ms ± 7% 10.2ms ± 4% -5.22% (p=0.000 n=18+18)
Latency-999 18.0ms ± 9% 16.7ms ±10% -7.09% (p=0.000 n=18+18)
scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50 840µs ± 3% 836µs ± 5% ~ (p=0.442 n=15+18)
Latency-95 6.52ms ± 4% 6.39ms ± 3% -2.00% (p=0.000 n=18+18)
Latency-99 10.8ms ± 7% 10.2ms ± 6% -5.15% (p=0.000 n=18+17)
Latency-999 18.0ms ± 9% 16.6ms ± 8% -7.48% (p=0.000 n=18+18)
scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50 840µs ± 3% 836µs ± 6% ~ (p=0.401 n=15+18)
Latency-95 6.52ms ± 4% 6.40ms ± 4% -1.79% (p=0.010 n=18+18)
Latency-99 10.8ms ± 7% 10.2ms ± 5% -4.95% (p=0.000 n=18+18)
Latency-999 18.0ms ± 9% 16.5ms ±14% -8.17% (p=0.000 n=18+18)
scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50 840µs ± 3% 828µs ± 2% -1.49% (p=0.001 n=15+17)
Latency-95 6.52ms ± 4% 6.38ms ± 4% -2.04% (p=0.001 n=18+18)
Latency-99 10.8ms ± 7% 10.2ms ± 4% -4.77% (p=0.000 n=18+18)
Latency-999 18.0ms ± 9% 16.9ms ± 9% -6.23% (p=0.000 n=18+18)
With significant oversubscription (background http benchmark
was running with full GOMAXPROCS):
scang yield delay = 1, casgstatus yield delay = 1
Latency-50 1.32ms ±12% 1.30ms ±13% ~ (p=0.454 n=14+14)
Latency-95 16.3ms ±10% 15.3ms ± 7% -6.29% (p=0.001 n=14+14)
Latency-99 29.4ms ±10% 27.9ms ± 5% -5.04% (p=0.001 n=14+12)
Latency-999 49.9ms ±19% 45.9ms ± 5% -8.00% (p=0.008 n=14+13)
scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50 1.32ms ±12% 1.29ms ± 9% ~ (p=0.227 n=14+14)
Latency-95 16.3ms ±10% 15.4ms ± 5% -5.27% (p=0.002 n=14+14)
Latency-99 29.4ms ±10% 27.9ms ± 6% -5.16% (p=0.001 n=14+14)
Latency-999 49.9ms ±19% 46.8ms ± 8% -6.21% (p=0.050 n=14+14)
scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50 1.32ms ±12% 1.35ms ± 9% ~ (p=0.401 n=14+14)
Latency-95 16.3ms ±10% 15.0ms ± 4% -7.67% (p=0.000 n=14+14)
Latency-99 29.4ms ±10% 27.4ms ± 5% -6.98% (p=0.000 n=14+14)
Latency-999 49.9ms ±19% 44.7ms ± 5% -10.56% (p=0.000 n=14+11)
scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50 1.32ms ±12% 1.36ms ±10% ~ (p=0.246 n=14+14)
Latency-95 16.3ms ±10% 14.9ms ± 5% -8.31% (p=0.000 n=14+14)
Latency-99 29.4ms ±10% 27.4ms ± 7% -6.70% (p=0.000 n=14+14)
Latency-999 49.9ms ±19% 44.9ms ±15% -10.13% (p=0.003 n=14+14)
scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50 1.32ms ±12% 1.41ms ± 9% +6.37% (p=0.008 n=14+13)
Latency-95 16.3ms ±10% 15.1ms ± 8% -7.45% (p=0.000 n=14+14)
Latency-99 29.4ms ±10% 27.5ms ±12% -6.67% (p=0.002 n=14+14)
Latency-999 49.9ms ±19% 45.9ms ±16% -8.06% (p=0.019 n=14+14)
scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50 1.32ms ±12% 1.42ms ±10% +7.21% (p=0.003 n=14+14)
Latency-95 16.3ms ±10% 15.0ms ± 7% -7.59% (p=0.000 n=14+14)
Latency-99 29.4ms ±10% 27.3ms ± 8% -7.20% (p=0.000 n=14+14)
Latency-999 49.9ms ±19% 44.8ms ± 8% -10.21% (p=0.001 n=14+13)
All numbers are on 8 cores and with GOGC=10 (http benchmark has
tiny heap, few goroutines and low allocation rate, so by default
GC barely affects tail latency).
10us/5us yield delays seem to provide a reasonable compromise
and give 5-10% tail latency reduction. That's what used in this change.
go install -a std results on 4 core machine:
name old time/op new time/op delta
Time 8.39s ± 2% 7.94s ± 2% -5.34% (p=0.000 n=47+49)
UserTime 24.6s ± 2% 22.9s ± 2% -6.76% (p=0.000 n=49+49)
SysTime 1.77s ± 9% 1.89s ±11% +7.00% (p=0.000 n=49+49)
CpuLoad 315ns ± 2% 313ns ± 1% -0.59% (p=0.000 n=49+48) # %CPU
MaxRSS 97.1ms ± 4% 97.5ms ± 9% ~ (p=0.838 n=46+49) # bytes
Update #14396
Update #14189
Change-Id: I3f4109bf8f7fd79b39c466576690a778232055a2
Reviewed-on: https://go-review.googlesource.com/21503
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Austin Clements <austin@google.com>