From: Austin Clements Date: Sat, 28 May 2016 01:04:40 +0000 (-0400) Subject: runtime: bound scanobject to ~100 µs X-Git-Tag: go1.8beta1~1476 X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=cf4f1d07a189125a8774a923a3259126599e942b;p=gostls13.git runtime: bound scanobject to ~100 µs Currently the time spent in scanobject is proportional to the size of the object being scanned. Since scanobject is non-preemptible, large objects can cause significant goroutine (and even whole application) delays through several means: 1. If a GC assist picks up a large object, the allocating goroutine is blocked for the whole scan, even if that scan well exceeds that goroutine's debt. 2. Since the scheduler does not run on the P performing a large object scan, goroutines in that P's run queue do not run unless they are stolen by another P (which can take some time). If there are a few large objects, all of the Ps may get tied up so the scheduler doesn't run anywhere. 3. Even if a large object is scanned by a background worker and other Ps are still running the scheduler, the large object scan doesn't flush background credit until the whole scan is done. This can easily cause all allocations to block in assists, waiting for credit, causing an effective STW. Fix this by splitting large objects into 128 KB "oblets" and scanning at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates to bounding scanobject at roughly 100 µs. This improves assist behavior both because assists can no longer get "unlucky" and be stuck scanning a large object, and because it causes the background worker to flush credit and unblock assists more frequently when scanning large objects. This also improves GC parallelism if the heap consists primarily of a small number of very large objects by letting multiple workers scan a large objects in parallel. Fixes #10345. Fixes #16293. This substantially improves goroutine latency in the benchmark from issue #16293, which exercises several forms of very large objects: name old max-latency new max-latency delta SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12) SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20) SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20) MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18) ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20) ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20) name old P99.9-latency new P99.9-latency delta SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18) SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20) SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20) ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19) ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) This has negligible effect on all metrics from the garbage, JSON, and HTTP x/benchmarks. It shows slight improvement on some of the go1 benchmarks, particularly Revcomp, which uses some multi-megabyte buffers: name old time/op new time/op delta BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20) Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20) FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19) FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16) FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20) FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20) FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20) FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20) FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19) GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20) GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19) Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19) Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19) HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19) JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19) JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18) Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19) GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19) RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17) RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20) RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19) RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20) RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20) RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18) RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19) RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19) Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19) Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17) TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13) TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20) [Geo mean] 53.5µs 53.3µs -0.23% Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132 Reviewed-on: https://go-review.googlesource.com/23540 Run-TryBot: Austin Clements TryBot-Result: Gobot Gobot Reviewed-by: Rick Hudson --- diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go index f184d81b23..ce7ac63083 100644 --- a/src/runtime/mgc.go +++ b/src/runtime/mgc.go @@ -122,6 +122,15 @@ // proportion to the allocation cost. Adjusting GOGC just changes the linear constant // (and also the amount of extra memory used). +// Oblets +// +// In order to prevent long pauses while scanning large objects and to +// improve parallelism, the garbage collector breaks up scan jobs for +// objects larger than maxObletBytes into "oblets" of at most +// maxObletBytes. When scanning encounters the beginning of a large +// object, it scans only the first oblet and enqueues the remaining +// oblets as new scan jobs. + package runtime import ( diff --git a/src/runtime/mgcmark.go b/src/runtime/mgcmark.go index 0c624d2cbc..a4f25ac48f 100644 --- a/src/runtime/mgcmark.go +++ b/src/runtime/mgcmark.go @@ -25,6 +25,15 @@ const ( // rootBlockSpans is the number of spans to scan per span // root. rootBlockSpans = 8 * 1024 // 64MB worth of spans + + // maxObletBytes is the maximum bytes of an object to scan at + // once. Larger objects will be split up into "oblets" of at + // most this size. Since we can scan 1–2 MB/ms, 128 KB bounds + // scan preemption at ~100 µs. + // + // This must be > _MaxSmallSize so that the object base is the + // span base. + maxObletBytes = 128 << 10 ) // gcMarkRootPrepare queues root scanning jobs (stacks, globals, and @@ -1113,9 +1122,10 @@ func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) { } // scanobject scans the object starting at b, adding pointers to gcw. -// b must point to the beginning of a heap object; scanobject consults -// the GC bitmap for the pointer mask and the spans for the size of the -// object. +// b must point to the beginning of a heap object or an oblet. +// scanobject consults the GC bitmap for the pointer mask and the +// spans for the size of the object. +// //go:nowritebarrier func scanobject(b uintptr, gcw *gcWork) { // Note that arena_used may change concurrently during @@ -1130,9 +1140,11 @@ func scanobject(b uintptr, gcw *gcWork) { arena_start := mheap_.arena_start arena_used := mheap_.arena_used - // Find bits of the beginning of the object. - // b must point to the beginning of a heap object, so - // we can get its bits and span directly. + // Find the bits for b and the size of the object at b. + // + // b is either the beginning of an object, in which case this + // is the size of the object to scan, or it points to an + // oblet, in which case we compute the size to scan below. hbits := heapBitsForAddr(b) s := spanOfUnchecked(b) n := s.elemsize @@ -1140,6 +1152,42 @@ func scanobject(b uintptr, gcw *gcWork) { throw("scanobject n == 0") } + if n > maxObletBytes { + // Large object. Break into oblets for better + // parallelism and lower latency. + if b == s.base() { + // It's possible this is a noscan object (not + // from greyobject, but from other code + // paths), in which case we must *not* enqueue + // oblets since their bitmaps will be + // uninitialized. + if !hbits.hasPointers(n) { + // Bypass the whole scan. + gcw.bytesMarked += uint64(n) + return + } + + // Enqueue the other oblets to scan later. + // Some oblets may be in b's scalar tail, but + // these will be marked as "no more pointers", + // so we'll drop out immediately when we go to + // scan those. + for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes { + if !gcw.putFast(oblet) { + gcw.put(oblet) + } + } + } + + // Compute the size of the oblet. Since this object + // must be a large object, s.base() is the beginning + // of the object. + n = s.base() + s.elemsize - b + if n > maxObletBytes { + n = maxObletBytes + } + } + var i uintptr for i = 0; i < n; i += sys.PtrSize { // Find bits for this word. diff --git a/src/runtime/mgcwork.go b/src/runtime/mgcwork.go index d04840b686..0c1c482827 100644 --- a/src/runtime/mgcwork.go +++ b/src/runtime/mgcwork.go @@ -94,7 +94,7 @@ func (w *gcWork) init() { } // put enqueues a pointer for the garbage collector to trace. -// obj must point to the beginning of a heap object. +// obj must point to the beginning of a heap object or an oblet. //go:nowritebarrier func (w *gcWork) put(obj uintptr) { wbuf := w.wbuf1.ptr()