Brad Fitzpatrick [Sun, 1 May 2016 02:11:42 +0000 (21:11 -0500)]
net/http: add Transport.IdleConnTimeout
Don't keep idle HTTP client connections open forever. Add a new knob,
Transport.IdleConnTimeout, and make the default be 90 seconds. I
figure 90 seconds is more than a minute, and less than infinite, and I
figure enough code has things waking up once a minute polling APIs.
This also removes the Transport's idleCount field which was unused and
redundant with the size of the idleLRU map (which was actually used).
The B/op number is effectively meaningless. There
is a surprisingly large one-time cost that gets
divided by the number of iterations that your
machine can get through in a second.
This CL discards the first run, which helps.
It is not a panacea. Running with -benchtime=10s
will allow the sync.Pool to be emptied,
which brings the problem back.
However, since there are more iterations to divide
the cost through, it’s not quite as bad,
and running with a high benchtime is rare.
This CL changes the meaning of the B/op number,
which is unfortunate, since it won’t have the
same order of magnitude as previous Go versions.
But it wasn’t really comparable before anyway,
since it didn’t have any reliable meaning at all.
Austin Clements [Sun, 1 May 2016 01:47:30 +0000 (21:47 -0400)]
runtime: update some comments
This updates some comments that became out of date when we moved the
mark bit out of the heap bitmap and started using the high bit for the
first word as a scan/dead bit.
Change-Id: I4a572d16db6114cadff006825466c1f18359f2db
Reviewed-on: https://go-review.googlesource.com/22662 Reviewed-by: Rick Hudson <rlh@golang.org>
MIPS N64 ABI passes arguments in registers R4-R11, return value in R2.
R16-R23, R28, R30 and F24-F31 are callee-save. gcc PIC code expects
to be called with indirect call through R25.
Change-Id: I24f582b4b58e1891ba9fd606509990f95cca8051
Reviewed-on: https://go-review.googlesource.com/19805 Reviewed-by: Minux Ma <minux@golang.org>
cmd/compile: Improve readability of HTML produced by GOSSAFUNC
Factor out the Aux/AuxInt handling in (*Value).LongString() and
use it in (*Value).LongHTML() as well.
This especially improves readability of auxFloat32, auxFloat64,
and auxSymValAndOff values which would otherwise be printed as
opaque integers.
This change also makes LongString() slightly less verbose by
eliding offsets that are zero (as is very often the case).
Additionally, ensure the HTML is interpreted as UTF-8 so that
non-ASCII characters (especially the "middle dots" in some symbols)
show up correctly.
cmd/internal/obj/mips et al.: introduce SB register on mips64x
SB register (R28) is introduced for access external addresses with shorter
instruction sequences. It is loaded at entry points. External data within
2G of SB can be accessed this way.
cmd/internal/obj: relocaltion R_ADDRMIPS is split into two relocations
R_ADDRMIPS and R_ADDRMIPSU, handling the low 16 bits and the "upper" 16
bits of external addresses, respectively, since the instructios may not
be adjacent. It might be better if relocation Variant could be used.
cmd/link/internal/mips64: support new relocations.
cmd/compile/internal/mips64: reserve SB register.
runtime: initialize SB register at entry points.
Change-Id: I5f34868f88c5a9698c042a8a1f12f76806c187b9
Reviewed-on: https://go-review.googlesource.com/19802 Reviewed-by: Minux Ma <minux@golang.org>
Kevin Burke [Sat, 23 Apr 2016 18:00:05 +0000 (11:00 -0700)]
database/sql: clone data for named []byte types
Previously named byte types like json.RawMessage could get dirty
database memory from a call to Scan. These types would activate a
code path that didn't clone the byte data coming from the database
before assigning it. Another thread could then overwrite the byte
array in src, which has unexpected consequences.
Originally reported by Jason Moiron; the patch and test are his
suggestions. Fixes #13905.
With the switch to separate mark bitmaps, the scan/dead bit for the
first word of each object is now unused. Reclaim this bit and use it
as a scan/dead bit, just like words three and on. The second word is
still used for checkmark.
This dramatically simplifies heapBitsSetTypeNoScan and hasPointers,
since they no longer need different cases for 1, 2, and 3+ word
objects. They can instead just manipulate the heap bitmap for the
first word and be done with it.
In order to enable this, we change heapBitsSetType and runGCProg to
always set the scan/dead bit to scan for the first word on every code
path. Since these functions only apply to types that have pointers,
there's no need to do this conditionally: it's *always* necessary to
set the scan bit in the first word.
We also change every place that scans an object and checks if there
are more pointers. Rather than only checking morePointers if the word
is >= 2, we now check morePointers if word != 1 (since that's the
checkmark word).
Looking forward, we should probably reclaim the checkmark bit, too,
but that's going to be quite a bit more work.
Tested by setting doubleCheck in heapBitsSetType and running all.bash
on both linux/amd64 and linux/386, and by running GOGC=10 all.bash.
This particularly improves the FmtFprintf* go1 benchmarks, since they
do a large amount of noscan allocation.
runtime: avoid conditional execution in morePointers and isPointer
heapBits.bits is carefully written to produce good machine code. Use
it in heapBits.morePointers and heapBits.isPointer to get good machine
code there, too.
Change-Id: Ia75945ef563956613bf88bbe57800a96455c265d
Reviewed-on: https://go-review.googlesource.com/22661 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Matthew Dempsky [Fri, 29 Apr 2016 22:56:57 +0000 (15:56 -0700)]
root: remove dev.garbage file
Change-Id: I99b2ca52824341d986090f5c78ab4f396594bcdf
Reviewed-on: https://go-review.googlesource.com/22660 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Ian Lance Taylor [Wed, 27 Apr 2016 21:18:29 +0000 (14:18 -0700)]
cmd/cgo, runtime, runtime/cgo: use cgo context function
Add support for the context function set by runtime.SetCgoTraceback.
The context function was added in CL 17761, without support.
This CL is the support.
This CL has not been tested for real C code, as a working context
function for C code requires unwind support that does not seem to exist.
I wanted to get the CL out before the freeze.
I apologize for the length of this CL. It's mostly plumbing, but
unfortunately the plumbing is processor-specific.
Change-Id: I8ce11a0de9b3dafcc29efd2649d776e93bff0e90
Reviewed-on: https://go-review.googlesource.com/22508 Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Michael Munday [Mon, 18 Apr 2016 01:26:23 +0000 (21:26 -0400)]
crypto/cipher, crypto/aes: add s390x implementation of AES-CTR
This commit adds the new 'ctrAble' interface to the crypto/cipher
package. The role of ctrAble is the same as gcmAble but for CTR
instead of GCM. It allows block ciphers to provide optimized CTR
implementations.
The primary benefit of adding CTR support to the s390x AES
implementation is that it allows us to encrypt the counter values
in bulk, giving the cipher message instruction a larger chunk of
data to work on per invocation.
The xorBytes assembly is necessary because xorBytes becomes a
bottleneck when CTR is done in this way. Hopefully it will be
possible to remove this once s390x has migrated to the ssa
backend.
name old speed new speed delta
AESCTR1K 160MB/s ± 6% 867MB/s ± 0% +442.42% (p=0.000 n=9+10)
Change-Id: I1ae16b0ce0e2641d2bdc7d7eabc94dd35f6e9318
Reviewed-on: https://go-review.googlesource.com/22195 Reviewed-by: Adam Langley <agl@golang.org>
Michael Munday [Tue, 26 Apr 2016 01:46:02 +0000 (21:46 -0400)]
crypto/cipher, crypto/aes: add s390x implementation of AES-CBC
This commit adds the cbcEncAble and cbcDecAble interfaces that
can be implemented by block ciphers that support an optimized
implementation of CBC. This is similar to what is done for GCM
with the gcmAble interface.
The cbcEncAble, cbcDecAble and gcmAble interfaces all now have
tests to ensure they are detected correctly in the cipher
package.
name old speed new speed delta
AESCBCEncrypt1K 152MB/s ± 1% 1362MB/s ± 0% +795.59% (p=0.000 n=10+9)
AESCBCDecrypt1K 143MB/s ± 1% 1362MB/s ± 0% +853.00% (p=0.000 n=10+9)
Change-Id: I715f686ab3686b189a3dac02f86001178fa60580
Reviewed-on: https://go-review.googlesource.com/22523
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Adam Langley <agl@golang.org>
Rick Hudson [Fri, 29 Apr 2016 17:49:18 +0000 (13:49 -0400)]
Merge remote-tracking branch 'origin/dev.garbage'
This commit moves the GC from free list allocation to
bit mark allocation. Instead of using the bitmaps
generated during the mark phases to generate free
list and then using the free lists for allocation we
allocate directly from the bitmaps.
The change in the garbage benchmark
name old time/op new time/op delta
XBenchGarbage-12 2.22ms ± 1% 2.13ms ± 1% -3.90% (p=0.000 n=18+18)
Rick Hudson [Fri, 29 Apr 2016 16:09:36 +0000 (12:09 -0400)]
[dev.garbage] runtime: simplify nextFreeFast so it is inlined
nextFreeFast is currently not inlined by the compiler due
to its size and complexity. This CL simplifies
nextFreeFast by letting the slow path handle (nextFree)
handle a corner cases.
sweep used to skip mcental.freeSpan (and its locking) if it didn't
find any new free objects. We lost that optimization when the
freed-object counting changed in dad83f7 to count total free objects
instead of newly freed objects.
The previous commit brings back counting of newly freed objects, so we
can easily revive this optimization by checking that count (like we
used to) instead of the total free objects count.
Commit 8dda1c4 changed the meaning of "nfree" in sweep from the number
of newly freed objects to the total number of free objects in the
span, but didn't update where sweep added nfree to c.local_nsmallfree.
Hence, we're over-accounting the number of frees. This is causing
TestArrayHash to fail with "too many allocs NNN - hash not balanced".
Fix this by computing the number of newly freed objects and adding
that to c.local_nsmallfree, so it behaves like it used to. Computing
this requires a small tweak to mallocgc: apparently we've never set
s.allocCount when allocating a large object; fix this by setting it to
1 so sweep doesn't get confused.
We broke tracing of freed objects in GODEBUG=allocfreetrace=1 mode
when we removed the sweep over the mark bitmap. Fix it by
re-introducing the sweep over the bitmap specifically if we're in
allocfreetrace mode. This doesn't have to be even remotely efficient,
since the overhead of allocfreetrace is huge anyway, so we can keep
the code for this down to just a few lines.
Currently we always zero objects when we allocate them. We used to
have an optimization that would not zero objects that had not been
allocated since the whole span was last zeroed (either by getting it
from the system or by getting it from the heap, which does a bulk
zero), but this depended on the sweeper clobbering the first two words
of each object. Hence, we lost this optimization when the bitmap
sweeper went away.
Re-introduce this optimization using a different mechanism. Each span
already keeps a flag indicating that it just came from the OS or was
just bulk zeroed by the mheap. We can simply use this flag to know
when we don't need to zero an object. This is slightly less efficient
than the old optimization: if a span gets allocated and partially
used, then GC happens and the span gets returned to the mcentral, then
the span gets re-acquired, the old optimization knew that it only had
to re-zero the objects that had been reclaimed, whereas this
optimization will re-zero everything. However, in this case, you're
already paying for the garbage collection, and you've only wasted one
zeroing of the span, so in practice there seems to be little
difference. (If we did want to revive the full optimization, each span
could keep track of a frontier beyond which all free slots are zeroed.
I prototyped this and it didn't obvious do any better than the much
simpler approach in this commit.)
This significantly improves BinaryTree17, which is allocation-heavy
(and runs first, so most pages are already zeroed), and slightly
improves everything else.
name old time/op new time/op delta
XBenchGarbage-12 2.15ms ± 1% 2.14ms ± 1% -0.80% (p=0.000 n=17+17)
Nigel Tao [Fri, 29 Apr 2016 07:17:44 +0000 (17:17 +1000)]
compress/flate: use a constant hash table size for Best Speed.
This makes compress/flate's version of Snappy diverge from the upstream
golang/snappy version, but the latter has a goal of matching C++ snappy
output byte-for-byte. Both C++ and the asm version of golang/snappy can
use a smaller N for the O(N) zero-initialization of the hash table when
the input is small, even if the pure Go golang/snappy algorithm cannot:
"var table [tableSize]uint16" zeroes all tableSize elements.
For this package, we don't have the match-C++-snappy goal, so we can use
a different (constant) hash table size.
This is a small win, in terms of throughput and output size, but it also
enables us to re-use the (constant size) hash table between
encodeBestSpeed calls, avoiding the cost of zero-initializing the hash
table altogether. This will be implemented in follow-up commits.
Dave Cheney [Fri, 29 Apr 2016 04:17:04 +0000 (14:17 +1000)]
cmd/compile/internal/gc: bv.go cleanup
Drive by gardening of bv.go.
- Unexport the Bvec type, it is not used outside internal/gc.
(machine translated with gofmt -r)
- Removed unused constants and functions.
(driven by cmd/unused)
Nigel Tao [Tue, 26 Apr 2016 09:16:30 +0000 (19:16 +1000)]
compress/flate: replace "Best Speed" with specialized version
This encoding algorithm, which prioritizes speed over output size, is
based on Snappy's LZ77-style encoder: github.com/golang/snappy
This commit keeps the diff between this package's encodeBestSpeed
function and and Snappy's encodeBlock function as small as possible (see
the diff below). Follow-up commits will improve this package's
performance and output size.
Output size is larger. In the table below, the first column is the input
size, the second column is the output size prior to this commit, the
third column is the output size after this commit.
The diff between github.com/golang/snappy's encodeBlock function and
this commit's encodeBestSpeed function:
1c1,7
< func encodeBlock(dst, src []byte) (d int) {
---
> func encodeBestSpeed(dst []token, src []byte) []token {
> // This check isn't in the Snappy implementation, but there, the caller
> // instead of the callee handles this case.
> if len(src) < minNonLiteralBlockSize {
> return emitLiteral(dst, src)
> }
>
4c10
< // and len(src) <= maxBlockSize and maxBlockSize == 65536.
---
> // and len(src) <= maxStoreBlockSize and maxStoreBlockSize == 65535.
65c71
< if load32(src, s) == load32(src, candidate) {
---
> if s-candidate < maxOffset && load32(src, s) == load32(src, candidate) {
73c79
< d += emitLiteral(dst[d:], src[nextEmit:s])
---
> dst = emitLiteral(dst, src[nextEmit:s])
90c96
< // This is an inlined version of:
---
> // This is an inlined version of Snappy's:
93c99,103
< for i := candidate + 4; s < len(src) && src[i] == src[s]; i, s = i+1, s+1 {
---
> s1 := base + maxMatchLength
> if s1 > len(src) {
> s1 = len(src)
> }
> for i := candidate + 4; s < s1 && src[i] == src[s]; i, s = i+1, s+1 {
96c106,107
< d += emitCopy(dst[d:], base-candidate, s-base)
---
> // matchToken is flate's equivalent of Snappy's emitCopy.
> dst = append(dst, matchToken(uint32(s-base-3), uint32(base-candidate-minOffsetSize))) 114c125
< if uint32(x>>8) != load32(src, candidate) {
---
> if s-candidate >= maxOffset || uint32(x>>8) != load32(src, candidate) { 124c135
< d += emitLiteral(dst[d:], src[nextEmit:])
---
> dst = emitLiteral(dst, src[nextEmit:]) 126c137
< return d
---
> return dst
This change is based on https://go-review.googlesource.com/#/c/21021/ by
Klaus Post, but it is a separate changelist as cl/21021 seems to have
stalled in code review, and the Go 1.7 feature freeze approaches.
Golang-dev discussion:
https://groups.google.com/d/topic/golang-dev/XYgHX9p8IOk/discussion and
of course cl/21021.
[dev.garbage] runtime: use s.base() everywhere it makes sense
Currently we have lots of (s.start << _PageShift) and variants. We now
have an s.base() function that returns this. It's faster and more
readable, so use it.
Alex Brainman [Thu, 28 Apr 2016 05:19:11 +0000 (15:19 +1000)]
debug/pe: .bss section must contain only zeros
.bss section has no data stored in PE file. But when .bss section data
is used by the linker it is assumed that its every byte is set to zero.
(*Section).Data returns garbage at this moment. Change (*Section).Data
so it returns slice filled with 0s.
Updates #15345
Change-Id: I1fa5138244a9447e1d59dec24178b1dd0fd4c5d7
Reviewed-on: https://go-review.googlesource.com/22544 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Alex Brainman <alex.brainman@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Rick Hudson [Thu, 31 Mar 2016 14:45:36 +0000 (10:45 -0400)]
[dev.garbage] runtime: use sys.Ctz64 intrinsic
Our compilers now provides instrinsics including
sys.Ctz64 that support CTZ (count trailing zero)
instructions. This CL replaces the Go versions
of CTZ with the compiler intrinsic.
Count trailing zeros CTZ finds the least
significant 1 in a word and returns the number
of less significant 0s in the word.
Allocation uses the bitmap created by the garbage
collector to locate an unmarked object. The logic
takes a word of the bitmap, complements, and then
caches it. It then uses CTZ to locate an available
unmarked object. It then shifts marked bits out of
the bitmap word preparing it for the next search.
Once all the unmarked objects are used in the
cached work the bitmap gets another word and
repeats the process.
Rick Hudson [Mon, 14 Mar 2016 16:17:48 +0000 (12:17 -0400)]
[dev.garbage] runtime: restructure alloc and mark bits
Two changes are included here that are dependent on the other.
The first is that allocBits and gcamrkBits are changed to
a *uint8 which points to the first byte of that span's
mark and alloc bits. Several places were altered to
perform pointer arithmetic to locate the byte corresponding
to an object in the span. The actual bit corresponding
to an object is indexed in the byte by using the lower three
bits of the objects index.
The second change avoids the redundant calculation of an
object's index. The index is returned from heapBitsForObject
and then used by the functions indexing allocBits
and gcmarkBits.
Finally we no longer allocate the gc bits in the span
structures. Instead we use an arena based allocation scheme
that allows for a more compact bit map as well as recycling
and bulk clearing of the mark bits.
Michael Hudson-Doyle [Thu, 28 Apr 2016 22:37:37 +0000 (10:37 +1200)]
cmd/link: fix -no-pie / -race check
golang.org/cl/22453 was supposed to pass -no-pie to the linker when linking a
race-enabled binary if the host toolchain supports it. But I bungled the
supported check as I forgot to pass -c to the host compiler so it tried to
compile a 0 byte .c file into an executable, which will never work. Fix it to
pass -c as it should have all along.
Change-Id: I4801345c7a29cb18d5f22cec5337ce535f92135d
Reviewed-on: https://go-review.googlesource.com/22587
Run-TryBot: Michael Hudson-Doyle <michael.hudson@canonical.com>
TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Minux Ma <minux@golang.org>
The _SigUnblock flag was appended to SIGSYS slot of runtime signal table
for Linux in https://go-review.googlesource.com/22202, but there is
still no concrete opinion on whether SIGSYS must be an unblocked signal
for runtime.
This change removes _SigUnblock flag from SIGSYS on Linux for
consistency in runtime signal handling and adds a reference to #15204 to
runtime signal table for FreeBSD.
Updates #15204.
Change-Id: I42992b1d852c2ab5dd37d6dbb481dba46929f665
Reviewed-on: https://go-review.googlesource.com/22537 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Matthew Dempsky [Thu, 28 Apr 2016 20:44:15 +0000 (13:44 -0700)]
net: remove unneeded tags from dnsRR structs
DNS packing and unpacking uses hand-coded struct walking functions
rather than reflection, so these tags are unneeded and just contribute
to their runtime reflect metadata size.
Matthew Dempsky [Thu, 28 Apr 2016 20:26:14 +0000 (13:26 -0700)]
net: remove internal support for obsolete DNS record types
There are no real world use cases for HINFO, MINFO, MB, MG, or MR
records, and package net's exposed APIs don't provide any way to
access them even if there were. If a use ever does show up, we can
revive them. In the mean time, this is just effectively-dead code that
sticks around because of rr_mk.
Robert Griesemer [Thu, 28 Apr 2016 19:43:12 +0000 (12:43 -0700)]
cmd/compile: use delta encoding for filenames in export data position info
This reduces the export data size significantly (15%-25%) for some packages,
especially where the paths are very long or if there are many files involved.
Slight (2%) reduction on average, with virtually no increases in export data
size.
Selected export data sizes for packages with |delta %| > 3%:
Matthew Dempsky [Thu, 28 Apr 2016 18:31:59 +0000 (11:31 -0700)]
net: ensure dnsConfig search list is rooted
Avoids some extra work and string concatenation at query time.
benchmark old allocs new allocs delta
BenchmarkGoLookupIP-32 154 150 -2.60%
BenchmarkGoLookupIPNoSuchHost-32 446 442 -0.90%
BenchmarkGoLookupIPWithBrokenNameServer-32 564 568 +0.71%
benchmark old bytes new bytes delta
BenchmarkGoLookupIP-32 10824 10704 -1.11%
BenchmarkGoLookupIPNoSuchHost-32 43140 42992 -0.34%
BenchmarkGoLookupIPWithBrokenNameServer-32 46616 46680 +0.14%
BenchmarkGoLookupIPWithBrokenNameServer's regression appears to be
because it's actually only performing 1 LookupIP call, so the extra
work done parsing the DNS config file doesn't amortize as well as for
BenchmarkGoLookupIP or BenchmarkGoLOokupIPNoSuchHost, which perform
2000+ LookupIP calls per run.
Adam Langley [Tue, 26 Apr 2016 17:45:35 +0000 (10:45 -0700)]
crypto/tls: allow renegotiation to be handled by a client.
This change adds Config.Renegotiation which controls whether a TLS
client will accept renegotiation requests from a server. This is used,
for example, by some web servers that wish to “add” a client certificate
to an HTTPS connection.
This is disabled by default because it significantly complicates the
state machine.
Originally, handshakeMutex was taken before locking either Conn.in or
Conn.out. However, if renegotiation is permitted then a handshake may
be triggered during a Read() call. If Conn.in were unlocked before
taking handshakeMutex then a concurrent Read() call could see an
intermediate state and trigger an error. Thus handshakeMutex is now
locked after Conn.in and the handshake functions assume that Conn.in is
locked for the duration of the handshake.
Additionally, handshakeMutex used to protect Conn.out also. With the
possibility of renegotiation that's no longer viable and so
writeRecordLocked has been split off.
Robert Griesemer [Thu, 28 Apr 2016 05:04:49 +0000 (22:04 -0700)]
cmd/compile: have all or no parameter named in exported signatures
Binary export format only.
Make sure we don't accidentally export an unnamed parameter
in signatures which expect all named parameters; otherwise
we crash during import. Appears to happen for _ (blank)
parameter names, as observed in method signatures such as
the one at: x/tools/godoc/analysis/analysis.go:76.
Fixes #15470.
TBR=mdempsky
Change-Id: I1b1184bf08c4c09d8a46946539c4b8c341acdb84
Reviewed-on: https://go-review.googlesource.com/22543 Reviewed-by: Robert Griesemer <gri@golang.org>
Robert Griesemer [Thu, 28 Apr 2016 00:30:20 +0000 (17:30 -0700)]
cmd/compile: use correct (field/method) node for position info
Position info for fields and methods was based on the wrong node
in the new export format, leading to position info for empty
file names and 0 line numbers. Use correct node now.
Due to compact delta encoding, there is no difference in export
format size. In fact, because encoding of "no line changed" is
uncommon and thus a bit more expensive, in many cases the data
is now slightly shorter.
Stats for export data size (pachage, before, after, delta%):
net: clarify DialContext's use of its provided context
Fixes #15325
Change-Id: I60137ecf27e236e97734b1730ce29ab23e9fe07f
Reviewed-on: https://go-review.googlesource.com/22509 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change-Id: I5408ec22932a05e85c210c0faa434bd19dce5650
Reviewed-on: https://go-review.googlesource.com/22532 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Michael Hudson-Doyle [Mon, 28 Mar 2016 09:27:36 +0000 (22:27 +1300)]
cmd/compile: de-dup the gclocals symbols in compiler too
These symbols are de-duplicated in the linker but the compiler generates quite
many duplicates too: 2425 of 13769 total symbols for runtime.a for example.
De-duplicating them in the compiler saves the linker a bit of work.
Fixes #14983
Change-Id: I5f18e5f9743563c795aad8f0a22d17a7ed147711
Reviewed-on: https://go-review.googlesource.com/22293 Reviewed-by: David Crawshaw <crawshaw@golang.org>
Rick Hudson [Mon, 14 Mar 2016 16:17:48 +0000 (12:17 -0400)]
[dev.garbage] runtime: add gc work buffer tryGet and put fast paths
The complexity of the GC work buffers put and tryGet
prevented them from being inlined. This CL simplifies
the fast path thus enabling inlining. If the fast
path does not succeed the previous put and tryGet
functions are called.
Rick Hudson [Mon, 14 Mar 2016 16:02:02 +0000 (12:02 -0400)]
[dev.garbage] runtime: cleanup and optimize span.base()
Prior to this CL the base of a span was calculated in various
places using shifts or calls to base(). This CL now
always calls base() which has been optimized to calculate the
base of the span when the span is initialized and store that
value in the span structure.
Rick Hudson [Wed, 2 Mar 2016 17:15:02 +0000 (12:15 -0500)]
[dev.garbage] runtime: remove heapBitsSweepSpan
Prior to this CL the sweep phase was responsible for locating
all objects that were about to be freed and calling a function
to process the object. This was done by the function
heapBitsSweepSpan. Part of processing included calls to
tracefree and msanfree as well as counting how many objects
were freed.
The calls to tracefree and msanfree have been moved into the
gcmalloc routine and called when the object is about to be
reallocated. The counting of free objects has been optimized
using an array based popcnt algorithm and if all the objects
in a span are free then span is freed.
Similarly the code to locate the next free object has been
optimized to use an array based ctz (count trailing zero).
Various hot paths in the allocation logic have been optimized.
At this point the garbage benchmark is within 3% of the 1.6
release.
Rick Hudson [Wed, 24 Feb 2016 19:36:30 +0000 (14:36 -0500)]
[dev.garbage] runtime: add bit and cache ctz64 (count trailing zero)
Add to each span a 64 bit cache (allocCache) of the allocBits
at freeindex. allocCache is shifted such that the lowest bit
corresponds to the bit freeindex. allocBits uses a 0 to
indicate an object is free, on the other hand allocCache
uses a 1 to indicate an object is free. This facilitates
ctz64 (count trailing zero) which counts the number of 0s
trailing the least significant 1. This is also the index of
the least significant 1.
Each span maintains a freeindex indicating the boundary
between allocated objects and unallocated objects. allocCache
is shifted as freeindex is incremented such that the low bit
in allocCache corresponds to the bit a freeindex in the
allocBits array.
Currently ctz64 is written in Go using a for loop so it is
not very efficient. Use of the hardware instruction will
follow. With this in mind comparisons of the garbage
benchmark are as follows.
Rick Hudson [Wed, 17 Feb 2016 16:27:52 +0000 (11:27 -0500)]
[dev.garbage] runtime: logic that uses count trailing zero (ctz)
Most (all?) processors that Go supports supply a hardware
instruction that takes a byte and returns the number
of zeros trailing the first 1 encountered, or 8
if no ones are found. This is the index within the
byte of the first 1 encountered. CTZ should improve the
performance of the nextFreeIndex function.
Since nextFreeIndex wants the next unmarked (0) bit
a bit-wise complement is needed before calling ctz.
Furthermore unmarked bits associated with previously
allocated objects need to be ignored. Instead of writing
a 1 as we allocate the code masks all bits less than the
freeindex after loading the byte.
While this CL does not actual execute a CTZ instruction
it supplies a ctz function with the appropiate signature
along with the logic to execute it.
Rick Hudson [Tue, 16 Feb 2016 22:16:43 +0000 (17:16 -0500)]
[dev.garbage] runtime: replace ref with allocCount
This is a renaming of the field ref to the
more appropriate allocCount. The field
holds the number of objects in the span
that are currently allocated. Some throws
strings were adjusted to more accurately
convey the meaning of allocCount.
Rick Hudson [Thu, 11 Feb 2016 18:57:58 +0000 (13:57 -0500)]
[dev.garbage] runtime: allocate directly from GC mark bits
Instead of building a freelist from the mark bits generated
by the GC this CL allocates directly from the mark bits.
The approach moves the mark bits from the pointer/no pointer
heap structures into their own per span data structures. The
mark/allocation vectors consist of a single mark bit per
object. Two vectors are maintained, one for allocation and
one for the GC's mark phase. During the GC cycle's sweep
phase the interpretation of the vectors is swapped. The
mark vector becomes the allocation vector and the old
allocation vector is cleared and becomes the mark vector that
the next GC cycle will use.
Marked entries in the allocation vector indicate that the
object is not free. Each allocation vector maintains a boundary
between areas of the span already allocated from and areas
not yet allocated from. As objects are allocated this boundary
is moved until it reaches the end of the span. At this point
further allocations will be done from another span.
Since we no longer sweep a span inspecting each freed object
the responsibility for maintaining pointer/scalar bits in
the heapBitMap containing is now the responsibility of the
the routines doing the actual allocation.
This CL is functionally complete and ready for performance
tuning.
The gcmarkBits is a bit vector used by the GC to mark
reachable objects. Once a GC cycle is complete the gcmarkBits
swap places with the allocBits. allocBits is then used directly
by malloc to locate free objects, thus avoiding the
construction of a linked free list. This CL introduces a set
of helper functions for manipulating gcmarkBits and allocBits
that will be used by later CLs to realize the actual
algorithm. Minimal attempts have been made to optimize these
helper routines.
Rick Hudson [Mon, 8 Feb 2016 14:53:14 +0000 (09:53 -0500)]
[dev.garbage] runtime: add stackfreelist
The freelist for normal objects and the freelist
for stacks share the same mspan field for holding
the list head but are operated on by different code
sequences. This overloading complicates the use of bit
vectors for allocation of normal objects. This change
refactors the use of the stackfreelist out from the
use of freelist.
Rick Hudson [Thu, 4 Feb 2016 16:41:48 +0000 (11:41 -0500)]
[dev.garbage] runtime: bitmap allocation data structs
The bitmap allocation data structure prototypes. Before
this is released these underlying data structures need
to be more performant but the signatures of helper
functions utilizing these structures will remain stable.
Austin Clements [Fri, 18 Mar 2016 15:27:59 +0000 (11:27 -0400)]
runtime: don't rescan globals
Currently the runtime rescans globals during mark 2 and mark
termination. This costs as much as 500µs/MB in STW time, which is
enough to surpass the 10ms STW limit with only 20MB of globals.
It's also basically unnecessary. The compiler already generates write
barriers for global -> heap pointer updates and the regular write
barrier doesn't check whether the slot is a global or in the heap.
Some less common write barriers do cause problems.
heapBitsBulkBarrier, which is used by typedmemmove and related
functions, currently depends on having access to the pointer bitmap
and as a result ignores writes to globals. Likewise, the
reflect-related write barriers reflect_typedmemmovepartial and
callwritebarrier ignore non-heap destinations; though it appears they
can never be called with global pointers anyway.
This commit makes heapBitsBulkBarrier issue write barriers for writes
to global pointers using the data and BSS pointer bitmaps, removes the
inheap checks from the reflection write barriers, and eliminates the
rescans during mark 2 and mark termination. It also adds a test that
writes to globals have write barriers.
Programs with large data+BSS segments (with pointers) aren't common,
but for programs that do have large data+BSS segments, this
significantly reduces pause time:
name \ 95%ile-time/markTerm old new delta
LargeBSS/bss:1GB/gomaxprocs:4 148200µs ± 6% 302µs ±52% -99.80% (p=0.008 n=5+5)
David Crawshaw [Wed, 27 Apr 2016 17:10:49 +0000 (13:10 -0400)]
reflect: fix strings of SliceOf-created types
The new type was inheriting the tflagExtraStar from its prototype.
Fixes #15467
Change-Id: Ic22c2a55cee7580cb59228d52b97e1c0a1e60220
Reviewed-on: https://go-review.googlesource.com/22501 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>