Dave Cheney [Sun, 2 Feb 2014 05:05:07 +0000 (16:05 +1100)]
time: use an alternative method of yielding during Overflow timer test
Fixes #6874.
Use runtime.GC() as a stronger version of runtime.Gosched() which tends to bias the running goroutine in an otherwise idle system. This appears to reduce the worst case number of spins from 600 down to 30 on my 2 core system under high load.
Ian Lance Taylor [Thu, 30 Jan 2014 17:25:47 +0000 (09:25 -0800)]
cmd/ld: fix bug with "runtime/cgo" in external link mode
In external link mode the linker explicitly adds the string
constant "runtime/cgo". It adds the string constant using the
same symbol name as the compiler, but a different format. The
compiler assumes that the string data immediately follows the
string header, but the linker puts the two in different
sections. The result is bad string data when the compiler
sees "runtime/cgo" used as a string constant.
The compiler assumption is in datastring in [568]g/gobj.c.
The linker layout is in addstrdata in ld/data.c. The compiler
assumption is valid for string literals. The linker is not
creating a string literal, so its assumption is also valid.
There are a few ways to avoid this problem. This patch fixes
it by only doing the fake import of runtime/cgo if necessary,
and by only creating the string symbol if necessary.
Dmitriy Vyukov [Thu, 30 Jan 2014 09:28:19 +0000 (13:28 +0400)]
runtime: increase page size to 8K
Tcmalloc uses 8K, 32K and 64K pages, and in custom setups 256K pages.
Only Chromium uses 4K pages today (in "slow but small" configuration).
The general tendency is to increase page size, because it reduces
metadata size and DTLB pressure.
This change reduces GC pause by ~10% and slightly improves other metrics.
This is the exact reincarnation of already LGTMed:
https://golang.org/cl/45770044
which must not break darwin/freebsd after:
https://golang.org/cl/56630043
TBR=iant
Brad Fitzpatrick [Thu, 30 Jan 2014 08:57:04 +0000 (09:57 +0100)]
net/http: use a struct instead of a string in transport conn cache key
The Transport's idle connection cache is keyed by a string,
for pre-Go 1.0 reasons. Ever since Go has been able to use
structs as map keys, there's been a TODO in the code to use
structs instead of allocating strings. This change does that.
Saves 3 allocatins and ~100 bytes of garbage per client
request. But because string hashing is so fast these days
(thanks, Keith), the performance is a wash: what we gain
on GC and not allocating, we lose in slower hashing. (hashing
structs of strings is slower than 1 string)
This seems a bit faster usually, but I've also seen it be a
bit slower. But at least it's how I've wanted it now, and it
the allocation improvements are consistent.
Rob Pike [Thu, 30 Jan 2014 00:14:45 +0000 (16:14 -0800)]
cmd/8g: don't crash if Prog->u.branch is nil
The code is copied from cmd/6g.
Empirically, all branch targets are nil in this code so
something is still wrong, but at least this stops 8g -S
from crashing.
Update #7178
LGTM=dave, iant
R=iant, dave
CC=golang-codereviews
https://golang.org/cl/58400043
Brad Fitzpatrick [Wed, 29 Jan 2014 12:44:21 +0000 (13:44 +0100)]
net/http: read as much as possible (including EOF) during chunked reads
This is the chunked half of https://golang.org/cl/49570044 .
We want full reads to return EOF as early as possible, when we
know we're at the end, so http.Transport client connections are eagerly
re-used in the common case, even if no Read or Close follows.
To do this, make the chunkedReader.Read fill up its argument p []byte
buffer as much as possible, as long as that doesn't involve doing
any more blocking reads to read chunk headers. That means if we
have a chunk EOF ("0\r\n") sitting in the incoming bufio.Reader,
we see it and set EOF on our final Read.
Brad Fitzpatrick [Wed, 29 Jan 2014 10:23:45 +0000 (11:23 +0100)]
net/http: reuse client connections earlier when Content-Length is set
Set EOF on the final Read of a body with a Content-Length, which
will cause clients to recycle their connection immediately upon
the final Read, rather than waiting for another Read or Close
(neither of which might come). This happens often when client
code is simply something like:
Then there's usually no subsequent Read. Even if the client
calls Close (which they should): in Go 1.1, the body was
slurped to EOF, but in Go 1.2, that was then treated as a
Close-before-EOF and the underlying connection was closed.
But that's assuming the user even calls Close. Many don't.
Reading to EOF also causes a connection be reused. Now the EOF
arrives earlier.
This CL only addresses the Content-Length case. A future CL
will address the chunked case.
Dmitriy Vyukov [Tue, 28 Jan 2014 18:34:32 +0000 (22:34 +0400)]
runtime: adjust malloc race instrumentation for tiny allocs
Tiny alloc memory block is shared by different goroutines running on the same thread.
We call racemalloc after enabling preemption in mallocgc,
as the result another goroutine can act on not yet race-cleared tiny block.
Call racemalloc before enabling preemption.
Fixes #7224.
LGTM=dave
R=golang-codereviews, dave
CC=golang-codereviews
https://golang.org/cl/57730043
Michael Hudson-Doyle [Tue, 28 Jan 2014 05:47:09 +0000 (16:47 +1100)]
cmd/go: When linking with gccgo pass .a files in the order they are discovered
Under some circumstances linking a test binary with gccgo can fail, because
the installed version of the library ends up before the version built for the
test on the linker command line.
This admittedly slightly hackish fix fixes this by putting the library archives
on the linker command line in the order that a pre-order depth first traversal
of the dependencies gives them, which has the side effect of always putting the
version of the library built for the test first.
Fixes #6768
LGTM=rsc
R=golang-codereviews, minux.ma, gobot, rsc, dave
CC=golang-codereviews
https://golang.org/cl/28050043
Vincent Vanackere [Mon, 27 Jan 2014 22:00:00 +0000 (14:00 -0800)]
runtime/debug: fix incorrect Stack output if package path contains a dot
Although debug.Stack is deprecated, it should still return the correct result.
Output before this CL (using a trivial library in $GOPATH/test.com/a):
/home/vince/src/test.com/a/lib.go:9 (0x42311e)
com/a.ShowStack: os.Stdout.Write(debug.Stack())
Output with this CL applied:
/home/vince/src/test.com/a/lib.go:9 (0x42311e)
ShowStack: os.Stdout.Write(debug.Stack())
Dmitriy Vyukov [Mon, 27 Jan 2014 20:26:56 +0000 (00:26 +0400)]
runtime: fix windows build
Currently windows crashes because early allocs in schedinit
try to allocate tiny memory blocks, but m->p is not yet setup.
I've considered calling procresize(1) earlier in schedinit,
but this refactoring is better and must fix the issue as well.
Fixes #7218.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/54570045
Dmitriy Vyukov [Mon, 27 Jan 2014 19:17:46 +0000 (23:17 +0400)]
runtime: tune P retake logic
When GOMAXPROCS>1 the last P in syscall is never retaken
(because there are already idle P's -- npidle>0).
This prevents sysmon thread from sleeping.
On a darwin machine the program from issue 6673 constantly
consumes ~0.2% CPU. With this change it stably consumes 0.0% CPU.
Fixes #6673.
R=golang-codereviews, r
CC=bradfitz, golang-codereviews, iant, khr
https://golang.org/cl/56990045
Ian Lance Taylor [Mon, 27 Jan 2014 18:18:22 +0000 (10:18 -0800)]
debug/dwarf, debug/elf: add support for reading DWARF 4 type info
In DWARF 4 the debug info for large types is put into
.debug_type sections, so that the linker can discard duplicate
info. This change adds support for reading type units.
Another small change included here is that DWARF 3 supports
storing the byte offset of a struct field as a formData rather
than a formDwarfBlock.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/56300043
Dmitriy Vyukov [Mon, 27 Jan 2014 16:29:21 +0000 (20:29 +0400)]
runtime: fix buffer overflow in stringtoslicerune
On 32-bits n*sizeof(r[0]) can overflow.
Or it can become 1<<32-eps, and mallocgc will "successfully"
allocate 0 pages for it, there are no checks downstream
and MHeap_Grow just does:
npage = (npage+15)&~15;
ask = npage<<PageShift;
Dmitriy Vyukov [Mon, 27 Jan 2014 11:11:12 +0000 (15:11 +0400)]
runtime: smarter slice grow
When growing slice take into account size of the allocated memory block.
Also apply the same optimization to string->[]byte conversion.
Fixes #6307.
benchmark old ns/op new ns/op delta
BenchmarkAppendGrowByte 45410364434108 -2.35%
BenchmarkAppendGrowString 5988567344813604 -25.17%
Dmitriy Vyukov [Fri, 24 Jan 2014 18:35:11 +0000 (22:35 +0400)]
runtime: combine small NoScan allocations
Combine NoScan allocations < 16 bytes into a single memory block.
Reduces number of allocations on json/garbage benchmarks by 10+%.
Dmitriy Vyukov [Fri, 24 Jan 2014 18:29:53 +0000 (22:29 +0400)]
sync: scalable Pool
Introduce fixed-size P-local caches.
When local caches overflow/underflow a batch of items
is transferred to/from global mutex-protected cache.
Dmitriy Vyukov [Fri, 24 Jan 2014 18:29:01 +0000 (22:29 +0400)]
runtime: do not zero terminate strings
On top of "tiny allocator" (cl/38750047), reduces number of allocs by 1% on json.
No code must rely on zero termination. So will also make debugging simpler,
by uncovering issues earlier.
json-1
allocated 79496867915766 -0.43%
allocs 93778 92790 -1.05%
time 10095779597250949 -3.67%
rest of the metrics are too noisy.
Russ Cox [Fri, 24 Jan 2014 04:11:04 +0000 (23:11 -0500)]
cmd/gc: add zeroing to enable precise stack accounting
There is more zeroing than I would like right now -
temporaries used for the new map and channel runtime
calls need to be eliminated - but it will do for now.
This CL only has an effect if you are building with
GOEXPERIMENT=precisestack ./all.bash
(or make.bash). It costs about 5% in the overall time
spent in all.bash. That number will come down before
we make it on by default, but this should be enough for
Keith to try using the precise maps for copying stacks.
amd64 only (and it's not really great generated code).
Russ Cox [Fri, 24 Jan 2014 03:51:39 +0000 (22:51 -0500)]
liblink, runtime: fix cgo on arm
The addition of TLS to ARM rewrote the MRC instruction
differently depending on whether we were using internal
or external linking mode. That's clearly not okay, since we
don't know that during compilation, which is when we now
generate the code. Also, because the change did not introduce
a real MRC instruction but instead just macro-expanded it
in the assembler, liblink is rewriting a WORD instruction that
may actually be looking for that specific constant, which would
lead to very unexpected results. It was also using one value
that happened to be 8 where a different value that also
happened to be 8 belonged. So the code was correct for those
values but not correct in general, and very confusing.
Throw it all away.
Replace with the following. There is a linker-provided symbol
runtime.tlsgm with a value (address) set to the offset from the
hardware-provided TLS base register to the g and m storage.
Any reference to that name emits an appropriate TLS relocation
to be resolved by either the internal linker or the external linker,
depending on the link mode. The relocation has exactly the
semantics of the R_ARM_TLS_LE32 relocation, which is what
the external linker provides.
This symbol is only used in two routines, runtime.load_gm and
runtime.save_gm. In both cases it is now used like this:
MRC 15, 0, R0, C13, C0, 3 // fetch TLS base pointer
MOVW $runtime·tlsgm(SB), R2
ADD R2, R0 // now R0 points at thread-local g+m storage
It is likely that this change breaks the generation of shared libraries
on ARM, because the MOVW needs to be rewritten to use the global
offset table and a different relocation type. But let's get the supported
functionality working again before we worry about unsupported
functionality.
LGTM=dave, iant
R=iant, dave
CC=golang-codereviews
https://golang.org/cl/56120043
Dmitriy Vyukov [Thu, 23 Jan 2014 20:13:21 +0000 (15:13 -0500)]
bufio: fix benchmarks behavior
Currently the benchmarks lie to testing package by doing O(N)
work under StopTimer. And that hidden O(N) actually consitutes
the bulk of benchmark work (e.g includes GC per iteration).
This behavior accounts for windows-amd64-race builder hangs.
««« original CL description
runtime: increase page size to 8K
Tcmalloc uses 8K, 32K and 64K pages, and in custom setups 256K pages.
Only Chromium uses 4K pages today (in "slow but small" configuration).
The general tendency is to increase page size, because it reduces
metadata size and DTLB pressure.
This change reduces GC pause by ~10% and slightly improves other metrics.
Dmitriy Vyukov [Thu, 23 Jan 2014 14:59:43 +0000 (18:59 +0400)]
runtime: increase page size to 8K
Tcmalloc uses 8K, 32K and 64K pages, and in custom setups 256K pages.
Only Chromium uses 4K pages today (in "slow but small" configuration).
The general tendency is to increase page size, because it reduces
metadata size and DTLB pressure.
This change reduces GC pause by ~10% and slightly improves other metrics.
Gautham Thambidorai [Wed, 22 Jan 2014 23:24:03 +0000 (18:24 -0500)]
crypto/tls: Client side support for TLS session resumption.
Adam (agl@) had already done an initial review of this CL in a branch.
Added ClientSessionState to Config which now allows clients to keep state
required to resume a TLS session with a server. A client handshake will try
and use the SessionTicket/MasterSecret in this cached state if the server
acknowledged resumption.
We also added support to cache ClientSessionState object in Config that will
be looked up by server remote address during the handshake.
Russ Cox [Wed, 22 Jan 2014 21:39:39 +0000 (16:39 -0500)]
runtime: fix typo in ARM code
The typo was introduced by one of Dmitriy's CLs this morning.
The fix makes the ARM build compile again; it still won't pass
its tests, but one thing at a time.
Jeff Sickel [Wed, 22 Jan 2014 21:21:53 +0000 (22:21 +0100)]
net: plan9 changes for default net directory
This change include updates to the probeIPv4Stack
and probeIPv6Stack to ensure that one or both
protocols are supported by ip(3).
The addition of fdMutex to netFD fixes the
TestTCPConcurrentAccept failures.
Additional changes add support for keepalive.
Brad Fitzpatrick [Wed, 22 Jan 2014 18:35:41 +0000 (10:35 -0800)]
syscall: use unsafe.Pointer in BSD kevent
Doesn't really matter for the most part, since the runtime-integrated
network poller uses its own kevent implementation, but for people using
the syscall directly, we should use an unsafe.Pointer for the precise GC
to retain the pointer arguments.
Also push down unsafe.Pointer a bit further in exec_linux.go, not
that there are any GC preemption points in the middle and sys
is still live anyway.
Dmitriy Vyukov [Wed, 22 Jan 2014 07:27:16 +0000 (11:27 +0400)]
runtime: remove locks from netpoll hotpaths
Introduces two-phase goroutine parking mechanism -- prepare to park, commit park.
This mechanism does not require backing mutex to protect wait predicate.
Use it in netpoll. See comment in netpoll.goc for details.
This slightly reduces contention between reader, writer and read/write io notifications;
and just eliminates a bunch of mutex operations from hotpaths, thus making then faster.
Dmitriy Vyukov [Wed, 22 Jan 2014 06:30:10 +0000 (10:30 +0400)]
runtime: fix and improve CPU profiling
- do not lose profiling signals when we have no mcache (possible for syscalls/cgo)
- do not lose any profiling signals on windows
- fix profiling of cgo programs on windows (they had no m->thread setup)
- properly setup tls in cgo programs on windows
- check _beginthread return value
Russ Cox [Wed, 22 Jan 2014 00:46:34 +0000 (19:46 -0500)]
liblink: remove use of linkmode on ARM
Now that liblink is compiled into the compilers and assemblers,
it must not refer to the "linkmode", since that is not known until
link time. This CL makes the ARM support no longer use linkmode,
which fixes a bug with cgo binaries that contain their own TLS
variables.
The x86 code must also remove linkmode; that is issue 7164.
Russ Cox [Tue, 21 Jan 2014 18:31:34 +0000 (13:31 -0500)]
cmd/gc: do not follow uintptr passed as function argument
The escape analysis works by tracing assignment paths from
variables that start with pointer type, or addresses of variables
(addresses are always pointers). It does allow non-pointers
in the path, so that in this code it sees x's value escape into y:
var x *[10]int
y := (*int)(unsafe.Pointer(uintptr(unsafe.Pointer(x))+32))
It must allow uintptr in order to see through this kind of
"pointer arithmetic".
It also traces such values if they end up as uintptrs passed to
functions. This used to be important because packages like
encoding/gob passed around uintptrs holding real pointers.
The introduction of precise collection of stacks has forced
code to be more honest about which declared stack variables
hold pointers and which do not. In particular, the garbage
collector no longer sees pointers stored in uintptr variables.
Because of this, packages like encoding/gob have been fixed.
There is not much point in the escape analysis accepting
uintptrs as holding pointers at call boundaries if the garbage
collector does not.
Excluding uintptr-valued arguments brings the escape
analysis in line with the garbage collector and has the
useful side effect of making arguments to syscall.Syscall
not appear to escape.
That is, this CL should yield the same benefits as
CL 45930043 (rolled back in CL 53870043), but it does
so by making uintptrs less special, not more.
Dmitriy Vyukov [Tue, 21 Jan 2014 09:06:57 +0000 (13:06 +0400)]
runtime: do not collect GC roots explicitly
Currently we collect (add) all roots into a global array in a single-threaded GC phase.
This hinders parallelism.
With this change we just kick off parallel for for number_of_goroutines+5 iterations.
Then parallel for callback decides whether it needs to scan stack of a goroutine
scan data segment, scan finalizers, etc. This eliminates the single-threaded phase entirely.
This requires to store all goroutines in an array instead of a linked list
(to allow direct indexing).
This CL also removes DebugScan functionality. It is broken because it uses
unbounded stack, so it can not run on g0. When it was working, I've found
it helpless for debugging issues because the two algorithms are too different now.
This change would require updating the DebugScan, so it's simpler to just delete it.
With 8 threads this change reduces GC pause by ~6%, while keeping cputime roughly the same.