Brad Fitzpatrick [Wed, 29 Jan 2014 12:44:21 +0000 (13:44 +0100)]
net/http: read as much as possible (including EOF) during chunked reads
This is the chunked half of https://golang.org/cl/49570044 .
We want full reads to return EOF as early as possible, when we
know we're at the end, so http.Transport client connections are eagerly
re-used in the common case, even if no Read or Close follows.
To do this, make the chunkedReader.Read fill up its argument p []byte
buffer as much as possible, as long as that doesn't involve doing
any more blocking reads to read chunk headers. That means if we
have a chunk EOF ("0\r\n") sitting in the incoming bufio.Reader,
we see it and set EOF on our final Read.
Brad Fitzpatrick [Wed, 29 Jan 2014 10:23:45 +0000 (11:23 +0100)]
net/http: reuse client connections earlier when Content-Length is set
Set EOF on the final Read of a body with a Content-Length, which
will cause clients to recycle their connection immediately upon
the final Read, rather than waiting for another Read or Close
(neither of which might come). This happens often when client
code is simply something like:
Then there's usually no subsequent Read. Even if the client
calls Close (which they should): in Go 1.1, the body was
slurped to EOF, but in Go 1.2, that was then treated as a
Close-before-EOF and the underlying connection was closed.
But that's assuming the user even calls Close. Many don't.
Reading to EOF also causes a connection be reused. Now the EOF
arrives earlier.
This CL only addresses the Content-Length case. A future CL
will address the chunked case.
Dmitriy Vyukov [Tue, 28 Jan 2014 18:34:32 +0000 (22:34 +0400)]
runtime: adjust malloc race instrumentation for tiny allocs
Tiny alloc memory block is shared by different goroutines running on the same thread.
We call racemalloc after enabling preemption in mallocgc,
as the result another goroutine can act on not yet race-cleared tiny block.
Call racemalloc before enabling preemption.
Fixes #7224.
LGTM=dave
R=golang-codereviews, dave
CC=golang-codereviews
https://golang.org/cl/57730043
Michael Hudson-Doyle [Tue, 28 Jan 2014 05:47:09 +0000 (16:47 +1100)]
cmd/go: When linking with gccgo pass .a files in the order they are discovered
Under some circumstances linking a test binary with gccgo can fail, because
the installed version of the library ends up before the version built for the
test on the linker command line.
This admittedly slightly hackish fix fixes this by putting the library archives
on the linker command line in the order that a pre-order depth first traversal
of the dependencies gives them, which has the side effect of always putting the
version of the library built for the test first.
Fixes #6768
LGTM=rsc
R=golang-codereviews, minux.ma, gobot, rsc, dave
CC=golang-codereviews
https://golang.org/cl/28050043
Vincent Vanackere [Mon, 27 Jan 2014 22:00:00 +0000 (14:00 -0800)]
runtime/debug: fix incorrect Stack output if package path contains a dot
Although debug.Stack is deprecated, it should still return the correct result.
Output before this CL (using a trivial library in $GOPATH/test.com/a):
/home/vince/src/test.com/a/lib.go:9 (0x42311e)
com/a.ShowStack: os.Stdout.Write(debug.Stack())
Output with this CL applied:
/home/vince/src/test.com/a/lib.go:9 (0x42311e)
ShowStack: os.Stdout.Write(debug.Stack())
Dmitriy Vyukov [Mon, 27 Jan 2014 20:26:56 +0000 (00:26 +0400)]
runtime: fix windows build
Currently windows crashes because early allocs in schedinit
try to allocate tiny memory blocks, but m->p is not yet setup.
I've considered calling procresize(1) earlier in schedinit,
but this refactoring is better and must fix the issue as well.
Fixes #7218.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/54570045
Dmitriy Vyukov [Mon, 27 Jan 2014 19:17:46 +0000 (23:17 +0400)]
runtime: tune P retake logic
When GOMAXPROCS>1 the last P in syscall is never retaken
(because there are already idle P's -- npidle>0).
This prevents sysmon thread from sleeping.
On a darwin machine the program from issue 6673 constantly
consumes ~0.2% CPU. With this change it stably consumes 0.0% CPU.
Fixes #6673.
R=golang-codereviews, r
CC=bradfitz, golang-codereviews, iant, khr
https://golang.org/cl/56990045
Ian Lance Taylor [Mon, 27 Jan 2014 18:18:22 +0000 (10:18 -0800)]
debug/dwarf, debug/elf: add support for reading DWARF 4 type info
In DWARF 4 the debug info for large types is put into
.debug_type sections, so that the linker can discard duplicate
info. This change adds support for reading type units.
Another small change included here is that DWARF 3 supports
storing the byte offset of a struct field as a formData rather
than a formDwarfBlock.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/56300043
Dmitriy Vyukov [Mon, 27 Jan 2014 16:29:21 +0000 (20:29 +0400)]
runtime: fix buffer overflow in stringtoslicerune
On 32-bits n*sizeof(r[0]) can overflow.
Or it can become 1<<32-eps, and mallocgc will "successfully"
allocate 0 pages for it, there are no checks downstream
and MHeap_Grow just does:
npage = (npage+15)&~15;
ask = npage<<PageShift;
Dmitriy Vyukov [Mon, 27 Jan 2014 11:11:12 +0000 (15:11 +0400)]
runtime: smarter slice grow
When growing slice take into account size of the allocated memory block.
Also apply the same optimization to string->[]byte conversion.
Fixes #6307.
benchmark old ns/op new ns/op delta
BenchmarkAppendGrowByte 45410364434108 -2.35%
BenchmarkAppendGrowString 5988567344813604 -25.17%
Dmitriy Vyukov [Fri, 24 Jan 2014 18:35:11 +0000 (22:35 +0400)]
runtime: combine small NoScan allocations
Combine NoScan allocations < 16 bytes into a single memory block.
Reduces number of allocations on json/garbage benchmarks by 10+%.
Dmitriy Vyukov [Fri, 24 Jan 2014 18:29:53 +0000 (22:29 +0400)]
sync: scalable Pool
Introduce fixed-size P-local caches.
When local caches overflow/underflow a batch of items
is transferred to/from global mutex-protected cache.
Dmitriy Vyukov [Fri, 24 Jan 2014 18:29:01 +0000 (22:29 +0400)]
runtime: do not zero terminate strings
On top of "tiny allocator" (cl/38750047), reduces number of allocs by 1% on json.
No code must rely on zero termination. So will also make debugging simpler,
by uncovering issues earlier.
json-1
allocated 79496867915766 -0.43%
allocs 93778 92790 -1.05%
time 10095779597250949 -3.67%
rest of the metrics are too noisy.
Russ Cox [Fri, 24 Jan 2014 04:11:04 +0000 (23:11 -0500)]
cmd/gc: add zeroing to enable precise stack accounting
There is more zeroing than I would like right now -
temporaries used for the new map and channel runtime
calls need to be eliminated - but it will do for now.
This CL only has an effect if you are building with
GOEXPERIMENT=precisestack ./all.bash
(or make.bash). It costs about 5% in the overall time
spent in all.bash. That number will come down before
we make it on by default, but this should be enough for
Keith to try using the precise maps for copying stacks.
amd64 only (and it's not really great generated code).
Russ Cox [Fri, 24 Jan 2014 03:51:39 +0000 (22:51 -0500)]
liblink, runtime: fix cgo on arm
The addition of TLS to ARM rewrote the MRC instruction
differently depending on whether we were using internal
or external linking mode. That's clearly not okay, since we
don't know that during compilation, which is when we now
generate the code. Also, because the change did not introduce
a real MRC instruction but instead just macro-expanded it
in the assembler, liblink is rewriting a WORD instruction that
may actually be looking for that specific constant, which would
lead to very unexpected results. It was also using one value
that happened to be 8 where a different value that also
happened to be 8 belonged. So the code was correct for those
values but not correct in general, and very confusing.
Throw it all away.
Replace with the following. There is a linker-provided symbol
runtime.tlsgm with a value (address) set to the offset from the
hardware-provided TLS base register to the g and m storage.
Any reference to that name emits an appropriate TLS relocation
to be resolved by either the internal linker or the external linker,
depending on the link mode. The relocation has exactly the
semantics of the R_ARM_TLS_LE32 relocation, which is what
the external linker provides.
This symbol is only used in two routines, runtime.load_gm and
runtime.save_gm. In both cases it is now used like this:
MRC 15, 0, R0, C13, C0, 3 // fetch TLS base pointer
MOVW $runtime·tlsgm(SB), R2
ADD R2, R0 // now R0 points at thread-local g+m storage
It is likely that this change breaks the generation of shared libraries
on ARM, because the MOVW needs to be rewritten to use the global
offset table and a different relocation type. But let's get the supported
functionality working again before we worry about unsupported
functionality.
LGTM=dave, iant
R=iant, dave
CC=golang-codereviews
https://golang.org/cl/56120043
Dmitriy Vyukov [Thu, 23 Jan 2014 20:13:21 +0000 (15:13 -0500)]
bufio: fix benchmarks behavior
Currently the benchmarks lie to testing package by doing O(N)
work under StopTimer. And that hidden O(N) actually consitutes
the bulk of benchmark work (e.g includes GC per iteration).
This behavior accounts for windows-amd64-race builder hangs.
««« original CL description
runtime: increase page size to 8K
Tcmalloc uses 8K, 32K and 64K pages, and in custom setups 256K pages.
Only Chromium uses 4K pages today (in "slow but small" configuration).
The general tendency is to increase page size, because it reduces
metadata size and DTLB pressure.
This change reduces GC pause by ~10% and slightly improves other metrics.
Dmitriy Vyukov [Thu, 23 Jan 2014 14:59:43 +0000 (18:59 +0400)]
runtime: increase page size to 8K
Tcmalloc uses 8K, 32K and 64K pages, and in custom setups 256K pages.
Only Chromium uses 4K pages today (in "slow but small" configuration).
The general tendency is to increase page size, because it reduces
metadata size and DTLB pressure.
This change reduces GC pause by ~10% and slightly improves other metrics.
Gautham Thambidorai [Wed, 22 Jan 2014 23:24:03 +0000 (18:24 -0500)]
crypto/tls: Client side support for TLS session resumption.
Adam (agl@) had already done an initial review of this CL in a branch.
Added ClientSessionState to Config which now allows clients to keep state
required to resume a TLS session with a server. A client handshake will try
and use the SessionTicket/MasterSecret in this cached state if the server
acknowledged resumption.
We also added support to cache ClientSessionState object in Config that will
be looked up by server remote address during the handshake.
Russ Cox [Wed, 22 Jan 2014 21:39:39 +0000 (16:39 -0500)]
runtime: fix typo in ARM code
The typo was introduced by one of Dmitriy's CLs this morning.
The fix makes the ARM build compile again; it still won't pass
its tests, but one thing at a time.
Jeff Sickel [Wed, 22 Jan 2014 21:21:53 +0000 (22:21 +0100)]
net: plan9 changes for default net directory
This change include updates to the probeIPv4Stack
and probeIPv6Stack to ensure that one or both
protocols are supported by ip(3).
The addition of fdMutex to netFD fixes the
TestTCPConcurrentAccept failures.
Additional changes add support for keepalive.
Brad Fitzpatrick [Wed, 22 Jan 2014 18:35:41 +0000 (10:35 -0800)]
syscall: use unsafe.Pointer in BSD kevent
Doesn't really matter for the most part, since the runtime-integrated
network poller uses its own kevent implementation, but for people using
the syscall directly, we should use an unsafe.Pointer for the precise GC
to retain the pointer arguments.
Also push down unsafe.Pointer a bit further in exec_linux.go, not
that there are any GC preemption points in the middle and sys
is still live anyway.
Dmitriy Vyukov [Wed, 22 Jan 2014 07:27:16 +0000 (11:27 +0400)]
runtime: remove locks from netpoll hotpaths
Introduces two-phase goroutine parking mechanism -- prepare to park, commit park.
This mechanism does not require backing mutex to protect wait predicate.
Use it in netpoll. See comment in netpoll.goc for details.
This slightly reduces contention between reader, writer and read/write io notifications;
and just eliminates a bunch of mutex operations from hotpaths, thus making then faster.
Dmitriy Vyukov [Wed, 22 Jan 2014 06:30:10 +0000 (10:30 +0400)]
runtime: fix and improve CPU profiling
- do not lose profiling signals when we have no mcache (possible for syscalls/cgo)
- do not lose any profiling signals on windows
- fix profiling of cgo programs on windows (they had no m->thread setup)
- properly setup tls in cgo programs on windows
- check _beginthread return value
Russ Cox [Wed, 22 Jan 2014 00:46:34 +0000 (19:46 -0500)]
liblink: remove use of linkmode on ARM
Now that liblink is compiled into the compilers and assemblers,
it must not refer to the "linkmode", since that is not known until
link time. This CL makes the ARM support no longer use linkmode,
which fixes a bug with cgo binaries that contain their own TLS
variables.
The x86 code must also remove linkmode; that is issue 7164.
Russ Cox [Tue, 21 Jan 2014 18:31:34 +0000 (13:31 -0500)]
cmd/gc: do not follow uintptr passed as function argument
The escape analysis works by tracing assignment paths from
variables that start with pointer type, or addresses of variables
(addresses are always pointers). It does allow non-pointers
in the path, so that in this code it sees x's value escape into y:
var x *[10]int
y := (*int)(unsafe.Pointer(uintptr(unsafe.Pointer(x))+32))
It must allow uintptr in order to see through this kind of
"pointer arithmetic".
It also traces such values if they end up as uintptrs passed to
functions. This used to be important because packages like
encoding/gob passed around uintptrs holding real pointers.
The introduction of precise collection of stacks has forced
code to be more honest about which declared stack variables
hold pointers and which do not. In particular, the garbage
collector no longer sees pointers stored in uintptr variables.
Because of this, packages like encoding/gob have been fixed.
There is not much point in the escape analysis accepting
uintptrs as holding pointers at call boundaries if the garbage
collector does not.
Excluding uintptr-valued arguments brings the escape
analysis in line with the garbage collector and has the
useful side effect of making arguments to syscall.Syscall
not appear to escape.
That is, this CL should yield the same benefits as
CL 45930043 (rolled back in CL 53870043), but it does
so by making uintptrs less special, not more.
Dmitriy Vyukov [Tue, 21 Jan 2014 09:06:57 +0000 (13:06 +0400)]
runtime: do not collect GC roots explicitly
Currently we collect (add) all roots into a global array in a single-threaded GC phase.
This hinders parallelism.
With this change we just kick off parallel for for number_of_goroutines+5 iterations.
Then parallel for callback decides whether it needs to scan stack of a goroutine
scan data segment, scan finalizers, etc. This eliminates the single-threaded phase entirely.
This requires to store all goroutines in an array instead of a linked list
(to allow direct indexing).
This CL also removes DebugScan functionality. It is broken because it uses
unbounded stack, so it can not run on g0. When it was working, I've found
it helpless for debugging issues because the two algorithms are too different now.
This change would require updating the DebugScan, so it's simpler to just delete it.
With 8 threads this change reduces GC pause by ~6%, while keeping cputime roughly the same.
Dmitriy Vyukov [Tue, 21 Jan 2014 07:20:23 +0000 (11:20 +0400)]
runtime: per-P defer pool
Instead of a per-goroutine stack of defers for all sizes,
introduce per-P defer pool for argument sizes 8, 24, 40, 56, 72 bytes.
For a program that starts 1e6 goroutines and then joins then:
old: rss=6.6g virtmem=10.2g time=4.85s
new: rss=4.5g virtmem= 8.2g time=3.48s
Dmitriy Vyukov [Tue, 21 Jan 2014 06:53:51 +0000 (10:53 +0400)]
runtime: zero 2-word memory blocks in-place
Currently for 2-word blocks we set the flag to clear the flag. Makes no sense.
In particular on 32-bits we call memclr always.
Dmitriy Vyukov [Tue, 21 Jan 2014 06:24:42 +0000 (10:24 +0400)]
runtime: ensure fair scheduling during frequent GCs
What was happenning is as follows:
Each writer goroutine always triggers GC during its scheduling quntum.
After GC goroutines are shuffled so that the timer goroutine is always second in the queue.
This repeats infinitely, causing timer goroutine starvation.
Fixes #7126.
Keith Randall [Fri, 17 Jan 2014 22:48:45 +0000 (14:48 -0800)]
runtime, cmd/gc: Get rid of vararg channel calls.
Vararg C calls present a problem for the GC because the
argument types are not derivable from the signature. Remove
them by passing pointers to channel elements instead of the
channel elements directly.
The compiler change is an ugly hack.
We can do better.
««« original CL description
syscall: mark arguments to Syscall as noescape
Heap arguments to "async" syscalls will break when/if we have moving GC anyway.
With this change is must not break until moving GC, because a user must
reference the object in Go to preserve liveness. Otherwise the code is broken already.
Reduces number of leaked params from 125 to 36 on linux.
Rob Pike [Fri, 17 Jan 2014 21:19:00 +0000 (13:19 -0800)]
syscall: allocate 64 bits of "basep" for Getdirentries
Recent crashes on 386 Darwin appear to be caused by this system call
smashing the stack. Phenomenology shows that allocating more data
here addresses the probem.
The guess is that since the actual system call is getdirentries64, 64 is
what we should allocate.
Dmitriy Vyukov [Fri, 17 Jan 2014 16:18:37 +0000 (20:18 +0400)]
syscall: mark arguments to Syscall as noescape
Heap arguments to "async" syscalls will break when/if we have moving GC anyway.
With this change is must not break until moving GC, because a user must
reference the object in Go to preserve liveness. Otherwise the code is broken already.
Reduces number of leaked params from 125 to 36 on linux.
Luke Curley [Fri, 17 Jan 2014 16:07:04 +0000 (11:07 -0500)]
crypto/cipher: improved cbc performance
decrypt: reduced the number of copy calls from 2n to 1.
encrypt: reduced the number of copy calls from n to 1.
Encryption is straight-forward: use dst instead of tmp when
xoring the block with the iv.
Decryption now loops backwards through the blocks abusing the
fact that the previous block's ciphertext (src) is the iv. This
means we don't need to copy the iv every time, in addition to
using dst instead of tmp like encryption.