Move currently uses mov instructions directly up to 31 bytes and then
switches to duffcopy. Moving 31 bytes is 4 instructions corresponding to
two loads and two stores, (or 6 if !useSSE) depending on the usage,
duffcopy is five (one or two mov, two or three lea, one call).
This adds direct mov instructions for Move's of size 32, 48, and 64 with
sse and for only size 32 without.
With useSSE:
- 32 is 4 instructions (byte +/- comparison below)
- 33 thru 48 is 6
- 49 thru 64 is 8
Without:
- 32 is 8
Note that the only platform with useSSE set to false is plan 9. I have
built three projects based off tip and tip with this patch and the
project's byte size is equal to or less than they were prior.
The basis of this change is that copying data with instructions directly
is nearly free, whereas calling into duffcopy adds a bit of overhead.
This is most noticeable in range statements where elements are 32+
bytes. For code with the following pattern:
func Benchmark32Range(b *testing.B) {
var f s32
for _, count := range []int{10, 100, 1000, 10000} {
name := strconv.Itoa(count)
b.Run(name, func(b *testing.B) {
base := make([]s32, count)
for i := 0; i < b.N; i++ {
for _, v := range base {
f = v
}
}
})
}
_ = f
}
Function calls are not benefitted as much due how they are compiled, but
other benchmarks I ran show that calling function with 64 byte elements
is marginally improved.
The main downside with this change is that it may increase binary sizes
depending on the size of the copy, but this change also decreases
binaries for moves of 48 bytes or less.
For the following code:
package main
type size [32]byte
//go:noinline
func use(t size) {
}
//go:noinline
func get() size {
var z size
return z
}
func main() {
var a size
use(a)
}
Changing size around gives the following assembly leading up to the call
(the initialization and actual call are removed):
So, at best we save 9B, at worst we gain 17. I do not think that copying
around 65+B sized types is common enough to bloat program sizes. Using
bincmp on the go binary itself shows a zero byte difference; there are
gains and losses all over. One of the largest gains in binary size comes
from cmd/go/internal/cache.(*Cache).Get, which passes around a 64 byte
sized type -- this is one of the cases I would expect to be benefitted
by this change.
I think that this marginal improvement in struct copying for 64 byte
structs is worth it: most data structs / work items I use in my programs
are small, but few are smaller than 32 bytes: with one slice, the budget
is up. The 32 rule alone would allow another 16 bytes, the 48 and 64
rules allow another 32 and 48.