The first biggest offender was crypto/des.init at ~1%. It's
cryptographically broken and the init function is relatively expensive,
which is unfortunate as both crypto/tls and crypto/x509 (and by
extension, cmd/go) import it. Hide the work behind sync.Once.
The second biggest offender was flag.sortFlags at just under 1%, used by
the Visit flagset methods. It allocated two slices, which made a
difference as cmd/go iterates over multiple flagsets during init.
Use a single slice with a direct sort.Interface implementation.
Another big offender is initializing global maps. Reducing this work in
cmd/go/internal/imports and net/textproto gives us close to another
whole 1% in saved work. The former can use map literals, and the latter
can hide the work behind sync.Once.
Finally, compress/flate used newHuffmanBitWriter as part of init, which
allocates many objects and slices. Yet it only used one of the slice
fields. Allocating just that slice saves a surprising ~0.3%, since we
generated a lot of unnecessary garbage.
All in all, these little pieces amount to just over 3% saved CPU time.
name old time/op new time/op delta
ExecGoEnv-8 3.61ms ± 1% 3.50ms ± 0% -3.02% (p=0.000 n=10+10)