cmd/ld: align function entry on arch-specific boundary
16 seems pretty standard on x86 for function entry.
I don't know if ARM would benefit, so I used just 4
(single instruction alignment).
This has a minor absolute effect on the current timings.
The main hope is that it will make them more consistent from
run to run.
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17
4222117400 4140739800 -1.93%
BenchmarkFannkuch11
3462631800 3259914400 -5.85%
BenchmarkGobDecode
20887622 20620222 -1.28%
BenchmarkGobEncode
9548772 9384886 -1.72%
BenchmarkGzip 151687 150333 -0.89%
BenchmarkGunzip 8742 8741 -0.01%
BenchmarkJSONEncode
62730560 65210990 +3.95%
BenchmarkJSONDecode
252569180 249394860 -1.26%
BenchmarkMandelbrot200
5267599 5273394 +0.11%
BenchmarkRevcomp25M
980813500 996013800 +1.55%
BenchmarkTemplate
361259100 360620840 -0.18%
R=ken2
CC=golang-dev
https://golang.org/cl/
6244066