runtime: implement transfer cache
Currently we do the following dance after sweeping a span:
1. lock mcentral
2. remove the span from a list
3. unlock mcentral
4. unmark span
5. lock mheap
6. insert the span into heap
7. unlock mheap
8. lock mcentral
9. observe empty list
10. unlock mcentral
11. lock mheap
12. grab the span
13. unlock mheap
14. mark span
15. lock mcentral
16. insert the span into empty list
17. unlock mcentral
This change short-circuits this sequence to nothing,
that is, we just cache and use the span after sweeping.
This gives us functionality similar (even better) to tcmalloc's transfer cache.
benchmark old ns/op new ns/op delta
BenchmarkMalloc8 22.2 19.5 -12.16%
BenchmarkMalloc16 31.0 26.6 -14.19%
LGTM=khr
R=golang-codereviews, khr
CC=golang-codereviews, rlh, rsc
https://golang.org/cl/
119550043