deductSweepCredit expects the size in bytes of the span being
allocated, but mCentral_CacheSpan passes the size of a single object
in the span. As a result, we don't sweep enough on that call and when
mCentral_CacheSpan later calls reimburseSweepCredit, it's very likely
to underflow mheap_.spanBytesAlloc, which causes the next call to
deductSweepCredit to think it owes a huge number of pages and finish
off the whole sweep.
In addition to causing the occasional allocation that triggers the
full sweep to be potentially extremely expensive relative to other
allocations, this can indirectly slow down many other allocations.
deductSweepCredit uses sweepone to sweep spans, which returns
fully-unused spans to the heap, where these spans are freed and
coalesced with neighboring free spans. On the other hand, when
mCentral_CacheSpan sweeps a span, it does so with the intent to
immediately reuse that span and, as a result, will not return the span
to the heap even if it is fully unused. This saves on the cost of
locking the heap, finding a span, and initializing that span. For
example, before this change, with GOMAXPROCS=1 (or the background
sweeper disabled) BinaryTree17 returned roughly 220K spans to the heap
and allocated new spans from the heap roughly 232K times. After this
change, it returns 1.3K spans to the heap and allocates new spans from
the heap 39K times. (With background sweeping these numbers are
effectively unchanged because the background sweeper sweeps almost all
of the spans with sweepone; however, parallel sweeping saves more than
the cost of allocating spans from the heap.)