runtime: use OnesCount64 to count allocated objects in a span
This change modifies the implementation of (*mspan).countAlloc by
using OnesCount64 (which on many systems is intrinsified). It does so by
using an unsafe pointer cast, but in this case we don't care about
endianness because we're just counting bits set.
This change means we no longer need the popcnt table which was redundant
in the runtime anyway. We can also simplify the logic here significantly
by observing that mark bits allocations are always 8-byte aligned, so we
don't need to handle any edge-cases due to the fact that OnesCount64
operates on 64 bits at a time: all irrelevant bits will be zero.
Overall, this implementation is significantly faster than the old one on
amd64, and should be similarly faster (or better!) on other systems
which support the intrinsic. On systems which do not, it should be
roughly the same performance because OnesCount64 is implemented using a
table in the general case.