runtime: add pcvalue cache to improve stack scan speed
The cost of scanning large stacks is currently dominated by the time
spent looking up and decoding the pcvalue table. However, large stacks
are usually large not because they contain calls to many different
functions, but because they contain many calls to the same, small set
of recursive functions. Hence, walking large stacks tends to make the
same pcvalue queries many times.
Based on this observation, this commit adds a small, very simple, and
fast cache in front of pcvalue lookup. We thread this cache down from
operations that make many pcvalue calls, such as gentraceback, stack
scanning, and stack adjusting.
This simple cache works well because it has minimal overhead when it's
not effective. I also tried a hashed direct-map cache, CLOCK-based
replacement, round-robin replacement, and round-robin with lookups
disabled until there had been at least 16 probes, but none of these
approaches had obvious wins over the random replacement policy in this
commit.
This nearly doubles the overall performance of the deep stack test
program from issue #10898:
name old time/op new time/op delta
Issue10898 16.5s ±12% 9.2s ±12% -44.37% (p=0.008 n=5+5)
It's a very slight win on the garbage benchmark:
name old time/op new time/op delta
XBenchGarbage-12 4.92ms ± 1% 4.89ms ± 1% -0.75% (p=0.000 n=18+19)
It's a wash (but doesn't harm performance) on the go1 benchmarks,
which don't have particularly deep stacks: