I was surprised to see readvarint show up in a cpu profile.
Use a few simple optimizations to speed up stack copying:
* Avoid making a copy of the cache.entries array or any of its elements.
* Use a shift instead of a signed division in stackmapdata.
* Change readvarint to return the number of bytes consumed
rather than an updated slice.
* Make some minor optimizations to readvarint to help the compiler.
* Avoid called readvarint when the value fits in a single byte.
The first and last optimizations are the most significant,
although they all contribute a little.
Add a benchmark for stack copying that includes lots of different
functions in a recursive loop, to bust the cache.
This might speed up other runtime operations as well;
I only benchmarked stack copying.
name old time/op new time/op delta
StackCopy-8 96.4ms ± 2% 82.7ms ± 1% -14.24% (p=0.000 n=20+19)
StackCopyNoCache-8 167ms ± 1% 131ms ± 1% -21.58% (p=0.000 n=20+20)