For less predictable data, like loop iteration dependant sizes,
branch table still shows 2-5% worse results.
It also makes code slightly more complicated.
This CL removes TODO note that directly suggest trying this
optimization out. That encourages people to spend their time
in a quite hopeless endeavour.
The code used to implement branch table used a 32/64-entry table
with pointers to TEXT blocks that implemented every associated
label work. Most last entries point to "loop" code that is
a fallthrough for all other sizes that do not map into specialized
routines. The only inefficiency is extra MOVL/MOVQ required
to fetch table pointer itself as MOVL $sym<>(SB)(AX*4) is not valid
in Go asm (it works in other assemblers):
TEXT ·memclrNew(SB), NOSPLIT, $0-8
MOVL ptr+0(FP), DI
MOVL n+4(FP), BX
// Handle 0 separately.
TESTL BX, BX
JEQ _0
LEAL -1(BX), CX // n-1
BSRL CX, CX
// AX or X0 zeroed inside every text block.
MOVL $memclrTable<>(SB), AX
JMP (AX)(CX*4)
_0:
RET