It wasn't actually testing what it says it was testing.
A random permutation isn't cyclic. It only probably hits a few
elements before entering a cycle.
Use an algorithm that generates a random cyclic permutation instead.
Fixing the test makes the previous CL look less good. But it still helps.
(Theory: Fixing the test makes it less cache friendly, so there are
more misses all around. That makes the benchmark slower, suppressing
the differences seen. Also fixing the benchmark makes the loop
iteration count less predictable, which hurts the raw loop
implementation somewhat.)