Use one comparison to detect underflow and overflow simultaneously.
Use a shift, bitwise complement and uint8 type conversion to handle
clamping to upper and lower bound without additional branching.
Overall the new code is faster for a mix of
common case, underflow and overflow.
name old time/op new time/op delta
YCbCr-2 1.12ms ± 0% 0.64ms ± 0% -43.01% (p=0.000 n=48+47)