introduce rolor() function to subsitute (a << c) | (a >> (bits(a) - c))
with (a <<< c) where <<< is cyclic rotation and c is constant.
this almost doubles the speed of chacha encryption of 386 and amd64.
the peephole optimizer used to stop when it hit a shift or rol
instruction when attempting to eleminate moves by register
substitution. but we do not have to as long as the shift count
operand is not CX (which cannot be substituted) and CX is not
a subject for substitution.
convert:
x = B || W
MOVxLZX a, r; MOVxQZX r, b -> MOVxQZX a, r; MOVQ r, b
MOVxLSX a, r; MOVxQSX r, r -> MOVxQSX a, r; MOVQ r, r
the MOVQ can then be eleminated by copy propagation.
improve subprop() by accepting other mov and lea
instructions as the source op.
Without an explicit signal for a truncation, copy propagation will
sometimes propagate a 32-bit truncation and end up overwriting uses of
the original 64-bit value.
This was independently discovered and fixed in Go. See:
http://golang.org/issue/1315https://codereview.appspot.com/6002043/
Thanks Charles Forsyth for tips and advice.
the shift instructions does not change the zero flag
when the shift count is 0, so we cannot remove the
compare instruction in this case.
this fixes oggdec under 386.
when the previous instruction sets the zero flag,
we can remove the CMPL/CMPQ instruction.
this removes compares for zero/non zero tests only.
it only looks at the previous non-nop instruction
to see if it sets our compare value register.