Cassy "SnowGirl"
2006-10-06 00:24:44 UTC
So, I had it mentioned to me that Intel came out with a new set of SSE
instructions, and finally, there's one that could allow us to do VPERM more
efficiently than dumping everything out and doing it scalar-wise.
The instruction is PSHUFB, and it takes two byte vectors as arguments: A = (
a_0, a_1, a_2 ... a_15), and B = (b_0, b_1, b_2, ... b_15) It then replaces
A with ( a_(b_0) a_(b_1) a_(b_2) ... a_(b_15) ) Unless the top bit of b_n
is 1, in which case it sets it equal to zero. With this, we could translate
vperm vrA, vrB, vrC to:
MOVPS XMMA, [vrA] # map XMMA -> vrA
MOVPS XMMB, [vrB] # map XMMB -> vrB
MOVPS XMMC, [vrC] # map XMMC -> vrC
PSHUFB XMMA, XMMC # invalidate XMMA mapping
PXOR XMMC, [vrSIGN] # invalidate XMMC mapping
PSHUFB XMMB, XMMC # invalidate XMMB mapping
POR XMMA, XMMB # map XMMA -> vrD
I think that would run a bit faster than having to dump the registers out to
memory, loading offsets, etc. Of course, there's still the problem that
we're accessing in the wrong order, but that should just require a step
after loading XMMC. We should just be able to subtract 0 by the number,
then add 15 to each, then and with 0x8F. So, final solution would look
something like this:
MOVPS XMMA, [vrA] # map XMMA -> vrA
MOVPS XMMB, [vrB] # map XMMB -> vrB
PXOR XMMC, XMMC
PSUBB XMMC, [vrC] # unless someone knows a faster way to
negate
PADDB XMMC, [FIFTEENS]
PAND XMMC, [PERM_MASKS]
PSHUFB XMMA, XMMC # invalidate XMMA mapping
PXOR XMMC, [vrSIGN]
PSHUFB XMMB, XMMC # invalidate XMMB mapping
POR XMMA, XMMB # map XMMA -> vrD
Of course, we would also need support to get the CPU info for SSSE3, and it
will only be available with very new processors, (of course, the same
argument put off many SSE2 implementations for a long time also)
It looks like there are a few interesting instructions that could help with
VMHRADDSHS, which actually does get called often with OSX.
instructions, and finally, there's one that could allow us to do VPERM more
efficiently than dumping everything out and doing it scalar-wise.
The instruction is PSHUFB, and it takes two byte vectors as arguments: A = (
a_0, a_1, a_2 ... a_15), and B = (b_0, b_1, b_2, ... b_15) It then replaces
A with ( a_(b_0) a_(b_1) a_(b_2) ... a_(b_15) ) Unless the top bit of b_n
is 1, in which case it sets it equal to zero. With this, we could translate
vperm vrA, vrB, vrC to:
MOVPS XMMA, [vrA] # map XMMA -> vrA
MOVPS XMMB, [vrB] # map XMMB -> vrB
MOVPS XMMC, [vrC] # map XMMC -> vrC
PSHUFB XMMA, XMMC # invalidate XMMA mapping
PXOR XMMC, [vrSIGN] # invalidate XMMC mapping
PSHUFB XMMB, XMMC # invalidate XMMB mapping
POR XMMA, XMMB # map XMMA -> vrD
I think that would run a bit faster than having to dump the registers out to
memory, loading offsets, etc. Of course, there's still the problem that
we're accessing in the wrong order, but that should just require a step
after loading XMMC. We should just be able to subtract 0 by the number,
then add 15 to each, then and with 0x8F. So, final solution would look
something like this:
MOVPS XMMA, [vrA] # map XMMA -> vrA
MOVPS XMMB, [vrB] # map XMMB -> vrB
PXOR XMMC, XMMC
PSUBB XMMC, [vrC] # unless someone knows a faster way to
negate
PADDB XMMC, [FIFTEENS]
PAND XMMC, [PERM_MASKS]
PSHUFB XMMA, XMMC # invalidate XMMA mapping
PXOR XMMC, [vrSIGN]
PSHUFB XMMB, XMMC # invalidate XMMB mapping
POR XMMA, XMMB # map XMMA -> vrD
Of course, we would also need support to get the CPU info for SSSE3, and it
will only be available with very new processors, (of course, the same
argument put off many SSE2 implementations for a long time also)
It looks like there are a few interesting instructions that could help with
VMHRADDSHS, which actually does get called often with OSX.
--
Cassy
Cassy