### SIMD within a Register

I've been posting a little here and there on bit hacks. A while back I stumbled on a page titled

"The basic concept of SWAR, SIMD Within A Register, is that operations on word-length registers can be used to speed-up computations by performing SIMD parallel operations on n k/n-bit field values."

In other words, in the absence of MMX/SSE/etc., which SIMD operations can be done in an ordinary register using bit operations? It's an intriguing puzzle to ponder, especially as 64-bit systems continue to gain ground.

I'm sure it's been done before, but one idea coming to mind is a parallel average. The obvious way to average two unsigned values is (A + B) / 2. Actually, if you want to round up, (A + B + 1) / 2, but what happens if (A + B) overflows? How about adding A/2 and B/2, and if the low bit is set in either A or B, round up by one:

average = (A>>1) + (B>>1) + (A&1 OR B&1);

If we want to average two bytes in parallel using this method, we should be able to do it in a 16-bit word. The only additional thing we need to do is AND with 0xFEFE to mask off least significant bits before the shifft so they aren't shifted into the most significant bit of adjacent bytes:

average_x_2 = ((A&0xFEFE)>>1) + ((B&0xFEFE)>>1) + (A&0x0101 OR B&0x0101);

Of course, this idea readily extends to 32 bits to handle 4 bytes in parallel:

average_x_4 = ((A&0xFEFEFEFE)>>1) + ((B&0xFEFEFEFE)>>1) + (A&0x01010101 OR B&0x01010101);

Likewise, it can be extended to 64-bit for processing 8 bytes in parallel.

*Technical Summary: SWAR Technology*by Hank Dietz (1997):"The basic concept of SWAR, SIMD Within A Register, is that operations on word-length registers can be used to speed-up computations by performing SIMD parallel operations on n k/n-bit field values."

In other words, in the absence of MMX/SSE/etc., which SIMD operations can be done in an ordinary register using bit operations? It's an intriguing puzzle to ponder, especially as 64-bit systems continue to gain ground.

I'm sure it's been done before, but one idea coming to mind is a parallel average. The obvious way to average two unsigned values is (A + B) / 2. Actually, if you want to round up, (A + B + 1) / 2, but what happens if (A + B) overflows? How about adding A/2 and B/2, and if the low bit is set in either A or B, round up by one:

average = (A>>1) + (B>>1) + (A&1 OR B&1);

If we want to average two bytes in parallel using this method, we should be able to do it in a 16-bit word. The only additional thing we need to do is AND with 0xFEFE to mask off least significant bits before the shifft so they aren't shifted into the most significant bit of adjacent bytes:

average_x_2 = ((A&0xFEFE)>>1) + ((B&0xFEFE)>>1) + (A&0x0101 OR B&0x0101);

Of course, this idea readily extends to 32 bits to handle 4 bytes in parallel:

average_x_4 = ((A&0xFEFEFEFE)>>1) + ((B&0xFEFEFEFE)>>1) + (A&0x01010101 OR B&0x01010101);

Likewise, it can be extended to 64-bit for processing 8 bytes in parallel.

*Note: I am using OR to indicate a BITWISE-OR because Blogger nukes vertical bars.*
## 0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home