So much for that exercise. New SSE2 code:
53000 microseconds spent scoring hands /
( 128662 relevant hands *
13 possible arrangements to score ) =
32 ns per score calculation =
64 clock cycles
Old branching code:
58000 microseconds spent scoring hands / ( 128662 * 13 scores ) = 69 clock cycles.
Also, unfortunately, the scoring is only taking about 1/3rd to 1/2 of the total time spent evaluating the set of undominated hand arrangements.
In the example above, 184300 microseconds evaluating 13 arrangements vs. 10M hands =
53000 microseconds scoring hands + 131300 microseconds comparing cards for overlap.
131300 microseconds / 10M hands =
13 ns per comparison =
26 clock cycles (including loop)
Check: 26 cycles * 10M hands + 64 cycles * 13 arrangements * 128662 relevant hands = 367 * 10^6 clocks = 184 ms.
I can probably shave off a few cycles here, but there aren't huge gains yet to be had. 26 cycles seems like a lot, but we're trashing both the L1 and L2 cache with 10M hands so I think
memory bandwidth becomes the main issue.
Perhaps I can make more aggressive use of SIMD. We could look for overlap in two hands at once in a 128-bit register. We also could maybe score four possible arrangements at once. Comparing front, middle, and back in parallel + converting that into a score turned out to be a big pain, but doing front, middle, and back serially for four hands at once should be much more straightforward.