I did not put any extra effort into making the batched hands use memory efficiently, because I figured if there was a significant effect, it would be visible despite the batch being scattered throughout memory.
Let's do the math.
It takes about 140-160 milliseconds to evaluate one hand vs 1M opponent hands. That means each opponent hand is considered in (an average of) 0.14-0.16 microseconds. This is significantly higher than memory latency--- we'd expect that to be about 0.04 microseconds using 100MHz DDR SDRAM.
But, most hands are rejected quickly because they overlap. For these the memory latency might be an issue. But others are scored against each possible arrangement. 1 cycle = 5*10^-10 second = 5*10^-4 microseconds, so 0.14-0.16 us = 280-320 cyles per opponent hand. But if count just the 13000 or so relevant hands, then the cost is actually 11-12 us, 22000-24000 cycles per relevant hand.
The truth is probably that the overlapping hands take much less time (just a few CPU cycles for a compare and branch, memory latency adds perhaps 80 cycles) while non-overlapping hands take more.
So, it doesn't make sense to concentrate on getting through overlapping hands faster--- I need to focus on getting the score calculation down to a minimum.
But, I still can't explain why the 2-CPU box is slower.