My brain has finally kicked up a possible explanation. Since to evaluate a hand we are looping over a big array of opponent hands, the performance may be bounded by memory access. The 2-CPU machine is a dual-core and has separate L2 caches, but the working set size is at least 128MB. So the bottleneck may actually be pulling cache lines in from main memory, which won't run any faster on a 2-CPU machine.
If that's the case, though, there ought to be a substantial performance improvement from batching the calculations for several hands. The L2 cache is 512KB large using 64-byte lines, and is two-way set-associative. Right now it takes exactly two cache lines to store one opponent hand (128B).
The test-hand is not so well behaved. If it were rewritten to increase locality, we'd still want 52x1 bits to mark which cards are present (for easy lookup), 3x8 bytes to record the hand strength of each arrangement, and another 4 bytes to keep the running total of the score. 8 bytes per hand + 32 bytes per arrangement = say 1KB per hand as a worst case.
Comparing 256 hands at a time would use 256KB of the L1 cache, one line from every set, leaving the other line in each set for opponent hands. (We'd also need some L2 cache lines for instructions.)
Also, I'm being an idiot about how I compare whether the target hand and the opponent hand have overlapping cards. Instead of storing the target as a hash and the opponent as an array I should be storing both as bitmaps and just doing a big bitwise and.