I've been pondering whether it's possible to make a space/time tradeoff that will dramatically reduce that number. The current code consists of the following steps, where K is the number of cards being drawn at once:
1) Calculate the 'weight' of the current draw: K lookups and multiplications.
2) Copy the K cards of the current draw to a return object.
3) Advance to the next draw in lexicographic order. Maximum 2K loop interations, with about 5 lookups and three increment/decrement operations per loop.
But since we're never drawing more than 5 cards per hand, it should be possible to have all possible draws in big precomputed tables. Then we just have to calculate the weight of each possible draw as in step #1.
Also, the current design produces the possible draws one by one, and operates on each one individually. But I'd probably get better performance by doing all the draws at once, so that we can obtain some benefit from unrolling the loop and increased cache locality. (We could even maintain the same front-end interface and just generate bigger chunks inside the object implementation.) As things stand now there are a lot of CPU cycles and memory accesses between one draw and the next, though this obviously isn't reflected in my timing program.
Unfortunately there is still one heck of a data dependency if we are looking up 5 different array locations to evaluate the weight of a single draw. A couple ideas come to mind.
The array of number of cards per rank is only 13, and each value is 0 through 4. So we could store the entire array in a general-purpose register or two (only 39 bits are needed) if there is some way to extract the values that's less expensive than a data load from the 1st-level cache.
The other approach would be to iterate over the array of draws 13 times, once per card rank. We could do a very parallel compare (using an SSE instruction) to determine how many cards of that rank are present, and multiply the associated weight appropriately. But doing 13*N iterations of parallel load/compare/extract/multiply is almost certainly worse than 5*N iterations of indexed load/multiply.
My hand comparison routine needs to be optimized for the common case as well, but I have fewer ideas how to make that more efficient.