Mark Gritter (markgritter) wrote,
Mark Gritter

New CP2-7 inner loop still sucks

So much for that exercise. New SSE2 code:

53000 microseconds spent scoring hands /
( 128662 relevant hands *
13 possible arrangements to score ) =
32 ns per score calculation =
64 clock cycles

Old branching code:

58000 microseconds spent scoring hands / ( 128662 * 13 scores ) = 69 clock cycles.

Also, unfortunately, the scoring is only taking about 1/3rd to 1/2 of the total time spent evaluating the set of undominated hand arrangements.

In the example above, 184300 microseconds evaluating 13 arrangements vs. 10M hands =
53000 microseconds scoring hands + 131300 microseconds comparing cards for overlap.

131300 microseconds / 10M hands =
13 ns per comparison =
26 clock cycles (including loop)

Check: 26 cycles * 10M hands + 64 cycles * 13 arrangements * 128662 relevant hands = 367 * 10^6 clocks = 184 ms.

I can probably shave off a few cycles here, but there aren't huge gains yet to be had. 26 cycles seems like a lot, but we're trashing both the L1 and L2 cache with 10M hands so I think
memory bandwidth becomes the main issue.

Perhaps I can make more aggressive use of SIMD. We could look for overlap in two hands at once in a 128-bit register. We also could maybe score four possible arrangements at once. Comparing front, middle, and back in parallel + converting that into a score turned out to be a big pain, but doing front, middle, and back serially for four hands at once should be much more straightforward.
Tags: chinese poker, code
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.