The first major piece of surgery was moving constant tables into "shared memory" on the GPU, instead of device memory. The GPU's L1 cache does not serve as a read cache for device memory! At least, not in the version of the architecture I'm using, 3.2. Later revisions have a "read only" cache and a special intrinsic to use it. The "shared memory" is data that is available to all threads in a "block", and lives in the L1 memory, along with per-thread allocations and register spills.
After this, profiling (with the NVIDIA visual profiler) still showed that L2 cache throughput was maxed out, with a mix of both reads and writes. So the next step was to find a way to move the actual data being hashed into registers instead of memory. Shared memory would not be appropriate as each thread is working on a different input.
After both these steps, profiling shows that the performance is now bounded by compute, and that the kernel occupancy is 97.3%, so the GPU is almost fully utilized. The device is doing slightly more than 100 million MD5 hashes per second. (For this particular problem, only the first word of the hash is relevant.) Further improvements would probably require finding ways to hash with less integer arithmetic--- which is about 80% of the instructions. But, I have not looked at other GPU-based hash calculations and figured out what techniques they are using.