9C5 = 126
16C8 = 12,870
25C13 = 5,200,300
36C18 = 9,075,135,300
49C25 =~ 6.3 * 10^13
64C32 =~ 1.8 * 10^18
What you need for the brute-force algorithm is basically a bit per tile arrangement (permutation) to tell whether you've visited it already, plus a list of the "frontier" of newly-generated arrangements:
next =  for X in frontier: for P in permutations: if P.X not already visited: mark P.X visited next.add( P.X ) frontier = next
So you only need sequential access to the queue of positions at distance D, in order to calculate D+1; that queue can safely be placed in a file.
9 billion bits is just over a gigabyte, thus a "streaming" algorithm where frontier/next are files will scale at least that far. And could even be written in pure Python! The files for each distance can be stored as a list of numbers representing the positions (no more than 64 GB for this size problem) or we can include some additional metadata if we want to remember the actual word.
How much further could we go using the same basic algorithm? Well, 49C25 bits is about 7.2 terabytes, so in-memory on a single computer is out. But, we could certainly use an array of flash drives to store the visited set. (A Tintri T540 has 2.4TGB of raw flash, but you could get 1 TB SSDs if you really want them.) Assuming sufficient parallelism to keep the drives busy, >=100000 IOPS is feasible, but even if each position required even a single I/O we're talking 20 years. (We might be somewhat smarter about I/Os but on the other hand the brute-force approach will visit the same position many times. Even a million IOPS is still 2 years.)
On the other hand, 7.2TB of memory is only about 31 Amazon EC2 cr1.8xlarge instances, so storing a distributed hash table would be feasible and solve the problem in a much shorter period of time. You would do this in a sort of map/reduce style, distributing the multi-TB list of positions in generation D to each node and have each independently produce its slice of generation D+1. Each node would categorize each P.X as "visited", "not visited", or "some other node's problem", and then merge all the "not visited" to form the list for the next iteration. Of course, you'd need at least at least half a petabyte to store the results, which doesn't come cheaply either. But you could throw money at the problem and it would still be doable within a reasonable amount of time. We're talking about 4 days to stream all the data at 10Gbps, though, so "reasonable" is probably at least a couple weeks.
So what about 8x8? Well, now the hash table itself is 200 petabytes. Even SETI-at-home style resources aren't adequate for this, as you'd need almost 900,000 computers with 244GB of memory to implement the straightforward approach. 200 petabytes is about as large as a storage cluster gets today, or larger, I think--- most examples I can find are no more than half that. (Internet Archive is at about 10 petabytes, Megaupload held about 30 petabytes, Facebooks's Hadoop cluster is 100PB physical.) Instead--- assuming you can build out a multimillion dollar exabyte storage solution for the intermediate steps of the solution--- you'd need to find ways to make CPU-vs-space tradeoffs. The algorithm would have to change at this point, given today's technology, and would probably consume a nontrivial portion of the Internet's communication capability. (Partitioning a second time by giving each computer only part of D and part of the visited set doesn't scale because you require N^2 passes to make sure you found all the previously visited positions in D+1.)
I asked earlier this year whether it's still true that computers can get just one "size" larger than humans--- in this example, I think computers can get 3 sizes larger, assuming the 4x4 grid is human-tractable and the 5x5 is not.