Suppose you've got a large set of boxes, each of which contains a varying number of marbles, each of which has a color.
If you're interested in the color distribution of marbles, but can only pick a box at random, can you still do unbiased sampling? (There might be skew such that boxes that have more marbles are predominantly blue, while boxes with fewer marbles are predominantly red and green.)
In the experiment I threw together, random sampling of boxes seems to return the correct distribution of colors even with a high correlation between marbles per box, and color of marbles in the box. With a high sample error, naturally, but the mean looks correct. Is this always true, or is there some adjustment that needs to be done? For example, should you pick just one marble from each selected box? (This seems like it would skew things even further.)
I don't even know what keywords to use to figure this out...