This implementation of Mercy should be in pure ANSI C, but the timing harness is Unix-specific; code for other operating systems to do the same work would be welcome. It achieves just under 9 cycles per byte on most systems I've tried it on.
The version implemented here is that described in the proceedings for FSE 2000; I refer to it as Mercy-6. I announced the first version of this algorithm on 17th August on sci.crypt.research. Unfortunately there was a bug in the example source code posted that weakened the algorithm, so to avoid ambiguity I called the corrected version "Mercy-2". Various incremental revisions were made for Mercy-3 and Mercy-4. After Mercy-4, I decided the cipher needed to be much simpler, easier to analyse, and needed to carry more state between blocks, and made several major revisions to create Mercy-5. When the paper was accepted I went back, realised I had missed out some easy improvements, and made a couple of very minor tweaks to create Mercy-6.
Scott Fluhrer found a bug in the RC4 implementation used in the key schedule; this bug is fixed here.
Since Mercy-6 has been broken, there may be another revision, but I'm not sure if it'll be called Mercy.