If you're using GHC 7.2.1 and find a performance loss relative to an earlier GHC, trying playing around with the new stack options:
-kc<size> Sets the stack chunk size (default 32k)
-kb<size> Sets the stack chunk buffer size (default 1k)
The RTS in 7.2.1 allocates stack in separate chunks of 32k each, copying the last 1k of the previous chunk into the new chunk to avoid thrashing at the boundary.
If your program has very volatile stack usage, growing and then shrinking the stack often, then 7.2.1 will be re-allocating stack chunks a lot. Increasing the chunk size, e.g. with +RTS -kc1m, might help.
I'd really like to know whether it happens a lot in practice - I've found a couple of microbenchmarks that benefit from larger stack chunk sizes, but larger programs seem to be unaffected.