Shared publicly  - 
If you're using GHC 7.2.1 and find a performance loss relative to an earlier GHC, trying playing around with the new stack options:

-kc<size> Sets the stack chunk size (default 32k)
-kb<size> Sets the stack chunk buffer size (default 1k)

The RTS in 7.2.1 allocates stack in separate chunks of 32k each, copying the last 1k of the previous chunk into the new chunk to avoid thrashing at the boundary.

If your program has very volatile stack usage, growing and then shrinking the stack often, then 7.2.1 will be re-allocating stack chunks a lot. Increasing the chunk size, e.g. with +RTS -kc1m, might help.

I'd really like to know whether it happens a lot in practice - I've found a couple of microbenchmarks that benefit from larger stack chunk sizes, but larger programs seem to be unaffected.
Clifford Beshers's profile photoPaul Bone's profile photoBruno Martínez's profile photo
We have a similar option in Mercury (stack segmentation), it's off by default because it can introduce a slight slow-down. It largely depends on how tail-recursive code is.

In parallel grades I like to turn stack segmentation on because it makes each context (what GHC calls a thread) smaller and cheaper to allocate. Therefore, programs that allocate many contexts benefit from stack segmentation.
Add a comment...