We're moving from one local computational server with 2*Xeon X5650 to another one with 2*Opteron 4280... Today I was trying to launch my wonderful C programs on the new machine (AMD one), and discovered a significant downfall of the performance >50%, keeping all possible parameters the same(even seed for a random numbers generator). I started digging into this problem: googling "amd opteron 4200 compiler options" gave me couple suggestions, i.e., "flags"(options) for available to me GCC 4.6.3 compiler. I played with these flags and summarized my findings on the plots down here...

I'm wondering if anyone (coding folks) could give me any comments on the subject, especially I'm interested in the fact that "... -march=bdver1 -fprefetch-loop-arrays" and "... -fprefetch-loop-arrays -march=bdver1" yield in a different runtime?
 I'm not sure also if, let's say "-funroll-all-loops" is already included in "-O3" or "-Ofast", - why then adding this flag one more time makes any difference at all?
Why any additional flags for intel processor makes the performance even worse (except only "-ffast-math" - which is kind of obvious, because it enables less precise and faster by definition floating point arithmetic, as I understand it, though...)?

A bit more details about machines and my program:
2*Xeon X5650 machine is an Ubuntu Server with gcc 4.4.3, it is 2(CPUs on the motherboard)X6(real cores per each)*2(HyperThreading)=24 thread machine, and there was something running on it , during my "experiments" or benchmarks...

2*Opteron 4280 machine is an Ubuntu Server with gcc 4.6.3, it is 2(CPUs on the motherboard)X4(real cores per each=Bulldozer module)*2(AMD Bulldozer whatever threading=kind of a core)=18 thread machine, and I was using it solely for my wonderful "benchmarks"...

My benchmarking program is just a Monte Carlo simulation thing, it does some IO in the beginning, and then ~10^5 Mote Carlo loops to give me the result. So, I assume it is both integer and floating point calculations program, looping every now and then and checking if randomly generated "result" is "good" enough for me or not... The program is just a single-threaded , and I was launching it with the very same parameters for every benchmark(it is obvious, but I should mention it anyway) including random generator seed(so, the results were 100% identical)... The program IS NOT MEMORY INTENSIVE. Resulting runtime is just a "user" time by the standard "/usr/bin/time" command.
3 Photos - View album
Shared publiclyView activity