It covers some of the basics of optimizing C++ (especially with LLVM) and the challenges presented by this complex language. Hope folks enjoy.
Note that if your really into optimizations or compilers, some of this may be a bit boring, but I think it was useful for the audience.
there seems to be a lot of things people don't realize on how P state selection works on Intel processors, and arguably the documentation is slightly confusing in this regard... and things have been changing generation to generation.
First.. why use the word "P state" and not "frequency"? This is important in terms of thinking about how this works.
"Clock frequency" is something that you measure over some period of time, basically an average on how fast a clock signal went up/down.It's something you can measure, but it's backwards looking. Intel CPUs expose two counters (aperf and mperf) via MSR registers, and if you look at these two registers at two separate times (far enough apart to avoid rounding effects), the ratio of the delta in these two registers gives you a very nice "average frequency" over your measurement interval. (The official SDM documentation has the exact formula for this)
A P state is a number the OS tells the hardware regarding how much performance it would like to see on a certain (logical) cpu; a P state request is very much something forward looking.
So how are these related?
In the ten year old, single core, no hyperthreading world, things were relatively simple. You could basically map a P state to some "frequency" that you'd get, and as the marketing folks told us, a higher frequency means more performance.
Today, things are much more complex in several key ways.
First of all, and this is important and different from 10 years ago... no matter which P state you ask for, when a logical processor is idle (C state), its frequency is typically 0. The exception to this "typically" is the lightest of the C states (C1), where the frequency is the lowest frequency the CPU supports, and not zero. (but going into C1 is pretty rare, and very short lived, so for this posting, I'm going to ignore C1).
A second important aspect is that of "coordination". For practical reasons, on current Intel processors, all the cores in a package share the same voltage. And because running at a lower frequency than possible at a certain voltage is inefficient, all the cores will also share the same clock frequency at any one time. Of course, except the cores that are idle, because their frequency is zero!
Because the OS will ask each individual logical processor for a separate P state, some reconciliation is needed between the different cores. This reconciliation is actually very simple, at any point in time, the frequency of all the cores is the maximum of what each of the individual cores wants. Of course, minus the idle cores. Their frequency is zero, and the maximum of "something" and "zero" is "something".
A simple example is appropriate here.
Lets take a two core system (core A and core B, that are initially both busy).
Core A would want to have a clock that ticks at 1 Ghz, and Core B wants a clock that ticks at 2 Ghz.
The maximum of 1Ghz and 2Ghz is .. 2Ghz, so Core A and Core B will both run at 2 Ghz, even though core A only asked for 1 Ghz.
But now at time X, Core B is going idle. Since an idle core has a frequency of zero, and the maximum of zero and 1Ghz is 1Ghz... Core A now runs with a clock of 1 Ghz.
The key thing here is that Core A gets a very variable behavior, independent of what it asked for, due to what Core B is doing.
Or in other words, the forward predictive value of a P state selection on a logical CPU is rather limited.
Sound complex? Now imagine that the GPU on die is in many ways like a CPU core.... and realize that what I described above is actually a simplification of reality.
Another development in the last few years has been that of "Turbo".
Some people call it "overclocking", but it isn't overclocking, it's all within the specs of the hardware. Turbo exists because in a multi-core system, it's possible to run a single core faster than the frequency that is on the label of the box when you buy the processor. This has to do with power budgets; when you buy a 35 Watt TDP cpu, the CPU isn't supposed to use more than 35 Watts. So if you have, say, 4 cores, that means each core by itself can use a little less than 9 Watts to fit that budget.
But if 3 of the 4 cores are idle... the one remaining core can use the whole 35 Watts. (Now add in that the GPU also counts into this 35 Watts as do several other shared resources, and it gets much more complex).
If this single core would be limited to 9 Watts instead of the full 35W even when the others are idle, a lot of potential performance is left on the table.
Now in the first processors that supported Turbo, the available "extra range" was limited, but this range has been growing and growing as core counts have gone up, power sensors have been added to the CPU and power levels have come down. (don't be surprised to see that your CPU has more levels in the turbo range than it has outside the turbo range)
What does this mean? Well, when the OS asks for a P state value that is in the "Turbo Range", it may not actually get the performance that maps to that level; the sum of the power in the system could be exceeding the allowed TDP value if that performance (clock frequency) was granted to all cores (remember from above that all running cores share clock frequency).
What you do get at any one point in time depends on what other cores and the GPU etc are doing.... and this will vary over time as cores go idle or become active, or as the GPU finishes a frame or starts a new complex frame... and even with temperature.
Or in other words, what frequency you get is highly dependent on other things including the C state selection policy and the graphics subsystem.
Another fun angle is that when a task is running completely memory bound, the performance of this task is basically independent of the clock frequency.... and some systems will detect this condition and temporarily lower the clock frequency to save power without reducing performance too much (all within the bounds of all the things I described above).
If it wasn't clear yet, a lot of what I described above varies from generation to generation quite a bit... and its going to change quite a bit more in the next few years.
In the 3.9 kernel we've introduced a new controller driver for the P states, simply because the previous, 10+ year old algorithm wasn't cutting it anymore; too much has changed. By making the driver CPU generation specific, we can now select and tune algorithms for each specific generation, and do significantly better (30%+) than when we used a very generic algorithm.
Another thing to realize from all of this is that while it's easy to talk and look at performance looking backwards (aperf/mperf allow us to do that), predicting performance going forward, even if you are very deliberately picking a P state value, is often near impossible since what you will actually get depends a LOT on what the other parts of the system are doing.
In there, hidden, is one picture of noticeably poor quality - but that was just too good not to post...
C++ and Beyond 2012: Herb Sutter - C++ Concurrency (Channel 9)
Herb Sutter presents C++ Concurrency. This was filmed at C++ and Beyond 2012. Get Herb's slides for this session. Herb says: I've spoken and
Combiner/Aggregator Synchronization Primitive | Intel® Developer Zone
Combiner/Aggregator synchronization primitive provides mutual exclusion like a mutex, but can be significantly faster in some situations due
Netmap: A Novel Framework for High Speed Packet I/O
Google Tech Talk (more info below) August 8, 2011 Presented by Luigi Rizzo, Universita` di Pisa ABSTRACT Software packet processing at line
SPDY Review from Martin Nilsson on 2012-06-07 ( from April to June 2012)
W3C home > Mailing lists > Public > email@example.com > April to June 2012. SPDY Review. This message : [ Message body ] [ Respond ] [ Mor