Profile

Cover photo
Simon Liu
164 followers|31,615 views
AboutPostsPhotosYouTube+1's

Stream

Simon Liu

Shared publicly  - 
 
 
Real World OCaml is now finished!  You can get it on amazon, or just go to realworldocaml.org to read it for free.  Note that the old requirement for a github login is now lifted.
1 comment on original post
1
Add a comment...

Simon Liu

Shared publicly  - 
 
Prelude If there is something that people love as much as tweaking their editing configurations it’s probably the selection of color themes. A …
1
Add a comment...

Simon Liu changed his profile photo.

Shared publicly  - 
1
Eric Yu's profile photoSimon Liu's profile photo
2 comments
 
哈哈哈,一般帅了。。
 ·  Translate
Add a comment...

Simon Liu

Shared publicly  - 
 
Hi, I continue my work on speeding up Go language. The next major release will feature faster parallel garbage collector, goroutine blocking profiler (enabled with -blockprofile flag) and a lot of oth...
1
Add a comment...
In his circles
115 people
Have him in circles
164 people
黄蜻蜓's profile photo
邢大鹏's profile photo
張顥霖's profile photo
刘奎斌's profile photo
eJon Hao's profile photo
Kainy Guo's profile photo
Ye Qi's profile photo
Miracle Builder's profile photo
Kasteluo Z's profile photo

Simon Liu

Shared publicly  - 
 
 
The video for my keynote at C++Now 2013 is up! =D

It covers some of the basics of optimizing C++ (especially with LLVM) and the challenges presented by this complex language. Hope folks enjoy.

Note that if your really into optimizations or compilers, some of this may be a bit boring, but I think it was useful for the audience.
2 comments on original post
1
Add a comment...

Simon Liu

Shared publicly  - 
 
 
Some basics on CPU P states on Intel processors

there seems to be a lot of things people don't realize on how P state selection works on Intel processors, and arguably the documentation is slightly confusing in this regard... and things have been changing generation to generation.

First.. why use the word "P state" and not "frequency"? This is important in terms of thinking about how this works.

"Clock frequency" is something that you measure over some period of time, basically an average on how fast a clock signal went up/down.It's something you can measure, but it's backwards looking. Intel CPUs expose two counters (aperf and mperf) via MSR registers, and if you look at these two registers at two separate times (far enough apart to avoid rounding effects), the ratio of the delta in these two registers gives you a very nice "average frequency" over your measurement interval. (The official SDM documentation has the exact formula for this)

A P state is a number the OS tells the hardware regarding how much performance it would like to see on a certain (logical) cpu; a P state request is very much something forward looking.

So how are these related? 
In the ten year old, single core, no hyperthreading world, things were relatively simple. You could basically map a P state to some "frequency" that you'd get, and as the marketing folks told us, a higher frequency means more performance.

Today, things are much more complex in several key ways.

First of all, and this is important and different from 10 years ago... no matter which P state you ask for, when a logical processor is idle (C state), its frequency is typically 0. The exception to this "typically" is the lightest of the C states (C1), where the frequency is the lowest frequency the CPU supports, and not zero. (but going into C1 is pretty rare, and very short lived, so for this posting, I'm going to ignore C1).

A second important aspect is that of "coordination". For practical reasons, on current Intel processors, all the cores in a package share the same voltage. And because running at a lower frequency than possible at a certain voltage is inefficient, all the cores will also share the same clock frequency at any one time. Of course, except the cores that are idle, because their frequency is zero!
Because the OS will ask each individual logical processor for a separate P state, some reconciliation is needed between the different cores. This reconciliation is actually very simple, at any point in time, the frequency of all the cores is the maximum of what each of the individual cores wants. Of course, minus the idle cores. Their frequency is zero, and the maximum of "something" and "zero" is "something". 

A simple example is appropriate here.
Lets take a two core system (core A and core B, that are initially both busy).
Core A would want to have a clock that ticks at 1 Ghz, and Core B wants a clock that ticks at 2 Ghz.
The maximum of 1Ghz and 2Ghz is .. 2Ghz, so Core A and Core B will both run at 2 Ghz, even though core A only asked for 1 Ghz.
But now at time X, Core B is going idle. Since an idle core has a frequency of zero, and the maximum of zero and 1Ghz is 1Ghz... Core A now runs with a clock of 1 Ghz.

The key thing here is that Core A gets a very variable behavior, independent of what it asked for, due to what Core B is doing.
Or in other words, the forward predictive value of a P state selection on a logical CPU is rather limited.

Sound complex? Now imagine that the GPU on die is in many ways like a CPU core.... and realize that what I described above is actually a simplification of reality.

Another development in the last few years has been that of "Turbo".
Some people call it "overclocking", but it isn't overclocking, it's all within the specs of the hardware. Turbo exists because in a multi-core system, it's possible to run a single core faster than the frequency that is on the label of the box when you buy the processor. This has to do with power budgets; when you buy a 35 Watt TDP cpu, the CPU isn't supposed to use more than 35 Watts. So if you have, say, 4 cores, that means each core by itself can use a little less than 9 Watts to fit that budget.
But if 3 of the 4 cores are idle... the one remaining core can use the whole 35 Watts. (Now add in that the GPU also counts into this 35 Watts as do several other shared resources, and it gets much more complex).
If this single core would be limited to 9 Watts instead of the full 35W even when the others are idle, a lot of potential performance is left on the table.

Now in the first processors that supported Turbo, the available "extra range" was limited, but this range has been growing and growing as core counts have gone up, power sensors have been added to the CPU and power levels have come down. (don't be surprised to see that your CPU has more levels in the turbo range than it has outside the turbo range)

What does this mean? Well, when the OS asks for a P state value that is in the "Turbo Range", it may not actually get the performance that maps to that level; the sum of the power in the system could be exceeding the allowed TDP value if that performance (clock frequency) was granted to all cores (remember from above that all running cores share clock frequency).
What you do get at any one point in time depends on what other cores and the GPU etc are doing.... and this will vary over time as cores go idle or become active, or as the GPU finishes a frame or starts a new complex frame... and even with temperature.
Or in other words, what frequency you get is highly dependent on other things including the C state selection policy and the graphics subsystem.

Another fun angle is that when a task is running completely memory bound, the performance of this task is basically independent of the clock frequency.... and some systems will detect this condition and temporarily lower the clock frequency to save power without reducing performance too much (all within the bounds of all the things I described above).

If it wasn't clear yet, a lot of what I described above varies from generation to generation quite a bit... and its going to change quite a bit more in the next few years.

In the 3.9 kernel we've introduced a new controller driver for the P states, simply because the previous, 10+ year old algorithm wasn't cutting it anymore; too much has changed. By making the driver CPU generation specific, we can now select and tune algorithms for each specific generation, and do significantly better (30%+) than when we used a very generic algorithm.

Another thing to realize from all of this is that while it's easy to talk and look at performance looking backwards (aperf/mperf allow us to do that), predicting performance going forward, even if you are very deliberately picking a P state value, is often near impossible since what you will actually get depends a LOT on what the other parts of the system are doing.
39 comments on original post
1
Add a comment...

Simon Liu

Shared publicly  - 
 
真漂亮~~
 
Finally, with a reasonable internet connection during our layover in Narita, more and higher resolution pictures from Palau.
In there, hidden, is one picture of noticeably poor quality - but that was just too good not to post... 
26 comments on original post
1
Add a comment...

Simon Liu

Shared publicly  - 
1
Add a comment...
People
In his circles
115 people
Have him in circles
164 people
黄蜻蜓's profile photo
邢大鹏's profile photo
張顥霖's profile photo
刘奎斌's profile photo
eJon Hao's profile photo
Kainy Guo's profile photo
Ye Qi's profile photo
Miracle Builder's profile photo
Kasteluo Z's profile photo
Basic Information
Gender
Male
Story
Tagline
There is no sadder sight than a young pessimist.
Introduction
do not fear to be eccentric in opinion, for every opinion now accepted was once eccentric. 
Links
YouTube
Contributor to
Simon Liu's +1's are the things they like, agree with, or want to recommend.
Accept-Encoding, It's Vary important. - MaxCDN Blog
blog.maxcdn.com

One of the best things about running BootstrapCDN are the new things I've learned about web performance. Today, I'd like to share a few insi

Twitter
twitter.com

Instantly connect to what's most important to you. Follow your friends, experts, favorite celebrities, and breaking news.

JetBrains 开发工具全场7折 - 开源中国
www.oschina.net

JetBrains 个人版产品列表如需购买公司商业版授权,请点击这里. AppCode Personal License: $ 99.0$ 70.0/439 RMB: 立即购买. dotCover Personal License: $ 99.0$ 70.0/439 RMB: 立

Color Theming in Emacs: Reloaded - (think)
batsov.com

Prelude If there is something that people love as much as tweaking their editing configurations it’s probably the selection of color themes.

C++ and Beyond 2012: Herb Sutter - C++ Concurrency (Channel 9)
channel9.msdn.com

Herb Sutter presents C++ Concurrency. This was filmed at C++ and Beyond 2012. Get Herb's slides for this session. Herb says: I've spoken and

Combiner/Aggregator Synchronization Primitive | Intel® Developer Zone
software.intel.com

Combiner/Aggregator synchronization primitive provides mutual exclusion like a mutex, but can be significantly faster in some situations due

Google Guava
plus.google.com

java opensource library collections google concurrency

Writing Go in Emacs
dominik.honnef.co

Writing Go in Emacs. Using the right tools plays a big role in getting your job done efficiently. In the case of programming Go, most of you

LLVM Project Blog: Status of the C++11 Migrator
blog.llvm.org

Since the design document for cpp11-migrate, the C++11 migrator tool, was first proposed in early December 2012 development has been making

Netmap: A Novel Framework for High Speed Packet I/O
www.youtube.com

Google Tech Talk (more info below) August 8, 2011 Presented by Luigi Rizzo, Universita` di Pisa ABSTRACT Software packet processing at line

Mail Checker Plus for Google Mail™
chrome.google.com

Displays the number of unread messages in your Gmail and Google Apps inbox. Preview mail, read, delete, archive and mark as spam!

Emacs 24.1 released
lists.gnu.org

GNU Emacs 24.1 has been released. It is available on the GNU ftp site at ftp.gnu.org/gnu/emacs/. See http://www.gnu.org/order/ftp.html for a

新浪微博登录 新浪微博-随时随地分享身边的新鲜事儿
weibo.com

还没有新浪微博帐号? 立即注册 · 邀请好友开微博,赢大奖!>> 使用其他账号登录 MSN| 天翼| 联通| 360 · 手机玩转新浪微博 · 微博帮助意见反馈开放平台微博招聘新浪网导航 · 不良信息举报. 客服电话:400 096 0960(个人) 400 098 0980(企

新浪微博登录 新浪微博-随时随地分享身边的新鲜事儿
weibo.com

还没有新浪微博帐号? 立即注册 · 邀请好友开微博,赢大奖!>> 使用其他帐号登录 MSN| 天翼| 联通| 360 · 手机玩转新浪微博 · 微博帮助意见反馈开放平台微博招聘新浪网导航 · 不良信息举报. 客服电话:400 096 0960(个人) 400 098 0980(企

SPDY Review from Martin Nilsson on 2012-06-07 ( from April to June 2012)
lists.w3.org

W3C home > Mailing lists > Public > ietf-http-wg@w3.org > April to June 2012. SPDY Review. This message : [ Message body ] [ Respond ] [ Mor

MySQL Proxy - MySQL Forge Wiki
forge.mysql.com

MySQL Proxy. MySQL Proxy is a simple program that sits between your client and MySQL server(s) that can monitor, analyze or transform their

【去哪儿网】机票查询,特价机票,打折飞机票-去哪儿网Qunar.com
flight.qunar.com

去哪儿(Qunar.com)作为全球最大的中文旅游搜索引擎,通过对机票,酒店,旅游线路的整合与发布,提供专业、实时、可信的旅游产品价格比较与服务比较系统,帮助消费者轻松进行充分选择,是您预订机票、酒店、旅游线路的的最佳选择!

Joshua Zhu’s Blog » Apache 2.4 Faster Than Nginx?
feedproxy.google.com

Some reports came out recently after Apache 2.4 was released, saying it's “as fast, and even faster than Nginx”. To check it out if it&#

C Programmers, Time To Try Ada
electronicdesign.com

Technology Editor Bill Wong takes another look Ada 2012 and thinks it is time for other C programmers to do the same. It has significant adv