Shared publicly  - 
One of the things I end up doing is do a lot of performance profiling on core kernel code, particularly the VM and filesystem. 

And I tend to do it for the "good case" - when things are pretty much perfectly cached.  Because while I do care about IO, the loads I personally run tend to be things that cache well. For example, one of my main loads tends to be to do a full kernel build after most of the pulls I do, and it matters deeply to me how long that takes, because I don't want to do another pull until I've verified that the first one passes that basic sanity test.

Now, the kernel build system is actually pretty smart, so for a lot of driver and architecture pulls that didn't change some core header file, that "recompile the whole kernel" doesn't actually do a lot of building: most of what it does is check "ok, that file and the headers it depends on hasn't changed, so nothing to do". 

But it does that for thousands of header files, and tens of thousands of C files, so it all does take a while. Even a fully built kernel ("allmodconfig", so a pretty full build) takes about half a minute on my normal desktop to say "I'm done, that pull changed nothing I could compile".

Ok, so half a minute for an allmodconfig build isn't really all that much, but it's long enough that I end up waiting for it before I can do the next pull, and short enough that I can't just go take a coffee break.

Annoying, in other words.

So I profile that sh*t to death, and while about half of it is just "make" being slow, this is actually one of the few very kernel-intensive loads I see, because it's doing a lot of pathname lookups and does a fair amount of small short-lived processes (small shell scripts, "make" just doing fork/exit, etc).

The main issue used to be the VFS pathname lookup, and that's still a big deal, but it's no longer the single most noticeable one.

Most noticeable single cost? Page fault handling by the CPU.

And I really mean that "by the CPU" part. The kernel VM does really well. It's literally the cost of the page fault itself, and (to a smaller degree) the cost of the "iret" returning from the page fault.

I wrote a small test-program to pinpoint this more exactly, and it's interesting. On my Haswell CPU, the cost of a single page fault seems to be about 715 cycles. The "iret" to return is 330 cycles. So just the page fault and return is about 1050 cycles. That cost might be off by some small amount, but it's close. On another test case, I got a number that was in the 1150 cycle range, but that had more noise, so 1050 seems to be the minimum cost.

Why is that interesting? It's interesting, because the kernel software overhead for looking up the page and putting it into the page tables is actually much lower. In my worst-case situation (admittedly a pretty made up case where we just end up mapping the fixed zero-page), those 1050 cycles is actually 80.7% of all the CPU time. That's the extreme case where neither kernel nor user space does much anything else that fault pages, but on my actual kernel build, it's still 5% of all CPU time.

On an older 32-bit Core Duo, my test program says that the page fault overhead is "just" 58% instead of 80%, and it does seem to be because page faults have gotten slower (the cost on Core Duo seems to be "just" 700 + 240 cycles).

Another part of it is probably because Haswell is better at normal code (so the fault overhead is relatively more noticeable), but it was sad to see how this cost is going in the wrong direction.

I'm talking to some Intel engineers, trying to see if this can be improved. 
Samson Kwesi's profile photoJames Sarasin's profile photoGaurav Rajput's profile photoramesh patel's profile photo
Great you care about performance so much. This is why git is so extremely fast ;)
page faults: I'd guess some of it is the slowly increasing size of x86 state and the time it takes to load it back from cache/RAMs?
+Linus Torvalds At least the time taken for the build and make is a lot faster these days with the new and fancy processors Intel throws out these days. I'm stuck with a 2003 Intel Celeron and a 2009 Intel Atom for kernel compiling. Now while these are compiling, I could have a coffee banquet. 
Interesting how building a kernel is a great stress test for the kernel in general and the file system in particular. I guess to a man with a hammer ...
How about "compiling makefiles", take the Makefiles, figure out all the cross dependencies, and caching all that information for all the files in some format that can be quickly reloaded if the Makefiles haven't changed? What difference would that make to your build time, do you think?
Side note: when I say "cost of iret", I don't have much visibility into how much of that cost is actually the iret microcode, and how much of it is the restarting of the instruction the iret returns to. Some part of that 330 cycles is bound to be the TLB refill from the page table entry that just got filled, which may involve some microfault of its own etc.

So if you're an intel CPU engineer, and say "I can prove that iret takes just 250 cycles", you may well be right. I don't have access to some cycle-accurate simulator. But 330 cycles seems to be the actual cost of returning to the instruction.

Same goes for the page fault path. I'm sure there's a complex dance with TLB miss and instruction replay and microfaults with pipeline flushes etc. But that 700+ cycles, wherever it comes from, is too damn much.
We need drivers for Chromebook trackpads please please please.
+Linus Torvalds
What do you think about other build systems, specially TreeUP?

I do understand that you're more into micro-optimisations for your usecase here, however the standard kernel dev usually has only a narrow scope (i.e. change one file at a time, all day long, and recompiling) and this is where tup shines as it doesn't need to treewalk the whole tree.
I don't think Linus decides what init system distros use. :)
+Peter da Silva oh, it's absolutely true that 'make' is a pig, and does too much, and we don't exactly help the situation by using tons of GNU make features and complex variables and various random shell escapes etc etc.

So there's no question that some "makefile compiler" could optimize this all. But quite frankly, I've had my fill of random make replacements. imake, cmake, qmake, they all solve some problem, and they all have their own quirks and idiocies.

So while I'd love for 'make' to be super-efficient, at the same time I'd much rather optimize the kernel to do what make needs really well, and have CPU's that don't take too long either.

Because let's face it, even if the kernel build process was some super-efficient thing, real life isn't that anyway. I guarantee you that the "tons of small scripts etc" that the kernel build does is a real load somewhere totally unrelated. Optimizing page faults will help other loads.
I think the page fault cost will depend a lot on the page table level where it finally finds out that page is not present. So at pte level it will be the worst. And the presence of the relevant page tables in the L2 cache will play a big role. So it depends on how your experiment is set up. And there are counters to tell you how many page walks had to be done etc.
Once I managed to drive the cost of successful translation to 50 cycles on Intel and 100+ on AMD, and that was with page tables cached.
cmake and qmake are not make replacements, they generate makefiles. Or other stuff, e.g. ninja files (which is much faster than make by the way, especially when it comes to doing nothing -- as in, "ten times as fast" when there's nothing to do at all).
+Mike Grunewald there are ones that work... The haswell script the chrubuntu guys put together works with ubuntu 13.10 and 14.04, not sure about other versions.
Might be time to write an OS for making and OS.
+Jesper Dangaard Brouer for this case: do instruction level profiles to get percentage of cycles spent at fault site and first instruction of the kernel. Then use that, total cycles, and number of page faults to calculate the cost of one instance.
+Linus Torvalds Verry interesting read, but why don't you setup a blog? So it would easyr for some people to read, since not everyone uses g+.
What the hell r u guys talking abt! It sounds impressive.
I think that improving the build system itself would bring much more gains. Like changing make with ninja for example. Or completing the work to build with clang
+arno robin werkman How is a G+ stream different from a blog...? People can read it even without a G+ account, and in many cases (e.g. Linus's stream), it even has an easy-to-remember URL....
Why don't you write a script (or something) that does a pull, then a compile, and upon success does another pull (rinse... repeat...)? Then you would have enough time for a coffee break.
+Michael Warburton Always seemed to me a less inefficient GNU make would solve a lot: for example, if Linux provided a batch stat call like the NT kernel does, GNU make could skip a lot of individual vsyscalls. Similarly if there were a batch async file open syscall with full aio syscall implementation rather than the user space implementation currently used by Linux, again that would help a lot. That said, I would imagine GNU make would need a very substantial refactor to be async file i/o capable sadly, but batching up file stats and opens ought to be doable.
Just some thoughts +Linus Torvalds that will either help or amuse you greatly.

Be interesting what results you would get with the Intel compiler, or another compiler, just to prove the issue more.

Also does seem that over time Intel may of well designed better to handle sloppy code over the less common well written code at the expense of well written code - possible.

Though comparing a 32 bit system to a 64 bit system with regards to page faults when one will be mapping twice the number of bits due to memory addressing then I could see how things will be slower.  I would also suspect a kernel page fault handling would be faster as the compiler would be better able to optimise the code, as well as laying out the code in a way that makes the magic optimizations inside the chip kick in more in branch predictions etc etc.

But nice to see you also performance profile on a older lesser system, I've always sworn the best way to optimise something is to see it really struggle, also good for making your code more robust.  
+Connor Blakey I'd rather have an older processor for compatibility (or support) for Linux. Clover trail atom processor doesn't seem to feed up any kind of kernel... And the reason to it is moving out from the bare built-in hardware that was on the market with it (basically being stuck on a single OS)
Will it make a different if you use Intel compiler?
+stan wong I should not, but if it does then that would highlight the issue in another area.  But any other compiler should at least be able to get the same results or very close and if not - again points to issue not entirely the CPU's fault.  But I suggested the Intel one as it would be the way to highlight an issue to Intel better.  Though I'm sure Linus has enough credentials for them to take his word for the Earth being flat if he said it was, Us mere mortals would have to at least provide more data to get the same level of affection.
Bits, is bits, is bits ,,,I would shorten all paths to single letters, \a , \b , etc. Moving all pertaining builds to those directories, no use spending 64 bytes to get there, just saying
Man, Are you claiming for changes on the arquitecture of the microprocesors?? (deep^deep insights)
+Linus Torvalds Do transparent hugepages help? I'd expect them to decrease the cache miss cost of page table walking, though I don't know at what cost.
+arno robin werkman g+ is perfectly fine to throw out a thought and see if a golden nugget of information appears from the net. Blogs get freakishly messy in a heartbeat. 
+Don Lafontaine I agree. Imagine all the kids that would love to hack Linus Torvalds blog. He would be submitting more WordPress patches than kernel patches.
It really may be worth having someone look into a CMake build for the kernel. We just switched to it in WebKitGTK, and CMake + ninja is at least an order of magnitude faster for rebuilds. I'd definitely advise having someone else do it though, since getting CMake is nowhere near as easy to extend as Make is.
+Chris Snook transparent hugepages don't help for a very simple reason: most of the page faults are for code, not data.  So it's file mappings - and nobody sane thinks that hugepages are good for file caches (although insane people clearly do exist, and so clueless people have talked about it).

What does help is the new "fault-around" code we do for file mappings, and we opportunistically fill in the page tables around the faulting address with cached file data that we already have. That's actually quite noticeable, but the "5% CPU time on page faults" is with that optimization.
Is there a way to reduce the number of page faults? Prefetch or batch calls? 
+Linus Torvalds Good to know.

I've actually seen code that mremap()s itself to hugepages in userspace. It's nuts, but when some other nut has forced you to statically link everything into 2 GB binaries, and the binary is the primary workload on >20k servers, "insane" 5% boosts make a certain sort of sense.
+Chris Snook yeah. That's really the only place hugepages make sense. Not in generic code, but in specialized server-side stuff that can do stuff that would be insane and entirely inappropriate if that wasn't the only thing that code did.

It's the epitome of the kinds of loads I don't care about: tuning for one particular load. I realize some people care deeply, and I'm sure it's a huge sexy market for big iron, but it's just too specialized. 

It's also why I'm not interested in 'make' replacements. Part of the whole exercise here is to work well in general, not to optimize for one particular special case (the kernel build).
I recall in 1994 it took I think it was 3 hours to compile the kernel on a 386SX system that was my desktop. Back then my work systems beat my home system, now my home system beats anything I have at work.
Just out of curiosity, what is big iron? I routinely use servers with 256 GB of ram, but only 16 cores for databases.
+Linus Torvalds what are you going to do about the coffee problem?

Need me to make a sensor-based coaster that sends a signal to your coffee machine to brew when these two conditions are met: 1.) The cup is empty 2.) You have typed make in the terminal.
Speaking of lots of little forks, I used to have a fork-heavy script that ran almost 5-times faster in a virtual machine as it did natively on the same hardware, all disk buffers pre-cached in memory. I never narrowed it down sufficiently; seems to be gone of recent.

Apropos of nothing, recent mainline has a funny where I/O slows to 1/jiffy as a virtual guest (XEN or KVM). No way to trigger it, nothing in any log, no bug reports filed. I've got nothing I can open a bug report with myself. Darn!
Good work and thanks for the share, any idea how do we make use of this information to reduce page faults 
Linus,wouldnt the new Intel quantum chipsets just do damage and have you bypass allot of kernel rendering because of the extreme amount of information transparency and speed+efficiency?
I’m wondering, since this is so far an Intel-only discussion…
Is it the same for AMD CPUs?
+Andreas Thalhammer If you have an AMD CPU why don't you try to replicate his test process? I doubt anyone would be upset with you for providing them the data.
+Mark Johnson Yes, can someone point me to the site where I can get Linus Torvalds' test program?
And how to use it?
Honestly, I have no idea how to count CPU cycles…

Update: I found the issue on LKML, but not how to test this on my system.

Anyway, this is far too advanced stuff for me…
He explained his test procedure, but it was over my head.
You lost me at 'performance profiling'. Just an age thing I guess!
+Connor Blakey your solution for squeezing a coffee break off this situation is a lot more time efficient than all the testing
For the hardware aspect, the issue presented at the bottom of is probably related. It's awesome what modern CPUs are doing to improve instruction throughput, but as soon as something is not as expected (page fault, context switch, branch missprediction, ...) it's getting costly :(

For make ... I guess not using recursive makefiles would be a great start. At least the last time I checked, kernel makefiles did pretty often call make themselves.
Surrounded by family and friends put new innovations to improve oneself
It's a worth reading to me: convinced about the penalties of a VM page fault, but that experience sounds to me as a confirmation about the needs of ultra-optimized kernle code. Great job, +Linus Torvalds.
What about real modules in C as opposed to #include files, and improving the build system to avoid forking a new gcc per file? Your post isn't clear about that parts still being the main culrpit here...

I've seem WIP on both the gcc/make and C modules fronts, FWIW.
I want to respectfully congratulate Linus and his colleagues worldwide
for their professionalism and hard work over more than two decades in
creating  and maintaining Linux,  a truly wonderful operating system. Well done and keep up the good work !
My Dad was from radio valve testing to silicone painting circuits boards.  From telephone keypad oscillation checks to call up 2-meter FM repeaters while over competitive to navigate by doling sticks to find hidden transmissions on behalf of Amateur Radio Ham road shows; concludes, out the mobility of the hobby he must have had it all learned from forums and books in print or his telex printer and a QRT magazine subscription. Thank you everyone for bringing his genius back to our minds  for all of the world who knew him! A great multi international awarded ham who was a NBC TV manager.
my five year old red hat still sits in the drewer
Would you consider putting your test code online somewhere? It would help remove any questions that people have and also allow people to test it on their own.
I usually try to wait for the tick cycles for intel.  If I had to guess, I'd guess the TSX extensions to cache coherency added cycles to page faulting.
+Andreas Thalhammer   +Jesper Dangaard Brouer  "I have no idea how to count CPU cycles." -     One can use "rdtsc" instruction to measure CPU cycles.      If you set customer profiling system correctly, you can easily track how many CPU cycles it takes from a particular Kernel line of code to the the user space line of code.    I missed this instruction in ARM.    
+Tony Lee I usually use both "rdtsc" (takes approx 8.25ns) and clock_gettime() (takes approx 26ns) see my code and usage example

At the same time I usually use the "perf stat" command. Given my program times a loop, and I know the loop count, I can divide the "perf stat" counters to see how many instruction it took.

I was looking for a method to use inside the kernel.
Modern build systems take around the same time to verify no build is necessary as it takes "git status" to verify that nothing changed.
If page faults are slow, don't do page faults.  Make the pages bigger.  It's not like Haswell doesn't have good support for 2 M pages or that we're short of real memory.
Hello Linus,
Hope you're having a great day.  Just wanted to say It would be great, despite the progress of Linux if  suspend and resume on a laptop would work.  This important function worked in the not too distant past ... what actually happened and will it ever be fixed?  The "Linux" forums been littered with requests for help on how to fix this frustrating "bug" for at least 2 years...
+Lawrence Stewart If the seek time for a spinning disk is ~10ms, and it takes about ~10ms to read 2MB after seeking, is that the point that disk-backed huge pages cease being "insane" for the general case? And if the typical SSD has 100x faster seek time, but similar data transfer rates as a spinning disk, does that change the equation? How does one amortize the cost of a page fault seek vs. the size of the page fault?

Apparently Linux is optimized for is the creation of itself (the implicit or explicit goal of most life forms), so I guess that's the workload that will determine the answer to these questions.
it's the page's fault, take him out and have him flogged
Have you seen Gary Bernhardt's The Birth and Death of Javascript? It's not about JavaScript at all, but rather about killing virtual memory and running all code in a virtual machine (with the claim that one can replace the other, and that the performance overhead is about the same), in order to remove most build and compilation architecture from development. Even if you don't buy the argument, it's a really great presentation.
+Lawrence Stewart Using 4k pages is so... '90s. If most of the misses are for code (really? that's a huge working set unless it's the initial in-fault) then more RAM is indicated. "Virtual memory is for those that can't afford the very best." Also, I have to wonder how much hidden loss is do to TLB misses which get attributed to random instructions w/ many profiling systems.
+Linus Torvalds Apologies as I'm not too familiar with that kind of profiling, but if your tools are portable I'd be very curious to see numbers from a few different ARM variants as comparison. As you well know, lots of people are keen to use ARM in server environments, especially in a storage context. Most reports I've seen on the topic compare processing times on CPU-bound tasks - perhaps there's a performance advantage when you add the MMU back into the picture.
whats up with so many spam comments? got a little tired of reporting them...
How about ignoring the whole "check thousands of files for changes" step? Why not keep the dependencies in memory, and watch the filesystem for changes?
Trackpad drivers for Chromebooks built in the kernel so we can install any distro please 
Hi, Mr. Torvalds.
Can "I" re-share this post on Stephen Wolfram's questions&answers place?? (just for throw a dice)
How about a different page replacement algorithm other than the heuristic jumble LRU?
My job focus is on virtualization, and one of my frustrations has always been 'what is consuming all of the cycles when the hardware performs a vmexit??'. I wish I had access to the microcode! Given all that I've seen with vmenter/vmexit - going from core 2 duo to the latest cores - I was honestly waiting for someone to discover something similar with 'normal' exceptions. Good luck with that - even to those 'in the family', Intel is not that forthcoming with internal design details. (:
+Linus Torvalds You've mentioned after many suggestions to optimize the actual build that you care more about the general use case and how an optimization on the CPU's pagefault handling would help other loads.  I am curious if that 58% --> 80% is reflected in the general case.  You reported 5% of your CPU time is used on your current system, how does that compare to the Core Duo?
I'll hazard the armchair opinion that integration of editor and project compilation would go some ways to reduce the specific problem described.  Change a single source file, and yer editor would know what needs to be recompiled or linked.   This is somewhat opposed to the traditional UNIX philosophy of many-distinct-tools marshalled to perform some complex operation, which is what 'make' or shell scripts do.   Integration of a revision control system adds another layer of complexity. Something like this is on my planning horizon, but not an early goal of what I am not quite working on in my non-existent CFT.  If I get that far with it though, I'll let you know (as in a couple of years or so).

As for the problem with page faults, well hey, you just have to add more RAM...
Have you considered a test with caching disabled?  I've read certain scatter-gather-intensive loads can benefit from that.  What rates of TLB lookup misses are you observing?
If you have a page fault, might it not be possible that retrieving the page from ram/hdd might be well slower than only 1k cycles and you can't see the delay? Loading the page itself might be just under 1k cycles but the load time might be much more?
I guess the main point here is that the trend to optimize performance for the long pipeline - seen that discussion before. All of the exception cases (misses, stalls, page faults) seem to be getting more expensive. I wonder if, when we start getting systems with dozens of CPUs, that some of these problems could be mitigated by branching switching to a different core and "pausing" the current core - keeping the L1 (etc) cache intact (where possible)....or even pre-emptive branching where - in the place of prediction, multiple possible paths are loaded up on different cores. The loser could then be released back to the pool. I am thinking of an architecture more like Parallela (and some of the GPU / Intels new stuff rather than the current "mainstream" cpus.
The low-end MIPS SoCs (think "router") have tiny TLBs relative to the amount of memory they address, and the people designing them were not expecting SVR4 PIC, where calling a library function touches ~3 pages, with TLB entries unsharable between processes. (The NEC Vr41xx chips were a lot faster with jumptable shlibs.) Today, on a 32M router I've been planning on wiring a 1M TLB entry into every process to dump stuff like libc and busybox which are always hot anyway. If I'm reading the sims right, these chips blow through the TLB before they start capacity misses on the caches...
There is real question if changing few control register bits could cut those latencies by 2/3. If linux uses one task per CPU and cost of IRET that linus mentions is closer to numbers where some control bit tells it to add a cost of changing task, than cost of changing privilege level.
+Siegfried Kiermayer Linus is talking about the cost of the fault here, i.e. simply mapping a page in memory, assuming it's already cached somewhere (he mentions the zero page case). So the cost of the disk would come on top of that, but after two or three builds, the file system cache would take care of it and it would go away. When you create a new process, however, you still have to map the corresponding pages, and that takes time.
+Linus Torvalds Why don't you go into the CPU development business? This all sounds great and most importantly it sounds like you have more experience and knowledge about this than the current CPU engineers out there.
Wow, excellent analysis, and 80% is huge. And this is a good example of why %CPU alone can be misleading, until you study the type of cycles. Did you use perf for profiling? It would be interesting to make a flame graph from the perf data (eg,
+Linus torvalds I realise this isn't just about your personal case of kernel builds, and I'm no expert in this area by a long shot, but from your description it looks like kernel builds would be a good candidate for multiprocessing if you have the hardware available.

By the way, some of the comments seem to assume you're talking about page faults to disk. I'm assuming you're talking about page faults from cache to memory, or similar. Perhaps a clarification from you would help.

P.S. Many thanks for your great contribution to Linux/Open Source computing.
Linux was written by a novice without a clear understanding of professional programming. It continues to be a hack to this day. It is useful only because it is free.
Ah - a flamewar troll.  Anyone insist on taking the bait?
very fascinating analysis.  I've found it interesting in engineer workstation deploys to see how older i5 and i7 processors could, in certain workloads (usually code builds), get badly beaten out by even older core2s.  The only thing we could pin it on at the time was the pure lack of l1 and l2 cache on the "newer" processors (those were 1st gen i-chips, and I know Intel has made many improvements on that line since...and very thankfully).
Just for grins, turn off hyperthreading and adjacent cacheline prefetch(L2 streamer) in BIOS, and then run your bench.
Hello Linus.
       I  was confused to understand why you write
 "*& _tmp.a"
(the defination of  _tmp is :
 struct {long a,b} _tmp).

        what‘s the factors  that make you to use "*& _tmp.a" but not directly use "_tmp.a".

         You know I am reading Linux kernel 0.12. I wish you could give me some feedback.
         I am here wait for your reply. Thank you!
Many thanks to you for to do so mutch over jears for linux.... and keep going. merci l.g. silvan
+Linus Torvalds page fault is a precise exception which probably requires memory fence (and synchronization of all processor state) for page fault to be restartable. Memory fences have long latency on modern CPUs because of complex memory systems, and, moreover, implementations are usually conservative for simplicity. It would be interesting to compare page fault latency to memory fence latency on the same CPU.
I would like the linux kernel to have "batch all syscalls" functionality, so I can be adding to a queue of syscalls to run in my process, and the kernel (on a different thread) is taking syscalls off the queue, running them pseudo-sequentially, and putting all the return values into a result queue.
I apologise to all but must find what strongs tht my fault exactly.. N people this site knew the true is happend. Plss someone tell me why? 
Wierd. Intel developers missed something pretty important. :(
What do you think about systemd? Maybe this get impact into performance of system.
+Floriano Ferreira
Hello sir,
Thanks for connecting. Just to let you know more about us.
Finance One Hong Kong Limited, a leading financial and asset Management Company here in Hong Kong.
We are in control of funds from a consortium of Private Investors ranging from 1-950 Million US$ for long term investments.We give loans from 1-950m US$ from 2 -10 years
If you are in need of funds to expand existing businesses or to start up a new project, then look no further as we would be more than delighted to work with you.
We are driven by a project's credibility to yield investment returns and should we ascertain your project as such, we will engage our funds at guaranteed 3% Fixed Interest Rate per annum no fees no charges but strictly in form of Loans.
Finance One Hong Kong Limited
> I'm talking to some Intel engineers, trying to see if this can be improved. 

I am very curious of what kind of reply you got.
can anyone plz help to deal with Page faults questions and also questions on virtual me
Add a comment...