Shared publicly  - 
More bitwise tricks..

So my quest to calculate the hash and the length of a pathname component efficiently continues. I'm pretty happy with where I am now (some changes to the code have happened, it you actually want to see the current situation you need to check out the kernel mailing list post), but finding the number of bytes in the final mask bothers me.

Using an explicit loop is out - the branch mispredicts kill it. And while at least modern Intel CPU's do quite well with just using the bit scan instructions ("bsf") to find where the first NUL or '/' was in the word, that sucks on some older CPU's.

So I came up with the following trick to count the number of bytes set in the byte mask:

/* Low bits set in each byte we used as a mask */
mask &= ONEBYTES;
/* Add up "mask + (mask<<8) + (mask<<16) +... ":
same as a multiply */
mask *= ONEBYTES;
/* High byte now contains count of bits set */
len += mask >> 8*(sizeof(unsigned long)-1);

and I'm wondering if anybody can come up with something that avoids the need for that multiply (and again - conditionals don't work, the mispredict costs kill you).

Because that multiply isn't free either.
Eric Mewhort's profile photoali rahatloo's profile photoOrghici Alexandru's profile photoCoenraad Loubser's profile photo
there is nothing more efficient in this context (classical binary mathematics) than the shift operators, I marked this post to see if someone responds:)
What is wrong with len += mask >> ((sizeof(unsigned long)-1) << 3);
That is essentially free.
+Alexander Levitskiy: that's not the multiply I'm taking about. The multiply by 8 is free, since it's all about constants and can be done at compile time.

It's the "mask *= ONEBYTES" multiply that is expensive.

Edit: and by "expensive" I mean "four cycles or so". Compared to being able to just do it with an add or something, it's expensive. Compared to most everything else, it's cheap.
+Johan Euphrosine: yeah, that one has the "sequential shifts" algorithm, which works ok and is basically what I started with. But they're data dependent so even though they are each individually cheap, the end result is actually that the multiply is cheaper.

There may be microarchiectures out there that have a multplier that is so slow that doing it by hand with sequential shifts is worth it, but none that I can think of that matters.
if not a multiplication how about a "look up" table (storage vs computation) for slow CPU's large pre-computed values (hashed) naive?:) too large of a space (maybe in approximation to get to n amount of bits than multiplication) (conditional statement redux)
How much difference does the multiply make in your benchmarks? It seems that you put it in to avoid the explicit shift/sum cascade. And it's cute of course! But maybe it depends too much on the platform to decide this in general? As for finding a better way to compute that sum, I think you've just about exhausted them, especially if you're also considering lookup tables as +Andrew Milkowski suggested.
(un?)interesting fact: that was one of the questions I got asked when interviewing at Google, (added more stuff in my previous comment)
I think, Linus is looking for this:
int nbytes(unsigned int mask)
mask &= 0x01010101;
mask *= 0x01010101;
return (int)(mask >> sizeof(mask)*8-8);
}and thus not clz (bsf) may not be needed, i.e., len = ((clz mask)+8)/8'

Are MMX/XMM instructions allowed?
+Jan Achrenius: nope, can't use mmx/xmm/avx in the kernel due to overhead of saving state and handling CR0.TS. So regular integer only instructions. And they do need to be fast/available across all uarchs, so popcnt etc don't work (that's also why bsf is not great - it can be very slow)
If you don't think of anything, make it panic and printf "Sorry bro, you try and fix it."
wait for non-abelian operators [operands] (sorry bad joke:) sorry cant help in full, static auxiliary (non-computational) read-outs seem to be in order. btw smart great group of people here:)
+Linus Torvalds
I think only a variation of popcnt() can do it faster, e.g.,
mask &= 0x01010101;
mask += mask >> 16;
mask += mask >> 8;
return mask&0x3;

This still assumes the 1st lsb of each byte is set. If this assumption is relaxed it might be easier to optimize (1st lsb of each byte has to be set earlier).
Believe Mark is refering to this format of calculations known as Comba multiplication, and embedded in the JVM (it is well "linked" in google, hence references won't be provided) but worth while to explore further:)
Ok it's hard to type on on a ipod
+Linus Torvalds

I'm not sure any simpler algorithm to count bits can exist, mathematically: counting implies examining a relationship between independent bytes, which either requires a loop, or requires a data dependent group of linear operations (shifts and adds), or requires the same hardwired into hw (BSF), or requires a non-linear algebraic operation that connects the bytes and thus computes their relationship (a multiplication) - hardwired into the hw.

(edit: assuming the state space is large enough so that we don't want to precompute the results.)
+Mark Sobkow When writing and compiling code for a specific (family of) target processor(s) without reverting to assembly, one can either let the compiler do the task or help (a specific) compiler to cope with the code. We (at former Hybrid Graphics Ltd.) used to have a specific set of inline code performing bitwise operations for each compiler we used. This, in most cases, was more efficient than using inline assembly.
+Ingo Molnar: so the thing that makes me hopeful that it could be simpler is that the byte mask is not arbitrary. For 32-bit, we really have only four cases we can have:

no bytes: 00000000 -> 0
one byte: 000000ff -> 1
two bytes: 0000ffff -> 2
three bytes: 00ffffff -> 3

so it's not a generic "count number of bits" (or even a generic "count number of bytes"). So I was hoping somebody would come up with something simple like (x & mask1) + (x&mask2)>>shift that just got the four possibilities right.

The "mask and multiply by one bits" thing gets the right result, but it gets it for a much more generic case than what we actually need (arbitrary byte mask). And it's that simpler case that I wonder if we can optimize somehow.

Of course, the 64-bit case has 8, not four, combinations. The high byte is always zero (one byte must have contained the NUL byte or the slash), so you have the 0-7 bytes set cases.
+Linus Torvalds So how does

mask &= 0x0101010101010101; // (assume 1 LSB of each byte is set)

mask += mask >> 32;
mask += mask >> 16;mask += mask >> 8;
return mask&0xff;

compare to the original with multiply?
+Linus Torvalds

The way I see it is that while the examination we do is very specific, the input bytes are independent.

So I don't see how we can avoid connecting them: having to do either 4 branches, or 4-deep data dependent shifts/adds, or 4 branches computed by hardware, in parallel (BSF or a multiplication).

We might be able to use a small, 16-entry lookup table though, if we compacted the identities to 4 bits. 256 entries on 64-bit, if I got the numbers right :-)
+Jan Achrenius: so the simple shift+add version is all totally serialized and nothing can be done before the previous operation ends: as a result the three adds and three shifts will inevitably take 6 cycles (the original P4 had that double-pumped ALU, but not for shifts).

That's already slower than almost any multiply. There are exceptions, so the shift-and-add approach can be faster for some uarchs (I think the K7 had a 9-cycle multiply latency, for example), but I really think that to be worth it, we'd have to find some clever trick that takes the "reduced case" into account (see my comment to Ingo) and is able to do it with a few less shifts/adds.

For example, imul latency is just three cycles on Sandybridge and most AMD CPU's (I think AMD is 4 cycles for the 64-bit case). That's a hard target to beat, but the alternative doesn't have to really beat it, because 'imul' latency is something like 14 cycles on Atom and 10 on P4.

Now, Atom is crap, and I don't really care, and P4's are old and going away. So I don't really want to optimize for them, but there's room to do better. Without hopefully doing worse on SNB or Athlon.

I realize this doesn't really matter. One or two cycles isn't worth worrying about. By now it's all just "dammit, I started down this bitwise path, let's see how far I can drive it".
+Linus Torvalds I know, been there ("it does not really matter but it's still a cycle") n>1000 times. My mindset is still on ARM where my suggestion could be feasible... First, I was considering using modified De Bruijn multiplication, but it became pretty obvious it cannot be faster. I'll have a look at the x86 instruction set (for the first time since Willamette..)
+Linus Torvalds

The lookup table for a 32-bit hash would be something like:

char nr[16] = {
0, 1, 0, 2,
0, 1, 0, 3,
0, 1, 0, 3,
0, 1, 0, 4 }
+David Beaumont: yes, and always in the low bits only.

So you start out with a mask that looks like "(1<<(8*n))-1" and the exercise is to figure out 'n'.

(Where n = 0..3 or 0..7 depending on word size).
That's good you're researching this efficiently. "efficiently continues" :P Grammar nazi on the prowl.
Ooh is this for an enhancement on git rename tracking?
Unless I'm missing something, the search space is small enough that even a fairly simple search algorithm could try every possible "move" (set of instructions) which doesn't take longer than Linus' proposal. Any takers? I'm on a cell phone all weekend, so it won't be me.
+Linus Torvalds: I don't suppose something simple like this would do any better (only 32-bit case though)?

char lookup[8] = { 0, 1, 0, 2, 0, 0, 0, 3 };
n = lookup[(mask & 1) | (mask >> 14) | (mask >> 21)];
+Josef Grahn: any memory access will generally have a load-to-use latency of roughly three cycles (even if you hit in L1), so a pure table lookup would work really well. But since the table lookup needs that shift+add series just to shrink the input space, it starts out with a few cycles disadvantage to begin with, and the multiply will win.

After all, you can get the right result without any table on 32-bit using
a fairly obvious approach:

n = (mask & 1) + ((mask >> 8) & 1) + ((mask >>16) & 1);

and on 32-bit that stupid approach is possibly the best approach. On 64-bit, the manual masking gets more expensive, though, and you have to fold the upper bytes down and do the third byte in the low part of the word too, so..
Hi Linus,
For the 4 byte version, this works

a = mask - 0xffff;
b = a >> 23;
result = 2 + b - (a & 0x1);

But requires three registers and only works for the 4 byte case (I'm assuming the problem is counting the bytes that are 0xff, so it won't work for 0x00DEADBE for example)

I think your solution is brilliant however as it scales nicely and exploits pascal's triangle.
+Linus Torvalds Have a look at the x86 LEA instruction. In the past I have used it to do all sorts of neat hacks. The circuitry is different from MUL. It can multiply faster than mul because the mathmatics behind it are not general purpose. If you are comfortable with inline asm and x86 only compatibility this is worth the reasearch. This hack would work because you are multiplying by 8! something the LEA opcode was designed to do readily.
+Carl Chatfield: That's exactly the kind of trick I was hoping somebody would find.

That said, I have to admit that it's really the 64-bit case I was hoping for a clever trick for. Yours avoids a shift (good) but the "stupid" one is not all that much worse.
The *8 is not the expensive multiply, and will be probably automatically be optimised by the compiler anyway (whether that be a LEA or a shift)
The cool thing about the multiply is that it "entangles" all the bytes together so information can be extracted from a single point. Is there any other instruction that does this?
+Carl Chatfield 'probably' is not reliable thing. Coder's like things to be definite don't we? Please define 'entangles' in context. It does not at all sound logical more conceptual than fact.
By entangle I am referring to the fact that the multiplication trick works because every byte in the mask (the multiplicand) is touched by every byte in the multiplier.
on i386 MUL can be anywhere from 9-41 clocks where LEA is only 2 clocks.
What about this for the 64-bit version?
n = (mask & 0x01010101) + ((mask >> 32) & 0x010101)
n = (n & 0xffff) + (n >> 16)
n = (n & 0xff) + (n >> 8)
Another fun but admittedly not very practical once but perhaps a source of inspiration

mask += 1;
result += !!(mask & 0xffffffff00000000) << 2;
result += !!(mask & 0xffff0000ffff0000) << 1;
result += !!(mask & 0xff00ff00ff00ff00);
Adding one will cause an overflow into the msb, and this will be the only bit set. Then binary search. But I think ultimately he multiplication will be hard to beat
Just realised that inverting the mask removes the need for one of the not's

mask += 1;
result += !(mask & 0x00000000ffffffff) << 2;
result += !(mask & 0x0000ffff0000ffff) << 1;
result += !(mask & 0x00ff00ff00ff00ff);
This is very nice to see, share ideas like this, magnify every day to GNU / Linux. Thanks to people like you, I can enjoy this wonderful OS. =)
Greetings from Argentina.
+Carl Chatfield: I like your shift-by-23 version, but I think it could be improved a tiny bit. Just re-organizing it results in

int a = (mask - 0xffff) >> 23;
int b = (mask & 0x1);
return 1 + a + b;

which has the advantage that 'a' and 'b' can be done in parallel, and the 1+a+b can be done with single a "lea" instruction (iow, addition is a bit cheaper than subtraction simply because it's easier to embed).

In fact, that "mask - 0xffff" can be done with a lea too, but for gcc to do that I really do have to do "mask + (-0xffff)". Ugh - silly compiler. But with that I can make gcc generate

leal -65535(%eax), %edx
andl $1, %eax
sarl $23, %edx
leal 1(%eax,%edx), %eax

for the 32-bit case, which looks pretty damn optimal. Certainly better than the multiply, and without the worry of some microarchitectures doing badly on it.
If I understand the problem, checking for a \0 NUL byte in a string by testing a word at a time, then the mask &= ONEBYTES throws away vital information. All byte bits must be condensed into the lowest order bit first, like so:

const char str[] = "F\0EL";
return MpBitMath_HasNul(* ((unsigned int *) str));

bool MpBitMath_HasNul(register unsigned int n)
n = (n & 0x55555555UL) | ((n >> 1U) & 0x55555555UL);
n = (n & 0x11111111UL) | ((n >> 2U) & 0x11111111UL);
n = (n & 0x01010101UL) | ((n >> 4U) & 0x01010101UL);
/* at this point, each byte in n that is non-NUL
* is converted to 0x01.
n = (n & 0x00010001UL) + ((n >> 8U) & 0x00010001UL);
n = (n & 0x00000003UL) + ((n >> 16U) & 0x00000003UL);
/* ASSERTion: n <= 4 */
/* if n < 4, then NUL byte is embedded. if n == 4, then no
* embedded NULs
return n < 4UL;

Now is this data-dependent technique slower than a multiply...?
I read all that code, or no, I try it. Then all of a sudden it awakens to me: I am just a Linux user (certified system admin and network admin, for the fun I took those courses) and fanatic. Plus I promote it. I do not have to know all of that. I know my place. :-)
+Linus Torvalds I have to admit I got kind of lucky with that one, I'm not so well versed with assembly programming. The 1 + a + b -> LEA optimisation was a new one for me. Thanks.
Open source brainstorming at its best. Seeing such an educated and brainy discussion on an open "forum" like this Google+ post is a pleasure to the eye.
Although my input would probably be most suitable under the cat photo, I agree, its great to see. Computers aren't smart, people are; and that's why technology rocks..
gcc-4.6.2 produced the lea without the mask + (-0xffff) trick here.

on arm this produced
e2423cff sub r3, r2, #65280 ; 0xff00
e2022001 and r2, r2, #1
e2820001 add r0, r2, #1
e24330ff sub r3, r3, #255 ; 0xff
e0800bc3 add r0, r0, r3, asr #23
Tha'ts the best inline ASM I've ever learned in my life.
+Linus Torvalds I think it's impossible to get rid of both the multiplication and the log n shift-adds (on 8-byte case). I came up with this (LUT by multiplication):

mask += 1;
mask *= 0x0001020304050607;
return mask>>56;
... so you get the idea (assuming all LSB's are set as you gave an example).

Now, my minor optimization is that you can remove the addition by adjusting the LUT ("magic number"), i.e.,

return (mask*0x0001020304050608)>>56;
This works for 0-7 bytes and assumes the 8MSB are zeroes.
What's Hot is now officially broken.
+Mark Sobkow We used inline C functions, and in some cases inline assembly (esp. when carry bits are utilized).
+Carl Chatfield: One nice thing about your binary search trick is that it completely avoids 64-bit right shifts, which apparently are rather expensive on P4:s.
I once wrote an asm that could load a 23 mb flat csv text file with 20+ fields per entry in under 1.5 second on pentium d. The main data interface existed soley of arrays of data pointers as to make the loading and exporting with modifications extremely fast.

The load effective address instruction was the core component involved in harnessing the raw and speedy calculations needed to store, manage, refer to, and manipulate the in memory database.
I believe there was an order of around half a million entries.
There were also properties about each entry and field to maintain.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
-- Donald Knuth
and some projects are so advanced they don't have much choice but to break ground from the jump.

The project I spoke of was also my first x86 assembler project, as I was learning the fundamentals and concepts of machine code whilst coding. So it made sense to build something very complex and new so I could see "what this baby can do". I did however follow the unix philosophy true and build each part of the system as a seperate component. Right down to the context sensitive block heap memory management.

Admittedly I tried to re-invent the project in a high level language without such depth of managment. The results were more than horrendous. However the api exposed from the assembly afforded the C compatible high level language more robust features such as changing the data delimiter, refrencing field headers and a host of ofther useful features. The interface was appropriately seperated from the implementation.
After thinking about all this a little bit, i would not go for further bit twiddling hacks:
There are three major archs which can do unaligned access:
- newer ARM -> has count leading zero since v5
- PPC -> has count leading zeros
- X86 -> has bsf/bsr
now x86 is a bit of a problem because bsf can be slow. But: looking at Agner Fog instruction timings ( either the uArch is obsolete and a pain anyway (Pentium 3 or earlier, ATOM), or you have to beat a latency of around 7 or much better.
In this case i would simply say: go for the bsf, better for the i-cache and register pressure, it's not in the loop and the last once off instruction, if you feel like it you can use tzcount which is fall back compatible to bsf as long as you don't use the flags (really Intel, what where you thinking, what was wrong with the zero flag...).
I tested a couple of the suggested methods on an old Pentium D box I have running. On this platform, the multiplication trick takes the cake by a far margin. Doing one billion repetitions, I got the following timings (disclaimer: I haven't penetrated the assembly code that deeply, so please take this for what it is):

Binary search: 8.7 seconds
Multiplication: 3 seconds

I also tried doing the highly optimized 32-bit trick twice and combining the result for 64-bit. Basically something like this:

int a = (himask - 0xffff) >> 23;
int b = (himask & 0x1);
int h = 5 + a + b;
int c = (lomask - 0xffff) >> 23;
int d = (lomask & 0x1);
int l = 1 + c + d;
int s = lomask >> 24;
return (s & h) + (~s & l);

It clocked at 7.4 seconds, slightly faster than the binary search variant but still quite a bit slower than integer multiplication. I should mention that I never managed to convince gcc to do the subtraction using "lea", though, so one could possibly shave something off that time, but not down to 3 seconds.
Linus' original code is:

mask &= ONEBYTES;
mask *= ONEBYTES;
len += mask >> 8*(sizeof(unsigned long)-1);

We can eliminate the first line when we realize we are looking for a simple mapping:

For 32 bit:
0x00000000 --> 0
0x000000ff --> 1
0x0000ffff --> 2
0x00ffffff --> 3

Can we multiply mask by some magic constant such that we always get the correct mapping values (0,1,2,3) at the same bit offset? Turns out there are many such magic constants, and if we pick the bit offset of 30 (for 32 bit), we can eliminate the need for a final AND mask. With a program I found a "nice" magic constant that does the job, here's the 32 bit code:

// uint32_t mask;
const uint32_t MAGIC32 = 0x004080c1;

len += (mask*MAGIC32)>>30;

We can do the same thing with 64 bit, again I picked a "nice" magic constant that generates the mapping values (0,1,2,3,4,5,6,7) at bit offset 61:

// uint64_t mask;
const uint64_t MAGIC64 = 0x0020406080a0c0f0;

len += (mask*MAGIC64)>>61;
If (for a 32-bit system) you are looking for a function that maps:
x = 0x00000000 -> f(x) = 0,
x = 0x000000FF -> f(x) = 1,
x = 0x0000FFFF -> f(x) = 2,
x = 0x00FFFFFF -> f(x) = 3,
x = [anything else] -> f(x) = [undefined/don't care],
this would be my twopence:

unsigned int nbytes(unsigned int mask) {
unsigned int reg;
mask &= 0x0001FF01;
mask += 0x00000100;
reg = mask;
reg >>= 16;
reg += mask;
reg &= 0x000000FF;
return reg;

I guess that performs worse than the shift23 method, but maybe it inspires someone to find a snappy 64-bit solution.
Current finalists for best code (two separate cases):

/* Modified Carl Chatfield G+ version for 32-bit */
long a = (mask-256) >> 23;
long b = mask & 1;
return a + b + 1;

/* Jan Achrenius on G+ for 64-bit case */
return mask*0x0001020304050608 >> 56;

both of which take advantage of the special byte pattern of bits. Jan's version is my multiply, but avoiding the need for the mask. The 32-bit version is Carl's code but with a smaller constant that would be better on ARM, for example, and with the subtraction changed to addition as above.
This must be the most unobvious bit of code I've seen in a long time, make sure to add a nice comment or some distant time in the future some poor maintainer will jump out of the window after failing to figure out what it does ;-D
+Niklas Schnelle: there will be comments. And the name of the function will be "count_masked_bytes()", so it will not only have commentary about how it works, but also a higher level "what does it do" meaning.

That said, sometimes I think code should intrigue you. It's not all about just getting the thing done, a bit of flair can be a good idea too.
+Linus Torvalds Here you are:

a = (0x0ff4000+mask) >> 23;
return a&mask;

edit: 0xff8000 works ok as well, but still not good for ARM
+Jan Achrenius nice. im about to hop on a plane. ill see if what you have there cqn somehow be extended to 64 bit
len += mask%255;

Can replace this:
mask *= ONEBYTES;
len += mask >> 8*(sizeof(unsigned long)-1);

(As (((mask>>0)&255) + ((mask>>8)&255) + ...) is congruent to mask mod 255, and in this situation the sum will be < 256 and thus the exact sum (bit count) is obtained. At least until we have 256-byte integers.)

But I have doubts the % is cheaper than a * and >>.
For the fun of it, I extended my Pentium D benchmark with Jan's magic number multiplication method. It clocks reliably at 2.58 seconds, which is 14 % faster than Linus' original version (not considering loop overhead).
However, I'm not sure how well this simple test matches "real world" conditions regarding latency of the computed length. I suppose that would have the biggest impact on actual performance, using a multiplication instruction on this architecture.
+Jacob Post Modulo is generally something like an order of magnitude slower than multiplication.
this tread makes me feel all warm and fuzzy. to the people who were asking about where to learn how to do this stuff i have a recommendation (if you are ok with english).
Richard Buckland (of UNSW) has a few courses on youtube. one of which is an intro to computer science where he does 50% hardware type instruction and 50% C code based instruction.
he is the best professor i've ever seen. very good at getting the point accross and keeping me interested even in the dry parts.
also there is the TECS class which is a very well designed class that you can do by your self or if a local university or hackerspace offers it you can take it as a group (also almost the entire text book is free online from the site i linked) :)
as for ASM i'd actually like to know a good resource for learning that :P
Lin lin
hey amazing uh
Here's one with small constants (inspired by Jan's 0x0ff4000-version):

a = 0x1ff + (mask >> 15);
return (a >> 8) & mask;

How about this one I sent to Larry Page?
How about a 3D program for kids? To be able to look at mathematical
formulas, then the physics application and then to see in 3D the actual
molecules, elements, etc. in interaction. I think this would be one of the best applications for Google as it would enhance kid's creativity.

Don Dollton
+Timo Linna I don't know should we make portable code that also works well on ARM, but it would require 8-bit constants with even shift. With assembly, carry bits make things a bit easier, but it's highly unlikely that a C compiler can utilize them efficiently.
Smaller-constants -challenge accepted :)

a = 0xFF + ((mask+1) >> 16);
return (a >> 7) & mask;
+Timo Linna: the single-shift version wins in my book.

Shifters are relatively expensive. It's quite common to have microarchitectures with two (or three) ALU's, but only one of them has a shifter.

So the winner still is

a = (mask-256) >> 23;
b = mask & 1;
return a + b + 1;

which should work fairly well on 32-bit ARM too because the constant is one of the easily generated ones.

That said, at least older ARM's have horrible issues with the unaligned part, so worrying about ARM is somewhat premature.
In fact,

a = (0x0ff8000+mask) >> 23;
return a&mask;
should do better on ARM, even though the constant must be added in two parts.. something like

add a,mask,#0x8000
add a,a,#0xff0000
and a,mask,a shr 23

(Haven't coded on ARM since 2005 so the syntax may be incorrect).
+Jan Achrenius: yeah, that one should work fine, although nobody actually does traditional ARM any more (big, cumbersome, and the "every arithmetic instruction can do a shift" thing was just stupid). So in Thumb2 it won't be able to combine the and and shift, but the constant isn't that inconvenient, so I think it's the winner.

And on x86-32, gcc generates "lea" for me for the add, which it randomly didn't do for the subtract. That's just the compiler being odd, though, but it means that in practice your version is two instructions shorter due to just odd compiler interaction.

I wonder why the ff8000 value, though? Am I right that any value of the form 0xff** where (**) is in the range 0x0001..0xff00 will work?
This reminds me of working with BAL on the System/360. Yes, I am that old.
+Linus Torvalds Yes, anything between 0x0001 and 0xff00 works. If I ever start working as a programmer again I guess I need to have another look at Thumb2..
the new kernel 3.1 séries detect all my hardware of my netbook asus 201t :) Great work ;)
Error handling: the code we have been creating for this simple case of counting the bytes in the final mask, may be very efficient, but it doesn't handle errors gracefully. If for some reason "mask" is corrupted, the current "optimal" codes can generate very large "len" values, which may cause problems elsewhere. Wouldn't it be better to generate the "len" values of 0-3 for correct final mask values, and set "len" to zero for corrupted final mask values? Or generate an error condition for corrupted final mask values? I've often wondered how a robust operating system is created, what coding policies are implemented, which raises this question.
It must be a way to predict with a small accuracy the number of bytes or if its possible the right length in an byte mask.
+Chris Perleberg: if the byte mask is incorrect, I'd rather have the code fail spectacularly (kernel oops or similar due to unbounded pointer access etc) than subtly.

The code is be part of my new word-at-a-time pathname component comparison and hashing code, and if the byte-mask is ever anything but the four (or eight, on 64-bit) valid values, things are so seriously broken that blowing up is the best thing that can happen.
Linus, looks like many more people could contribute to open source, if presented with small isolated code fragments that need to be optimized, without requiring knowledge of the larger framework (e.g. linux kernel). I've enjoyed working on this code fragment, and would enjoy future opportunities, as I'm sure other people would. You could continue throwing code fragments our way, or perhaps something more general could be constructed, which could become a storehouse of highly optimzed fragments, that could be used by all.
+Chris Perleberg the "mask corrupted"-case is irrelevant, it means your CPU or RAM is physically broken, so let it crash and burn.
+Jeffrey Koebel There often are instructions for that, but they may be slow, and that is the problem on x86.
+Chris Perleberg: it was a fun code exercise, but realistically the few cycles don't really matter. This code is deep in the kernel, and most of the time it takes thousands of cycles to even get here.

That said, this is also an unusual case in that it happens to be really core functionality (looking up a path is very common) where we over the years have spent an insane amount of effort on making the name lookup caches work really really well. And it is one of the few places in the kernel where you actually can get a loop: looking up a pathname is quite often a small loop of looking up path components.

But even then the loop counts are not "hundreds" - usually the number of path components is in the single digits (and often that single digit is "one"). But some loads will do almost nothing but look up pathnames, so there are loops at higher levels (think web servers, but also think things like "make" on a big project - one of my test-cases was the Linux kernel "make -j" when the kernel was fully built: it spends a fair amount of time looking up object files and the files those depend on to check the timestamps to see that it is fully built).

Very seldom do we have cases where "tricks" help. But I agree that it was fun to work on this way - and in fact the reason I so like working in open source is that open source is all about communication in general (although usually about higher-level problems and bigger code sequences than this ;)
hello,linus,i'm a Chinese i start to read The Art of Linux Kernel Design. it's difficult for me to understand right now, but i know it really is very cool!
118 comments! I guess that's what happens when the alpha-coder asks for ideas G
Have a look at the ripples you're making +Linus Torvalds , i find it depressing that the fun and learning experience of these bitwise tricks didn't get more coverage =P (And tricks like this should be better documented since this seems to be a dying art with new programmers =/)
one linus question, and there you have it: "hacker meeting begins" :D
There is a tiny wizard inside my computer that tells me the length of my path string. The sparks in there keep him warm. I can only joke, because the only high level programming language I know IS a joke.
Linus Torvalds If you're looking to offer some speed where bit scan forward instruction doesnt exist might I suggest a fast bit-scan-forward table? Something along the lines of:
#ifdef HAS_BSF
size_t f;
_asm_ ("bsf %1,%0" :"=r"(f):"rm"(mask));
size_t f = ((unsigned char)mask)?table[(unsigned char)mask] : table[mask>>8]+8;
static const size_t table[256] = {

also to be noted, GCC has a builtin for bsf too (which fairs out well too)
size_t f = __builtin_ctz(mask);
Hi Linus, Good Job with the Kernel's deserters! I enjoyed ur comments, they don't understand what The Kernel means.. Sincerely.
Does this help at all:
len += (mask >> (sizeof(unsigned long)-1)) >> 3;
instead of
len += mask >> 8*(sizeof(unsigned long)-1);
Saludos desde Venezuela Linus gracias a ti tenemos Canaima 3.0
I admire you.
oi linus vc sabia que vce e coisa mais linda que jesus tem eu sou florianopolis
Hello Mr.Torvalds you are amazing and I have a request here goes I think it would be incredible if you did an interview on the Linux Action Show (on
Okay, so a little part of my brain is going to think about this every time I read a string on a new Linux box... :) Okay, make that a path name...
Add a comment...