| Author |
Message |
Dan Koren
Guest
|
Posted:
Tue Sep 27, 2005 10:04 pm Post subject:
Re: Implementation of pop count instruction |
|
|
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:dhb5u7$8oh$1@gemini.csx.cam.ac.uk...
| Quote: |
And God help you if you have similar problems with the TLB :-(
|
Well, the TLB's are not much of a
problem if the architecture allows
multiple page sizes and one feels
comfortable modifying kernel code ;-)
dk |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Sep 27, 2005 10:36 pm Post subject:
Re: Implementation of pop count instruction |
|
|
In article <43397b8d$1@news.meer.net>, Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:dhb5u7$8oh$1@gemini.csx.cam.ac.uk...
And God help you if you have similar problems with the TLB :-(
Well, the TLB's are not much of a
problem if the architecture allows
multiple page sizes and one feels
comfortable modifying kernel code ;-)
|
Fine - in an embedded system. But most systems don't allow every
application to load its own kernel.
On almost every modern, general-purpose system, if you hit the TLB
hard enough, you can make an unprivileged program cause major system
problems. The cause of the problem is that TLB misses are taken at
a fixed (high) priority, irrespective of the priority of the calling
thread.
Most of them that have been around for a while have had enough hacks
put in that this is usually not too serious for systems with a few
CPUs, but it can cause real havoc on the larger ones.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Sep 27, 2005 10:38 pm Post subject:
Re: Implementation of pop count instruction |
|
|
In article <dhbrcs$u0$1@news1nwk.SFbay.Sun.COM>,
Seongbae Park <Seongbae.Park@Sun.COM> wrote:
| Quote: |
It's not the code that computes the popc that slows down
the emulation by 1000x
but handling a trap that's causing 1000+ cycles overhead.
A user-level trap handler is a bit better
but it still has the overhead in the order of hundred cycles
and unfortunately, the illegal_instruction trap can not be
handled in the user-level trap handler.
|
As I have posted before, there is absolutely no reason why an
instruction emulation trap should cost more than a mispredicted
broanch plus a function call. Vastly higher costs are a sign of
someone being incompetent - and note that I am NOT saying that
this is necessarily an engineer.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:dhc03d$66a$1@gemini.csx.cam.ac.uk...
| Quote: | In article <dhbrcs$u0$1@news1nwk.SFbay.Sun.COM>,
Seongbae Park <Seongbae.Park@Sun.COM> wrote:
It's not the code that computes the popc that slows down
the emulation by 1000x
but handling a trap that's causing 1000+ cycles overhead.
A user-level trap handler is a bit better
but it still has the overhead in the order of hundred cycles
and unfortunately, the illegal_instruction trap can not be
handled in the user-level trap handler.
As I have posted before, there is absolutely no reason why an
instruction emulation trap should cost more than a mispredicted
broanch plus a function call.
|
Or less...
| Quote: | Vastly higher costs are a sign of someone being incompetent -
|
Not necessarily. They could also be over-competent. Or they
may have designed their implementation using UML and strictly
following OO design principles ;-)
| Quote: | and note that I am NOT saying that this is necessarily an
engineer.
|
It could certainly be a computer scientist! ;-)
dk |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3kkq5.qri8kz14y46vqN%nospam@ab-katrinedal.dk...
| Quote: | Dan Koren <dankoren@yahoo.com> wrote:
"Seongbae Park" <Seongbae.Park@Sun.COM> wrote in message
news:dhbrcs$u0$1@news1nwk.SFbay.Sun.COM...
Dan Koren <dankoren@yahoo.com> wrote:
...
Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.
I'm kind of suprised that a table-lookup based POPC
takes only 1.67ns - that's only 2 cycles on a 1.2GHz
processor!
Yep. I will be happy to send you the code to
test yourself. I think the key to it is that
the code is highly parallelizable. BTW I was
surprised too, especially since I had tested
all the other algorithms first.
Couldn't you just post it here. I am sure there
are others wondering what magic you do.
|
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function. This is consistently
as fast as, or faster than any of the other
implementations listed on the same page (at
least to the extent I have tested them).
Extending the code to handle 64 bit words
is pretty straightforward even for low
IQ types like myself.
dk |
|
| Back to top |
|
 |
Christian Bau
Guest
|
Posted:
Wed Sep 28, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
In article <dhbcsk$s8f$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
| Quote: | It is seldom that a single word popcount() is on the critical path, more
often you need to count an array of bits, right?
In that case it is almost ceratinly faster to start with a series of
bitslice full adders, this can reduce each set of 15 input words in to
just 4 result words which then has to be counted.
Since the four result words can be counted in parallel, it is at this
point faster to use one or more of the shift & mask algorithms, instead
of a byte-wide table lookup.
|
PowerPC with Altivec instructions can do this especially fast.
Read 128 bit into vector register.
Copy the vector register, shifted by four bits to the right.
Use a vec_perm instruction to translate lower four bits in
each of sixteen bytes to a counter.
Use another vec_perm register with the shifted register as input.
Add two vectors -> 16 byte counters with values 0..8
Repeat with up to 30 more sets of 128 bit vectors ->
16 byte counters with values 0..248.
At that point you need to combine the 8 bit counters into 16 bit
counters. The inner loop of this takes one read, one shift, two vector
permutes, and two vector adds per 128 bit. Once you reach L2 cache you
are memory bandwidth limited. |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Wed Sep 28, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | "Seongbae Park" <Seongbae.Park@Sun.COM> wrote in message
news:dhbrcs$u0$1@news1nwk.SFbay.Sun.COM...
Dan Koren <dankoren@yahoo.com> wrote:
...
Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.
I'm kind of suprised that a table-lookup based POPC
takes only 1.67ns - that's only 2 cycles on a 1.2GHz
processor!
Yep. I will be happy to send you the code to
test yourself. I think the key to it is that
the code is highly parallelizable. BTW I was
surprised too, especially since I had tested
all the other algorithms first.
|
Couldn't you just post it here. I am sure there are others wondering
what magic you do.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Sep 28, 2005 8:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
Seongbae Park wrote:
| Quote: | It's not the code that computes the popc that slows down
the emulation by 1000x
but handling a trap that's causing 1000+ cycles overhead.
A user-level trap handler is a bit better
but it still has the overhead in the order of hundred cycles
and unfortunately, the illegal_instruction trap can not be
handled in the user-level trap handler.
|
Which means that popc should be compiled with enough NOPs around it that
a trap handler can backpatch the opcode into a CALL doing the same job.
If you do this, then it also becomes easy to tell the compiler that popc
should only be used on one specific register, obviating the need for
either multiple emulation routines or much more complicated setup logic.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Sep 28, 2005 8:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
It is seldom that a single word popcount() is
on the critical path, more often you need to
count an array of bits, right?
Please don't make assumptions about code you
have not seen, or know what is supposed to do.
|
Dan, I am not assuming that, I simply guessed and asked for confirmation!
As long as a single popcount is indeed a critical operation, then table
lookup _is_ a good idea. As I noted I have indeed BTDT, in fact I have
also tested multiple versions of the lookup table algorithm, based on
varying table sizes:
For a relatively tight loop, with no other signficant table access, on
an x86 with 4-way L1 associativity, it was a win to use a 2048-entry
table which reduced a 32-bit popcount into two shifts, two masks, three
table lookups and two adds.
I also tested 4, 5, 6 and 7-bit table indices, but never found a case
where this was faster than a byte-wide table. Due to the availability of
byte instructions it is of course signficantly easier to split a
register into byte-sized chunks than any other widths.
| Quote: |
Your argument above reminds me of the old MIPS
party line that justified the lack of atomic
memory update instructions. It isn't always
the direct cost of doing (or not) something
that matters. In some cases, the indirect
costs can be just as important.
|
Yes, I do know that, in fact this is what most of the other people
posting have been saying as well. :-)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 8:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dhdbl8$rbl$1@osl016lin.hda.hydro.com...
| Quote: | Seongbae Park wrote:
It's not the code that computes the popc that slows down
the emulation by 1000x
but handling a trap that's causing 1000+ cycles overhead.
A user-level trap handler is a bit better
but it still has the overhead in the order of hundred cycles
and unfortunately, the illegal_instruction trap can not be
handled in the user-level trap handler.
Which means that popc should be compiled with enough NOPs around it that
a trap handler can backpatch the opcode into a CALL doing the same job.
If you do this, then it also becomes easy to tell the compiler that popc
should only be used on one specific register, obviating the need for
either multiple emulation routines or much more complicated setup logic.
|
Considering how fast the table lookup popc
function works I see no reason for a trap
handler. Why be fixated on an opcode? ;-)
dk |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Wed Sep 28, 2005 1:31 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
|
Well, that is fairly trivial, but how do you make it go in 2 clocks?
Subtracting (too much) loop overhead?
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Laurent Desnogues
Guest
|
Posted:
Wed Sep 28, 2005 3:44 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren wrote:
| Quote: |
It may not be "obvious" to you since
you have not tested the possible
implementations.
|
Well I have been testing bit stuff for some time now.
| Quote: | I have. POPC is
not the "bottleneck" by itself,
but it is a key ingredient in a
solution that replaces the true
bottleneck (the memory allocator)
in my "application".
|
In one of my applications popcount using shifting and
masking is faster than byte lookup tables. Of course
if I benchmark the lookup method on its own, it is
faster. And that's why I asked you to provide some
more information about your claim:
I have recently
coded and compared all the POPC algorithms
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware.
One reading your claim may come to the wrong conclusion.
For information, my application is just an Othello
position enumerator. It uses 2x64 bit quantities to
represent the board. Score is computed using popcount.
If I disable hash table on my program, lookup table popc
is faster, but if I enable HT, it's not the case anymore.
(This is true on an Opteron and a Sun Blade 1500 both
running 64 bit OSes.)
| Quote: | And thanks for reminding me about testing
the full application. FYI I happen to be
person who cracked some very nasty scaling
problems with Oracle on RISC architectures
due to cache trashing (paper presented at
the IOUW in 1990).
|
Sorry Dan, I did not mean to put your skills in doubt, I
was just trying to get more information about a claim
that contradicts my observation.
And I don't want people to falsely believe that what is
true in one case is always true which your first message
could lead to believe :)
Laurent |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 4:04 pm Post subject:
Re: Implementation of pop count instruction |
|
|
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lh40.rv487f13k047aN%nospam@ab-katrinedal.dk...
| Quote: | Dan Koren <dankoren@yahoo.com> wrote:
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
|
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
;-)
dk |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 4:05 pm Post subject:
Re: Implementation of pop count instruction |
|
|
"Dan Koren" <dankoren@yahoo.com> wrote in message
news:433a78aa$1@news.meer.net...
| Quote: | ""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lh40.rv487f13k047aN%nospam@ab-katrinedal.dk...
Dan Koren <dankoren@yahoo.com> wrote:
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
|
If memory serves, the Ultra SPARC III+
can issue up to four instructions per
clock cycle -- can someone confirm or
refute this notion?
Thx,
dk |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Sep 28, 2005 4:15 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Niels Jørgen Kruse wrote:
| Quote: | Dan Koren <dankoren@yahoo.com> wrote:
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
Byte at a time table lookup on a 8 byte operand would seem to require 8
load instructions. The USIII can hardly do 4 loads per clock!
Show us the timing loop. There must be opportunities for the compiler to
make things evaporate.
|
Since even L1 cache won't allow more than 2 loads/cycle, it is pretty
obvious that a byte-table based popc can't actually run any faster than
8/2=4 cycles, and in fact it has to take at least two more to add the
last two counts into the accumulated value.
With an infinitely wide (>= 8 simultaneous load ports) L1 cache, you
could conceivably do all 8 loads during the first cycle, then add pairs
of results (8 -> 4, 4 -> 2 and 2 -> 1) during the next three cycles, but
this is the absolute 'speed of light' limit for this algorithm.
It still disregards the time needed to break the 64-bit value into 8
separate bytes!
So what is the real minimum, assuming three-operand instructions,
single-cycle shifts, masks and adds, as well as single-cycle loads with
no address generation interlock:
Splitting 64-bits into 8 parts requires 7 shifts and 7 masks, for a
total of 14 instructions, unless you have byte extract opcodes, in which
case you can get away with just 8.
The 8 loads are another 8 instructions.
The merging of 8 table lookup values into a single sum must always take
7 instructions, but many can be scheduled in parallel, as shown above.
Anyway, that still gives an absolute limit of 23 or 30 instructions
which must take 6-8 clock cycles, and in fact will use more!
With a maximum of 4 instructions/cycle, of any kind, the fastest
sequence I can come up with is:
(input value in n)
1: t1 = n >> 8; t2 = n >> 16; t3 = n >> 24; n &= 255;
2: count = table[n]; t4 = t1 >> 24; t1 &= 255; t2 &= 255;
3: tc1 = table[t1]; tc2 = table[t2]; t5 = t3 >> 16; t3 &= 255;
4: tc3 = table[t3]; t6 = t5 >> 8; t7 = t5 >> 16; t4 &= 255;
5: tc4 = table[t4]; t5 &= 255; t6 &= 255; tc7 = table[t7];
6: tc5 = table[t5]; tc6 = table[t6]; count += tc1; tc2 += tc3;
7: tc4 += tc5; tc6 += tc7; count += tc2;
8: count += tc4;
9: count += tc6;
It might be possible to tweak this to get down to 8, but I believe you
also have to schedule at least one cycle between address generation and
the load operation.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
|
|
|
|