Implementation of pop count instruction
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Implementation of pop count instruction
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Laurent Desnogues
Guest





Posted: Tue Sep 27, 2005 1:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Dan Koren wrote:
[...]
Quote:

Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz.

You really should take results measured in tight loops with
a grain of salt when using table lookups. It often happens
that when your code is put in real code (as opposed to benchmark
loop code) the result is very different due to cache trashing,
even when the table is small.


Laurent
Back to top
Grumble
Guest





Posted: Tue Sep 27, 2005 1:34 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:
Quote:
I have been thinking about how a pop count instruction would be
implemented in hardware. I have come up with several possibilities,
but none of them seem "optimal".

As far as Itanium 2 is concerned, popcnt (I9) is considered a
multimedia instruction and runs in 2 cycles (add an extra cycle
if the consumer is not another multimedia instruction).

Perhaps Dale (Morris) can comment on the implementation? ;-)
Back to top
Nick Maclaren
Guest





Posted: Tue Sep 27, 2005 2:07 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

In article <dhav2r$lrf$1@home.itg.ti.com>,
Laurent Desnogues <l-desnogues_delme@ti.com> writes:
|>
|> You really should take results measured in tight loops with
|> a grain of salt when using table lookups. It often happens
|> that when your code is put in real code (as opposed to benchmark
|> loop code) the result is very different due to cache trashing,
|> even when the table is small.

Indeed :-(


Regards,
Nick Maclaren.
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 2:13 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
news:dhav2r$lrf$1@home.itg.ti.com...
Quote:
Dan Koren wrote:
[...]

Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz.

You really should take results measured in tight loops with
a grain of salt when using table lookups. It often happens
that when your code is put in real code (as opposed to benchmark
loop code) the result is very different due to cache trashing,
even when the table is small.



Hhmmm.... Do you realize the
entire table is 256 bytes and
fits completely in one or two
cache lines?



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 2:25 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Dan Koren" <dankoren@yahoo.com> wrote in message
news:43390d28$1@news.meer.net...
Quote:
"Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
news:dhav2r$lrf$1@home.itg.ti.com...
Dan Koren wrote:
[...]

Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz.

You really should take results measured in tight loops with
a grain of salt when using table lookups. It often happens
that when your code is put in real code (as opposed to benchmark
loop code) the result is very different due to cache trashing,
even when the table is small.

Hhmmm.... Do you realize the
entire table is 256 bytes and
fits completely in one or two
cache lines?



Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

When did you last check cache
sizes for large server systems?

The HP system on which I test
my libraries has 3 MB L1 cache
per chip (1.5 MB per core) and
32 MB L2 cache (shared).



dk
Back to top
Laurent Desnogues
Guest





Posted: Tue Sep 27, 2005 2:46 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Dan Koren wrote:
[...]
Quote:

Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

When did you last check cache
sizes for large server systems?

The HP system on which I test
my libraries has 3 MB L1 cache
per chip (1.5 MB per core) and
32 MB L2 cache (shared).

Due to my work, I know pretty well current cache sizes :)

Now, unless popcount is really the bottleneck of your
application or your application completely fits in L1
cache, it is just not obvious using table lookups, even
small, is a win. And for sure, just benchmarking popc
in a loop won't tell you much about the behavior of popc
in a real application.

I am not telling that you are wrong, I just think you
did not provide enough evidence you are right ;-)


Laurent
Back to top
Nick Maclaren
Guest





Posted: Tue Sep 27, 2005 3:12 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

In article <43391010@news.meer.net>,
"Dan Koren" <dankoren@yahoo.com> writes:
|> > "Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
|> > news:dhav2r$lrf$1@home.itg.ti.com...
|> >>
|> >> You really should take results measured in tight loops with
|> >> a grain of salt when using table lookups. It often happens
|> >> that when your code is put in real code (as opposed to benchmark
|> >> loop code) the result is very different due to cache trashing,
|> >> even when the table is small.
|> >
|> > Hhmmm.... Do you realize the
|> > entire table is 256 bytes and
|> > fits completely in one or two
|> > cache lines?
|>
|> Not to mention that this code is
|> executed so frequently by the
|> library routines that rely on
|> it that it stays in the cache.

Sigh. I can confirm Laurent's comment from actual measurements.
You have missed the point.

The effect of introducing such a table is effectively to reduce
the associativity by one for the range of lines covered by the
table. This can be harmless, or can cause meltdown. It will
fairly rarely cause major problems with 4-way associativity, but
will quite commonly do so with 2-way and very commonly with
direct mapped.

And God help you if you have similar problems with the TLB :-(


Regards,
Nick Maclaren.
Back to top
Terje Mathisen
Guest





Posted: Tue Sep 27, 2005 4:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Dan Koren wrote:
Quote:
Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

Which lib routines would that be?

Do they count single words, or arrays?

Quote:
When did you last check cache
sizes for large server systems?

I used to do absolutely everything via (sometimes very tricky) lookup
tables, but lately I've been trying to avoid them if possible, because
they can more often be beaten by other approaches.

OTOH, my currently fastest word count algorithm still uses a 64K initial
lookup table! :-)
Quote:

The HP system on which I test
my libraries has 3 MB L1 cache
per chip (1.5 MB per core) and
32 MB L2 cache (shared).

Nice. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Terje Mathisen
Guest





Posted: Tue Sep 27, 2005 4:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Dan Koren wrote:
Quote:
Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms

OK, I've BTDT, multiple times...

Quote:
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.

It is seldom that a single word popcount() is on the critical path, more
often you need to count an array of bits, right?

In that case it is almost ceratinly faster to start with a series of
bitslice full adders, this can reduce each set of 15 input words in to
just 4 result words which then has to be counted.

Since the four result words can be counted in parallel, it is at this
point faster to use one or more of the shift & mask algorithms, instead
of a byte-wide table lookup.
Quote:

I was forced through this exercise after
finding out that SPARC v9 POPC was not
implemented in hardware but emulated by
a trap handler that is about 1000x slower
than the obvious table lookup algorithm
(wondering who wrote it....).

Ouch!

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Seongbae Park
Guest





Posted: Tue Sep 27, 2005 4:18 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Dan Koren <dankoren@yahoo.com> wrote:
....
Quote:
Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.

I'm kind of suprised that a table-lookup based POPC
takes only 1.67ns - that's only 2 cycles on a 1.2GHz processor!
I don't know whether that number is a latency or throughput,
but either case it seems incredible that you can do that.
Even assuming that you can schedule instructions perfectly
to hide all load latencies,
2 cycles is enough for 2 loads and 4 ALU on US3,
and I'm not sure that's enough to compute 8-byte POPC
(or am I not fully awake yet and got the numbers totally wrong ?)

Just released UltraSPARC-IV+ is the first SPARC from Sun
that implements POPC in hardware.
IIRC, the latency is 6 cycles and throughput is one every 4 cycles.

Quote:
I was forced through this exercise after
finding out that SPARC v9 POPC was not
implemented in hardware but emulated by
a trap handler that is about 1000x slower
than the obvious table lookup algorithm
(wondering who wrote it....).

It's not the code that computes the popc that slows down
the emulation by 1000x
but handling a trap that's causing 1000+ cycles overhead.
A user-level trap handler is a bit better
but it still has the overhead in the order of hundred cycles
and unfortunately, the illegal_instruction trap can not be
handled in the user-level trap handler.

Quote:
Hope this helps.

dk
--

#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 9:21 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
news:dhb4d3$osb$1@home.itg.ti.com...
Quote:
Dan Koren wrote:
[...]

Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

When did you last check cache
sizes for large server systems?

The HP system on which I test
my libraries has 3 MB L1 cache
per chip (1.5 MB per core) and
32 MB L2 cache (shared).

Due to my work, I know pretty well current cache sizes :)

Now, unless popcount is really the bottleneck of your
application or your application completely fits in L1
cache, it is just not obvious using table lookups, even
small, is a win.


It may not be "obvious" to you since
you have not tested the possible
implementations. I have. POPC is
not the "bottleneck" by itself,
but it is a key ingredient in a
solution that replaces the true
bottleneck (the memory allocator)
in my "application".


Quote:
And for sure, just benchmarking popc
in a loop won't tell you much about the behavior of popc in a real
application.


And what makes you think that I only
tested it in a loop?


Quote:
I am not telling that you are wrong, I just think you did not provide
enough evidence you are right ;-)


I wasn't aware this was a design review.

And thanks for reminding me about testing
the full application. FYI I happen to be
person who cracked some very nasty scaling
problems with Oracle on RISC architectures
due to cache trashing (paper presented at
the IOUW in 1990).



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 9:23 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:dhb5u7$8oh$1@gemini.csx.cam.ac.uk...
Quote:

In article <43391010@news.meer.net>,
"Dan Koren" <dankoren@yahoo.com> writes:
|> > "Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
|> > news:dhav2r$lrf$1@home.itg.ti.com...
|
|> >> You really should take results measured in tight loops with
|> >> a grain of salt when using table lookups. It often happens
|> >> that when your code is put in real code (as opposed to benchmark
|> >> loop code) the result is very different due to cache trashing,
|> >> even when the table is small.
|
|> > Hhmmm.... Do you realize the
|> > entire table is 256 bytes and
|> > fits completely in one or two
|> > cache lines?
|
|> Not to mention that this code is
|> executed so frequently by the
|> library routines that rely on
|> it that it stays in the cache.

Sigh. I can confirm Laurent's comment from actual measurements.

"Actual measurements" of my code ?!?

Quote:
You have missed the point.

Not any more than you did.

Quote:
The effect of introducing such a table is effectively to reduce
the associativity by one for the range of lines covered by the
table. This can be harmless, or can cause meltdown. It will
fairly rarely cause major problems with 4-way associativity, but
will quite commonly do so with 2-way and very commonly with
direct mapped.

And God help you if you have similar problems with the TLB :-(


God help HP ;-)



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 9:24 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dhbd3u$sf6$1@osl016lin.hda.hydro.com...
Quote:
Dan Koren wrote:
Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

Which lib routines would that be?

Do they count single words, or arrays?



One word only.



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 9:27 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dhbcsk$s8f$1@osl016lin.hda.hydro.com...
Quote:
Dan Koren wrote:
Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms

OK, I've BTDT, multiple times...

presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.

It is seldom that a single word popcount() is
on the critical path, more often you need to
count an array of bits, right?



Please don't make assumptions about code you
have not seen, or know what is supposed to do.

Your argument above reminds me of the old MIPS
party line that justified the lack of atomic
memory update instructions. It isn't always
the direct cost of doing (or not) something
that matters. In some cases, the indirect
costs can be just as important.



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 10:02 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Seongbae Park" <Seongbae.Park@Sun.COM> wrote in message
news:dhbrcs$u0$1@news1nwk.SFbay.Sun.COM...
Quote:
Dan Koren <dankoren@yahoo.com> wrote:
...
Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.

I'm kind of suprised that a table-lookup based POPC
takes only 1.67ns - that's only 2 cycles on a 1.2GHz
processor!


Yep. I will be happy to send you the code to
test yourself. I think the key to it is that
the code is highly parallelizable. BTW I was
surprised too, especially since I had tested
all the other algorithms first.


Quote:
I don't know whether that number is a latency
or throughput, but either case it seems incredible
that you can do that.


Let's see.... the unit is ns... it must be
latency.... then maybe not... let's say we
get one result every 4 cycles, still pretty
good.... ;-)

Quote:
Even assuming that you can schedule instructions
perfectly to hide all load latencies, 2 cycles is
enough for 2 loads and 4 ALU on US3,


Thank to you, my friend, and to the Sun compiler
team! The Sun Studio 9/10 compilers rock! (or at
least for my code). Pop the cork!


Quote:
and I'm not sure that's enough to compute 8-byte
POPC (or am I not fully awake yet and got the
numbers totally wrong ?)

Just released UltraSPARC-IV+ is the first SPARC
from Sun that implements POPC in hardware. IIRC,
the latency is 6 cycles and throughput is one
every 4 cycles.


Arrange for me a remote login on a IV+ for 24
hours and I will be happy to re-test all the
POPC implementations and publish the results
for the benefit of all mankind (after I get
back from vacation on October 16th).

Quote:

I was forced through this exercise after
finding out that SPARC v9 POPC was not
implemented in hardware but emulated by
a trap handler that is about 1000x slower
than the obvious table lookup algorithm
(wondering who wrote it....).

It's not the code that computes the popc that
slows down the emulation by 1000x but handling
a trap that's causing 1000+ cycles overhead.


Understood. IMHO I see no justification for a 1000+
cycle overhead to call a trap handler. There must be
a more efficient way to write trap handlers.


Quote:
A user-level trap handler is a bit better but it
still has the overhead in the order of hundred
cycles and unfortunately, the illegal_instruction
trap can not be handled in the user-level trap
handler.


Well, one doesn't have to handle it as an illegal
instruction. Just put it in a library routine that
calls an assigned registered user level trap handler.... ;-)



dk
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5  Next
Page 2 of 5

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB