Implementation of pop count instruction
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Implementation of pop count instruction
Goto page 1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Stephen Fuld
Guest





Posted: Mon Sep 26, 2005 7:31 am    Post subject: Implementation of pop count instruction Reply with quote

I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

One could have a look up ROM, but it would have to be pretty big to save any
time. For example, with a 64K by 5 ROM, one could do it in 4 lookups and
pipeline all but the last the add, but this still takes five cycles. If you
could use a dual ported ROM (are there such things?), you could cut it down
to about 3 cycles.

You could just use a humongo hunk of random logic and get the answer in a
single cycle, but the amount of logic required seems to be way to large to
be practical.

I guess if you allowed analog circuitry, you could somehow input all the
signals into some kind of A to D converter, but that seems to way out.

So, for real CPUs that have this instruction, how is it implemented?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Derek Gladding
Guest





Posted: Mon Sep 26, 2005 8:12 am    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:

Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

One could have a look up ROM, but it would have to be pretty big to save
any
time. For example, with a 64K by 5 ROM, one could do it in 4 lookups and
pipeline all but the last the add, but this still takes five cycles. If
you could use a dual ported ROM (are there such things?), you could cut it
down to about 3 cycles.

You could just use a humongo hunk of random logic and get the answer in a
single cycle, but the amount of logic required seems to be way to large to
be practical.

I guess if you allowed analog circuitry, you could somehow input all the
signals into some kind of A to D converter, but that seems to way out.

So, for real CPUs that have this instruction, how is it implemented?


For the one that I have personal knowledge of: an initial layer of (4,2)
compressors then some fast adders in a tree as you describe.

32 bits -> 8 x (4,2) compressors -> 4 x 2 bit adders ...

- Derek
Back to top
Terje Mathisen
Guest





Posted: Mon Sep 26, 2005 8:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:
Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

The most obvious idea would be a set of carry-save adders, and a final
regular (carry-select maybe?) adder to merge the final result.

A full adder takes three inputs and generates two outputs, with
something like 3-4 (wild guess!) gate delays.

You can even use a tree structure of FAs to directly count the number of
bits:

10 full adders will reduce 30+2 input bits to 20+2, another 7 can take
21+1 down to 14+1. 5 more to get to 10 (9+1) bits. Next is 3 FAs to
generate 6+1, then 1 or 2 FAs to get the final result.

The fastest sw bitcount code I know about (Hi Rob!) uses Full-Add
operations on 64-bit wide registers (Alpha) to first merge 15*64 bits
down to 4*64 result registers, then a simultaneous binary merge of each
of these four result regs to get the final count.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Derek Gladding
Guest





Posted: Mon Sep 26, 2005 8:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:

Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so

[snip]

Oops in previous post. (It was a machine from ten years ago, so memory
slipped and I forgot to sanity check until after posting).

I should have said 10 (3,2) compressors, one slightly degenerate compressor,
and then an adder tree.

- Derek
Back to top
Stephen Fuld
Guest





Posted: Mon Sep 26, 2005 4:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dh84l7$4n8$1@osl016lin.hda.hydro.com...
Quote:
Stephen Fuld wrote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six
cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

The most obvious idea would be a set of carry-save adders, and a final
regular (carry-select maybe?) adder to merge the final result.

A full adder takes three inputs and generates two outputs, with
something like 3-4 (wild guess!) gate delays.

You make a good point that I didn't catch. First let me correct my earlier
mistake and say that it looks like 5 (not six) levels in the adder tree for
a 54 bit register. Now you make the distinction between gate delays and
cycles. So the question is how many gate delays in a cycle? That is, if an
adder takes 4 gate delays and there are 8 gate delays per cycle, then the
instruction only takes three cycles instead of the five I had posited
earlier.



Quote:

You can even use a tree structure of FAs to directly count the number of
bits:

10 full adders will reduce 30+2 input bits to 20+2, another 7 can take
21+1 down to 14+1. 5 more to get to 10 (9+1) bits. Next is 3 FAs to
generate 6+1, then 1 or 2 FAs to get the final result.

Another good point I missed - the carry input. But in your second, and
later steps, don't you have to use "wider" adders than a single bit, since
the output of the first step of adders the bits have "place significance"?
Thus, for the second step, the inputs are 20 two bit numbers and the two
single bits "carries" left over from the first step. So don't you need ten
adders, not the seven you propose?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Stephen Fuld
Guest





Posted: Mon Sep 26, 2005 4:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"Derek Gladding" <derek-spammenot@ebollocks.net> wrote in message
news:sUJZe.1$OP6.0@fe84.usenetserver.com...
Quote:
Stephen Fuld wrote:

I am not a hardware guy, and this certainly isn't a homework problem, so

[snip]

Oops in previous post. (It was a machine from ten years ago, so memory
slipped and I forgot to sanity check until after posting).

I should have said 10 (3,2) compressors,

I am not familiar with the term compressor in this context. Is it
essentially a single bit wide adder? That is, it outputs, in two bits, the
number of one bits in its three, single bit inputs.

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
JJ
Guest





Posted: Mon Sep 26, 2005 4:15 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:
Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".


snipping

This usually comes up in c.a.fpga for the newbies doing DSP correlation
etc

The most optimal way using full adders in VLSI but maybe not in FPGA is
to consider a recursive definition for 2^n-1 and finish off the last
bit later if one absolutely must have that last bit.

So pop(63) = pop(31)+pop(31)+pop(1)

eventually pop(3) is a single adder delay of 2 gates.

Going back up each levels adds precisely 4 gate delays since the
carries flow down a diagonal as the width and hight increase 1 each
rank.

Still for 2^n bits the total no of adders is 2^n-1.
n=3 need 1 ie 1 or 3-2 or 0+0+1
n=7 need 1+1+2 ie 4 or 7-3 or 1+1+2
n=15 need 4+4+3 ie 11 or 15-4 or 4+4+3
n=31 need 11+11+4 ie 26 or 31-5 or 11+11+4
n=63 need 26+26+5 ie 57 or 63-6 or 26+26+5

Last row puts the log n adder back in.
Draw it out and it lays out quite well in a twisted sense.

To mul 2 large nos in wallace fashion, one could use columns of these
pop trees to do column by col;umn etc. Even more twistyness.

In DSP hardware for signal correlation against some fixed pattern, it
helps a little to remove the last bit to save last rank of adders.



johnjakson at usa_dot com
transputer2 at yahoo ...
Back to top
Stephen Fuld
Guest





Posted: Mon Sep 26, 2005 9:24 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

"JJ" <johnjakson@gmail.com> wrote in message
news:1127742718.021476.79910@g44g2000cwa.googlegroups.com...
Quote:
Stephen Fuld wrote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".


snipping

This usually comes up in c.a.fpga for the newbies doing DSP correlation
etc

Makes sense.

Quote:
The most optimal way using full adders in VLSI but maybe not in FPGA is
to consider a recursive definition for 2^n-1 and finish off the last
bit later if one absolutely must have that last bit.

So pop(63) = pop(31)+pop(31)+pop(1)

eventually pop(3) is a single adder delay of 2 gates.

Going back up each levels adds precisely 4 gate delays since the
carries flow down a diagonal as the width and hight increase 1 each
rank.

Still for 2^n bits the total no of adders is 2^n-1.
n=3 need 1 ie 1 or 3-2 or 0+0+1
n=7 need 1+1+2 ie 4 or 7-3 or 1+1+2
n=15 need 4+4+3 ie 11 or 15-4 or 4+4+3
n=31 need 11+11+4 ie 26 or 31-5 or 11+11+4
n=63 need 26+26+5 ie 57 or 63-6 or 26+26+5

Last row puts the log n adder back in.
Draw it out and it lays out quite well in a twisted sense.

I'll chew on that for a while. Thanks. You indicated that this is the
optimal method using full adders. Does that mean that using full adders is
the optimal method over the other ideas I talked about?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
JJ
Guest





Posted: Mon Sep 26, 2005 10:16 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:
Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)


Using wide adders in any fashion other than I explained before makes
little sense since it serializes itself through the carry chain. But if
this is done in firmware or microcode or a pipeline, it might not be so
bad sine the serializing is cut into shorter chains. Still the
recursive definition gives about 4gates log N time directly. For really
large N (maybe >256), though it must tend toward linear due to the
wiring delays with relatively sparse logic. Typically such structures
need to follow the shape of some other structure, ie a datapath or a
DSP block. Anyway it can usually be pipelined for really big N.

I was a bit careless about n though.

Quote:
One could have a look up ROM, but it would have to be pretty big to save any
time. For example, with a 64K by 5 ROM, one could do it in 4 lookups and
pipeline all but the last the add, but this still takes five cycles. If you
could use a dual ported ROM (are there such things?), you could cut it down
to about 3 cycles.

Dual ported RAMs which are relatively free in FPGAs can be used as DP
ROMs preloaded by the external Flash memory. But their best valure lies
in use for DP RAMs or computing register files rather than big ass
rarely used look up tables, but if you are drowning in them then use
them.

Quote:

You could just use a humongo hunk of random logic and get the answer in a
single cycle, but the amount of logic required seems to be way to large to
be practical.


4g log N delay with N-1 adder cells is all thats needed unless there is
some esoteric method I missed. Some folks in FPU design will go there
in order to exploit particular arithmetic properties for even less
delay. I assume here that an adder has 3 equal delays but thats not
really true. If you have a particular logic cell family, you could
simulate variations.

Quote:
I guess if you allowed analog circuitry, you could somehow input all the
signals into some kind of A to D converter, but that seems to way out.


Not necessarily, in a mixed signal or memory situation if you didn't
care about both precision and speed, a I summing network with suitable
references or a A/D converter could give the same result but way more
trouble than its worth.

Quote:
So, for real CPUs that have this instruction, how is it implemented?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Derek Gladding
Guest





Posted: Mon Sep 26, 2005 10:40 pm    Post subject: Re: Implementation of pop count instruction Reply with quote

Stephen Fuld wrote:

Quote:

"Derek Gladding" <derek-spammenot@ebollocks.net> wrote in message
news:sUJZe.1$OP6.0@fe84.usenetserver.com...
Stephen Fuld wrote:

I am not a hardware guy, and this certainly isn't a homework problem, so

[snip]

Oops in previous post. (It was a machine from ten years ago, so memory
slipped and I forgot to sanity check until after posting).

I should have said 10 (3,2) compressors,

I am not familiar with the term compressor in this context. Is it
essentially a single bit wide adder? That is, it outputs, in two bits,
the number of one bits in its three, single bit inputs.


Yes, just a full adder, but the carry input isn't being used as a carry.

http://www.geoffknagge.com/fyp/carrysave.shtml

An implementation with (5,3) compressors should also be possible, but would
need a little time with pencil and paper to work out all the plumbing.

- Derek
Back to top
Christian Bau
Guest





Posted: Tue Sep 27, 2005 12:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

In article
<o4JZe.316099$5N3.155074@bgtnsc05-news.ops.worldnet.att.net>,
"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote:

Quote:
So, for real CPUs that have this instruction, how is it implemented?

Since it isn't used very much, I would try to adapt existing hardware.
Assume you have a 64 x 64 bit integer multiplier, with two instructions
giving the most and the least significant 64 bits of the result.

Such a multiplier adds in bit position k the values (xi AND y(k-i)) for
0 <= i <= k using more or less brute force, for 0 <= k <= 126. Change
the inputs so that in bit position 64 the inputs are xi for 0 <= i <= 63
and in all other bit positions the inputs are 0. Let the existing
multiplier hardware do the rest.
Back to top
Dale Morris
Guest





Posted: Tue Sep 27, 2005 12:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message
news:o4JZe.316099$5N3.155074@bgtnsc05-news.ops.worldnet.att.net...
Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented in hardware. I have come up with several possibilities, but
none of them seem "optimal".

There are many different aspects of computing a population count that one
could optimize. Presumably you're thinking about latency, since you talked
about the circuit height, but as others point out, there's also throughput,
and the potential for leveraging other, already-existing circuits.

Further, you may want to step back and think about the bigger problem. Are
you really just wanting to popcount 64-bit values because that's the width
of the machine you're thinking about? If you're really wanting to popcount
longer strings than 64-bits, then there's not really any reason to carry
propagate on a 64-bit basis. You can more quickly compute a carry-save sum,
and even do a popcount-and-add (of a prior carry-save sum), and then put off
the propagation until you get to the end of the larger string you're wanting
to count over. (See US patent 5, 717, 616.)

So, I'd encourage you to step back from the potential implementation and
think about what problems you're wanting to solve that require population
counts.

- Dale Morris
Itanium processor architect
Hewlett Packard
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 8:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dh84l7$4n8$1@osl016lin.hda.hydro.com...
Quote:
Stephen Fuld wrote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six
cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

The most obvious idea would be a set of carry-save adders, and a final
regular (carry-select maybe?) adder to merge the final result.

A full adder takes three inputs and generates two outputs, with
something like 3-4 (wild guess!) gate delays.

You can even use a tree structure of FAs to directly count the number of
bits:

10 full adders will reduce 30+2 input bits to 20+2, another 7 can take
21+1 down to 14+1. 5 more to get to 10 (9+1) bits. Next is 3 FAs to
generate 6+1, then 1 or 2 FAs to get the final result.

The fastest sw bitcount code I know about (Hi Rob!) uses Full-Add
operations on 64-bit wide registers (Alpha) to first merge 15*64 bits
down to 4*64 result registers, then a simultaneous binary merge of each
of these four result regs to get the final count.



Judging from the results of my tests of software POPC implementations on
SPARC, PA-RISC, Itanium and PowerPC I strongly suspect this is no faster
than the obvious 8 byte table lookup. Unfortunately I do not have access
to an Alpha based system to check this out.



dk
Back to top
Dan Koren
Guest





Posted: Tue Sep 27, 2005 8:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message
news:o4JZe.316099$5N3.155074@bgtnsc05-news.ops.worldnet.att.net...
Quote:
I am not a hardware guy, and this certainly isn't a homework problem, so
please be gentle with me. :-)

I have been thinking about how a pop count instruction would be
implemented
in hardware. I have come up with several possibilities, but none of them
seem "optimal".

One could have a tree of adders - so for a 64 bit register, 32 two bit
adders feeding 16 four bit adders, etc. This works, but takes six cycles,
which seems like a lot. (maybe it doesn't matter given how infrequently
such an instruction is executed.)

One could have a look up ROM, but it would have to be pretty big to save
any
time. For example, with a 64K by 5 ROM, one could do it in 4 lookups and
pipeline all but the last the add, but this still takes five cycles. If
you
could use a dual ported ROM (are there such things?), you could cut it
down
to about 3 cycles.

You could just use a humongo hunk of random logic and get the answer in a
single cycle, but the amount of logic required seems to be way to large to
be practical.

I guess if you allowed analog circuitry, you could somehow input all the
signals into some kind of A to D converter, but that seems to way out.

So, for real CPUs that have this instruction, how is it implemented?



Take a look at http://aggregate.org/MAGIC and
www.hackersdelight.org for various ideas on
how to implement POPC using bit shifts and
bitwise logical operations. I have recently
coded and compared all the POPC algorithms
presented there. Not surprisingly it turns
out that the fastest one is to do byte by
byte lookups through a 256-entry table,
which can be easily parallelized and
implemented in hardware. Just to get an
idea, a double word (8 byte) POPC takes
1.67 ns on a 1.2 GHz Ultra SPARC III. I
got similar results on PA-RISC: 2.08 ns
at 900 MHz. Considering how fast these
run, I am not sure a pure hardware POPC
implemetation is justified.

I was forced through this exercise after
finding out that SPARC v9 POPC was not
implemented in hardware but emulated by
a trap handler that is about 1000x slower
than the obvious table lookup algorithm
(wondering who wrote it....).

Hope this helps.



dk
Back to top
Terje Mathisen
Guest





Posted: Tue Sep 27, 2005 8:15 am    Post subject: Re: Implementation of pop count instruction Reply with quote

Dale Morris wrote:
Quote:
There are many different aspects of computing a population count that one
could optimize. Presumably you're thinking about latency, since you talked
about the circuit height, but as others point out, there's also throughput,
and the potential for leveraging other, already-existing circuits.

Further, you may want to step back and think about the bigger problem. Are
you really just wanting to popcount 64-bit values because that's the width
of the machine you're thinking about? If you're really wanting to popcount
longer strings than 64-bits, then there's not really any reason to carry
propagate on a 64-bit basis. You can more quickly compute a carry-save sum,
and even do a popcount-and-add (of a prior carry-save sum), and then put off
the propagation until you get to the end of the larger string you're wanting
to count over. (See US patent 5, 717, 616.)

Huh??? A US patent on this???

"The US Patent System: Broken by Design!" :-(

Quote:
So, I'd encourage you to step back from the potential implementation and
think about what problems you're wanting to solve that require population
counts.

In my experience, that is _always_ good advice, and in 95%+ of the cases
where I have been consulted/involved, it also turned out both that the
real problem/target was different than originally stated, and that the
code became much faster after taking advantage of this.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB