| Author |
Message |
Guest
|
Posted:
Thu Nov 17, 2005 12:22 am Post subject:
complexity and speed of arithmetic |
|
|
hi everyone,
i'm teaching a vlsi design class, and we're covering datapath this
week.
we use weste and harris' new edition of "cmos vlsi design". it's a
great book; however, it doesn't discuss actual delays/area costs for
adders, multipliers, dividers, etc., and the other sources i've looked
at tend to be focused on asymptotics (O(log N) type analyses).
does anyone know off the top of their heads what the actual delay/area
numbers are like on a modern processor for the following:
addition, multiplication, division
16/32/64 bit
number of pipeline stages for these implementations.
i'd also be interested in knowing correspondnig values for floating
point arithmetic, and for FPGA implementations.
suggestions for survey articles would be welcome (i know i could just
skim the JSSC for the past several years, but it's unlikely that i'll
be able to do that before the end of classes).
thanks in advance,
adnan |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Nov 17, 2005 1:15 am Post subject:
Re: complexity and speed of arithmetic |
|
|
adnan.aziz@gmail.com wrote:
| Quote: | hi everyone,
i'm teaching a vlsi design class, and we're covering datapath this
week.
we use weste and harris' new edition of "cmos vlsi design". it's a
great book; however, it doesn't discuss actual delays/area costs for
adders, multipliers, dividers, etc., and the other sources i've looked
at tend to be focused on asymptotics (O(log N) type analyses).
does anyone know off the top of their heads what the actual delay/area
numbers are like on a modern processor for the following:
addition, multiplication, division
16/32/64 bit
number of pipeline stages for these implementations.
|
Delay::
Excepting the Pentium IV, Addition is between 1/2 pipe stage and 1 pipe
stage for anything between 16 and 64 bits. Depending upon the
implementation technology and other data path constraints (such as the
number of wires over the adder) the 16-bits of addition will take at
least 7 gates of dealy, while 32-bit adds incurr a single additional
gate delay and 64 adds either 1 or 2 additional gates depending upon
wire load and target speeds.
Multiplication in VLSI has typically been performed with a 4:2
compressor technolory which is then followed with a full length
addition of those bits not fully resolved by the multiplication tree.
Each 4:2 compressor is 'about' 3 gates of delay (with a good circuit
designer). These are invariably preceeded with boothe encoders to
minimize the logic in the first stage of multiplicatio and to enable
efficient control over signed multiplications. A 16-bit multiplier will
take 4 layers of this 4:2 compression tree followed by a 16-bit*2 -
(4-1) bit adder = 29 bit adder. A 32 bit multiplier takes 5 layers and
a 60-bit adder. A 64-bit multiplier takes 6 layers and a 123-bit
adder.*
Division in most modern processors divides (pun intended) between SRT
radix divisors and quadradic convergance dividers. SRT dividers use a
thin 4:2 compressor tree and a PLA to control the subtrahends. QC
dividors use a look up table approximating 1/d and then uses a
newtonian itteration which simultaneously drives the numerator to the
quotient while driving the divisor twoards 1.0. Getting the rounding
correct is an interesting challenge in and of itself in QC divisors.
Under the assumption that a 64-bit integer adder and associated
forwarding logic and result bus driving logic fully occupies 1.0 pipe
stage in your design. A 64-bit integer multiplication tree occupies
about 1.3 clock stages, and the subsequent 123-bit adder occupies
around 0.8 clock stages. Not included in the multiplication time is the
fan-out load to drive the multiplicand and multiplier 'across' the
array.
[*] designers genearlly have 2**n bit adders lying around, so the
29-bit, 60-bit and 123-bit adders are generally just 16-bit, 32-bit and
128-bit adders with a few input bits wired to constant values.
Area:
This has a lot to do with the implementation technology, number of
'useful' wiring layers, and whether the tools used enable 'exotic'
wiring grids to be programmed in or whether the tools only allow rather
regular wring grids to be programmed 'easily'. This can change the size
of the multiplier array by more than 30%. Having to squish the
multiplier parallelogram into a rectangle, exasterbates the wiring
problems and aleviates area 'packing' problems.
Given that a 64-bit integer adder is 1.0 units of size, a
64-bit*64-bit->128-bit integer multiplier will be at least 5.0 units of
size and likely be 6.5 units of size. A 128-bit by 64-bit radix 4 SRT
divisor will be 3.7 units (unless it borrows a topologically adjacent
<possibly double sized> integer adder for finish up work.) |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Tue Nov 22, 2005 5:15 pm Post subject:
Re: complexity and speed of arithmetic |
|
|
MitchAlsup@aol.com wrote:
| Quote: | Multiplication in VLSI has typically been performed with a 4:2
compressor technolory which is then followed with a full length
addition of those bits not fully resolved by the multiplication tree.
Each 4:2 compressor is 'about' 3 gates of delay (with a good circuit
designer).
|
Hm, for custom designs, the regularity of the 4:2 compressor makes sense,
but for a standard cell approach, there are some tradeoffs. None of them
has an optimized 4:2 compressor cell, only 3:2 cells (full adders); and you
need two of them for the 4:2 compressor. I'd rather use 9:4 building blocks
for the first stage(s), since the 9:4 compressors are quite handy for
getting rid of the surplus partial products (sign corrections - a 32x32
multiplication has 18 partial products to sum together, so you have
18:8:4:2 in your tree with 9:4 compressors in the first stage).
| Quote: | These are invariably preceeded with boothe encoders to
minimize the logic in the first stage of multiplicatio
|
You replace a 4:2 compressor by two 2:1 MUXes and two XOR2s. A full adder is
slightly larger than a MUX and an XOR, but not that much. You pay by having
to add a partial product for error correction - but you pay that anyway
once you have to work with signs.
| Quote: | and to enable efficient control over signed multiplications.
|
The technique how to adjust a wallace tree to signed multiplication is quite
similar to the "sign tricks" applied for booth encoders. Either way, you
have to add one additional partial product, and a negation correction
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
|
|
|
|