| Author |
Message |
Zeb
Guest
|
Posted:
Fri Oct 21, 2005 8:15 am Post subject:
Massive cancellation FP stall |
|
|
The following is from the IBM 970FX User's manual:
"If an arithmetic operation involves an effective subtraction and the
two operands of the subtraction operation are very close in magnitude,
then the FPU is stalled for two cycles. More specifically, for the
operation AxC+B, if the result is 2^-16 or less times AxC, then there is
a two cycle stall; and if the result is 2^-80 or less times AxC, then
there is a four cycle stall. These stalls may sometimes occur for
results that are up to 8 times larger (i.e. 2^-13 times AxC)."
I have two questions.
1) How common are these stalls in other architectures? My own
experiments show that these stalls occur in older PowerPC scalar FP
units (where they are not documented), though not in the Altivec FP
units. Do they occur, for example, in x86 processors?
2) Why do these stalls occur?
Thanks for any info,
Zeb |
|
| Back to top |
|
 |
Colin Percival
Guest
|
Posted:
Fri Oct 21, 2005 8:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
Zeb <zeb_138@yahoo.com> wrote:
| Quote: | "If an arithmetic operation involves an effective subtraction and the
two operands of the subtraction operation are very close in magnitude,
then the FPU is stalled for two cycles. More specifically, for the
operation AxC+B, if the result is 2^-16 or less times AxC, then there is
a two cycle stall; and if the result is 2^-80 or less times AxC, then
there is a four cycle stall. These stalls may sometimes occur for
results that are up to 8 times larger (i.e. 2^-13 times AxC)."
|
Interesting -- and potentially dangerous. Anyone want to implement a
side channel attack based on this? :-)
| Quote: | 1) How common are these stalls in other architectures? My own
experiments show that these stalls occur in older PowerPC scalar FP
units (where they are not documented), though not in the Altivec FP
units. Do they occur, for example, in x86 processors?
|
I've certainly never seen such stalls documented on x86 processors.
| Quote: | 2) Why do these stalls occur?
|
Two possibilities come to mind. The first is that when computing a
multiply-add operation, only the top half-plus-epsilon of the product
is computed; the lower bits are only computed if they turn out to be
necessary. The far less likely possibility is that the floating-point
renormalization logic can only shift a significand by a relatively small
number of bits per cycle.
Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
Colin Percival |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Oct 21, 2005 4:15 pm Post subject:
Re: Massive cancellation FP stall |
|
|
Zeb wrote:
| Quote: | The following is from the IBM 970FX User's manual:
"If an arithmetic operation involves an effective subtraction and the
two operands of the subtraction operation are very close in magnitude,
then the FPU is stalled for two cycles. More specifically, for the
operation AxC+B, if the result is 2^-16 or less times AxC, then there is
a two cycle stall; and if the result is 2^-80 or less times AxC, then
there is a four cycle stall. These stalls may sometimes occur for
results that are up to 8 times larger (i.e. 2^-13 times AxC)."
I have two questions.
1) Do they occur, for example, in x86 processors?
|
None that I know of, But MACs are not part of the x86 instruction set.
And this is a pure side effect of perrforming both multiplication and
addition without any rounding step in between.
| Quote: | 2) Why do these stalls occur?
|
The FPU probably does not compute all of the low order bits when doing
the multiplication and when there is massive cancelation with respect
to the augend, these bits need to get computed. So the architects of
this FPU decided to give the average MAC a lower latency at the cost of
some MACs taking pipe disruptiions. |
|
| Back to top |
|
 |
Andy Freeman
Guest
|
Posted:
Fri Oct 21, 2005 4:15 pm Post subject:
Re: Massive cancellation FP stall |
|
|
Colin Percival wrote:
| Quote: | Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
|
Aren't data-dependent instruction times pretty much the rule in IEEE FP
implementations, at least when denorms are enabled?
IIRC, Andy Glew suggested that we might start seeing data-dependent
instruction time for certain fixnum adds.
Of course, it may be difficult to observe these time variations in a
complex processor - for most of us, they'll likely just contribute to
"noise floor" due to the effect of system state on execution time. |
|
| Back to top |
|
 |
Colin Percival
Guest
|
Posted:
Fri Oct 21, 2005 4:16 pm Post subject:
Re: Massive cancellation FP stall |
|
|
Andy Freeman <anamax@earthlink.net> wrote:
| Quote: | Colin Percival wrote:
Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
Aren't data-dependent instruction times pretty much the rule in IEEE FP
implementations, at least when denorms are enabled?
|
Well... yes. But it's fairly easy to make sure that denormals never
occur in a computation, so you can get effectively data-independent
instructions times if they are the only problem.
Colin Percival |
|
| Back to top |
|
 |
Maynard Handley
Guest
|
Posted:
Sat Oct 22, 2005 12:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <1129907891.679500.200620@g47g2000cwa.googlegroups.com>,
"Andy Freeman" <anamax@earthlink.net> wrote:
| Quote: | Colin Percival wrote:
Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
Aren't data-dependent instruction times pretty much the rule in IEEE FP
implementations, at least when denorms are enabled?
IIRC, Andy Glew suggested that we might start seeing data-dependent
instruction time for certain fixnum adds.
Of course, it may be difficult to observe these time variations in a
complex processor - for most of us, they'll likely just contribute to
"noise floor" due to the effect of system state on execution time.
|
It's not like this is limited to FP.
I don't know about POWER4/970 but the 603, 604, 750, 7500 all had early
exit integer multiplies where you might save a cycle or more if the high
bits of one of your multiplicands were sign bits. |
|
| Back to top |
|
 |
Christian Bau
Guest
|
Posted:
Sat Oct 22, 2005 12:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <dj9plj$q3d$1@morgoth.sfu.ca>,
Colin Percival <cperciva@sfu.ca> wrote:
| Quote: | Zeb <zeb_138@yahoo.com> wrote:
"If an arithmetic operation involves an effective subtraction and the
two operands of the subtraction operation are very close in magnitude,
then the FPU is stalled for two cycles. More specifically, for the
operation AxC+B, if the result is 2^-16 or less times AxC, then there is
a two cycle stall; and if the result is 2^-80 or less times AxC, then
there is a four cycle stall. These stalls may sometimes occur for
results that are up to 8 times larger (i.e. 2^-13 times AxC)."
Interesting -- and potentially dangerous. Anyone want to implement a
side channel attack based on this? :-)
1) How common are these stalls in other architectures? My own
experiments show that these stalls occur in older PowerPC scalar FP
units (where they are not documented), though not in the Altivec FP
units. Do they occur, for example, in x86 processors?
I've certainly never seen such stalls documented on x86 processors.
2) Why do these stalls occur?
Two possibilities come to mind. The first is that when computing a
multiply-add operation, only the top half-plus-epsilon of the product
is computed; the lower bits are only computed if they turn out to be
necessary. The far less likely possibility is that the floating-point
renormalization logic can only shift a significand by a relatively small
number of bits per cycle.
Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
|
Much more likely that having a left shift over an enormous distance is
just too expensive to implement. Normalisation hardware detects say up
to 32 leading zeroes and shifts accordingly, but fused multiply-add can
produce over hundred leading zero bits.
Consider that a Pentium 4 needs four cycles for a plain integer shift
operation. |
|
| Back to top |
|
 |
Bradley J Lucier
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <ogailx502-BD404A.22361821102005@news.isp.giganews.com>,
Tim Olson <ogailx502@NOSPAMsneakemail.com> wrote:
| Quote: | In article <1129906611.137765.230970@o13g2000cwo.googlegroups.com>,
MitchAlsup@aol.com wrote:
| The FPU probably does not compute all of the low order bits when doing
| the multiplication and when there is massive cancelation with respect
| to the augend, these bits need to get computed.
The low order bits of the multiply always need to be computed.
|
Not always. The guard bit and the "sticky bit" need to be computed for correct
rounding, but these can often be computed without computing all of the
low-order bits of the multiply.
Brad Lucier |
|
| Back to top |
|
 |
Tim Olson
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <1129906611.137765.230970@o13g2000cwo.googlegroups.com>,
MitchAlsup@aol.com wrote:
| The FPU probably does not compute all of the low order bits when doing
| the multiplication and when there is massive cancelation with respect
| to the augend, these bits need to get computed.
The low order bits of the multiply always need to be computed. The
PowerPC fused multiply-add architecture is defined such that the
multiply generates all 106 bits of the product, which is then added
(with appropriate alignment) to the 53 bit augend in a 159-bit
adder/incrementer before rounding.
If massive-cancellation occurs, then a left shift of up to 106 bits is
required. That is typically broken up into a sequence of multiple,
smaller shifts.
| So the architects of
| this FPU decided to give the average MAC a lower latency at the cost of
| some MACs taking pipe disruptiions.
Correct (also smaller area and lower power).
-- Tim Olson |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Sat Oct 22, 2005 1:28 pm Post subject:
Re: Massive cancellation FP stall |
|
|
Andy Freeman wrote:
| Quote: | Colin Percival wrote:
Whatever the cause is, the presence of data-dependent instruction timings
on modern processors is something which I find worrying, and I'd be very
interested to hear about any more issues like this.
Aren't data-dependent instruction times pretty much the rule in IEEE FP
implementations, at least when denorms are enabled?
IIRC, Andy Glew suggested that we might start seeing data-dependent
instruction time for certain fixnum adds.
|
Actually, that is already occurring. Intel's Yonah has a variable
length divide latency, the variable being a function of the relative
size of the divisor and the dividend (terminology check here?).
David |
|
| Back to top |
|
 |
Zeb
Guest
|
Posted:
Mon Oct 24, 2005 8:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <1129969733.383161.152070@f14g2000cwb.googlegroups.com>,
"David Kanter" <dkanter@gmail.com> wrote:
| Quote: | IIRC, Andy Glew suggested that we might start seeing data-dependent
instruction time for certain fixnum adds.
Actually, that is already occurring. Intel's Yonah has a variable
length divide latency, the variable being a function of the relative
size of the divisor and the dividend (terminology check here?).
|
Thanks! Is this in any public documents yet? |
|
| Back to top |
|
 |
Zeb
Guest
|
Posted:
Mon Oct 24, 2005 8:15 am Post subject:
Re: Massive cancellation FP stall |
|
|
In article <name99-4CF47F.16424121102005@localhost>,
Maynard Handley <name99@name99.org> wrote:
| Quote: | It's not like this is limited to FP.
I don't know about POWER4/970 but the 603, 604, 750, 7500 all had early
exit integer multiplies where you might save a cycle or more if the high
bits of one of your multiplicands were sign bits.
|
Yes (sort of). And it depends on which of the two multipliers has the
high bits.
This is more-or-less documented for the earlier chips. The FP
subtraction hazards were not documented until the 970. |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Mon Oct 24, 2005 1:26 pm Post subject:
Re: Massive cancellation FP stall |
|
|
Zeb wrote:
| Quote: | In article <1129969733.383161.152070@f14g2000cwb.googlegroups.com>,
"David Kanter" <dkanter@gmail.com> wrote:
IIRC, Andy Glew suggested that we might start seeing data-dependent
instruction time for certain fixnum adds.
Actually, that is already occurring. Intel's Yonah has a variable
length divide latency, the variable being a function of the relative
size of the divisor and the dividend (terminology check here?).
Thanks! Is this in any public documents yet?
|
No, but I'll be writing an article about it soonish...maybe a week or
so.
DK |
|
| Back to top |
|
 |
|
|
|
|