Where do the gains of OOO architecture actually take effect?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Where do the gains of OOO architecture actually take effect?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Bart
Guest





Posted: Tue Jul 12, 2005 7:52 am    Post subject: Where do the gains of OOO architecture actually take effect? Reply with quote

Hi,

I have developed more than a passing curiosity about processor
architecture and have a question regarding out-of-order execution vs.
static scheduling (such as the approach taken by VLIW and Itanium.) I
hope someone here can clarify my understanding of OOO and precisely why
it improves performance.

If I understand correctly, OOO cores work by dispatching instructions
which have already been fetched when it is possible to do so; meaning
when there are no data dependencies or those dependencies can be
"eliminated" through register renaming. So for example, this
instruction sequence:

[1] add r0=r1,r2
[2] and r3=r0,0xFF
[3] xor r5=r6,r7
....
[6] ld r0=(r4)

Faced with this instruction sequence, an OOO processor could
theoretically execute instructions 1, 3, and 6 simultaneously, right?
Instruction 6 would require the "architectural register" R0 to be
assigned to a different hardware register than the one used by
instruction 1. Am I comprehending register renaming correctly?

If so, I can certainly see the advantage this provides. However, on an
architecture such as Itanium, I could simply reorder the instructions
myself and exploit the same level of parallelism. Yet, I realize the
Itanium doesn't always perform as well as an OOO machine, to say the
least.

I once wrote a Mandelbrot plotting program which, for those who aren't
familiar with the idea, calculates each pixel to plot using a tight
loop with virtually no parallelism to exploit. It's a very serial
sequence of logic. Using an HP TestDrive account, I tested the code on
the following:

- EV5 Alpha 500MHz (in-order)
- EV6 Alpha 466 or 500MHz (OOO)
- Itanium (~800MHz)

On the EV6, the performance almost doubled vs. the EV5. The Itanium did
very poorly even when I wrote hand-tuned assembly code (which
admittedly wasn't the best but did exploit a bit of natural parallelism
in the instruction sequence.) I realize the gains I saw on the EV6 must
have been due to its OOO execution. But why? There was no parallelism
to exploit in the main loop body.

I have an idea as to what the answer may be and this is the primary
point of this rather long winded (sorry!) post: To find out if I'm
thinking along the right path or if I've got it all fundamentally
wrong! Here it goes:

I figure that branch prediction will direct the processor to follow the
loop back to the top while still executing the loop body and at this
point, it begins the next iteration of the loop (facilitated by
register renaming.) Here is the loop in question:

for (i = 0; i < MAX_ITERATIONS; i++)
{
x = next_x * next_x - next_y * next_y + cx;
y = 2 * next_x * next_y + cy;

next_x = x;
next_y = y;

if ((x * x + y * y) > 4) // while |Zn| < 2
break;
}

The thing that puzzles me is if my understanding is correct, the
processor wouldn't be able to schedule instructions from the next
iteration until next_x and next_y have been assigned (unless it is
smart enough to execute the next_x = x instruction right after x is
first computed and then move on to the next iteration -- but can it
really look ahead that deep into the instruction stream?)

I can't think of how else it could be so much faster than an EV5. Or,
for that matter, a much higher clocked Itanium (~800MHz) where I
exploited the obvious parallelism fairly well.

Thanks in advance!
Back to top
Eric Gouriou
Guest





Posted: Tue Jul 12, 2005 8:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Bart wrote:
[...]
Quote:
I once wrote a Mandelbrot plotting program which, for those who aren't
familiar with the idea, calculates each pixel to plot using a tight
loop with virtually no parallelism to exploit. It's a very serial
sequence of logic. Using an HP TestDrive account, I tested the code on
the following:

- EV5 Alpha 500MHz (in-order)
- EV6 Alpha 466 or 500MHz (OOO)
- Itanium (~800MHz)

Itanium (1), aka Merced or Itanium2 ?

What OS/compiler did you use on those machines ?

Quote:
I can't think of how else it could be so much faster than an EV5. Or,
for that matter, a much higher clocked Itanium (~800MHz) where I
exploited the obvious parallelism fairly well.

Did you profile your code ? My first bet would be that you got into
denormals handling on Itanium, while you may have gotten "flush-to-zero"
on the Alpha systems.

If this was an HP-UX system with HP's compilers, try again with +FPD.

If your code is small enough I could profile it and give you more
information about what was going wrong.

Eric
Back to top
Eric Gouriou
Guest





Posted: Tue Jul 12, 2005 8:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Bart wrote:
Quote:
It was a Merced. I used gcc on all of the machines (all Linux boxes.)
The code is indeed small, I've uploaded the entire thing to
http://www.trzy.org/mandbrot.zip -- all that matters is m.c. I also
included my clunky quick IPF implementation under ipf/ if you want to
see it. It's really old and quickly hacked together code. I just wanted
to play around with the machines on TestDrive.

I didn't do any in-depth profiling. Function-level profiling wouldn't
have helped anyway -- it's obvious where the program actually spends
its time. It's a very simple piece of code.

If you'd like to look at it, feel free :) I just assumed the speed-up
from EV5 -> EV6 (even if EV6 has a wider execution core) was primarily
due to the fact that the EV6 microarchitecture is significantly
different (OOO) and I wanted to see if my understanding of OOO is
correct.

Some usage notes on m.c: You can pass the xres and yres of the image to
render on the command line (e.g., "m 640 480" will spit out a 640x480
BMP.) Obviously bigger images are more useful for benchmarking. The
code might be messy but the only function of any relevance is
Mandelbrot().

Thanks for sharing the code.

On a 1.6GHz Itanium2 system, I get:
./m 4096 4096: 12.102 Real / 12.000 User / 0.050 Sys (99% cpu)

I have no point of comparison to tell how good/bad this is compared
to the other systems you tried. This particular number was generated
with HP's C compiler at +O3 with profile feedback, but it's just 1s
faster than a basic -O. The serial nature of the code does not allow
miracles to occur with loop transformations.

A little bit of profiling with Caliper shows that:

+ the processor is stalled during 63% of the cycles, due to dependencies
between FP registers.

-----------------------------------------------
PLM
Event Name U..K TH Count
-----------------------------------------------
CPU_CYCLES x___ 0 19198290849
BACK_END_BUBBLE.ALL x___ 0 12254686091
BE_EXE_BUBBLE.ALL x___ 0 12162101510
BE_EXE_BUBBLE.FRALL x___ 0 11335330114
-----------------------------------------------
PLM: Privilege Level Mask
U/K = user/kernel levels (U: level 3, K: level 0)
The intermediate levels (1, 2) are unused on HP-UX or Linux
x : the metric is measured at the given level (_ : not measured)
TH: event THreshold, determines the event counter behavior,
TH == 0 : counter += event_count_in_cycle
TH > 0 : counter += (event_count_in_cycle >= threshold ? 1 : 0)
-----------------------------------------------

Basically this shows that almost all stalls are due to FR/FR or FR/Load
stalls. Since the code isn't generating floating point loads, we
conclude that we are seeing the effect of FP instruction latencies.

+ I was wrong regarding denormals, not a single denormal appears
to be generated.

Apologies for the wide format:

FP Summary Statistics
-------------------------------------------------------------------------------------------------------------------------------------
Finstr Flops Flops Flush SIR -----------FP Events/Sec-------------- --------FP Events/Fop-----
Stats Instr Per Per Per To
Retired Instr Instr Finstr Zero FOPS zero ---------SIR Event------- zero -----SIR Event-----
flush total stall trap flush total stall trap
-------------------------------------------------------------------------------------------------------------------------------------
MEAN 648389029 0.26 0.31 0.31 0 0 682389811 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
STDEV 251639562 0.10 0.13 0.13 0 0 272669480 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
90%CI 114797794 0.05 0.06 0.06 0 0 124391628 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
MINIMUM 545290673 0.00 0.00 0.00 0 0 6375 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
LOW90 533591235 0.21 0.25 0.25 0 0 557998183 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HIGH90 763186824 0.30 0.37 0.37 0 0 806781439 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
MAXIMUM 1280206969 0.30 0.36 0.36 0 0 796290608 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-------------------------------------------------------------------------------------------------------------------------------------

So we get ~800 MFlop/s during the best period and ~680 MFlop/s
over the whole execution.

+ As expected Mandelbrot is the hot function, followed
by fwrite

Load Module Summary
-----------------------------------------------
% Total Cumulat
IP % of IP
Samples Total Samples Load Module
-----------------------------------------------
93.41 93.41 35872 m
6.59 100.00 2531 libc.so.1
0.00 100.00 1 dld.so
-----------------------------------------------
100.00 100.00 38404 Total
-----------------------------------------------

Function Summary
-------------------------------------------------------------------------
% Total Cumulat
IP % of IP
Samples Total Samples Function File
-------------------------------------------------------------------------
93.06 93.06 35739 m::Mandelbrot m.c
5.98 99.04 2296 libc.so.1::fwrite fwrite.c
0.61 99.65 235 libc.so.1::memmove
0.35 100.00 133 m::OutBMP m.c
-------------------------------------------------------------------------

Function Details
---------------------------------------------------
% Total Line|
IP IP Slot| >Statement|
Samples Samples Col,Offset Instruction
---------------------------------------------------
93.06 [m::Mandelbrot, 0x4001510, m.c]
35739 ~45 Function Totals
------------------------------------------
[/home/egouriou/tmp/mbrot/m.c]
(4883) ~68 > step_y = (UPPER_Y - LOWER_Y) / yres;
~4,0x02e0:0 M adds r16=0,r8
:1 I adds r18=1024,r8
:2 I_ nop.i 0 ;;
100 ~4,0x02f0:0 M adds r8=0,r11
:1 M addl r17=0,r0
:2 F fmerge.s f15=f13,f13
~4,0x0300:0 M nop.m 0
:1 F fmerge.s f6=f9,f9
:2 I_ mov.i ar.lc=r11 ;;
4783 ~4,0x0310:0 M nop.m 0
:1 M nop.m 0
:2 F fma.d f32=f15,f15,f0
~88 > x = next_x * next_x - next_y * next_y + cx; // x^2 - y^2 + x
~3,0x0320:0 M nop.m 0
:1 M nop.m 0
:2 F_ fma.d f12=f6,f8,f0 ;;
(12943) ~89 > y = 2 * next_x * next_y + cy; // 2xy + y
6499 ~3,0x0330:0 M nop.m 0
:1 M nop.m 0
:2 F fms.d f32=f6,f6,f32
~3,0x0340:0 M nop.m 0
:1 M nop.m 0
:2 F_ fma.d f15=f12,f15,f13 ;;
6444 ~3,0x0350:0 M nop.m 0
:1 M nop.m 0
:2 F fma.d f12=f32,f1,f9
~88 *> x = next_x * next_x - next_y * next_y + cx; // x^2 - y^2 + x
~3,0x0360:0 M nop.m 0
:1 M nop.m 0
:2 F_ fma.d f32=f15,f15,f0 ;;
(16075) ~94 > if ((x * x + y * y) > 4) // while |Zn| < 2
6416 ~1,0x0370:0 M nop.m 0
:1 M nop.m 0
:2 F_ fma.d f32=f12,f12,f32 ;;
6483 ~1,0x0380:0 M nop.m 0
:1 M nop.m 0
:2 F_ fcmp.lt.unc p0,p6=f7,f32 ;;
3176 ~1,0x0390:0 M (p6) cmp.eq.unc p0,p7=r0,r8
:1 M (p6) adds r17=1,r17
:2 F_(p6) fmerge.s f6=f12,f12 ;;
(1656) ~96 > }
1656 ~1,0x03a0:0 M (p7) adds r8=-1,r8
:1 M nop.m 0
:2 B_(p7) br.dptk.many {self}+0x310 ;;
(156) ~99 > vidbuf[si++] = 0;
156 ~7,0x03b0:0 M lfetch.nt1 [r18],1
:1 M cmp4.eq.unc p8,p6=r14,r17
:2 I adds r15=1,r15
~81 *> for (sx = 0, cx = LOWER_X; sx < xres; sx++, cx += step_x)
~12,0x03c0:0 M cmp4.lt.unc p7,p0=r10,r42
:1 M adds r10=1,r10
:2 F_ fma.d f9=f14,f1,f9 ;;
(26) ~102 > vidbuf[si++] = i;
26 ~7,0x03d0:0 M (p6) st1 [r16]=r17
:1 M (p8) st1 [r16]=r0
:2 I adds r16=1,r16
(156) ~99 *> vidbuf[si++] = 0;
~7,0x03e0:0 M nop.m 0
:1 M nop.m 0
:2 B_(p7) br.dptk.many {self}+0x2f0 ;;

+ Branch prediction is not an issue at all
(output available on demand :-) ).

Eric
Back to top
Bart
Guest





Posted: Tue Jul 12, 2005 8:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

It was a Merced. I used gcc on all of the machines (all Linux boxes.)
The code is indeed small, I've uploaded the entire thing to
http://www.trzy.org/mandbrot.zip -- all that matters is m.c. I also
included my clunky quick IPF implementation under ipf/ if you want to
see it. It's really old and quickly hacked together code. I just wanted
to play around with the machines on TestDrive.

I didn't do any in-depth profiling. Function-level profiling wouldn't
have helped anyway -- it's obvious where the program actually spends
its time. It's a very simple piece of code.

If you'd like to look at it, feel free :) I just assumed the speed-up
from EV5 -> EV6 (even if EV6 has a wider execution core) was primarily
due to the fact that the EV6 microarchitecture is significantly
different (OOO) and I wanted to see if my understanding of OOO is
correct.


Some usage notes on m.c: You can pass the xres and yres of the image to
render on the command line (e.g., "m 640 480" will spit out a 640x480
BMP.) Obviously bigger images are more useful for benchmarking. The
code might be messy but the only function of any relevance is
Mandelbrot().
Back to top
Niels Jørgen Kruse
Guest





Posted: Tue Jul 12, 2005 4:15 pm    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Bart <bart_trzynadlowski@yahoo.com> wrote:

Quote:
I figure that branch prediction will direct the processor to follow the
loop back to the top while still executing the loop body and at this
point, it begins the next iteration of the loop (facilitated by
register renaming.) Here is the loop in question:

for (i = 0; i < MAX_ITERATIONS; i++)
{
x = next_x * next_x - next_y * next_y + cx;
y = 2 * next_x * next_y + cy;

next_x = x;
next_y = y;

if ((x * x + y * y) > 4) // while |Zn| < 2
break;
}

1) Unrolling twice, you could flip-flop between next_x and x and avoid
the copy.

2) Once the sequence leaves the bounding circle it is not going back, so
there is no need to test at *every* iteration. Stash the sequence into a
fifo and test (say) every 20 iterations, then backtrack if outside the
bound.

3) Calculate multiple pixels in parallel. Adjacent pixels are often
close in iteration count.

Sorry for not answering your question. I just couldn't ignore all the
performance left on the table.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Back to top
Tommy Thorn
Guest





Posted: Tue Jul 12, 2005 10:37 pm    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Eric Gouriou wrote:
Quote:
Thanks for sharing the code.

On a 1.6GHz Itanium2 system, I get:
./m 4096 4096: 12.102 Real / 12.000 User / 0.050 Sys (99% cpu)

I have no point of comparison to tell how good/bad this is compared
to the other systems you tried. This particular number was generated
with HP's C compiler at +O3 with profile feedback, but it's just 1s
faster than a basic -O. The serial nature of the code does not allow
miracles to occur with loop transformations.

A little bit of profiling with Caliper shows that:

+ the processor is stalled during 63% of the cycles, due to dependencies
between FP registers.

Thanks for the analysis Eric.

To address Bart's original question, runtime scheduling (OOO) can be
better or worse depending.

Statically scheduled code (VLIW) have the advantage of a vastely larger
scheduling window, but is limited by dynamic effects that at best it
must predict.

Dynamically scheduled code uses a much smaller fixed sized windows (~ 80
instructios on a P4 IIRC), but schedules the *actual* instruction
sequence. Effects such as

- latency of a load (which depends on many things, including the current
cache population),
- the outcome of a branch, and
- *actual* pointer aliases

can be accounted for more accurately. Notably that latter can be a big
win, thus recent pentiums have hardware disambiguation (~ alias
analysis) to enable reording loads and stores.

The OOO advantage here is well understood, but it does not come for
free. It's complicated and expensive and the potential for VLIW was
that eliminating all this complication should enable a shorter clock
cycle which should more than make up for this advantage.

Unfortunately, one cannot gauge the merit of OOO vs VLIW by looking at
current examples, such as Itanic and P4 because there are so many other
things going on here.

All In My Own Controvercial Opinion of course,
Tommy
Back to top
Guest






Posted: Wed Jul 13, 2005 12:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Bart
Niels

Niels> 1) Unrolling twice, you could flip-flop between next_x and x
Niels> and avoid the copy.

No need to unroll, since the compiler has all those temporaries
along the way:

do { // 21164: 21264:
e = a*a; // cycle 1 cycle 1
f = b*b; // cycle 2 cycle 2
g = a*b; // cycle 3 cycle 3
h = e-f; // cycle 6 cycle 6
i = e+f; // cycle 7 cycle 7
j = 2*g; // cycle 7 cycle 7
a = h+c; // cycle 10 cycle 10
m = (i>4.0); // cycle 11 cycle 12
b = j+d; // cycle 12 cycle 11
} until( m ); // cycle 14 cycle 14
//...
// e = a*a; // cycle 14 cycle 14
// f = b*b; // cycle 16 cycle 15

Niels> 2) Once the sequence leaves the bounding circle it is not
Niels> going back, so there is no need to test at *every* iteration.
Niels> Stash the sequence into a fifo and test (say) every 20
Niels> iterations, then backtrack if outside the bound.

I never thought of that!

.... it might help on something with bad branch prediction, but both
EV-5 and EV-6 shouldn't be having much trouble here. IIRC, the
average loop has dozens of iterations, so one or two mispredictions
each is no big deal and may be less than you'd get with a
backtracking scheme.

Niels> 3) Calculate multiple pixels in parallel. Adjacent pixels
Niels> are often close in iteration count.

On the R8000, this was good for something like 50% speedup.

I'm actually mystified why the 21264 would do the code in a lot
fewer cycles than the 21164. Both machines have independent FP
* and + pipes. Latency for both ops is 4 cycles on both machines.
There are no memory references in the loops (or, there shouldn't
be).

The only thing I could think of is this: if there is a few cycles
of latency getting the FP comparison result (that's "m" in the code
above) over to the integer side, then the 21264 probably speculates
farther past the branch than the 21164. This isn't really an
out-of-order thing, it's more of a register-renaming thing.

I know that moving FP values from the FP register file to the
integer register file required a store and a load on these machines.
I'd be suprised if that were true for condition codes, though.

The R8000 provided 8 independent, named floating-point condition
code bits so that the compiler could software pipeline loops that
had to send back FP compare information to a branch. The apparent
latency of FP compare -> branch was often over 10 cycles if the
processor was running with imprecise FP exceptions.
Back to top
Bart
Guest





Posted: Wed Jul 13, 2005 12:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Eric and Niels: Thanks for the profiling and the performance tips. So
the poor performance on Itanium can be attributed to the latency of the
FP instructions (as Eric said) and bubbles in the instruction stream
(it appears as though there really isn't much that can be done with the
scheduling judging from the output code shown.)

What about the performance gap between identically clocked EV5 and EV6
machines? I recall seeing a dramatic performance increase on EV6. Is it
because EV6 is a wider architecture or is it primarily because of its
OOO nature? The purpose of this post is to help me understand OOO first
and foremost. I'm just trying to use this program as an example (but
maybe it's a poor one.) What I'm getting at is: What benefits (if any)
would an OOO implementation of a processor give for code similar to
what I presented?

Tommy Thorn wrote:
Quote:
To address Bart's original question, runtime scheduling (OOO) can be
better or worse depending.

Statically scheduled code (VLIW) have the advantage of a vastely larger
scheduling window, but is limited by dynamic effects that at best it
must predict.

Dynamically scheduled code uses a much smaller fixed sized windows (~ 80
instructios on a P4 IIRC), but schedules the *actual* instruction
sequence. Effects such as

- latency of a load (which depends on many things, including the current
cache population),
- the outcome of a branch, and
- *actual* pointer aliases

can be accounted for more accurately. Notably that latter can be a big
win, thus recent pentiums have hardware disambiguation (~ alias
analysis) to enable reording loads and stores.

Is my basic understanding of OOO/dynamically correct based on my
description of it in my first post?

Quote:
Unfortunately, one cannot gauge the merit of OOO vs VLIW by looking at
current examples, such as Itanic and P4 because there are so many other
things going on here.

Right, which is why I brought up the EV5 and EV6. Of course, I don't
know the specifics of each architecture off-hand except that EV5 is
superscalar and in-order IIRC and EV6 has dynamic scheduling. I assumed
(perhaps incorrectly) that the large speed increase I saw on my
Mandelbrot test code had something to do with the dynamic scheduling as
opposed to a mere increase in execution resources. It doesn't appear as
though there is _that_ much parallelism for gcc to work with in the
inner-loop of my program so I was hoping to gain a better picture of
how the EV6 is executing the code -- particularly if it is the dynamic
scheduling that is making the difference.

The only thing I could think of is that dynamic scheduling allows for
the instruction stream to carry on past a branch that hasn't been taken
yet (branch prediction) which I believe is true, right? However, in my
Mandelbrot program, each loop iteration depends on the completion of
the previous loop. Perhaps there is still _something_ the processor can
execute if I can assume that it can look relatively far ahead into the
instruction stream.

Would it be correct to say that OOO and register renaming would benefit
code such as this:

for (i=0;i<256;i++)
{
c[i] = a[i] + b[i];
}

?

Where the loop body is small and each iteration is relatively
independent.
Back to top
Terje Mathisen
Guest





Posted: Wed Jul 13, 2005 12:53 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Niels Jørgen Kruse wrote:

Quote:
Bart <bart_trzynadlowski@yahoo.com> wrote:


I figure that branch prediction will direct the processor to follow the
loop back to the top while still executing the loop body and at this
point, it begins the next iteration of the loop (facilitated by
register renaming.) Here is the loop in question:

for (i = 0; i < MAX_ITERATIONS; i++)
{
x = next_x * next_x - next_y * next_y + cx;
y = 2 * next_x * next_y + cy;

next_x = x;
next_y = y;

if ((x * x + y * y) > 4) // while |Zn| < 2
break;
}

1) Unrolling twice, you could flip-flop between next_x and x and avoid
the copy.

Useful, but probably doesn't matter, since the copies can overlap with
other longer-latency operations.
Quote:

2) Once the sequence leaves the bounding circle it is not going back, so
there is no need to test at *every* iteration. Stash the sequence into a
fifo and test (say) every 20 iterations, then backtrack if outside the
bound.

Doing it every other iteration is good enough.
Quote:

3) Calculate multiple pixels in parallel. Adjacent pixels are often
close in iteration count.

Nice. :-)

To simplify the exit logic I'd leave the double loop as soon as the
first point is reached, then continue with only the remaining one.
Quote:

Sorry for not answering your question. I just couldn't ignore all the
performance left on the table.

:-)

Assuming we have a cpu which can issue two fp operations/cycle, with a
latency of 3 or 4 cycles, we can still do pretty well:

The key is to allow the exit condition to be determined up to two
iterations late, then correct afterwards: This removes the latency of
the exit test itself from the critical path:

for (i = 0, exit_condition = -1.0; i < MAX_ITERATIONS; i++) {
x2 = x * x; y2 = y * y;
x = next_x * next_x; t = next_y * next_y;
y = next_x * next_y;

next_x = x + cx; next_y = y * 2.0;
sq = x2 + y2;
if (exit_condition >= 0.0) break;

x -= t; y += cy;
exit_condition = sq - 4.0;
}

The inner loop should take about 9 cycles if fp latency is 3, this seems
like the minimum unless you overlap multiple iterations/pixels.

With FMAC available we can do a little better and/or tolerate a 4-cycle
fp ops latency.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Randy
Guest





Posted: Wed Jul 13, 2005 5:26 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Bart wrote:
Quote:
Eric and Niels: Thanks for the profiling and the performance tips. So
the poor performance on Itanium can be attributed to the latency of the
FP instructions (as Eric said) and bubbles in the instruction stream
(it appears as though there really isn't much that can be done with the
scheduling judging from the output code shown.)

What about the performance gap between identically clocked EV5 and EV6
machines? I recall seeing a dramatic performance increase on EV6. Is it
because EV6 is a wider architecture or is it primarily because of its
OOO nature? The purpose of this post is to help me understand OOO first
and foremost. I'm just trying to use this program as an example (but
maybe it's a poor one.) What I'm getting at is: What benefits (if any)
would an OOO implementation of a processor give for code similar to
what I presented?

Are you executing the exact same executable on each Alpha?

Conversely, did you specify the CPU's ID when you compiled for each CPU
in order to create binaries that are tuned by the compiler to
differences unique to each CPU?

I'd try both approaches, since they're likely to lead to different
results, however I'd prefer to compile explicitly for each CPU,
generating two different executables. That'd make the best use of the
CPUs' hardware and allow the fairest comparison, IMHO.

Also, if you profile each executable using a tool like Caliper or PAPI,
the source of the difference should be revealed through differences in
CPU event counts. Mapping the counts back to the source code isn't as
easy if you use PAPI, but with a tool like HPCToolkit
(http://www.hipersoft.rice.edu/hpctoolkit/), it should approach the
usability of Caliper.

Finally, if you really want to dig deep, you can use Simplescalar
(www.simplescalar.com) to run your executables and configure it to
simulate different Alphas by specifying different microarchitectural
parameters which reflect the differences in the EV5 and EV6. This
should identify which change in the microarchitecture is the root of the
runtime difference you are seeing.

Since all your data fits into registers, I'm inclined to think that data
access latencies and bandwidth should be irrelevant. As such, OOO can't
speed up the executable by hiding data latency. I don't see why OOO
would matter unless the compiler can't create an optimal static
instruction schedule in the first place (for one or both of the CPUs),
and thus, OOO is making the EV6 shine at runtime.

I also don't think the performance difference is due to branch
prediction, since there's only two jumps and they're independent and
probably execute very rarely (1/million?). Finally, I don't think
there's a meaningful difference in the ability of each CPU to execute
instructions concurrently. AFAIK, the number of functional units and
pipeline depths are essentially the same for the EV5 and EV6. They
should saturate their functional units equally well (create comparable
pipeline bubbles).

If the difference lies not in the CPUs' use of memory or the composition
of the microarchitectures, that pretty much leads me to the compiler's
(mis)use of the CPUs. I'll make a wild guess that there's a mismatch in
the scheduling of the binary's instruction stream onto one of the CPUs.

Or maybe there *is* an inherent difference in the CPUs' architectures,
like the EV5's 4 vs EV5's 6 instruction issue widths.

Here's a quick incomplete summary of the difference between the EV5 and EV6:

http://www.ece.rochester.edu/~albonesi/wced03/slides/kumar.ppt

These may also be useful:

http://en.wikipedia.org/wiki/DEC_Alpha
http://www.redhat.com/docs/manuals/linux/RHL-5.1-Manual/alpha/manual/doc030.html

Randy

Quote:

Tommy Thorn wrote:

To address Bart's original question, runtime scheduling (OOO) can be
better or worse depending.

Statically scheduled code (VLIW) have the advantage of a vastely larger
scheduling window, but is limited by dynamic effects that at best it
must predict.

Dynamically scheduled code uses a much smaller fixed sized windows (~ 80
instructios on a P4 IIRC), but schedules the *actual* instruction
sequence. Effects such as

- latency of a load (which depends on many things, including the current
cache population),
- the outcome of a branch, and
- *actual* pointer aliases

can be accounted for more accurately. Notably that latter can be a big
win, thus recent pentiums have hardware disambiguation (~ alias
analysis) to enable reording loads and stores.


Is my basic understanding of OOO/dynamically correct based on my
description of it in my first post?


Unfortunately, one cannot gauge the merit of OOO vs VLIW by looking at
current examples, such as Itanic and P4 because there are so many other
things going on here.


Right, which is why I brought up the EV5 and EV6. Of course, I don't
know the specifics of each architecture off-hand except that EV5 is
superscalar and in-order IIRC and EV6 has dynamic scheduling. I assumed
(perhaps incorrectly) that the large speed increase I saw on my
Mandelbrot test code had something to do with the dynamic scheduling as
opposed to a mere increase in execution resources. It doesn't appear as
though there is _that_ much parallelism for gcc to work with in the
inner-loop of my program so I was hoping to gain a better picture of
how the EV6 is executing the code -- particularly if it is the dynamic
scheduling that is making the difference.

The only thing I could think of is that dynamic scheduling allows for
the instruction stream to carry on past a branch that hasn't been taken
yet (branch prediction) which I believe is true, right? However, in my
Mandelbrot program, each loop iteration depends on the completion of
the previous loop. Perhaps there is still _something_ the processor can
execute if I can assume that it can look relatively far ahead into the
instruction stream.

Would it be correct to say that OOO and register renaming would benefit
code such as this:

for (i=0;i<256;i++)
{
c[i] = a[i] + b[i];
}

?

Where the loop body is small and each iteration is relatively
independent.


--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
Back to top
Sander Vesik
Guest





Posted: Wed Jul 13, 2005 7:00 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Randy <joe@burgershack.com> wrote:
Quote:
Bart wrote:
Eric and Niels: Thanks for the profiling and the performance tips. So
the poor performance on Itanium can be attributed to the latency of the
FP instructions (as Eric said) and bubbles in the instruction stream
(it appears as though there really isn't much that can be done with the
scheduling judging from the output code shown.)

What about the performance gap between identically clocked EV5 and EV6
machines? I recall seeing a dramatic performance increase on EV6. Is it
because EV6 is a wider architecture or is it primarily because of its
OOO nature? The purpose of this post is to help me understand OOO first
and foremost. I'm just trying to use this program as an example (but
maybe it's a poor one.) What I'm getting at is: What benefits (if any)
would an OOO implementation of a processor give for code similar to
what I presented?

Are you executing the exact same executable on each Alpha?

Conversely, did you specify the CPU's ID when you compiled for each CPU
in order to create binaries that are tuned by the compiler to
differences unique to each CPU?

A similarily clocked EV6 would be faster on either binary.

Quote:
If the difference lies not in the CPUs' use of memory or the composition
of the microarchitectures, that pretty much leads me to the compiler's
(mis)use of the CPUs. I'll make a wild guess that there's a mismatch in
the scheduling of the binary's instruction stream onto one of the CPUs.

Except the differnce does lie in the microarchitecture of the CPUs.


--
Sander

+++ Out of cheese error +++
Back to top
Jan Vorbrüggen
Guest





Posted: Wed Jul 13, 2005 8:15 am    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

Quote:
The thing that puzzles me is if my understanding is correct, the
processor wouldn't be able to schedule instructions from the next
iteration until next_x and next_y have been assigned (unless it is
smart enough to execute the next_x = x instruction right after x is
first computed and then move on to the next iteration -- but can it
really look ahead that deep into the instruction stream?)

IIRC, the EV6 has a 20-cycle window, so that at most 80 instructions can
be "in flight" simultaneously. It also has more than twice as many physical
registers than architectural registers - 80 vs 32 for integer and 72 vs 32
for FP. With register renaming, it should be able to get several interations
of the loop into the pipe at the same time, with the bypasses feeding the
previous' iterations output into the next's input with only the functional-
unit latency.

Also, the EV5 has some very weird scheduling restrictions. There is some
sample code that shows that executing more instructions (most of those
being one form or other of a NOP) can substnatially speed up execution.
This is true of the EV6, albeit to a more limited degree.

Jan
Back to top
Niels Jørgen Kruse
Guest





Posted: Wed Jul 13, 2005 1:17 pm    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

<iain-3@truecircuits.com> wrote:

Quote:
Niels> 2) Once the sequence leaves the bounding circle it is not
Niels> going back, so there is no need to test at *every* iteration.
Niels> Stash the sequence into a fifo and test (say) every 20
Niels> iterations, then backtrack if outside the bound.

I never thought of that!

... it might help on something with bad branch prediction, but both
EV-5 and EV-6 shouldn't be having much trouble here. IIRC, the
average loop has dozens of iterations, so one or two mispredictions
each is no big deal and may be less than you'd get with a
backtracking scheme.

Expensive pixels have thousands of iterations. Images that only have
slow gradients are boring. You might want to use different code anyway
for easy pixels, using SP SIMD.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Back to top
dcw
Guest





Posted: Thu Jul 14, 2005 3:11 pm    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

In article <1121136759.721988.22130@z14g2000cwz.googlegroups.com>,
Bart <bart_trzynadlowski@yahoo.com> wrote:

Quote:
for (i = 0; i < MAX_ITERATIONS; i++)
{
x = next_x * next_x - next_y * next_y + cx;
y = 2 * next_x * next_y + cy;

next_x = x;
next_y = y;

if ((x * x + y * y) > 4) // while |Zn| < 2
break;
}

Not relevant to your question, but you're computing every value
of x*x and y*y twice, once in the condition and again (under a
different name) at the beginning of the next iteration. You
only need three multiplications per iteration.

David
Back to top
Tom Linden
Guest





Posted: Thu Jul 14, 2005 4:15 pm    Post subject: Re: Where do the gains of OOO architecture actually take eff Reply with quote

On Thu, 14 Jul 05 10:11:54 GMT, dcw <D.C.Wood@ukc.ac.uk> wrote:

Quote:
In article <1121136759.721988.22130@z14g2000cwz.googlegroups.com>,
Bart <bart_trzynadlowski@yahoo.com> wrote:

for (i = 0; i < MAX_ITERATIONS; i++)
{
x = next_x * next_x - next_y * next_y + cx;
y = 2 * next_x * next_y + cy;

next_x = x;
next_y = y;

if ((x * x + y * y) > 4) // while |Zn| < 2
break;
}

Not relevant to your question, but you're computing every value
of x*x and y*y twice, once in the condition and again (under a
different name) at the beginning of the next iteration. You
only need three multiplications per iteration.

I would imagine most compilers would pick it up while doing CSE.
Quote:

David




Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB