Peak FLOPS for the PowerPC G4 7410?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Peak FLOPS for the PowerPC G4 7410?

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Robin Bruce
Guest





Posted: Fri Oct 07, 2005 4:15 pm    Post subject: Peak FLOPS for the PowerPC G4 7410? Reply with quote

Does anyone have a peak FLOPS and realistic sustained FLOPS for this
processor.

In fact, does anyone know a good resource that tells you the peak FLOPS
for a range of microprocessors? It seems like all I do is trawl the net
for this sort of info and find it hard to get these figures.

Not that peak FLOPS proves much anyway, but at least if a processor has
peak FLOPS of 5 GFLOPS in its marketing bumpf and I KNOW I can sustain
6 GFLOPS with a design, then I know who's going to win!
Back to top
James Irwin
Guest





Posted: Fri Oct 07, 2005 4:15 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

Kazushigi Goto's BLAS download page used to include a few graphs of his
sustained performance from a few (now older) processor architectures:

http://images.google.com/images?client=safari&rls=en&q=flame%20blas%20goto&ie=UTF-8&oe=UTF-8&sa=N&tab=wi

In summary you can get 'most' of peak from most common architectures
including e.g. Intel Xeons, AMD Opterons that give ~2 x clock dgemm
FLOPS. E.g. I've measured ~11/12GFLOPS of dgemm performance on dual
3.4/3.6GHz Xeons and ~8.5GFLOPS on dual 2.2/2.4GHzOpteron systems with
slightly over half of that for single chip tests.

Going more exotic, something like an SX8 will dish out ~16GFLOPS of
dgemm and CSX600 (shameless plug) considerably more than that.

Please pay more attention to delivered sustained performance for the
application rather than peak otherwise I fear you may fall foul of the
gigantic FLOP theoretical capabilities of things like GPUs.



James Irwin
Back to top
Thomas Womack
Guest





Posted: Fri Oct 07, 2005 5:59 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

In article <1128702378.238714.70930@f14g2000cwb.googlegroups.com>,
Robin Bruce <robin.bruce@gmail.com> wrote:

Quote:
The problem with what I'm trying to do is that I'm trying to make a
general comparison between commodity microprocessor-based
stored-program architectures and FPGA-based reconifigurable computers.
There is no specific application to compare. The uC people at MAPLD in
Washington last month were claiming that they were getting
as-near-as-damn-it peak FLOPS out of their Xeons using the icc compiler
(combined I imagine with some delicately designed microcode) across a
range of simple classic algorithms.

I'm prepared to believe them, for simple enough and classic enough
algorithms -- icc has pretty good pattern-matching of Standard Things
one might want to do, I'd not be at all surprised if it recognised
matrix multiplication and called out to Intel's very good
hand-optimised matrix libraries. I'm not _quite_ sure what you mean
by 'delicately designed microcode'; I don't think it's compulsory to
write in assembler for close-to-peak performance if what you're doing
is simple enough.

High-end x86 platforms nowadays have at peak 2 double-precision FLOPs
per cycle and 4 single-precision. Itanium has 4 DP since it has two
FMAC units; the same holds for Power4-like things (IBM high-end
equipment and Apple G5). Probably the easiest way to get the figures
is to get the Top500 list at
http://top500.org/lists/plists.php?Y=2005&M=06, find a machine with
the appropriate processor architecture, and divide R_peak by
N_cpus*GHz.

Tom
Back to top
Robin Bruce
Guest





Posted: Fri Oct 07, 2005 9:26 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

I very much take your point about sustained performance. I'm aware of
the difference between peak and sustained FLOPS though. It's comical to
think that if you count GPUs the next playstation will have peak FLOPS
of 2TFLOPS! Chain a few of them together and you've got yourself a
supercomputer ;-)

The problem with what I'm trying to do is that I'm trying to make a
general comparison between commodity microprocessor-based
stored-program architectures and FPGA-based reconifigurable computers.
There is no specific application to compare. The uC people at MAPLD in
Washington last month were claiming that they were getting
as-near-as-damn-it peak FLOPS out of their Xeons using the icc compiler
(combined I imagine with some delicately designed microcode) across a
range of simple classic algorithms.

It's for this reason that I feel I need to imagine I'm competing with
general-purpose commodity microprocessors operating at peak FLOPS.
Otherwise the people in the high-performance embedded computing
community are going to say that I'm not making a fair comparison.
They're of the opinion that you can't design a circuit in hardware,
implement it on an FPGA and then try and compare it to the same
algorithm compiled using gcc targetted to x86 in general when icc
specifically targetting a Xeon with the right flags set might perform
better by an order of magnitude.

I don't pretend to be an expert in this domain though. I'm very new to
all this, so I'd appreciate critical feedback.

Robin
Back to top
Maynard Handley
Guest





Posted: Sat Oct 08, 2005 12:15 am    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

In article <1128688149.074046.231600@g49g2000cwa.googlegroups.com>,
"Robin Bruce" <robin.bruce@gmail.com> wrote:

Quote:
Does anyone have a peak FLOPS and realistic sustained FLOPS for this
processor.

In fact, does anyone know a good resource that tells you the peak FLOPS
for a range of microprocessors? It seems like all I do is trawl the net
for this sort of info and find it hard to get these figures.

Not that peak FLOPS proves much anyway, but at least if a processor has
peak FLOPS of 5 GFLOPS in its marketing bumpf and I KNOW I can sustain
6 GFLOPS with a design, then I know who's going to win!

The peak flops is the MHz times 2 (one MAC/cycle) times a fudge factor.
The fudge factor is I think 5/6 but it might be 7/8.
The point is that a naive reading of the CPU docs will tell you that the
system can sustain one MAC/cycle, while a slightly more careful reading
will tell you that if there is a long succession of back-to-back FP
operations then every nth operation (where I think n is 6, but, as I
said, it may be 8; I think n is the length of the fp pipeline or the
length of the fp pipeline+1, which is why I think it is 6) there is a
cycle of basically dead-time which is necessary to deal with some
exception processing race condition or something.

OK, so that's the theory. Is it accurate? Yes it is. The radix-8 512 or
64 point IFFT that I wrote for AAC decoding (back in the day, before
this sort of thing was off-loaded to AltiVec) uses the full set of 32 fp
registers, runs all loads/stores (in-cache) and address arithmetic in
parallel with MACs, and manages to hit exactly the the expected
performance --- which is, of course, what got me to reading the docs
very carefully to understand why I was getting only 5/6th of my expected
performance.

At the time my personal machine was 7410 based, so I was very aware of
the issue. I'm afraid I couldn't tell you if you have the same sort of
fudge-factor on either 750s or 7450s.
The other issue is how many load-stores you plan to run in parallel with
the MACs. If it's not too many, you are OK, but limitations (like only 6
rename registers) can kick in if you try to aggressively run say one
load plus one MAC/cycle.

Now, of course, the 7410 is AltiVec enabled, meaning that you should, if
your code matches AltiVec capabilities, actually be able to get 4x the
scalar performance (and actually better because, without having to deal
with FP exceptions, the 5/6 fudge factor does not exist for AltiVec). Of
course you may lose some performance in shuffling your data into and out
of AltiVec alignment, but for many types of code you should still be
able to get at least 2x the scalar FLOPs.

Maynard
Back to top
Wes Felter
Guest





Posted: Sat Oct 08, 2005 5:20 am    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

On 2005-10-07 07:29:09 -0500, "Robin Bruce" <robin.bruce@gmail.com> said:

Quote:
Does anyone have a peak FLOPS and realistic sustained FLOPS for this
processor.

http://cr.yp.to/hardware/ppc.html

--
Wes Felter - wesley@felter.org - http://felter.org/wesley/
Back to top
Zak
Guest





Posted: Sat Oct 08, 2005 1:53 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

Robin Bruce wrote:

Quote:
They're of the opinion that you can't design a circuit in hardware,
implement it on an FPGA and then try and compare it to the same
algorithm compiled using gcc targetted to x86 in general when icc
specifically targetting a Xeon with the right flags set might perform
better by an order of magnitude.

Why not? See both as a black box. Both have their bandwidth limit to teh
outside and their internal limits. The x86 will have a lot more overhead
by serialiazing the algorithm to a program and then parallelizing that.

But hey; see it as a black box. If teh solution fits in the box it will
perform similarly.

Now, power consumption, cost if you make fifteen million of them, etc,
are other things. Or perhaps running your algorithm in parallel: if you
need to run it on a hundred FPGAs and a hundred DIMMs, that may (once
you have the single chip running) be easier to achieve than 100 fold
widening of the general purpose computer.


Thomas
Back to top
Emery Davis
Guest





Posted: Sat Oct 08, 2005 4:15 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

On Fri, 07 Oct 2005 23:30:39 GMT
Maynard Handley <name99@name99.org> wrote:

] In article <1128688149.074046.231600@g49g2000cwa.googlegroups.com>,
] "Robin Bruce" <robin.bruce@gmail.com> wrote:
]
] > Does anyone have a peak FLOPS and realistic sustained FLOPS for this
] > processor.
] >
][]
] Now, of course, the 7410 is AltiVec enabled, meaning that you should, if
] your code matches AltiVec capabilities, actually be able to get 4x the
] scalar performance (and actually better because, without having to deal
] with FP exceptions, the 5/6 fudge factor does not exist for AltiVec). Of
] course you may lose some performance in shuffling your data into and out
] of AltiVec alignment, but for many types of code you should still be
] able to get at least 2x the scalar FLOPs.
]

The fudge factor on the 7410 altivec unit is 1.5 clocks per mac instruction, not
counting any LSU activity. I've confirmed this empirically and also had it
confirmed by Motorola when the thing first came out. Can't remember whether
the old sim_g4 showed this or not. IIRC on the 745x the number went to 1.25.

HTH,

-E
Back to top
Niels Jørgen Kruse
Guest





Posted: Sat Oct 08, 2005 11:48 pm    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

Emery Davis <notareal@address.com> wrote:

Quote:
On Fri, 07 Oct 2005 23:30:39 GMT
Maynard Handley <name99@name99.org> wrote:

] In article <1128688149.074046.231600@g49g2000cwa.googlegroups.com>,
] "Robin Bruce" <robin.bruce@gmail.com> wrote:
]
] > Does anyone have a peak FLOPS and realistic sustained FLOPS for this
] > processor.
]
][]
] Now, of course, the 7410 is AltiVec enabled, meaning that you should, if
] your code matches AltiVec capabilities, actually be able to get 4x the
] scalar performance (and actually better because, without having to deal
] with FP exceptions, the 5/6 fudge factor does not exist for AltiVec). Of
] course you may lose some performance in shuffling your data into and out
] of AltiVec alignment, but for many types of code you should still be
] able to get at least 2x the scalar FLOPs.
]

The fudge factor on the 7410 altivec unit is 1.5 clocks per mac instruction,
not counting any LSU activity. I've confirmed this empirically and also had
it confirmed by Motorola when the thing first came out. Can't remember
whether the old sim_g4 showed this or not. IIRC on the 745x the number went
to 1.25.

I don't have a G4, but on the PPC970 at least, it is possible to sustain
1 vector FMADD along with 2 scalar FMADDs per clock. Perhaps you just
need to tweak that loop of yours a little, even if the Users Manual
doesn't say anything about it making any difference. That is the case on
the 970.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Back to top
Greg Lindahl
Guest





Posted: Sun Oct 09, 2005 12:15 am    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

In article <1128699910.109281.236280@g44g2000cwa.googlegroups.com>,
James Irwin <irwinj@gmail.com> wrote:

Quote:
Kazushigi Goto's BLAS download page used to include a few graphs of his
sustained performance from a few (now older) processor architectures:

This is the wrong definition of "sustained performance", unless your
application happens to be dominated by matrix-matrix multiply. As you
note, most architectures can get most of peak on this particular
computation. So it's not predictive at all of most computations, which
only get a small fraction of peak.

-- greg
Back to top
Emery Davis
Guest





Posted: Sun Oct 09, 2005 12:15 am    Post subject: Re: Peak FLOPS for the PowerPC G4 7410? Reply with quote

On Sat, 8 Oct 2005 20:48:52 +0200
nospam@ab-katrinedal.dk (Niels Jørgen Kruse) wrote:

] Emery Davis <notareal@address.com> wrote:
]
] > On Fri, 07 Oct 2005 23:30:39 GMT
] > Maynard Handley <name99@name99.org> wrote:
] >
] > ] In article <1128688149.074046.231600@g49g2000cwa.googlegroups.com>,
] > ] "Robin Bruce" <robin.bruce@gmail.com> wrote:
] > ]
] > ] > Does anyone have a peak FLOPS and realistic sustained FLOPS for this
] > ] > processor.
] > ] >
] > ][]
] > ] Now, of course, the 7410 is AltiVec enabled, meaning that you should, if
] > ] your code matches AltiVec capabilities, actually be able to get 4x the
] > ] scalar performance (and actually better because, without having to deal
] > ] with FP exceptions, the 5/6 fudge factor does not exist for AltiVec). Of
] > ] course you may lose some performance in shuffling your data into and out
] > ] of AltiVec alignment, but for many types of code you should still be
] > ] able to get at least 2x the scalar FLOPs.
] > ]
] >
] > The fudge factor on the 7410 altivec unit is 1.5 clocks per mac instruction,
] > not counting any LSU activity. I've confirmed this empirically and also had
] > it confirmed by Motorola when the thing first came out. Can't remember
] > whether the old sim_g4 showed this or not. IIRC on the 745x the number went
] > to 1.25.
]
] I don't have a G4, but on the PPC970 at least, it is possible to sustain
] 1 vector FMADD along with 2 scalar FMADDs per clock. Perhaps you just
] need to tweak that loop of yours a little, even if the Users Manual
] doesn't say anything about it making any difference. That is the case on
] the 970.

It seems spurious to point out that the micro-architecture of the 970
is substantially different than the 7410.

In any case such measurements are performed in very unrolled loops,
and the branch is taken into effect. As I pointed out, the results were
confirmed by Motorola. In any case YMMV but that was what we found,
and I hold to it.

-E
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB