| Author |
Message |
Randy
Guest
|
Posted:
Thu Jul 28, 2005 8:15 am Post subject:
Re: Cluster computing drawbacks |
|
|
Greg Lindahl wrote:
| Quote: | In article <1kmFe.28938$0f.12173@tornado.texas.rr.com>,
Randy <joe@burgershack.com> wrote:
Interesting. I'd like to know more about the HW used to generate the
Pathscale 1.31 us number. The fact that it appears to beat the Cray
XD1's MicroArray implies a connection that exceeds the performance of
any available or next-gen Opteron bus.
No, that's not the implication. The Opteron bus is not a bottleneck in
this situation. The XD1 uses an FPGA and we use an ASIC, and we build
our interconnect software with the best Opteron compiler (our own).
From the XD1 dual-port bandwidth numbers it's clear that they're using
8-bit HT and we're using 16-bit HT, but that doesn't affect the Random
Ring latency much.
|
I'm not sure we're really disagreeing here, but I see a few more
interesting points worth making, so I'll continue...
I can see where higher performance clocks and data path might make an
appreciable difference in moving data quickly. But I'm not clear how
a C compiler is going to contribute appreciably to a more efficient
use of the hardware. Given the seeming simplicity of routing
algorithms, I'd have thought they would be implemented either in
assembly, or using such simple C primitives that an compiler could
generate a near-optimal instruction sequence.
| Quote: |
BTW, 1.31 us is 1310 ns. According to SGI, the Altix's cache line
latency (128 bytes) within a C brick is 145 ns, or 9 times faster than
1310 ns (0 bytes).
That's an awfully small system -- that 1.31 number is good up to 12
nodes/24 cpus. BTW, SGI's numbers are all optimistic and depend on the
fact that only one CPU is moving a line, not all the cpus in the
system. That's why Random Ring Latency is a good benchmark, it uses
all the cpus in all the nodes in a system.
|
Honestly, I think that's a better example of how MPI programs access
data than how SMP programs do it. I'm sure that some HPC codes direct
all processes to demand the same data at the same time, but I suspect
most do not, for obvious reasons. If that were not the case, then
shared nothing computers would be superior at executing all HPC
applications. And they ain't. (More on this below.)
| Quote: |
Worst case latency on a 512 PE system is 800 ns on
a NUMA Link 3, or about 640 ns on a NUMA Link 4, which still beats
1310 ns handily.
You're comparing apples and oranges. Try testing it with all the cpus
getting a line, and you'll find radically different numbers.
|
Sure,but that access pattern is not representative of real world
codes. If 95% of data accesses land within caches, then only 1
process in 20 should be requesting a line that crosses the
interconnect at any given moment. The same is true of clusters, of
course. But transparent access to data among multiple processes is
the raison d'etre of SMPs, and as long as the system will fully
support 5% of the processes doing this, SMPs have significant
programmatic advantages over clusters. And if their MPI performance
is no worse than clusters, SMPs offer the best of both worlds. Of
course, that comes at a price, that is, $$$.
| Quote: |
That introduces an interesting point... that commodity cluster nodes
and interconnects like Cray MicroArray and Pathscale's Infinipath may
soon converge the term SMP with cluster.
How could it, given that we don't provide shared memory semantics?
-- greg
|
In terms of shared memory HPC apps, I'd argue that cache coherence is
not all it's cracked up to be. The Cray T3E/T3D wasn't cache
coherent. It ran MPI, PVM, and Linda very nicely and its shmget/
shmput services supported some very handy programming conventions,
albeit proprietary ones.
In fact, I'd argue that cache coherence really isn't necessary for
most folks skilled in programming shared memory for HPC. Most shmem
apps partition data during writes such that system-management of
coherence is overkill. A single call to flush() by each process at
the end of a split loop is all that's needed to keep the newly data
coherent.
Yes, such machines are only of historical interest. But with the
improvements in latency made available by newer cluster interconnects
(the XD1 and InfiniPath), the use of clusters as non-coherent SMPs may
again be viable. It's an interesting possibility, anyway.
Randy
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Jul 28, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <42e8d002$1@news.meer.net>,
lindahl@pbm.com (Greg Lindahl) writes:
|> In article <iM_Fe.38250$0f.35619@tornado.texas.rr.com>,
|> Randy <joe@burgershack.com> wrote:
|>
|> >In terms of shared memory HPC apps, I'd argue that cache coherence is
|> >not all it's cracked up to be. The Cray T3E/T3D wasn't cache
|> >coherent.
|>
|> Right, but that's not shared memory. People disagree on what term to
|> use, but they agree that it isn't shared memory. My favorite term is
|> "shared address, local consistency", but apparently there's some newer
|> term for it.
Grrk. Cache-coherent shared memory isn't the only form of shared
memory. It is perfectly reasonable to have no automatic cache
coherence and a synchronisation primitive, and for it still to be
shared memory. There were several successful incoherent shared
memory systems in the early 1980s.
However, I fully agree that a shared address space is NOT enough
to call something shared memory. One does need at least some sense
in which the actual memory is shared and not just the addressibility.
I agree with him that it isn't what it is cracked up to be, not
least because of the problem of register/cache incoherence. If
you want that automatic, you are effectively denying all serious
optimisation. But most languages lack a synchronisation primitive
or even a sunchronisation model across threads. OpenMP gets this
(largely) right, where POSIX threads doesn't.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Thu Jul 28, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <iM_Fe.38250$0f.35619@tornado.texas.rr.com>,
Randy <joe@burgershack.com> wrote:
| Quote: | I can see where higher performance clocks and data path might make an
appreciable difference in moving data quickly. But I'm not clear how
a C compiler is going to contribute appreciably to a more efficient
use of the hardware.
|
Passing an MPI message involves software, too.
| Quote: | Honestly, I think that's a better example of how MPI programs access
data than how SMP programs do it. I'm sure that some HPC codes direct
all processes to demand the same data at the same time, but I suspect
most do not, for obvious reasons.
|
Nobody said "the same data". Most HPC programs advance in lockstep,
which means that they all communicate at the same time.
| Quote: | Sure,but that access pattern is not representative of real world
codes. If 95% of data accesses land within caches, then only 1
process in 20 should be requesting a line that crosses the
interconnect at any given moment.
|
No. You speak like someone who's never tried to speed up an SMP
program.
| Quote: | But transparent access to data among multiple processes is
the raison d'etre of SMPs,
|
And that's their biggest scaling problem. But we already had
this discussion once before.
| Quote: | And if their MPI performance
is no worse than clusters, SMPs offer the best of both worlds.
|
Yes, but lose on price/performance. BTW, we've had some of our
customers tell us that we're even to slightly better than Altix on MPI
programs, at a much lower cost.
| Quote: | In terms of shared memory HPC apps, I'd argue that cache coherence is
not all it's cracked up to be. The Cray T3E/T3D wasn't cache
coherent.
|
Right, but that's not shared memory. People disagree on what term to
use, but they agree that it isn't shared memory. My favorite term is
"shared address, local consistency", but apparently there's some newer
term for it.
By the way, you're arguing against your other point, transparent
memory access. The T3E did not offer that: you had to know what
wasn't cache coherent and what was. So if transparent access is the
raison d'etre of shared memory...
| Quote: | Yes, such machines are only of historical interest.
|
Actually no. You might note that the department that you work at has a
large project working on a language that implements this kind of
model, which uses the same compiler infrastructure that my employer
does. But hey, if you prefer lecturing me about what I already know on
comp.arch, don't let me stop you ;-)
| Quote: | But with the
improvements in latency made available by newer cluster interconnects
(the XD1 and InfiniPath), the use of clusters as non-coherent SMPs may
again be viable. It's an interesting possibility, anyway.
|
That's not an SMP. It is an interesting possibility, but if you
continue to use the wrong term for it, you aren't going to have an
interesting discussion. Trust me on this.
-- greg
(working for, not speaking for, PathScale.) |
|
| Back to top |
|
 |
Colonel Forbin
Guest
|
Posted:
Thu Jul 28, 2005 9:42 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <dc8mmm$ddd$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
| Quote: |
Yes. A long time back, I got flamed for pointing out that postcards
were a perfectly good form of communication for the MOST embarrassingly
parallel applications, and had actually been used for the purpose!
'Tis true, sir ....
|
Yes, has worked for some groups of mathematicians working on problems
which didn't require a great deal of internode communications. Sure
worked for Craig Shergold, as well! :)
| Quote: | There is a gradation of requirements from there right up to the ones
that scale only if the interconnect latency is comparable to the
local memory latency.
|
Exactly. This is why I think the SGI Altix ccNUMA approach will
eventually supplant the current generation of MPI clusters for
most problems.
The problem I see with this debate is that there is a "cluster mindset"
that has evolved where people are fixated on the cluster instead of the
problem. The same is true to a lesser extent of the "SMP mindset." The
"cluster mindset" tends to cause people to view "HPC problems" as the
subset of problems that execute well on MPI clusters.
It kind of reminds me of the rather convoluted logic in the statement
that in general if you have two multiprocessor systems of the same
overall CPU power, the one with the slower CPUs will perform better on
IO bound applications. The problem with this statement is that it seems
to imply that CPU performance and IO performance are inversely related
when actually the reason is that having more nodes generally leads to
higher aggregate IO throughput.
The whole point of cache is that it is local. Hence, it is
obvious that optimizing a highly parallel application is going
to require avoiding assumptions about cache coherency for
nonlocal access and is going to require minimizing nonlocal memory
access in general.
OTOH, if you have the situation of a supercomputer as a shared
resource, the ccNUMA design is extremely attractive. This
permits optimal execution of applications whose nodal memory
requirements may vary asynchrously over time.
More importantly, the ccNUMA design lends itself to easier
management of workloads which may contain a number of different
applications by separating the allocation of memory, cpu, and
other resources rather than rigidly allocating them all by node.
Concepts such as thread/process affinity in scheduling and the
use by programmers of explicit MPI in optimizing *critical portions*
of applications maximizes flexibility and system resource optimization.
My personal opinion is that as the option of ccNUMA comes down in
price the disadvantages of current MPI clusters will become
more and more apparent, and that they will largely fall from favor.
The real issue is assigning resources to a problem in the most
effective and cost efficient fashion.
Thus, the optimal architecture for most problems is one where
memory, CPU power and other resources can be independently and
dynamically assigned to maximize resource utilization. This is
what makes SMPs attractive, and is what will likely make Altix
style ccNUMA attractive as it comes down in cost, not some
notion of optimal application design.
In this light, I see the whole "cluster vs. SMP" debate in the
same general perspective as "RISC vs. CISC." It's a market
bubble. |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Jul 28, 2005 9:53 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <aW7Ge.34267$B52.14202@tornado.ohiordc.rr.com>,
forbin@dev.nul (Colonel Forbin) writes:
|>
|> The problem I see with this debate is that there is a "cluster mindset"
|> that has evolved where people are fixated on the cluster instead of the
|> problem. The same is true to a lesser extent of the "SMP mindset." The
|> "cluster mindset" tends to cause people to view "HPC problems" as the
|> subset of problems that execute well on MPI clusters.
Well, there is, but clusters vary a lot, too. I am currently
going spare writing a tender, and that is specifying both a
serious interconnect and reasonably serious management. On
the other hand, there are the roomfulls of PCs.
The difference between the best integrated and connected clusters
and the Altix is essentially JUST cache coherent shared memory
versus distributed. Managing and using the Hitachi SSR2201 and
an SGI Origin weren't very different.
|> More importantly, the ccNUMA design lends itself to easier
|> management of workloads which may contain a number of different
|> applications by separating the allocation of memory, cpu, and
|> other resources rather than rigidly allocating them all by node.
Not in an HPC context. We tend to go to great trouble to isolate
the applications on a SMP machine - it is EASIER on a MPI cluster.
|> My personal opinion is that as the option of ccNUMA comes down in
|> price the disadvantages of current MPI clusters will become
|> more and more apparent, and that they will largely fall from favor.
It never will at the top end. ccNUMA is inherently expensive
on a large scale, but in terms of money and performance.
|> The real issue is assigning resources to a problem in the most
|> effective and cost efficient fashion.
That is true.
|> In this light, I see the whole "cluster vs. SMP" debate in the
|> same general perspective as "RISC vs. CISC." It's a market
|> bubble.
Hmm. Perhaps. There is a bit more to it than that, but RISC
versus CISC wasn't and isn't entirely a market bubble, either.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Randy
Guest
|
Posted:
Fri Jul 29, 2005 12:15 am Post subject:
Re: Cluster computing drawbacks |
|
|
Greg Lindahl wrote:
| Quote: | In article <iM_Fe.38250$0f.35619@tornado.texas.rr.com>,
Randy <joe@burgershack.com> wrote:
....
Honestly, I think that's a better example of how MPI programs access
data than how SMP programs do it. I'm sure that some HPC codes direct
all processes to demand the same data at the same time, but I suspect
most do not, for obvious reasons.
Nobody said "the same data". Most HPC programs advance in lockstep,
which means that they all communicate at the same time.
|
But parallel processes often don't read the same data concurrently, and
they definitely don't write to the same data concurrently. That
difference is key and exploitable. And I suspect it's ignored by most
parallel benchmarks, to the advantage of clusters and detriment of SMPs.
A good example lies with data mining. Non cache-coherent SMPs are
trivial to program for such tasks. Also Monte Carlo sims. Also any
other embarrassingly parallel task (at which clusters shine). It's only
the tasks at which clusters suck that it'll be hard to program
noncoherent SMPs. Effectively, that should put them at parity with
clusters. Except that you *can* program them with less explicit memory
movement primitives, or at least without handshakes.
In the case of MPICH2's asynchronous memory movement primitives,
noncoherent SMPs may well outshine all comers. And they scale just as
well as clusters...
| Quote: |
Sure,but that access pattern is not representative of real world
codes. If 95% of data accesses land within caches, then only 1
process in 20 should be requesting a line that crosses the
interconnect at any given moment.
No. You speak like someone who's never tried to speed up an SMP
program.
|
Them's fightin' words.
Speeding up parallel programs on CC-SMPs is *entirely* about managing
cache line locality (and access interference).
| Quote: |
But transparent access to data among multiple processes is
the raison d'etre of SMPs,
And that's their biggest scaling problem. But we already had
this discussion once before.
|
Yup. But scaling to hundreds of processes is not the only or even the
primary measure of success in HPC. Doing more science per unit of time
is. If the tradeoff between scalability, system cost, programmer time,
dusty deck reuse, and shortening the programmer learning curve changes
as system architectures evolve, then the prepared mind is going to RUN
the hell away from MPI. ASAP. IMHO.
| Quote: |
And if their MPI performance
is no worse than clusters, SMPs offer the best of both worlds.
Yes, but lose on price/performance. BTW, we've had some of our
customers tell us that we're even to slightly better than Altix on MPI
programs, at a much lower cost.
|
I believe it. Likewise, if they were using smget and smput (or MPICH2's
equivalents), they might likewise appreciate your system's potential for
doing that.
I'm just saying that there are other fish to fry, and it's possible that
the time to explore alternative HPC programming models may be upon us.
| Quote: |
In terms of shared memory HPC apps, I'd argue that cache coherence is
not all it's cracked up to be. The Cray T3E/T3D wasn't cache
coherent.
Right, but that's not shared memory. People disagree on what term to
use, but they agree that it isn't shared memory. My favorite term is
"shared address, local consistency", but apparently there's some newer
term for it.
|
SMP has been a much abused term for a long term. NUMA vs CC-NUMA
illustrates a comparable historical hiccough, since most folks assume
NUMA to imply cache coherence, which it does not.
Non-Cache-Coherent Shared Memory Programming is a mouthful. NCC-SMP?
You're suggestion is as good as mine.
| Quote: |
By the way, you're arguing against your other point, transparent
memory access. The T3E did not offer that: you had to know what
wasn't cache coherent and what was. So if transparent access is the
raison d'etre of shared memory...
|
When programming in C or fortran, that's true. Can you imagine a
convergence of language and compiler that ensures cache coherency using
data dependence analysis and warning messages? I can. Combining KAP's
Guide (now an Intel product) and a modern parallelizing compiler should
come very close to what I'm proposing. That *would* offer transparent
parallel access to data on a non cache coherent SMP.
Better yet, let's reexamine the programming languages while we're at it.
C/C++/Fortran/HPF suck almost as much as MPI.
| Quote: |
Yes, such machines are only of historical interest.
Actually no. You might note that the department that you work at has a
large project working on a language that implements this kind of
model, which uses the same compiler infrastructure that my employer
does. But hey, if you prefer lecturing me about what I already know on
comp.arch, don't let me stop you ;-)
|
I assume you mean Telescoping Languages? http://telescoping.cs.rice.edu
(BTW, I'm outside the CS dept.) AFAIK, hardware is not part of their
formula. But I suspect non cache coherent shared memory could serve
their needs nicely.
As I tried to suggest at the top of my last reply, I'm not so much
trying to argue with you or split hairs as suggest that with the recent
drop in interconnect latency due to InfiniPath (and comparable
products), alternatives to MPI may become viable, and profitably
exploitable by someone with initiative. God knows, the NSF is abubble
with initiatives to invent alternatives to programming parallel
machines, parallel SoCs, and the multiverse of multicores that are about
to befall us. Let me suggest this as another among their little forays.
Unlike some who can't abide the notion of parallel programming without
using MPI, I'm intrigued by the prospects of newer alternatives which
probably *can* be explored by slapping a few shmgets/shmputs onto
equivalent examples of a 1) serial C/Fortran code, 2) CC-SMP code, and
3) MPI code to see how the implementations compare. I'd love to get a
feel for the effort needed to A) compose such programs from scratch and
B) evolve a dusty deck serial program to 2 and 3. I suspect there's
money in them thar hills.
After we've shown competitive performance potential, it seems like
adding a smart compiler to the mix would be a natural progression,
perhaps delving the data transparency I implied earlier.
This must have been done once upon time in the days of T3E, and probably
before. Probably it was, but since everything in CS has to be
reinvented every decade anyway, maybe it's time to revisit the cost
model of non-cache-coherent shared-memory programming.
| Quote: |
But with the
improvements in latency made available by newer cluster interconnects
(the XD1 and InfiniPath), the use of clusters as non-coherent SMPs may
again be viable. It's an interesting possibility, anyway.
That's not an SMP. It is an interesting possibility, but if you
continue to use the wrong term for it, you aren't going to have an
interesting discussion. Trust me on this.
|
Fine. NCC-SMP it is. (Actually, I think I prefer Incoherent SMP, or
ISMP.) But I'll let *you* choose.
Randy
(Not speaking for Rice, and not really working for them either.)
| Quote: |
-- greg
(working for, not speaking for, PathScale.)
|
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 29, 2005 12:15 am Post subject:
Re: Cluster computing drawbacks |
|
|
In article <dcbf12$m3i$1@joe.rice.edu>, Randy <joe@burgershack.com> wrote:
| Quote: |
A good example lies with data mining. Non cache-coherent SMPs are
trivial to program for such tasks. Also Monte Carlo sims. Also any
other embarrassingly parallel task (at which clusters shine). It's only
the tasks at which clusters suck that it'll be hard to program
noncoherent SMPs. Effectively, that should put them at parity with
clusters. Except that you *can* program them with less explicit memory
movement primitives, or at least without handshakes.
In the case of MPICH2's asynchronous memory movement primitives,
noncoherent SMPs may well outshine all comers. And they scale just as
well as clusters...
|
Yes. But that sort of code is a right b*gger to debug.
| Quote: | Speeding up parallel programs on CC-SMPs is *entirely* about managing
cache line locality (and access interference).
|
Er, not quite. But I agree that is a very high proportion of the task,
and especially the nastiest bits.
| Quote: | Yup. But scaling to hundreds of processes is not the only or even the
primary measure of success in HPC. Doing more science per unit of time
is. If the tradeoff between scalability, system cost, programmer time,
dusty deck reuse, and shortening the programmer learning curve changes
as system architectures evolve, then the prepared mind is going to RUN
the hell away from MPI. ASAP. IMHO.
|
True. And that is the main reason that so many users have run away
from OpenMP back to MPI :-)
| Quote: | I'm just saying that there are other fish to fry, and it's possible that
the time to explore alternative HPC programming models may be upon us.
|
Agreed. BSP. Dataflow. Something even more radical :-)
| Quote: | SMP has been a much abused term for a long term. NUMA vs CC-NUMA
illustrates a comparable historical hiccough, since most folks assume
NUMA to imply cache coherence, which it does not.
|
Symmetric Multi-Processing, anyone?
| Quote: | Better yet, let's reexamine the programming languages while we're at it.
C/C++/Fortran/HPF suck almost as much as MPI.
|
The parallel versions (including HPF) considerably more so. MPI is
actually quite a good standard, has relatively few ambiguities,
allows efficient and portable code, and enables practical debugging.
Yes, it is very low level. Sad.
| Quote: | Unlike some who can't abide the notion of parallel programming without
using MPI, I'm intrigued by the prospects of newer alternatives which
probably *can* be explored by slapping a few shmgets/shmputs onto
equivalent examples of a 1) serial C/Fortran code, 2) CC-SMP code, and
3) MPI code to see how the implementations compare. I'd love to get a
feel for the effort needed to A) compose such programs from scratch and
B) evolve a dusty deck serial program to 2 and 3. I suspect there's
money in them thar hills.
|
I have dabbled with that. Stick to MPI, my lad ....
It is dead easy to convert a clean but dusty deck to OpenMP. Oh,
you want it to run FASTER than the serial version? How very
unreasonable of you.
| Quote: | After we've shown competitive performance potential, it seems like
adding a smart compiler to the mix would be a natural progression,
perhaps delving the data transparency I implied earlier.
|
God help me, NO!!! This has been tried and failed more times than
I care to think. The first requirement is a language that is
designed for parallelisation - Fortran is dire, C++ is indescribably
worse, and the English language contains no curses foul enough to
describe how C interacts with this.
| Quote: | This must have been done once upon time in the days of T3E, and probably
before. Probably it was, but since everything in CS has to be
reinvented every decade anyway, maybe it's time to revisit the cost
model of non-cache-coherent shared-memory programming.
|
That is getting back to sanity, in the sense that our world model is
now a stack of turtles rather than in something indescribably less
structured.
The opportunity that is being missed is incoherent SMP as a system
model - i.e. not as an application model. This would be a very
good basis for implementing a shared file cache, message passing
(MPI and SHMEM, if you must), efficient FIFOs between CPUs and so
on. It could even be used by consenting adults in private, but I
really don't want to have to explain to the average kiddy how to
use it.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Fri Jul 29, 2005 2:19 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <dcbf12$m3i$1@joe.rice.edu>, Randy <joe@burgershack.com> wrote:
| Quote: | Nobody said "the same data". Most HPC programs advance in lockstep,
which means that they all communicate at the same time.
But parallel processes often don't read the same data concurrently, and
they definitely don't write to the same data concurrently.
|
Randy,
This is a total non-sequitor. It looked like you were asserting that
the Random Ring latency somehow involved this. Apparently it was a
total non-sequitor, or you like repeating yourself.
| Quote: | And I suspect it's ignored by most
parallel benchmarks, to the advantage of clusters and detriment of SMPs.
|
Um, name one parallel benchmark that involves reading the same data
concurrently? I mean, I'm sure there is one, but most don't involve
that.
| Quote: | Non cache-coherent SMPs
|
aren't SMPs.
| Quote: | In the case of MPICH2's asynchronous memory movement primitives,
noncoherent SMPs may well outshine all comers. And they scale just as
well as clusters...
|
Given that there aren't any real noncoherent distributed memory
systems available on the market, I'm unsure how you can compute
scalability of them. In case you didn't notice, the T3E is dead,
and I refuse to count SCI or Quadrics or various RMDA approaches,
which are doomed to be slow.
| Quote: | Speeding up parallel programs on CC-SMPs is *entirely* about managing
cache line locality (and access interference).
|
Right. And when everyone wants to communicate at the same time (think:
lockstep finite difference with domain decomposition)... you're dead.
You can't say that only 1 in N processsors is going to be talking at
once. And it's collective effects like this which make SMPs hard to
program at scale. Which you claim you know how to do. But apparently
you only know how to do it for a class of very well behaved programs.
| Quote: | Yup. But scaling to hundreds of processes is not the only or even the
primary measure of success in HPC. Doing more science per unit of time
is.
|
Don't change the subject. If you want to run slow, then go invent a
new higher level language, preferably one based on Java.
| Quote: | then the prepared mind is going to RUN
the hell away from MPI. ASAP. IMHO.
|
Cue theme song from "Rebel Without A Clue"...
| Quote: | I believe it. Likewise, if they were using smget and smput (or MPICH2's
equivalents), they might likewise appreciate your system's potential for
doing that.
|
Gee, given that we haven't announced any performance numbers for that,
I think you're getting the cart before the horse. I do appreciate the
confidence you display.
| Quote: | I'm just saying that there are other fish to fry, and it's possible that
the time to explore alternative HPC programming models may be upon us.
|
Uh, duh. I'm attempting to argue against some of the facts you
present, and you keep on returning to the theme. Apparently the facts
don't matter, and you think I disagree with the theme when I'm mostly
ignoring it.
| Quote: | Non-Cache-Coherent Shared Memory Programming is a mouthful. NCC-SMP?
You're suggestion is as good as mine.
|
Grammmar flames considered harmful, but I wasn't suggesting, I was
simply repeating what other people have suggested. If we spent more
time adopting terms and less inventing, we would have less confusion.
| Quote: | When programming in C or fortran, that's true. Can you imagine a
convergence of language and compiler that ensures cache coherency using
data dependence analysis and warning messages?
|
Um, yeah, I imagine a language which a 1st year grad student would
realize would tend to throw up its hands and think everything isn't
local, because it's hard to follow pointers through an entire program.
No wonder UPC and Co-Array Fortran and Titanium don't do it that way.
Now Mentat can do this perfectly, but it's macro-dataflow and doesn't
do data decomposition well. As you can see, this is a pretty well
trodden area of CS; I named less than 10% of the examples you could
look at.
| Quote: | Better yet, let's reexamine the programming languages while we're at it.
C/C++/Fortran/HPF suck almost as much as MPI.
|
Be my guest, but please grind your axe against someone else.
| Quote: | As I tried to suggest at the top of my last reply, I'm not so much
trying to argue with you or split hairs as suggest that with the recent
drop in interconnect latency due to InfiniPath (and comparable
products), alternatives to MPI may become viable, and profitably
exploitable by someone with initiative.
|
Well, yes, that was a design goal. However, as you can see, I have
little quibbles with many of your details.
| Quote: | Unlike some who can't abide the notion of parallel programming without
using MPI,
|
Having fun beating that straw man? People like MPI because it works,
not because it's beautiful. You want to invent the next HPF, a
technology of the future.
| Quote: | This must have been done once upon time in the days of T3E, and probably
before. Probably it was, but since everything in CS has to be
reinvented every decade anyway, maybe it's time to revisit the cost
model of non-cache-coherent shared-memory programming.
|
You mean, aside from the people already revisiting it right now? Proof
that people doing real work don't post to Usenet much.
-- greg |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 29, 2005 2:31 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <42e9f4bf$1@news.meer.net>,
lindahl@pbm.com (Greg Lindahl) writes:
|>
|> > Non cache-coherent SMPs
|>
|> aren't SMPs.
Well, they were until recently. Why have they stopped being SMPs?
|> >Non-Cache-Coherent Shared Memory Programming is a mouthful. NCC-SMP?
|> >You're suggestion is as good as mine.
|>
|> Grammmar flames considered harmful, but I wasn't suggesting, I was
|> simply repeating what other people have suggested. If we spent more
|> time adopting terms and less inventing, we would have less confusion.
True.
|> > Unlike some who can't abide the notion of parallel programming without
|> > using MPI,
|>
|> Having fun beating that straw man? People like MPI because it works,
|> not because it's beautiful. You want to invent the next HPF, a
|> technology of the future.
Shudder :-(
I have been thinking about a decent shared-memory language, but
pretty well every sane choice takes me further away from HPF ....
I don't like OpenMP, but it is at least not totally off-beam.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Fri Jul 29, 2005 2:39 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <aW7Ge.34267$B52.14202@tornado.ohiordc.rr.com>,
Colonel Forbin <forbin@dev.nul> wrote:
| Quote: | The problem I see with this debate is that there is a "cluster mindset"
that has evolved where people are fixated on the cluster instead of the
problem.
|
No, people in HPC are fixated on programming models: message passing,
or shared memory.
The people with a cluster mindset are a subsert of the message
passing people.
| Quote: | The same is true to a lesser extent of the "SMP mindset." The
"cluster mindset" tends to cause people to view "HPC problems" as the
subset of problems that execute well on MPI clusters.
|
Which would fail to explain why many SGI Origin/Altix machines are run
as MPI boxes. Hint: if your code is MPI already, or if you found that
MPI let you scale your code farther than OpenMP... or if you used a
programming infrastructure that quietly used MPI under the hood...
| Quote: | My personal opinion is that as the option of ccNUMA comes down in
price the disadvantages of current MPI clusters will become
more and more apparent, and that they will largely fall from favor.
|
Let's see, which one is falling more rapidly in price per
performance... hint: not ccNUMA. I hate to be the first to broach the
topic, but SGI has been slowly going bankrupt for a long time. One
reason is that clusters are "good enough" for many problems, and will
continue to be, especially with those turkeys in the ex-SGI building
#10 named PathScale who are improving the scalability of the existing
MPI codebase. It's the Revenge of the Killer Micros, which were good
enough; the Killer Cluster Interconnect is now good enough.
| Quote: | Thus, the optimal architecture for most problems
|
Thus starts many a sentence with conclusions irrelevant in the real
world. Or maybe I'm just bitchy because I had to fly back into
Oakland and take a van to get my car at SFO...
-- greg |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Jul 29, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
Greg Lindahl wrote:
| Quote: | In article <aW7Ge.34267$B52.14202@tornado.ohiordc.rr.com>,
Colonel Forbin <forbin@dev.nul> wrote:
My personal opinion is that as the option of ccNUMA comes down in
price the disadvantages of current MPI clusters will become
more and more apparent, and that they will largely fall from favor.
Let's see, which one is falling more rapidly in price per
performance... hint: not ccNUMA. I hate to be the first to broach the
topic, but SGI has been slowly going bankrupt for a long time. One
reason is that clusters are "good enough" for many problems, and will
continue to be, especially with those turkeys in the ex-SGI building
#10 named PathScale who are improving the scalability of the existing
MPI codebase. It's the Revenge of the Killer Micros, which were good
enough; the Killer Cluster Interconnect is now good enough.
|
May be, SGI is doomed due to their financial weakness but at least on
the technical front they keep fighting. New Bx2-based Altixen are
obviously better than the previous generation and they keep scalability
war agianst MPP machines interesting.
Your bigger neighbor's earning's release provides a better support
for your claim:
During fiscal years 2002-2005 the share of >8way servers in their
revenues shrunk from 55% to 35%.
See slide 8:
http://www.sun.com/aboutsun/investor/earnings_releases/Q405_SLD.pdf
[O.T.]
Should Sun Microsysstems be regarded as Santa Clara company or as
Mountain View company? |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 29, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <3kup9kF10436sU3@individual.net>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|>
|> > Um, name one parallel benchmark that involves reading the same data
|> > concurrently? I mean, I'm sure there is one, but most don't involve
|> > that.
|>
|> Umm, reasonably efficient multicast or broadcast performance is not
|> relevant to a substantial fraction of applications? Or, perhaps, some
|> of them are so dissatified with the performance of the primitives
|> provided that they work around that issue?
In general, it is used mainly or even solely during initialisation,
and is not a major bottleneck. A logarithmic fanout, coded in terms
of point-to-point operations, is quite good enough.
There may be exceptions, but I haven't seen any.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Fri Jul 29, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
| Quote: | In general, it is used mainly or even solely during initialisation,
and is not a major bottleneck. A logarithmic fanout, coded in terms
of point-to-point operations, is quite good enough.
There may be exceptions, but I haven't seen any.
|
I have, that's why I persistently ask.
As an example, any image analysis (to not restrict this discussion to
face recognition 8-)) will usually perform some pre-processing (filtering)
of the image. This can of course be done in parallel, and often shows quite
fine-grained parallelism. The following analysis stages - there often are
multiple such stages that can be run in parallel - now all need read access
to (parts of) that pre-processed data. This stage usually is more coarse-
grained, so you might want to parallelize the stages as well. And, of
course, this is repeated for every image being analysed.
Jan |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Fri Jul 29, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
| Quote: | Um, name one parallel benchmark that involves reading the same data
concurrently? I mean, I'm sure there is one, but most don't involve
that.
|
Umm, reasonably efficient multicast or broadcast performance is not
relevant to a substantial fraction of applications? Or, perhaps, some
of them are so dissatified with the performance of the primitives
provided that they work around that issue?
Jan |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 29, 2005 4:15 pm Post subject:
Re: Cluster computing drawbacks |
|
|
In article <3kut7jF105qtpU1@individual.net>,
=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrueggen-not@mediasec.de> writes:
|>
|> > In general, it is used mainly or even solely during initialisation,
|> > and is not a major bottleneck. A logarithmic fanout, coded in terms
|> > of point-to-point operations, is quite good enough.
|> >
|> > There may be exceptions, but I haven't seen any.
|>
|> I have, that's why I persistently ask.
|>
|> As an example, any image analysis (to not restrict this discussion to
|> face recognition 8-)) will usually perform some pre-processing (filtering)
|> of the image. This can of course be done in parallel, and often shows quite
|> fine-grained parallelism. The following analysis stages - there often are
|> multiple such stages that can be run in parallel - now all need read access
|> to (parts of) that pre-processed data. This stage usually is more coarse-
|> grained, so you might want to parallelize the stages as well. And, of
|> course, this is repeated for every image being analysed.
Hmm. The question is how often such systems use broadcast and how
often scatter, and why. I have looked at FFTs from the point of
view of a broadcast-based algorithm, and it didn't really pay.
I agree that, if you have an efficient broadcast and an inefficient
scatter, it pays to use broadcast. But, as an argument for an
optimised broadcast, that is a little circular :-)
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
|
|
|
|