Cluster computing drawbacks
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Cluster computing drawbacks
Goto page Previous  1, 2, 3, 4, 5
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Greg Lindahl
Guest





Posted: Fri Jul 29, 2005 10:25 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <3kup9kF10436sU3@individual.net>,
Jan Vorbrüggen <jvorbrueggen-not@mediasec.de> wrote:

Quote:
Umm, reasonably efficient multicast or broadcast performance is not
relevant to a substantial fraction of applications?

Yes. But this is a matter of degreee, and I know some people disagree
with me on that. However, that's totally irrelevant to whatever
Randy's point is.

Quote:
Or, perhaps, some
of them are so dissatified with the performance of the primitives
provided that they work around that issue?

That's certainly very common in MPI programs. MPICH has actually
dramatically improved their collectives performance in the last 5
years; most of the griping was from experience in the 1995-2000
timeframe.

As an aside, InfiniPath has really good collectives performance. How
did we do it? A fast point-to-point implementation. Raising the ocean
floats all boats.

-- greg
(working for, not speaking for, PathScale)
(for proof of this last point, we have Pallas results on the
pathscale.com website.)
Back to top
Greg Lindahl
Guest





Posted: Fri Jul 29, 2005 11:07 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <1122640058.323152.56540@g47g2000cwa.googlegroups.com>,
<already5chosen@yahoo.com> wrote:

Quote:
Your bigger neighbor's earning's release provides a better support
for your claim:

I think you're way overinterpreting.

-- greg
Back to top
Randy
Guest





Posted: Fri Jul 29, 2005 11:13 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

Greg Lindahl wrote:
Quote:
In article <dcbf12$m3i$1@joe.rice.edu>, Randy <joe@burgershack.com> wrote:

Nobody said "the same data". Most HPC programs advance in lockstep,
which means that they all communicate at the same time.

But parallel processes often don't read the same data concurrently, and
they definitely don't write to the same data concurrently.

Randy,

This is a total non-sequitor. It looked like you were asserting that
the Random Ring latency somehow involved this. Apparently it was a
total non-sequitor, or you like repeating yourself.

"which means that they all communicate at the same time" implies that
all processes send each other data concurrently, in lockstep. I'm
countering that assertion, asserting that many parallel applications do
not do this, although most HPC benchmarks probably do (like random
ring's paired synchronous handshakes).

Seems like a sequitor to me.

Quote:

And I suspect it's ignored by most
parallel benchmarks, to the advantage of clusters and detriment of SMPs.

Um, name one parallel benchmark that involves reading the same data
concurrently? I mean, I'm sure there is one, but most don't involve
that.

For example, anything that does a matrix multiply.

Quote:

Non cache-coherent SMPs

aren't SMPs.

You know, Greg, I suspect you piss off a lot of potential customers
unnecessarily by arguing for argument's sake, like this. SMP is an
imprecise term that by convention refers only to cache coherent shared
memory multiprocessor systems that share a snoopy bus. Persisting in
niggling my attempt to work around that term's acknowledged inadequacies
during a discussion in which you AND I appreciate the difference, as
well as appreciate the fact that no better term is available, IS REALLY
ANNOYING.

Quote:

In the case of MPICH2's asynchronous memory movement primitives,
noncoherent SMPs may well outshine all comers. And they scale just as
well as clusters...

Given that there aren't any real noncoherent distributed memory
systems available on the market, I'm unsure how you can compute
scalability of them. In case you didn't notice, the T3E is dead,
and I refuse to count SCI or Quadrics or various RMDA approaches,
which are doomed to be slow.

Not any more. Pathscale makes one. Cray makes several. Using MPI-2,
these low latency clusters make NCC-SMPs a viable possibility, as well
as a revision to the shared-nothing, painful as hell, assembly-level
explicit-message-passing-uber-alles programming model.

For some reason, you just don't want to acknowledge that users might use
InfiniPath in ways that differ from what your engineers intended. In my
experience, which largely falls *outside* traditional numeric HPC
applications, I think there's interesting potential there. For example,
I know several intelligence applications that could greatly benefit from
this.

But god knows, if I were a technical lead on one of those gov't projects
and I read any of this thread, I'd RUN not walk to one of Pathscale's
competitors (like Cray). Dealing with Pathscale and *you* would clearly
be more pain than gain.

Quote:

Speeding up parallel programs on CC-SMPs is *entirely* about managing
cache line locality (and access interference).

Right. And when everyone wants to communicate at the same time (think:
lockstep finite difference with domain decomposition)... you're dead.
You can't say that only 1 in N processsors is going to be talking at
once. And it's collective effects like this which make SMPs hard to
program at scale. Which you claim you know how to do. But apparently
you only know how to do it for a class of very well behaved programs.

For loop splitting, concurrent access to data will bottleneck. This is
true of any architecture. Alleviating the bottleneck through hardware
is very desirable. But it's not everything. Software solutions are
often possible. Workpile or guided scheduling of loop decomposition are
two examples. Logarithmic decomposition of message broadcasts is
another. There are lots of others, demonstrating that hardware
optimization in message-passing isn't the only way to skin that cat. If
software solutions are made more convenient by a change to the
programming model, then software can step in where hardware fails to
tread. (Or TreadMarks, for example.)

Quote:

Yup. But scaling to hundreds of processes is not the only or even the
primary measure of success in HPC. Doing more science per unit of time
is.

Don't change the subject. If you want to run slow, then go invent a
new higher level language, preferably one based on Java.

Yeah, Greg. I want to run slow.

Quote:

then the prepared mind is going to RUN
the hell away from MPI. ASAP. IMHO.


Cue theme song from "Rebel Without A Clue"...

Greg, I don't know how else to say this. Pathscale is hurt every time
you open your mouth. You are uncivil. You resort to ad hominen attacks
when you should say nothing. If you want to emphasize the fact that
your "opponent" is a dumbass, be silent and let their idiocy resonate in
the ears of others. Every time you are a boor, you diminish your
company for having hired someone who is a very poor spokesman.

Remember, I'm the customer here, and I work for a nonprofit. I'm free
to suggest any nutty thing I want, so long as it's civil. You aren't,
and you aren't. That's an imprudent combination, and eventually it will
get you fired.

Quote:

When programming in C or fortran, that's true. Can you imagine a
convergence of language and compiler that ensures cache coherency using
data dependence analysis and warning messages?

Um, yeah, I imagine a language which a 1st year grad student would
realize would tend to throw up its hands and think everything isn't
local, because it's hard to follow pointers through an entire program.
No wonder UPC and Co-Array Fortran and Titanium don't do it that way.
Now Mentat can do this perfectly, but it's macro-dataflow and doesn't
do data decomposition well. As you can see, this is a pretty well
trodden area of CS; I named less than 10% of the examples you could
look at.

Cluster interconnects are improving. Obsolete theory will be reexamined
as part of a new cost formula. Maybe infeasible implementations of old
are less infeasible now.

But I'm done fighting with you. This is going nowhere.

Quote:

This must have been done once upon time in the days of T3E, and probably
before. Probably it was, but since everything in CS has to be
reinvented every decade anyway, maybe it's time to revisit the cost
model of non-cache-coherent shared-memory programming.

You mean, aside from the people already revisiting it right now? Proof
that people doing real work don't post to Usenet much.

:-) You aren't especially good at reflection, are you Greg?

Randy

--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
Back to top
Greg Lindahl
Guest





Posted: Sat Jul 30, 2005 12:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <dcdrk8$j7e$1@joe.rice.edu>, Randy <joe@burgershack.com> wrote:

Quote:
"which means that they all communicate at the same time" implies that
all processes send each other data concurrently, in lockstep.

Ah. When you said "same data", I was thinking you were talking about
the same data, i.e. a collective operation.

Quote:
I'm
countering that assertion, asserting that many parallel applications do
not do this, although most HPC benchmarks probably do (like random
ring's paired synchronous handshakes).

Many HPC applications do behave as I've said; it's not just a
benchmark special. For example, all spatially-decomposed stencil
algorithms work that way. That's an extremely large class of
applications. I also know of applications that don't behave like
this.

Quote:
You know, Greg, I suspect you piss off a lot of potential customers
unnecessarily by arguing for argument's sake, like this.

Indeed, that's why we have marketing & sales people. But if I thought
I was arguing for argument's sake, I wouldn't bother to post. I give
you the benefit of the doubt; how about being more civil and giving me
that courtesy, too?

Quote:
SMP is an
imprecise term that by convention refers only to cache coherent shared
memory multiprocessor systems that share a snoopy bus.

Yes, we agree. Instead of confusing the issue, why not adopt one of
the suggested terms for these non-SMP systems you are talking about?
Inventing yet more terms is really annoying, but I'll be civil and
not use all-caps.

Quote:
Given that there aren't any real noncoherent distributed memory
systems available on the market, I'm unsure how you can compute
scalability of them. In case you didn't notice, the T3E is dead,
and I refuse to count SCI or Quadrics or various RMDA approaches,
which are doomed to be slow.

Not any more. Pathscale makes one.

We do not. Again, thanks for being nice, but that's not what we built.
You have *no* *idea* how our hardware works.

Quote:
For some reason, you just don't want to acknowledge that users might use
InfiniPath in ways that differ from what your engineers intended.

I am 100% in favor of people using InfiniPath in ways we never dreamed
of. And yes, we've thought about how to implement MPI-2, GAS, ARMCI,
UPC, CAF, SHMEM, and any of the 40+ distributed software shared memory
approaches out there on our hardware. That's just part of the
engineering process of doing an interconnect. But our interconnect
doesn't work like you think it does.

Quote:
In my
experience, which largely falls *outside* traditional numeric HPC
applications, I think there's interesting potential there.

I agree. Someday you'll notice that we're mostly agreeing.

Quote:
But god knows, if I were a technical lead on one of those gov't projects
and I read any of this thread, I'd RUN not walk to one of Pathscale's
competitors (like Cray).

And then you wonder why vendor employees often don't post on Usenet?
It's because of threads like this, in which you try to tell me how my
hardware works, and then claim it scares off customers when I try to
point out where you're wrong. If you prefer, I could fall silent.

Quote:
Dealing with Pathscale and *you* would clearly
be more pain than gain.

Then please add me to your killfile.

Quote:
Remember, I'm the customer here,

And the customer is always right. Even when he's wrong about how the
hardware in question works. Even when he's uncivil. Even when he
hasn't read the literature, and wants to propose what's been tried,
what's being tried, what's been shown to work or not work, and then
goes ballistic when pointed at the literature. And so on.

Yes, I'm now sorry I got involved with this thread to start with. It's
a good thing that Usenet is an irrelevant forum.

-- greg
Back to top
Mark Hahn
Guest





Posted: Sat Jul 30, 2005 12:16 am    Post subject: Re: Cluster computing drawbacks Reply with quote

Quote:
"which means that they all communicate at the same time" implies that
all processes send each other data concurrently, in lockstep. I'm
countering that assertion, asserting that many parallel applications do
not do this, although most HPC benchmarks probably do (like random
ring's paired synchronous handshakes).

I run an HPC center and get to talk to a lot of disparate people about
how their parallel code works. I can't think of any of them who do not
have this sort of lockstep behavior. not really lockstep, but periods
of work independent work interspersed with periods when all nodes are
exchanging info with other nodes. in some, this corresponds with timesteps
in a simulation. but another I'm thinking of is a really, really huge
diagonalization that has the same compute/communicate phases in spite of
not being a temporal simulation.

the only counterexamples I can think of are incredibly loosely-coupled,
embarassingly parallel types. there's absolutely nothing wrong with that,
but as a policy, we encourage people to avoid gratuitous parallelization.
that is, leave EP as the big bunch of serial jobs that they *are*,
so we can schedule them better.


Quote:
Cue theme song from "Rebel Without A Clue"...

Greg, I don't know how else to say this. Pathscale is hurt every time
you open your mouth. You are uncivil. You resort to ad hominen attacks

oh, crap. Greg knows his stuff, and Pathscale is pretty much doing
everything right. NNTP is almost *defined* as a protocol for uncivil
discourse, since this thread demonstrates how easy it is to misunderstand.

your "comm at the same time" statements were easy to interpret as being
obviously wrong (I certainly read them as Greg did). obviously wrong
statements, especially those the author insists are right without
explanation, lead to personalization of the rhetoric. no surprise.

Quote:
Cluster interconnects are improving. Obsolete theory will be reexamined
as part of a new cost formula. Maybe infeasible implementations of old
are less infeasible now.

interconnect has made a jump. it's not at all clear that this is part
of a trend, or that it will continue. more importantly, it's extremely
unclear whether interconnect is improving faster than processors. worse,
the amount of memory+compute you put in one box is growing, effectively
dilluting improvements to interconnect.

regards, mark hahn.
Back to top
Nick Maclaren
Guest





Posted: Sat Jul 30, 2005 3:08 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <dcefb5$3pg$1@informer1.cis.mcmaster.ca>,
Mark Hahn <hahn@coffee.psychology.mcmaster.ca> wrote:
Quote:
"which means that they all communicate at the same time" implies that
all processes send each other data concurrently, in lockstep. I'm
countering that assertion, asserting that many parallel applications do
not do this, although most HPC benchmarks probably do (like random
ring's paired synchronous handshakes).

I run an HPC center and get to talk to a lot of disparate people about
how their parallel code works. I can't think of any of them who do not
have this sort of lockstep behavior. not really lockstep, but periods
of work independent work interspersed with periods when all nodes are
exchanging info with other nodes. in some, this corresponds with timesteps
in a simulation. but another I'm thinking of is a really, really huge
diagonalization that has the same compute/communicate phases in spite of
not being a temporal simulation.

No, it is very common indeed. Pretty well any program that uses MPI
collectives will work that way - it is precisely hiw those are
specified, after all!

One of the reasons that even programs that don't need to do it may
do so is that it is one of the paradigms that is debuggable. Iy
may be theoretically more efficient to mix everything up, but few
people can get their minds round such designs.


Regards,
Nick Maclaren.
Back to top
David Magda
Guest





Posted: Sat Jul 30, 2005 4:15 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

Mark Hahn <hahn@coffee.psychology.mcmaster.ca> writes:

Quote:
the only counterexamples I can think of are incredibly
loosely-coupled, embarassingly parallel types. there's absolutely
nothing wrong with that, but as a policy, we encourage people to
avoid gratuitous parallelization. that is, leave EP as the big
bunch of serial jobs that they *are*, so we can schedule them
better.

Just curious, but are there any popular methods used to program the
cluster? Mostly wondering about languages, libraries, and
frameworks. The majority of languages were probably designed with the
implicit assumption that there would mostly be one thread of
execution, and with the rise of SMT, SMP, multi-core chips, etc., it's
becoming more of a 'mainstream' issue.

I ran across the Erlang programming language [1] a little while ago
and and your comments above piqued my curiousity.

[1] http://en.wikipedia.org/wiki/Erlang_programming_language

--
David Magda <dmagda at ee.ryerson.ca>
Because the innovator has for enemies all those who have done well under
the old conditions, and lukewarm defenders in those who may do well
under the new. -- Niccolo Machiavelli, _The Prince_, Chapter VI
Back to top
Greg Lindahl
Guest





Posted: Sun Jul 31, 2005 12:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <m2ll3ofhef.fsf@gandalf.local>,
David Magda <dmagda+trace050401@ee.ryerson.ca> wrote:

Quote:
Just curious, but are there any popular methods used to program the
cluster?

The most popular is a subroutine library for message-passing named
MPI, combined with the usual suspects (Fortran, C, C++). There are a
bunch of other libraries and toolkits used; a bunch of parallel
languages have been invented, but none are in wide use.

Quote:
and with the rise of SMT, SMP, multi-core chips, etc., it's
becoming more of a 'mainstream' issue.

The mainstream on SMP systems seems to be sticking with the usual
multi-threading. Scientific programming on SMPs tends to use OpenMP or
MPI.

-- greg
Back to top
Nick Maclaren
Guest





Posted: Sun Jul 31, 2005 8:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <42ec0565$1@news.meer.net>, Greg Lindahl <lindahl@pbm.com> wrote:
Quote:
In article <m2ll3ofhef.fsf@gandalf.local>,
David Magda <dmagda+trace050401@ee.ryerson.ca> wrote:

Just curious, but are there any popular methods used to program the
cluster?

The most popular is a subroutine library for message-passing named
MPI, combined with the usual suspects (Fortran, C, C++). There are a
bunch of other libraries and toolkits used; a bunch of parallel
languages have been invented, but none are in wide use.

There were a bunch of other parallel libraries in widespread use
(PVM, various SHMEMs etc.), but PVM is croaking its last and most
of the others haven't been used for distributed memory in years.

Quote:
and with the rise of SMT, SMP, multi-core chips, etc., it's
becoming more of a 'mainstream' issue.

The mainstream on SMP systems seems to be sticking with the usual
multi-threading. Scientific programming on SMPs tends to use OpenMP or
MPI.

There is some POSIX threads, but not much. OpenMP is the most
successful shared memory paradigm ever, by a long way, which shows
just how fast most of the other sank.

If the new multi-core chips are going to be useful or used for
speeding up applications, there will have to be a new parallel
programming paradigm, that is as easy to use as OpenMP and at
least as easy to debug and tune as MPI. But I don't see one on
the horizon. The HPC people will continue to use MPI, but I am
talking about general programming, here.

It really must not be forgotten that up to 4 cores will be taken
up by the operating system running separate processes or kernel
threads in parallel - for example, when editing, all events go
getween the kernel, GUI driver (in whatever sense) and editor,
at least. At present, that needs a LOT of context switching;
with 4 cores, it doesn't.


Regards,
Nick Maclaren.
Back to top
Nick Maclaren
Guest





Posted: Mon Aug 01, 2005 8:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <dckdqp$ci6$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
Quote:

As long as the total cpu state of an editing session is greater than the
local state of each of those processes, the overhead of transferring
between cpus/cores might be greater than by process switching on a
single cpu/core.

Eh? Don't you mean COMMUNICATED state?

Quote:
Yes, it is perfectly possible that the graphics driver has more local
state than what needs to be communicated from the user process in order
to update the screen. I'm just afraid that it might not be so. :-(

I am certain that it is so, for most uses - when the 'GUI driver' is
something like the X Windowing System. Take a look at the size of
data that needs to be passed around when you move the mouse or press
a key. Similarly, text output is small (though not that small).


Regards,
Nick Maclaren.
Back to top
Terje Mathisen
Guest





Posted: Mon Aug 01, 2005 8:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

Nick Maclaren wrote:
Quote:
It really must not be forgotten that up to 4 cores will be taken
up by the operating system running separate processes or kernel
threads in parallel - for example, when editing, all events go
getween the kernel, GUI driver (in whatever sense) and editor,
at least. At present, that needs a LOT of context switching;
with 4 cores, it doesn't.

Ah!

Nick, I believe that might be a (popular?) misconception:

As long as the total cpu state of an editing session is greater than the
local state of each of those processes, the overhead of transferring
between cpus/cores might be greater than by process switching on a
single cpu/core.

Yes, it is perfectly possible that the graphics driver has more local
state than what needs to be communicated from the user process in order
to update the screen. I'm just afraid that it might not be so. :-(

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Peter Grandi
Guest





Posted: Mon Aug 01, 2005 11:20 pm    Post subject: Re: Cluster computing drawbacks Reply with quote

[ ... ]

Quote:
scheduler or both. Just Do It. Converting to use MPI
communication is1 harder, but still easier than converting
to use SMP communication.

Most people's experience is that it is EASIER than
converting a serial program to use SMP communication.
Seriously. Converting to use SMP is one of the foulest
tasks that you can imagine, and is

"This software is not thread-safe" :-) I made the same claim
on my Ph.D. defense. You can find books and books with
chapters and chapters devoted to _explain_ the possible
deadlocks on SMPs and then chapters and chapters devoted to
explain The Right Way to build several paradigms
(consumer-producer, etc)

nmm1> To be fair, about half of those also apply to message
nmm1> passing. What you don't get with message passing is
nmm1> IMPLICIT interaction; if you don't pass a message, the
nmm1> threads are independent. With shared memory, it is
nmm1> usually unclear when threads are interacting.

[ ... ]

Quote:
Nice to learn somebody else thinks the same :-)
nmm1> For me too :-)


But what is the outcome of this is basically that sort of the
''Actor model'' (whose definition apparently has changed over
the decades, until it now some people think it means ''some
formal theory of concurrency'') is a good way to think about
parallelism, precisely because of the ''no global state''
difficulty that plagues multithreaded thinking.

However, in this whole discussion there is a little forgotten
area of application for clusters or SMPs, which is DBMSes. :-)
Back to top
Nick Maclaren
Guest





Posted: Tue Aug 02, 2005 12:15 am    Post subject: Re: Cluster computing drawbacks Reply with quote

In article <yf31x5decxr.fsf@base.gp.example.com>,
Peter Grandi <pg_nh@0506.exp.sabi.co.UK> wrote:
Quote:

But what is the outcome of this is basically that sort of the
''Actor model'' (whose definition apparently has changed over
the decades, until it now some people think it means ''some
formal theory of concurrency'') is a good way to think about
parallelism, precisely because of the ''no global state''
difficulty that plagues multithreaded thinking.

Well, yes, but surely you have seen me post about dataflow?

Quote:
However, in this whole discussion there is a little forgotten
area of application for clusters or SMPs, which is DBMSes. :-)

Why? In what way do they differ from other programs, as far as
their requirements for parallelism go?


Regards,
Nick Maclaren.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5
Page 5 of 5

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB