| Author |
Message |
Greg Lindahl
Guest
|
Posted:
Mon Jan 24, 2005 3:09 am Post subject:
Re: clusters vs shared-memory |
|
|
In article <3NFId.43933$_56.9978@fe2.texas.rr.com>,
Randy Crawford <joe@burgershack.com> wrote:
| Quote: | Greg Lindahl wrote:
Randy,
In your extensive discussion of the forest, you've missed the tree I
was pointing at. It is not necessary to do a full comparison of the
merits and demerits of shared memory and MPI to discuss the positive
effects of locality for MPI. Nor is my pointing out a particular
situation invalidated because I didn't discuss every aspect of MPI.
Without belaboring pet nits, my point was:
1) Shared-memory architectures/programming are often a Good Thing.
|
Randy, your posting doesn't have anything to do with mine. If you want
to grind your axe about your favorite programming technique, please
pick a thread that I'm not in. And especially don't pick a thread in
which I am already complaining that you aren't paying attention to
what the thread is about.
| Quote: | Perhaps that wasn't the point of the thread,
|
Woah! A glimmering of understanding!
| Quote: | 2) If the thread's point was that the MPI latencies of distributed-
memory machines are comparable with shared-memory machines, then this is
interesting (because I think shared-memory offers some conveniences).
|
No, that wasn't the point, as I said in the posting which you were
quoting.
| Quote: | However, if the point of the thread was only that SGI has improved the
implementation of its MPI library on the Altix such that its latencies
rival those of direct memory accesses, then the discussion of OpenMP's
failings doesn't seem especially relevant, and I certainly *have*
overlooked the forest.
|
No, that isn't the case: I didn't say that, nobody else said that, and
I have no idea why you might think that.
| Quote: | Finally, your memeory of the Origin's latencies is wrong. It's nowhere
near a factor of 10 from the shared memory latency (for shared stuff)
================
and the MPI latency.
When I was working for SGI (~1998), I recall the acknowledged ideal
latency for MPI being as little as 5 us, but was practically more like
10-15 us (across a Cray Link). As I recall, RAM latency was about 150
ns, which indeed is *not* a difference of a magnitude. It's two magnitudes.
|
FOR SHARED STUFF. What's the point of comparing apples and oranges?
Geez.
-- greg |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Mon Jan 24, 2005 1:50 pm Post subject:
Re: CAS and LL/SC |
|
|
In article <HqCdnV9U5Jy5mmncRVn-1Q@comcast.com>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
| Quote: |
Oh, and I'd expect "tagging" to get more common in the future. We've
passed through a period when "all 32 bits" were used for addressing
but we are now entering a period when there is "no chance" that all
64 bits will be significant.
I'm not convinced we are about to enter such period and fully expect
people to crumble about people having done stupid tricks in 10-15
years time.
While that's possible, I don't think that it is likely. It would
certainly be likely if there were a resurgence of the models that
map the whole filing system into memory, but I don't see much sign
of that.
I do wonder. Even if Moore's law continues, so that memory
prices continue to decrease, is there a point where we really
don't need that much?
|
That's not the reason I don't see it happening. The discrepancy
between latency and bandwidth is continuing to grow, and is already
comparable to (older) disks when distributed memory is considered.
This makes it largely pointless to use a model that is appropriate
for low-latency access to small units of data.
Yes, I am one of the fair number of people who has been saying for
years that RAM is becoming the new disk - the transition has been
slower than we expected, but isn't showing signs of stopping.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Andi Kleen
Guest
|
Posted:
Mon Jan 24, 2005 6:06 pm Post subject:
Re: CAS and LL/SC |
|
|
Anne & Lynn Wheeler <lynn@garlic.com> writes:
| Quote: | it is how almost 40 years later and i've now got a 4gbyte (although
linux claims that there is only 3.5bytes) "personal computer". except
for the possible linux glitch, have effectively doubled the number of
|
It's a BIOS glitch, Linux is not to blame.
Directly below 4GB there is the PCI+AGP memory hole which is roughly
500MB. Your BIOS is not able to map the memory "around" the hole and
the memory sitting below the hole is lost. Linux cannot use it
because it can only use memory that the BIOS gives to it.
Due to PC compatibility the hole has to be below the 4GB(32bit)
boundary because there are lots of 32bit only registers addressing
it.
Some BIOS support memory hoisting to work around this, but it's
complicated and the cheaper boards tend to not bother implementing
it. In some cases (e.g. cheaper Intel chipsets) the chipset also
doesn't support memory above the 4GB boundary and the BIOS has no
choice.
One way to recover some of the memory would be to decrease the
size of the AGP aperture in the BIOS setup. That works if you
don't use anything AGP intensive like 3d graphics.
To bring it back on topic to comp.arch: morale is to never add
silly address space limits to registers that cause such problems
later. If people hadn't designed 32bit only bridges the PCI
memory hole could be above 4GB.
-Andi |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Jan 24, 2005 6:55 pm Post subject:
Re: CAS and LL/SC |
|
|
Andi Kleen wrote:
| Quote: | To bring it back on topic to comp.arch: morale is to never add
silly address space limits to registers that cause such problems
later. If people hadn't designed 32bit only bridges the PCI
memory hole could be above 4GB.
|
Or at least virtualize DMA memory. With an Athlon64, the design idea is to
map all RAM above 4GB, and use the IOMMU so that PCI and AGP DMA can access
the RAM it needs to. Needs to be supported by the BIOS, though.
On the other hand, with all this virtualization talk (Vanderpool): Please
add an IO processor that takes messages from the host OS and talks to the
IO all by itself - including MMU tasks for DMA, and such stuff. And make it
a multithreaded processor.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Mon Jan 24, 2005 7:45 pm Post subject:
Re: CAS and LL/SC |
|
|
Bernd Paysan wrote:
| Quote: |
On the other hand, with all this virtualization talk (Vanderpool): Please
add an IO processor that takes messages from the host OS and talks to the
IO all by itself - including MMU tasks for DMA, and such stuff. And make it
a multithreaded processor.
|
Why do you think that asymmetric multiprocessing is better compared
to adding another main processor with an SMP OS? In SMP once the
cpu is finished housekeeping it can be used for running apps.
An IO coprocessor cannot.
Eric |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Jan 24, 2005 8:45 pm Post subject:
Re: CAS and LL/SC |
|
|
Eric P. wrote:
| Quote: | Bernd Paysan wrote:
On the other hand, with all this virtualization talk (Vanderpool): Please
add an IO processor that takes messages from the host OS and talks to the
IO all by itself - including MMU tasks for DMA, and such stuff. And make
it a multithreaded processor.
Why do you think that asymmetric multiprocessing is better compared
to adding another main processor with an SMP OS? In SMP once the
cpu is finished housekeeping it can be used for running apps.
An IO coprocessor cannot.
|
An IO processor is a simplistic device that mostly moves data around. Apart
from that, it will only provide a small "abstraction layer" for accessing
IO. IO has a lot of properties that makes it awful to access with the main
processor:
* It is slow
* It typically needs to be uncached
* It needs strict ordered accesses
and it often comes with dreadful design mistakes:
* Destructive reads (no speculation! No prefetch!)
* Polling
When I'm talking about an IO processor, I'm talking about logic that takes
in the order of 10k gates. Then, the advantage to adding another main
processor is obvious:
* The IO processor costs cents or less
Furthermore, there are advantages from an OS design point of view:
* The IO processor can be used as more generic interface to the IO - it can
translate commands to device-specific ones. That's what VMs need: A
"channel processor" where the real machine can use the real IO processor,
and the virtual machine can filter the channel commands and pass them along
to the real IO processor without much overhead. IO processor commands are
big chunks of semantically connected work, not single register transfers.
Compare that to the normal IO situation on the PCI/ISA bus of a PC, where
you get single register accesses and you ping-pong between the VM and the
OS.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Anne & Lynn Wheeler
Guest
|
Posted:
Mon Jan 24, 2005 10:57 pm Post subject:
Re: CAS and LL/SC |
|
|
Andi Kleen <freitag@alancoxonachip.com> writes:
| Quote: | It's a BIOS glitch, Linux is not to blame.
Directly below 4GB there is the PCI+AGP memory hole which is roughly
500MB. Your BIOS is not able to map the memory "around" the hole and
the memory sitting below the hole is lost. Linux cannot use it
because it can only use memory that the BIOS gives to it.
|
bios says
4gbyte memory
agp (defaulted) 128mb, only options 64mb, 128mb, & 256mb
primary graphics uses agp
i don't use 3d graphics, i changed bios agp from 128mb to 64mb and it
didn't make any difference, still shows 3.5gb (fedora fc3)
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/ |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Mon Jan 24, 2005 11:56 pm Post subject:
Re: CAS and LL/SC |
|
|
"Andi Kleen" <freitag@alancoxonachip.com> wrote in message
news:m3brbfou1q.fsf@averell.firstfloor.org...
snip
| Quote: | To bring it back on topic to comp.arch: morale is to never add
silly address space limits to registers that cause such problems
later.
|
Yes. Or, in a slightly different formulation, don't use memory mapped I/O
at all (at least not for general purpose processors where the stringent
requreiments of some embedded systems don't apply).
| Quote: | If people hadn't designed 32bit only bridges the PCI
memory hole could be above 4GB.
|
Yup. And if all of the I/O addressing stuff was in the CPU and not on some
memory mapped thingie, the problem wouldn't have arisen.
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Jan 25, 2005 12:52 am Post subject:
Re: CAS and LL/SC |
|
|
In article <YxbJd.20816$8u5.17772@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
| Quote: | "Andi Kleen" <freitag@alancoxonachip.com> wrote in message
news:m3brbfou1q.fsf@averell.firstfloor.org...
To bring it back on topic to comp.arch: morale is to never add
silly address space limits to registers that cause such problems
later.
Yes. Or, in a slightly different formulation, don't use memory mapped I/O
at all (at least not for general purpose processors where the stringent
requreiments of some embedded systems don't apply).
|
Amen.
Merging this with another thread (the request for specialised
coprocessors using an I/O interface), you end up with it being
regarded as a good thing to have a CPU core pretending to be an
I/O device pretending to be some memory attached to another CPU.
Stop the world - I want to get off.
As I have said before, traditional engineering is the wrong model;
Darwinian evolution of the sort that leads to the panda's 'thumb'
is a much better one .....
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Tue Jan 25, 2005 4:15 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
| Quote: | In fairness, all the early RISC designs suffered from multiple
architectural blunders that could not be justified by the high
cost of silicon alone, this was by no means a Mips monopoly.
PA-RISC cold not queue interrupts, SPARC went to market without
multiply instructions and no way to turn off virtual memory, etc..
|
What about POWER and Alpha? Did those avoid most of the major blunders
(obviously, they avoided the lack of a multiply)?
David kanter |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Jan 25, 2005 1:05 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
In article <1106608559.393100.303700@c13g2000cwb.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:
| Quote: | In fairness, all the early RISC designs suffered from multiple
architectural blunders that could not be justified by the high
cost of silicon alone, this was by no means a Mips monopoly.
PA-RISC cold not queue interrupts, SPARC went to market without
multiply instructions and no way to turn off virtual memory, etc..
What about POWER and Alpha? Did those avoid most of the major blunders
(obviously, they avoided the lack of a multiply)?
|
I haven't looked at POWER in detail, but Alpha got several things
badly wrong.
It originally had only full register loads and stores, which is
unsuitable for implementing C. That was fixed.
Its floating-point exception model is a thorough mess, which was made
worse by the compilers for some years, and still is by the lack of
(compiler) documentation.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
glen herrmannsfeldt
Guest
|
Posted:
Tue Jan 25, 2005 1:32 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Nick Maclaren wrote:
(snip)
| Quote: | I haven't looked at POWER in detail, but Alpha got several things
badly wrong.
It originally had only full register loads and stores, which is
unsuitable for implementing C. That was fixed.
|
It had load/store and instructions for manipulating bytes
in a register, and is a RISC machine.
What happens when you add byte and halfword store?
It loads the fullword into a register, modifies the
byte or halfword, and writes it back again. With ECC
that is the only way it can be done. It will be just
as slow in either case, though it takes a few more
instructions in the original version.
I still remember multiply step and divide step where on
some RISC processors it took more than one instruction
to multiply and divide.
I also remember Fortran IV/66 with no character data type,
so if you wanted to manipulate characters you stored them
in fullwords with A1 format. That solves the byte addressing
problem. (I did use a compiler with LOGICAL*1, but you can't
use relational operators on LOGICAL variables.)
As far as I know Alpha with only fullword load still did
a better job with C character data than SPARC did, otherwise
I was always pretty happy with SPARC.
-- glen |
|
| Back to top |
|
 |
Per Schröder
Guest
|
Posted:
Tue Jan 25, 2005 2:14 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
glen herrmannsfeldt wrote:
| Quote: | Nick Maclaren wrote:
I haven't looked at POWER in detail, but Alpha got several things
badly wrong.
It originally had only full register loads and stores, which is
unsuitable for implementing C. That was fixed.
It had load/store and instructions for manipulating bytes
in a register, and is a RISC machine.
What happens when you add byte and halfword store?
It loads the fullword into a register, modifies the
byte or halfword, and writes it back again. With ECC
that is the only way it can be done. It will be just
as slow in either case, though it takes a few more
instructions in the original version.
|
This leads to wrong results if you are using threads and you have two
adjacent bytes (in the same 32-bit word) that are protected by different
mutexes.
Of course, you could do *all* byte and 16-bit accesses by LL/SC sequences.
This would reinstate correct semantics but the cost would be prohibitive.
You could also declare that having adjacent bytes protected by different
mutexes is UNSUPPORTED (it's a bad idea anyway), but unfortunately a lot of
programmers wouldn't notice, and the net result would be that threaded
programs that worked fine on other architectures would randomly fail when
ported to Alpha.
/Per Schröder |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Tue Jan 25, 2005 2:57 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
| Quote: | It originally had only full register loads and stores, which is
unsuitable for implementing C. That was fixed.
|
I believe the driving force in providing less-than-32-bit-read and
-writes was memory-mapped I/O, not implementing C (which was needed
from day one in any case). The workarounds in hard- and software for
the first systems were Nor Pretty and prone to misuse and errors, so
the necessary instructions were added.
Jan |
|
| Back to top |
|
 |
Jean-Pierre Panziera
Guest
|
Posted:
Tue Jan 25, 2005 4:47 pm Post subject:
Re: clusters vs shared-memory |
|
|
sorry for the delay, I read newsgroup very irregularly.
When I talked about "an order of magnitude" I was not actually thinking
about SGI systems.
I was considering say a 4 way Opteron and a 2x2Opteron interconnected
with Infiniband. The numbers I had in mind (from marketing brochures) do
show an order of magnitude ~150 nanoseconds vs 1.5 microsecond. May be I
got the wrong numbers.
On SGI ccNUMA hardware, a full MPI ping-pong cycle requires 5 return
trips between the two nodes involved in the communication. So you have
2.5 trips per "ping" or "pong". A Shared-Memory access is one return
trip. And then you might have to consider software overhead.
Jean-Pierre Panziera
Greg Lindahl wrote:
| Quote: | In article <41EE2D36.6080907@sgi.com>,
Jean-Pierre Panziera <jpp.removeThis@sgi.com> wrote:
Simple HW fact:
Accessing data from another MPI thread takes microsecond(s),
this is an order of magnitude slower than Shared Memory.
This is not true on your hardware.
So the question remains: why is it that the MPI versions of programs
tend to be faster, on Altix, than shared memory versions? Don't trot
out just one counter-example...
The answer is that since MPI makes the programmer explicitly think
about locality, false sharing and other locality problems are
minimized in MPI programs.
The place MPI loses on Altix are irregular codes, where load balancing
is a major issue.
Regarding Scalable Shared-Memory applications, you could check what NASA
is doing. I think that's down the road from where you work.
Thanks, but I was aware of that work many years ago, long before I
moved to California. One counter-example does not a counter-argument
make.
-- greg
|
|
|
| Back to top |
|
 |
|
|
|
|