| Author |
Message |
Del Cecchi
Guest
|
Posted:
Wed Oct 05, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
Thomas Womack wrote:
| Quote: | In article <3qg2e7Fesjk5U1@individual.net>,
Del Cecchi <cecchinospam@us.ibm.com> wrote:
Greg Lindahl wrote:
Meanwhile, the $900 number quoted for the PathScale InfiniPath HCA on
our website is a _list_ price. Our intention is for clusters built with
our HCA to have better price/performance than the competition. Remember
that our performance is as much as 15X faster* than the competition --
and the HCA is only a small part of the cost of a cluster.
-- greg
(employed by, not speaking for, PathScale.)
(* 128 byte MPI payload, streaming, vs Mellanox.)
Of course if one doesn't care about latency, there is 10g ethernet....
Although I don't know how much cheaper it is.
$900 is a _lot_ cheaper than Intel's 10G PXLA8591[SL]R cards, though I
suspect a lot of that saving comes from using copper rather than fibre
as the connection medium; an HP 24-port Infiniband 4x switch is $7000
if I believe froogle, I imagine the cables from the cards to the
switch are probably another few thousand dollars for a 24-node system. I
can't find pricing for 10Gbit Ethernet switches.
Ah. Intel has PXLA8591CX4 which is 10Gbit over 'CX4', and seems on some
dubious European tech sites to be advertised for EUR800 or so.
Tom
|
CX4 is XAUI, or 4 shielded pairs per direction, similar to IB 4X.
--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.” |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Wed Oct 05, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
In article <1128457515.511056.6920@g43g2000cwa.googlegroups.com>,
<long_term_capital@yahoo.com> wrote:
| Quote: | I am now trying to learn more about Pathscale. I'm certainly not going
to use their compilers, so the OpenIB is crucial.
|
It is not necessary to use our compilers. We do provide non-OpenIB
drivers for some protocols such as MPI; the reason to use them is that
they're a lot faster than MPI+OpenIB.
| Quote: | My first reading is that Pathscale is not plug-and-play compatible
with the other IB vendors, so this is a huge negative.
|
Our OpenIB drivers are plug-and-play compatible with other IB vendors.
| Quote: | One thing I'm a bit confused about: does Pathscale work only with the
Opteron? not with Xeon?
|
Our InfiniPath HCA plugs directly into the Opteron/Athlon64
HyperTransport. This means that it doesn't work with Xeon.
| Quote: | The peak bandwidth numbers on the Pathscale web page are the same as
the Mellanox and others. The benchmarks show that they are better,
but I am really interested only in how my particular application
will perform, and the benchmarks are next to useless for that.
|
I'll half-agree with that. It's certainly true that the standard
"latency of 0-byte payloads and bandwidth of huge payloads" benchmark
is misleading: real apps use medium-sized messages, and we can show
you both microbenchmarks and real application benchmarks that
demonstrate this advantage. I'll disagree with you if you want to
claim that ALL benchmarks are useless; real application benchmarks and
microbenchmarks can teach you quite a bit, especially if you have an
idea of what the characteristics of your particular application are.
BTW, I like the new HPC Challenge microbenchmarks; you can find
results here: http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
We're about to submit 128- and 256-core results.
| Quote: | "Performance" means different things to different people. I am curious
what do you mean when you say that Pathscale is 15 times faster than
the competition.
|
This claim is for a microbenchmark which is sending 128-byte MPI
messages, using 2+ cpus per node. Our performance is about 15X higher.
Now, this is irrelevant if your application is sending different size
messages, and of course this 15X speedup only applies to the
communication portion of your program, and not the compute portion.
But we've run into a real application (PETSc) where the problem the
customer wanted to run at large cpu counts was sending 128-byte
messages... and the end result was much better scaling of the whole
application.
-- greg
(employed by, not speaking for, PathScale.) |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Oct 05, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
Thanks for clarifying the Mellanox pricing thing for me because I also
saw that number and was surprized. The $400-$600 number is more likely,
but that number is for the _dual_ port HCA.
I am now trying to learn more about Pathscale. I'm certainly not going
to use their compilers, so the OpenIB is crucial. My first reading is
that Pathscale is not plug-and-play compatible with the other IB
vendors, so this is a huge negative.
I want a top performance, but I want to get it from commodity
components.
One thing I'm a bit confused about: does Pathscale work only with the
Opteron? not with Xeon?
The peak bandwidth numbers on the Pathscale web page are the same as
the Mellanox
and others. The benchmarks show that they are better, but I am really
interested only in
how my particular application will perform, and the benchmarks are next
to useless for that.
"Performance" means different things to different people. I am curious
what do you mean when you say that Pathscale is 15 times faster than
the competition.
Thanks,
Ed |
|
| Back to top |
|
 |
Wes Felter
Guest
|
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Oct 05, 2005 8:02 am Post subject:
Re: How to build an infiniband cluster |
|
|
Thank you very much. This was hugely helpful.
I spent some time searching the hp web site before, but I couldn't find
this.
Ed |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Oct 05, 2005 4:15 pm Post subject:
Re: How to build an infiniband cluster |
|
|
Greg,
So if I use your compilers with your hardware I can get the most out of
your hardware.
The natural question then is what happens if I use MPI +OpenIB? Is it
still going to be better than the competition?
I looked at the HPC challenge benchmarks page. You certainly have a
very nice number for the latency benchmark, but I was surprized that
the RandomRing bandwidth benchmark was lower than Voltaire's. But it is
difficult to know what exactly is the meaning of this number and
whether there is a simple comparison.
The benchmark description does not give sufficient explanation.
For example, clearly the TFLOP/s benchmarks need to be divided by the
number of processors to be somehow comparable.
Overall, I'd say this whole benchmark thing is like watching a horse
race. You can see how well the horses are running, but it doesn't mean
this is useful for your own horse...
In any case, you don't know what they've been feeding their horses and
it is too much work to go around asking. (This continues the horse
thread brought on by Del Cecchi).
Ed |
|
| Back to top |
|
 |
Thomas Womack
Guest
|
Posted:
Wed Oct 05, 2005 7:38 pm Post subject:
Re: How to build an infiniband cluster |
|
|
In article <3qg83rFeg2b9U1@individual.net>,
Del Cecchi <cecchinospam@us.ibm.com> wrote:
| Quote: | Ah. Intel has PXLA8591CX4 which is 10Gbit over 'CX4', and seems on some
dubious European tech sites to be advertised for EUR800 or so.
Tom
CX4 is XAUI, or 4 shielded pairs per direction, similar to IB 4X.
|
Thanks. Does this mean that the CX4 connector could be plugged into a
transceiver if Intel get the silicon lasers working and the
transceivers cheap? [my googles are suggesting that you can get
XAUI->fibre chips, but are not finding actual transceivers]
Gore.com offers both CX4 and IB4x cables, at (roughly) $65 + $6 per
metre, so your 24 nodes of connectivity is about $21000 for HCAs plus
$2000 for cables plus $7000 for switches; which looks intuitively
about the right price point, it's about the same cost per-node as the
second dual-core Opteron.
200Gflops peak, 15GB/sec bisection bandwidth at about 2us latency,
change from $100k; back to the minisupercomputer days even before
adjusting for inflation, but with a product that the Cray of the
minisupercomputer days would have killed for.
Tom |
|
| Back to top |
|
 |
Del Cecchi
Guest
|
Posted:
Thu Oct 06, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
Thomas Womack wrote:
| Quote: | In article <3qg83rFeg2b9U1@individual.net>,
Del Cecchi <cecchinospam@us.ibm.com> wrote:
Ah. Intel has PXLA8591CX4 which is 10Gbit over 'CX4', and seems on some
dubious European tech sites to be advertised for EUR800 or so.
Tom
CX4 is XAUI, or 4 shielded pairs per direction, similar to IB 4X.
Thanks. Does this mean that the CX4 connector could be plugged into a
transceiver if Intel get the silicon lasers working and the
transceivers cheap? [my googles are suggesting that you can get
XAUI->fibre chips, but are not finding actual transceivers]
Gore.com offers both CX4 and IB4x cables, at (roughly) $65 + $6 per
metre, so your 24 nodes of connectivity is about $21000 for HCAs plus
$2000 for cables plus $7000 for switches; which looks intuitively
about the right price point, it's about the same cost per-node as the
second dual-core Opteron.
200Gflops peak, 15GB/sec bisection bandwidth at about 2us latency,
change from $100k; back to the minisupercomputer days even before
adjusting for inflation, but with a product that the Cray of the
minisupercomputer days would have killed for.
Tom
|
Look for XENPAK for example.
--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.” |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Thu Oct 06, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
In article <1128528233.031261.91390@z14g2000cwz.googlegroups.com>,
<long_term_capital@yahoo.com> wrote:
| Quote: | So if I use your compilers with your hardware I can get the most out of
your hardware.
The natural question then is what happens if I use MPI +OpenIB? Is it
still going to be better than the competition?
|
Good question. The only benchmark number we have for OpenIB on our
hardware is SDP, Sockets Direct Protocol, not MVAPICH. For SDP, our
performance at small-to-medium message sizes is much better than
Mellanox-based HCAs. This uses the same RC messaging that MVAPICH
uses.
| Quote: | I looked at the HPC challenge benchmarks page. You certainly have a
very nice number for the latency benchmark, but I was surprized that
the RandomRing bandwidth benchmark was lower than Voltaire's.
|
Which result are you looking at? Our only published submission has a
0.265305 GB/s random ring bandwidth, while the only Voltaire
submission I see is 0.155755 GB/s. (BTW, Voltaire is now reselling our
HCAs, but this Voltaire result uses Mellanox-based HCAs.)
| Quote: | But it is difficult to know what exactly is the meaning of this
number and whether there is a simple comparison.
|
Unfortunately, reading the source is the best way to learn about these
benchmarks.
-- greg |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Oct 06, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
My bad; I was looking at a different number (the PTRANS), which is
10.365 for Voltaire and 6.719 for Pathscale, but as I said, I don't
really know what the number means.
If it is some aggregate bandwidth then it would actually be in favor of
Pathscale, which is using a smaller number of processors. |
|
| Back to top |
|
 |
Del Cecchi
Guest
|
Posted:
Thu Oct 06, 2005 12:15 am Post subject:
Re: How to build an infiniband cluster |
|
|
Thomas Womack wrote:
| Quote: | In article <3qg83rFeg2b9U1@individual.net>,
Del Cecchi <cecchinospam@us.ibm.com> wrote:
Ah. Intel has PXLA8591CX4 which is 10Gbit over 'CX4', and seems on some
dubious European tech sites to be advertised for EUR800 or so.
Tom
CX4 is XAUI, or 4 shielded pairs per direction, similar to IB 4X.
Thanks. Does this mean that the CX4 connector could be plugged into a
transceiver if Intel get the silicon lasers working and the
transceivers cheap? [my googles are suggesting that you can get
XAUI->fibre chips, but are not finding actual transceivers]
|
You don't need silicon lasers. Regular GaAs or whatever lasers will
work fine. The normal methodology is that a chip takes the 4 channels
and makes a 10gbit serial stream and drives the laser. The receiver
works the same only backwards.
And optical is only required if you want to go more than 10-15 meters.
| Quote: |
Gore.com offers both CX4 and IB4x cables, at (roughly) $65 + $6 per
metre, so your 24 nodes of connectivity is about $21000 for HCAs plus
$2000 for cables plus $7000 for switches; which looks intuitively
about the right price point, it's about the same cost per-node as the
second dual-core Opteron.
|
Electrically CX4 and IB 4X are quite similar and the cables are also
similar.
| Quote: |
200Gflops peak, 15GB/sec bisection bandwidth at about 2us latency,
change from $100k; back to the minisupercomputer days even before
adjusting for inflation, but with a product that the Cray of the
minisupercomputer days would have killed for.
Tom
|
--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.” |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Thu Oct 06, 2005 5:23 am Post subject:
Re: How to build an infiniband cluster |
|
|
In article <1128556985.626910.234230@g49g2000cwa.googlegroups.com>,
<long_term_capital@yahoo.com> wrote:
| Quote: | My bad; I was looking at a different number (the PTRANS), which is
10.365 for Voltaire and 6.719 for Pathscale, but as I said, I don't
really know what the number means.
|
PTrans is one of the ones that scales up as you add more cpus. And the
PathScale number is with 1/4 as many cpus. Our 128 cpu number (not yet
published) is 11.75 GB/s.
-- greg
(working for, not speaking for, PathScale.) |
|
| Back to top |
|
 |
Wes Felter
Guest
|
|
| Back to top |
|
 |
Del Cecchi
Guest
|
|
| Back to top |
|
 |
Thomas Womack
Guest
|
Posted:
Fri Oct 07, 2005 5:48 pm Post subject:
Re: How to build an infiniband cluster |
|
|
In article <2005100620570216807%wesley@felterorg>,
Wes Felter <wesley@felter.org> wrote:
They've got a nice-looking 9-port 12x switch in 1U for $8500, and a
PCI-X 64bit/133MHz dual 4x HCA for $1000, with Solaris drivers for
Sparc and x86. This means you need 4x->12x cables for everything, a
snip at $150 per two metres if you buy from Sun, but you've got only
one model of switch for all the levels of your interconnect tree.
Does anyone produce 12x HCAs? 30Gbit/sec in each direction is quite a
lot of bandwidth, it doesn't _quite_ saturate PCI-E 16x but I'd be
impressed at the chipset that could get close to peak.
[Looking at Mellanox's page, is Infiniband DDR a 'real' Infiniband
standard with interoperable implementations of which only Mellanox's
is available today, or Mellanox's own way of running the serdes twice
as fast over the same cabling infrastructure?]
Tom |
|
| Back to top |
|
 |
|
|
|
|