TLP-oriented design and HPC
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
TLP-oriented design and HPC

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Guest






Posted: Mon Dec 26, 2005 9:15 am    Post subject: TLP-oriented design and HPC Reply with quote

Would a TLP-oriented design be able to attract HPC customers away
from common ILP-oriented designs?

Obviously a design like Sun's Niagara (T1000 and T2000) would
not be appropriate; however, by avoiding the overhead of OoO
execution, heavily ported register files, complex multiple result
forwarding, etc. it should be possible to increase the proportion
of the die area used for functional units (thereby increasing the
potential throughput per chip) and decrease power consumption
(thereby further increasing potential compute density). It seems
that some of the problems that benefit from Tera-style
multithreading would be suitable to a system composed of chips
with many small and simple cores. However, it is not clear
whether (even with a shared on-chip level of caching)
communication overhead would overcome the benefits from more
FPUs. The success of Blue Gene hints that smaller, simpler cores
can be useful for HPC; however, Blue Gene uses paired value FPUs
to substantially increase per-core compute potential. (It might
be interesting to compare performance for various HPC workloads
between an in-order ILP-oriented design like current IPF systems
and an in-order TLP-oriented design like Piranha [an 8-core
scalar Alpha design].)

(While the HPC market is not especially large; it might be more
receptive to more exotic designs than the OLTP server market or
even the web/mail server market, allowing an initial product to
more easily break even. Obviously Sun feels its brand and good
benchmark results can sell a peculiar design to enough server
customers to make a profit without needing to appeal to the HPC
market.)

If HPC customers are appropriately viewed as a significant part
of the target market for a TLP-oriented design (say 20% of
sales), what would be the characteristics of such a design in
order to provide superior HPC performance per monetary unit and
per unit volume without excessively sacrificing performance for
the other 80% of the targeted market?

Limited two-wide issue might provide substantially improved
performance with minimal area increase (e.g., adding one FPR
read/write port would allow any non-dependent combination of one
GPR-using instruction [including FPR load and store] and one
FPR-using instruction; the dependency check logic could be shared
if each Icache had an additional predecode bit per aligned
instruction pair indicating if LIW-like operation is possible).
Even supporting somewhat more aggressive two-wide issue might not
cost much for a significant performance increase (30%??) (e.g.,
separating the LSU, ALU, and branch unit and providing one more
GPR read port and one more GPR write port could allow substantial
flexibility). (It might not even be especially expensive to
provide simultaneous issue of a load and a 64-bit store or
prefetch instruction [i.e., a memory operation that does not
require reading, assuming ECC is over 64-bit words]. Having a
separate stack cache might be worthwhile--not only increasing
potential issue width at modest cost but also increasing
effective associativity of the data cache.) Providing SMT would
make more general two-wide issue attractive by increasing
utilization.

Unfortunately, it becomes very easy to sacrifice area (and
therefore the number of cores) for increased per-core
performance. However, I am guessing that moderately aggressive
multiple issue (branch, ALU, load/store, load/store stack, and
FP/SIMD op; two-wide fetch) with SMT would provide greater than
50% throughput increase for less than 50% core size increase
(with a greater than 50% increase in design complexity :-\).

Using 128-bit registers for FP and integer SIMD operations,
separating them into high and low banks, and sharing the SIMD
functional units between two cores (where low bank for core A
would be the high bank for core B) could allow one SIMD operation
from either core or one 64-bit wide operation from both cores to
issue simultaneously. Widening the data cache interface might be
a little expensive, but could provide single-cycle load/store GPR
pair operations as well as load/store SIMD register (i.e., it is
not only a HPC improvement). Providing 32-byte-wide loads and
stores between 'aligned' 64-bit register quads (or 'aligned'
128-bit register pairs) and a small fully associative write
buffer/prefetch cache might be a relatively inexpensive way to
increase potential cache<->register bandwidth; even some server
code could probably use this form of increased bandwidth.

Using FP/SIMD register space to store most of a small-context
thread's processor state might allow switching among such
small-context threads to be relatively rapid (perhaps 8 cycles?)
without greatly increasing core area (only adding a few registers
to hold additional state). Even 16-cycle fully-dead-time context
switches might be acceptable for many uses.

By avoiding register renaming and implementing SMT, some
interesting register file banking possibilities exist. E.g.,
even register quads can be somewhat easily read/written and one
could guarantee that instructions from different threads would
not have bank conflicts. (Of course, SMT makes code optimization
more complex. Result latencies can be halved with two-way SMT,
but cache associativity and size can also be effectively halved.)

(An interesting note on Piranha: the Icaches were snooped and
coherence is provided as an artifact of using the same basic
design for the Icache and the Dcache. This might make have a
significant positive effect on Java engines that use JIT
compilation.)


Paul A. Clayton
just a technophile with diarrhea of the keyboard
Back to top
Guest






Posted: Tue Dec 27, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

Anton Ertl wrote:
[snip]
Quote:
If you are thinking about general-purpose CPUs with TLP (like the Xbox
360 CPU or Sun's upcoming Rock), one problem these might have for HPC
is that many HPC applications require high memory bandwidth; TLP may
not necessarily buy anything for such applications, because the
available memory bandwidth is already used up by a single thread
(remember the Power4 systems for HPC where only one core was active
per chip?).

Here a TLP-oriented design could have a modest advantage.
Since a TLP-oriented design is likely to be more energy
efficient, more of its pin count can go to I/O and less to power
and ground. Also a TLP-oriented design will emphasize
memory bandwidth more than an ILP-oriented design (which
would tend to emphasize latency), so it might happen to be a
better fit for HPC.

(Energy efficiency can also improve cluster density.)

Obviously an embedded chip like that of the Xbox would
not support the necessary memory capacity. I am speculating
that Niagara2 might be a suitable system.

[snip]
Quote:
So, if TLP helps, it's for applications that are well parallelizable
(many HPC applications fit this) and don't require much memory
bandwidth per thread (I guess there are HPC applications in that
category, too). Combined these restrictions might be too strong to
make general-purpose TLP CPUs a big seller in the HPC market, but we
will see.

We will see only if someone takes the risk to develop suitable
systems.

ISTM that an important question in this matter is whether the
FPUs can be kept busy with an instruction fetch bandwidth
that is suited to aggressively TLP-oriented designs (which
tend to err on the side of too many cores/contexts rather
than too much performance per core/context). (Obviously
the memory hierarchy is important as well, but I suspect
operation fetch bandwidth is more critical.)

Assuming a TLP-oriented system's cores have clock
frequency comparable to an ILP-oriented system, if the
TLP system had eight FPUs per chip and the ILP system
had two cores per chip each with two FPUs, then the
TLP system's FPUs would only need 75% of the
occupancy of the ILP system's FPUs to achieve 50%
better performance (assuming the problem is
computation limited rather than bandwidth limited).

ISTM that SIMD operation support alone would not
be enough to gain sufficient occupancy; two-wide
instruction fetch seems likely to be necessary. So,
would a superscalar design still be area- and energy-
efficient enough to be attractive for a TLP-oriented
chip? I suspect it would be--if at least two-way SMT
was supported.


Paul A. Clayton
just a technophile
Back to top
Anton Ertl
Guest





Posted: Tue Dec 27, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

Dysthymicdolt@aol.com writes:
Quote:
Would a TLP-oriented design be able to attract HPC customers away
from common ILP-oriented designs?

Tera thought so.

If you are thinking about general-purpose CPUs with TLP (like the Xbox
360 CPU or Sun's upcoming Rock), one problem these might have for HPC
is that many HPC applications require high memory bandwidth; TLP may
not necessarily buy anything for such applications, because the
available memory bandwidth is already used up by a single thread
(remember the Power4 systems for HPC where only one core was active
per chip?).

Ok, without OoO execution, the single thread may leave the memory
system idle for short periods of time, and having another thread would
utilize this memory bandwidth, but I doubt that further threads help
much for such applications.

So, if TLP helps, it's for applications that are well parallelizable
(many HPC applications fit this) and don't require much memory
bandwidth per thread (I guess there are HPC applications in that
category, too). Combined these restrictions might be too strong to
make general-purpose TLP CPUs a big seller in the HPC market, but we
will see.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
David Kanter
Guest





Posted: Wed Dec 28, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

Iain McClatchie wrote:
Quote:
Paul> Would a TLP-oriented design be able to attract HPC customers away
Paul> from common ILP-oriented designs?

As Anton says, the problem is memory bandwidth.

The solution to the memory bandwidth problem is GPU-style memories:
572-bit, 8- or 16-channel GDDR memory systems, but using 4 bits or
maybe even 1 bit per DRAM rather than 32. This can get the basic
ASIC+DRAM+buffers cluster to 16-64 GB of DRAM and 40 GB/s
peak bandwidth with existing technology. (Mmmm... subject to the
availability of 4 or 1 bit wide GDDR3 parts, which might be as trivial
as a bond-out change for the memory manufacturers.)

We have to leave behind the concept of DIMM sockets and
post-production configurable memories. The unit we buy from AMD
(and later, Intel) will be like a graphics card, except that it may
likely have "fins" with DRAM on them, like DIMMs but without the
air-blockage and impedance mismatch of the DIMM socket.

The first question is, how many HPC systems really use configurable
memory? Do they really tend to add memory later, or do they just toss
the whole motherboard, keeping select parts? IIRC SGI actually solders
their memory in, because the DIMM sockets cause too many failures when
you have TB.

Quote:
Well, the HPC version will have fins. The consumer version with 1-2
GB will just have surface-mounted DRAMs.

I suspect HPC will adopt this way before consumers or server folks do.
ITSM the latter two markets upgrade a lot, whereas the former performs
more "cart the entire thing out" upgrades.

Quote:
Beyond that, memory will have to be very nonuniform, as the PU
clusters will be linked by high-speed interconnects. The Xbox 360
folks are showing the way with their 10.8 GB/s each direction link
on the GPU. An HPC ASIC in today's technology might manage
two of those.

HPC guys might be willing to deal with that, but I have a hard time
seeing desktop stuff becoming too non-uniform. We just don't have
enough smart programmers to satisfy the demand for programmers.

Quote:
Thread scheduling and dispatch are going to become more GPU
like, with ganged instruction fetch. I can see datapaths which
execute 64 threads simultaneously on the same code path. An
"IF" branch on the code path causes the fetch system to fetch
and distribute both paths, and each of the 64 clusters grabs ops
fromt the appropriate stream depending on which way the local
branch went.

This sounds like predication but instead of getting more ILP, it gives
you TLP; can you explain how it differs?

To me it sounds like you are fetching instructions from both sides,
which reduces your miss penalty, but eats bandwidth like mad. I also
detect an implication that perhaps the system should be using a trace
processor or multiscalar (I'm familiar with the former, not the
latter).

Quote:
This kind and size of datapath is shipping today.
The challenge is to figure out how to marry it to HPC codes
without disabling changes to either.

Indeed...

DK
Back to top
Eugene Miya
Guest





Posted: Wed Dec 28, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

Just passing thru...

In article <1135582492.507874.298300@f14g2000cwb.googlegroups.com>,
<Dysthymicdolt@aol.com> wrote:
Quote:
Would a TLP-oriented design be able to attract HPC customers away
from common ILP-oriented designs?
....

As Anton noted: Tera.
It really depends on the performance improvement it buys the end user.

Quote:
(While the HPC market is not especially large; it might be more
receptive to more exotic designs than the OLTP server market or
....

It really depends on the performance improvement it buys the end user.

Quote:
If HPC customers are appropriately viewed as a significant part
of the target market for a TLP-oriented design (say 20% of
sales), what would be the characteristics of such a design in
order to provide superior HPC performance per monetary unit and
per unit volume without excessively sacrificing performance for
the other 80% of the targeted market?

While updating my biblio I came upon a very nice quote:

the nature of the high performance computing game is
really quite simple -
the market is very small and the programs run for a very long time.
Consequently, the folks working in this area are prepared to
expend an astonishing amount of effort to gain really quite minor improvements
(e.g. a couple of man-weeks to achieve a saving of 1% or less).
The net result is that a parallel programming model which
introduces ANY inefficiencies is doomed to being a commercial failure
(this was true in the 1980s and is still true today).

That was written for the Myrias SPS.

Quote:
Unfortunately, it becomes very easy to sacrifice area (and
therefore the number of cores) for increased per-core performance.

Once down this path:
Then this is not HPC.

Quote:
just a technophile with diarrhea of the keyboard

That's OK.
It's a complex topic.

--
Back to top
Dr. Adrian Wrigley
Guest





Posted: Wed Dec 28, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

On Tue, 27 Dec 2005 12:24:37 -0800, Iain McClatchie wrote:

....
Quote:
Thread scheduling and dispatch are going to become more GPU
like, with ganged instruction fetch. I can see datapaths which
execute 64 threads simultaneously on the same code path. An
"IF" branch on the code path causes the fetch system to fetch
and distribute both paths, and each of the 64 clusters grabs ops
fromt the appropriate stream depending on which way the local
branch went. This kind and size of datapath is shipping today.
The challenge is to figure out how to marry it to HPC codes
without disabling changes to either.

Sounds interesting...
Can you give some examples/links please? How are they programmed?

Thanks.
--
Adrian
Back to top
Iain McClatchie
Guest





Posted: Wed Dec 28, 2005 1:15 am    Post subject: Re: TLP-oriented design and HPC Reply with quote

Paul> Would a TLP-oriented design be able to attract HPC customers away
Paul> from common ILP-oriented designs?

As Anton says, the problem is memory bandwidth.

The solution to the memory bandwidth problem is GPU-style memories:
572-bit, 8- or 16-channel GDDR memory systems, but using 4 bits or
maybe even 1 bit per DRAM rather than 32. This can get the basic
ASIC+DRAM+buffers cluster to 16-64 GB of DRAM and 40 GB/s
peak bandwidth with existing technology. (Mmmm... subject to the
availability of 4 or 1 bit wide GDDR3 parts, which might be as trivial
as a bond-out change for the memory manufacturers.)

We have to leave behind the concept of DIMM sockets and
post-production configurable memories. The unit we buy from AMD
(and later, Intel) will be like a graphics card, except that it may
likely have "fins" with DRAM on them, like DIMMs but without the
air-blockage and impedance mismatch of the DIMM socket.

Well, the HPC version will have fins. The consumer version with 1-2
GB will just have surface-mounted DRAMs.

Beyond that, memory will have to be very nonuniform, as the PU
clusters will be linked by high-speed interconnects. The Xbox 360
folks are showing the way with their 10.8 GB/s each direction link
on the GPU. An HPC ASIC in today's technology might manage
two of those.

Thread scheduling and dispatch are going to become more GPU
like, with ganged instruction fetch. I can see datapaths which
execute 64 threads simultaneously on the same code path. An
"IF" branch on the code path causes the fetch system to fetch
and distribute both paths, and each of the 64 clusters grabs ops
fromt the appropriate stream depending on which way the local
branch went. This kind and size of datapath is shipping today.
The challenge is to figure out how to marry it to HPC codes
without disabling changes to either.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB