What do you think of Sun's Niagara
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
What do you think of Sun's Niagara
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Nick Maclaren
Guest





Posted: Mon Sep 26, 2005 2:40 pm    Post subject: Re: What do you think of Sun's Niagara Reply with quote

In article <dh8agm$2m2$1@gemini.csx.cam.ac.uk>,
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
|> In article <dh70cc$11jg$1@ns.felk.cvut.cz>,
|> Milos Becvar <becvarm@fel.cvut.cz> wrote:
|> >
|> >There is relatively detailed description of Niagara architecture
|> >in March-April 2005 issue of IEEE Micro.
|>
|> Thanks for the reminder. It wasn't in when I last looked. I must
|> take another look.

May/June, actually.

Unfortunately, it didn't tell me much that I didn't know. The
figures would have been more interesting if they had had some
information on what they represented, though the ones on the
"cache enlargement" were OK.

I am interested in much more technical information, such as
exactly what the NT bit is associated with. For example,
consider code like the following:

p = (p != NULL ? *p : *def1);
p = (p != NULL ? *p : *def2);
p = (p != NULL ? *p : *def3);
p = (p != NULL ? *p : *def4);
p = (p != NULL ? *p : *def5);

Does this require the compiler to use different registers for
each p in order to get advantage from scouting?

What is the effect on the cache and TLB, and does the system
have a mechanism to prevent dynamic deadly embraces?


Regards,
Nick Maclaren.
Back to top
David Kanter
Guest





Posted: Tue Sep 27, 2005 6:07 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Quote:
There is relatively detailed description of Niagara architecture
in March-April 2005 issue of IEEE Micro.

There was some interesting stuff in the May-June issue by
Chaudhry et. al. of Sun, under "High-Performance Throughput Computing"
as well. The "hardware scout" aimed at improving single
threaded performance seemed an intriguing alternative to
a lot of the complexity in other processors.

Hardware scout is just another term for OOO multithreading, and it's
not precisely new. There's a bunch of variants of this idea, starting
with full blown OOO fetch on a trace processor, dynamic multithreading
(Intel's akkary), skip ahead multithreading (Andy Glew proposed this
for the K10 and I think it's pretty much the same as hardware scout).

The problem with hardware scout IMHO is that you are wasting
computation to ensure future locality. You are much better off doing
the computation and keeping the results if they are valid, then merging
them in the instruction stream.

The OOO fetch sequencer idea is to start fetch,issue and execute at a
likely (re) entry point in the instruction stream ahead of time. For
example, the authors noted the destination of a return call as an ideal
place to start the fetch. You then execute along that path and are
able to prematurely trigger any loads that would miss in cache. Not
only that, but you also get the execution for "free", so you really are
able to take full advantage of everything you are doing, rather than
just the memory benefits. It is unclear to me as to whether the extra
worth is worth the performance gain, but it is an interesting option.

David
Back to top
Nick Maclaren
Guest





Posted: Tue Sep 27, 2005 8:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

In article <1127783244.283249.177240@g43g2000cwa.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:
Quote:

Hardware scout is just another term for OOO multithreading, and it's
not precisely new. There's a bunch of variants of this idea, starting
with full blown OOO fetch on a trace processor, dynamic multithreading
(Intel's akkary), skip ahead multithreading (Andy Glew proposed this
for the K10 and I think it's pretty much the same as hardware scout).

No, it's not. You are correct that it has similarities, and that it's
not new - I was discussing it some 20 years back as a possible way of
tackling this problem, and it had apparently been thought of a long
time before.

Quote:
The problem with hardware scout IMHO is that you are wasting
computation to ensure future locality. You are much better off doing
the computation and keeping the results if they are valid, then merging
them in the instruction stream.

Er, no. Depending on the details of the design, there is a lot you
can do using hardware scouting to speed it up and cut costs. For
example, you could optimise all floating-point operations to use
an accuracy of 4 bits :-)


Regards,
Nick Maclaren.
Back to top
David Kanter
Guest





Posted: Tue Sep 27, 2005 1:18 pm    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Nick Maclaren wrote:
Quote:
In article <1127783244.283249.177240@g43g2000cwa.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:

Hardware scout is just another term for OOO multithreading, and it's
not precisely new. There's a bunch of variants of this idea, starting
with full blown OOO fetch on a trace processor, dynamic multithreading
(Intel's akkary), skip ahead multithreading (Andy Glew proposed this
for the K10 and I think it's pretty much the same as hardware scout).

No, it's not. You are correct that it has similarities, and that it's
not new - I was discussing it some 20 years back as a possible way of
tackling this problem, and it had apparently been thought of a long
time before.

Allow me to rephrase and clarify. Hardware scout is sun's version of
using multithreading to prefetch loads and warm up the caches. The
work done in 'scout' mode cannot be committed/retired, which strikes me
as...well, stupid. "OOO Multithreading" was a little vague and could
have referred to something like the EV8 or P4 or POWER5. What I really
meant was MT that is done OOO WRT a single fetch stream.

Sorry about that, I should have been much more clear.

Quote:
The problem with hardware scout IMHO is that you are wasting
computation to ensure future locality. You are much better off doing
the computation and keeping the results if they are valid, then merging
them in the instruction stream.

Er, no. Depending on the details of the design, there is a lot you
can do using hardware scouting to speed it up and cut costs. For
example, you could optimise all floating-point operations to use
an accuracy of 4 bits :-)

Heh.

Hardware Scout as far as Sun is pitching it exists solely to prefetch
and improve branch prediction. Both are noble causes, but quite
frankly, I think actually retiring instructions is better.

One idea which I happen to think is better would be a combination of
two ideas:
http://www.princeton.edu/~rblee/ELE572Papers/DynamicMultithreadingProc_akkary.pdf
http://citeseer.ist.psu.edu/cache/papers/cs/27176/http:zSzzSzwww.cs.wisc.eduzSz~paramzSzpaperszSzisca03.pdf/oberoi03parallelism.pdf

Now, I don't necessarily know that this is the best idea. In fact, it
likely isn't. But I think it's a hell of a lot better than Hardware
Scout (as sun has portrayed it)...

David
Back to top
Nick Maclaren
Guest





Posted: Tue Sep 27, 2005 2:07 pm    Post subject: Re: What do you think of Sun's Niagara Reply with quote

In article <1127809087.974146.60210@g49g2000cwa.googlegroups.com>,
"David Kanter" <dkanter@gmail.com> writes:
|>
|> Allow me to rephrase and clarify. Hardware scout is sun's version of
|> using multithreading to prefetch loads and warm up the caches. The
|> work done in 'scout' mode cannot be committed/retired, which strikes me
|> as...well, stupid. "OOO Multithreading" was a little vague and could
|> have referred to something like the EV8 or P4 or POWER5. What I really
|> meant was MT that is done OOO WRT a single fetch stream.

Hmm. Maybe I am being stupid, but I haven't been able to convince
myself that either 'scout' mode can be committed/retired or that
it cannot. That is one of the aspects where I have tried to get
more detailed information and have so far failed.

See my posting about the effects on cache and TLB and the possibility
of dynamic deadlock.


Regards,
Nick Maclaren.
Back to top
Stefan Monnier
Guest





Posted: Fri Sep 30, 2005 12:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Quote:
Hardware Scout as far as Sun is pitching it exists solely to prefetch
and improve branch prediction. Both are noble causes, but quite
frankly, I think actually retiring instructions is better.

The problem with actually retiring the instructions is that in order to do
that, you need to keep track of all the pending conditions that need to be
become true for the commit to be correct. If the scout thread is far ahead
that can mean a lot of information. Also in practice that info can be
compressed (when control-flow joins, the previous branch condition becomes
irrelevant, for example), but it can be very costly/difficult for the
hardware to figure it out.

So committing doesn't come for free. And we don't know well how to do it.

In comparison, running the scout thread with full precision but without
committing might waste work, but it's a known factor and we know how to
do it.

It's far from obvious that retiring would be more efficient.


Stefan
Back to top
Oliver S.
Guest





Posted: Fri Sep 30, 2005 5:31 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Quote:
The problem with actually retiring the instructions is that in
order to do that, you need to keep track of all the pending con-
ditions that need to be become true for the commit to be correct.

As I suppose, the hw-scout will work pretty simple. It just executes
ahead the instruction-stream of one thread in another thread without
doing any stores to memory. To ensure that this doesn't take off pro-
cessor-cycles from any other thread, a scout-thread has idle-priority
(of couese a SMT-core wouldn't implement any priorities like an OS).
It seems pretty easy for me to use unused threads on a core for this
purpose by copying other threads states to these unused threads (an
interesting question may be, which thread is most eligible to be
scouted when three threads run on a four-thread-core).
A simple memcpy-loop would look ideal for scouting and would eliminate
the need for hard- or software-prefetching (although this need is mi-
tigated to a large extent by SMT and thereby letting the latency-pro-
blem to become a bandwidth-problem). But it seems unclear to me how a
scout thread could detect stop-conditions.
Back to top
Oliver S.
Guest





Posted: Sun Oct 02, 2005 12:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Quote:
It just executes ahead the instruction-stream of one thread
in another thread without doing any stores to memory.

And of course it doesn't wait for any loads, but instead issues that
loads to the load-queue to do some prefetching for the scouted thread
and goes ahead! Otherwise this wouldn't get any benefit because the
thread would suffer the same load-stalls like the copied thread. And
being aware of that, this got me to the idea how the whole thing in-
cluding the stop-conditions work! It's absolutely simple:
The core copies a thread's state to a scout-thread and executes this
thread without doing any memory-loads or -stores; and when the core
sees a load to a register, it simply sets a flag for that register
to meorize, that this register has an unknown content. Whenever follo-
wing instructions operate on that register, the register's state re-
mains unknown. And the ultimate stop-state is, when there's a branch
-instruction which depends on the results of an unknows register.
Back to top
Joe Seigh
Guest





Posted: Sun Oct 02, 2005 4:15 pm    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Graeme Gill wrote:
Quote:
Milos Becvar wrote:

There is relatively detailed description of Niagara architecture
in March-April 2005 issue of IEEE Micro.

There was some interesting stuff in the May-June issue by
Chaudhry et. al. of Sun, under "High-Performance Throughput Computing"
as well. The "hardware scout" aimed at improving single
threaded performance seemed an intriguing alternative to
a lot of the complexity in other processors.


There's 3 choices (possibly others) when trying to exploit multi-cored
cpu's. In increasing order of implementation time, they are: do it all in
hardware, e.g. the hardware scount, exploit it in the OS kernel, and
exploit it in the application software.

The last option is too long term to be of any benefit from the point of
view of a hw vendor introducing a new architecture, so w're not likely
to ever see something like that. So if the real breakthroughs were to
occur at that level, we'll never know. Nobody is going to commit that
much resources for long enough of a time to find out.

Since it looks like most of the advances in using lots of computing
resources is the stuff like the distributed commodity approach of
Google, the best approach for hw vendors, if they want software
exploitation, is to make their multi-core cpu's look like a bunch of
loosely connected cheap processors with non-shared memory. It
may not be the best solution from a technical point of view but it's
probably the best solution from a market driven point of view.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Back to top
Iain McClatchie
Guest





Posted: Mon Oct 03, 2005 12:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

David,

Scout threads can be a really good idea. Consider:

1) An OoO machine can dissipate maybe 4x the power of an in-order
machine
for the same number of instructions at the same frequency.
2) Most events that cause power dissipation stem from cache misses.
Branch mispredicts are a biggie: you have to flush the pipe and dump
all
the work you just did. Dcache misses cause pipe bubbles, which
usually looks like executing an instruction two or more times.
- In the Sun design, these events still happen, mostly just once, to
the scout thread. Because the scout thread is more speculative
than
a normal OoO machine, there will be more of these events, and so
more dissipation as a result. But the underlying cores are more
efficient.
- Since the architectural thread doesn't take cache misses and
mispredicts
for the most part, it burns just a small portion of the overall
system
power. It might cost *more* power to save the state to be
committed
from the scout threads than to just recompute it.
3) Having a simple in-order core that goes fast by getting consistently
lucky is a good verification target.
Back to top
David Kanter
Guest





Posted: Mon Oct 03, 2005 8:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Iain McClatchie wrote:
Quote:
David,

Scout threads can be a really good idea. Consider:

1) An OoO machine can dissipate maybe 4x the power of an in-order
machine
for the same number of instructions at the same frequency.

Interesting, I was entirely unaware of a quantification of this. Of
course, it's hard to get an apples to apples comparison. I think
perhaps the closest thing would be to compare the US-III core to the
EV6/7 or R10K, but that's very much not an even comparison. Either
way, I think that 4x sounds reasonable.

Quote:
2) Most events that cause power dissipation stem from cache misses.
Branch mispredicts are a biggie: you have to flush the pipe and dump
all
the work you just did. Dcache misses cause pipe bubbles, which
usually looks like executing an instruction two or more times.

I don't see any reason why this should be true as a principal rather
than as a temporary sort of issue. IIRC, Intel and others are working
on 'sleep' transistors which will effectively turn off portions of the
chip while they are unused. Unfortunately, I don't know how fast this
can be done, but it seems like if you could get the sleep T's switching
fast enough, then you could simply take a 'nap' during your D$ miss.

Quote:
- In the Sun design, these events still happen, mostly just once, to
the scout thread. Because the scout thread is more speculative
than
a normal OoO machine, there will be more of these events, and so
more dissipation as a result. But the underlying cores are more
efficient.

Sure, but how much of this is "false efficiency" i.e. doing the same
work twice? It seems to me that this is very analagous to the theory
of predication, in that you are essentially relying on
replication/extra work to provide correct results and avoid undesirable
events.

Of course, this depends on whether saving results or recomputing is
quickest/most efficient.

Quote:
- Since the architectural thread doesn't take cache misses and
mispredicts
for the most part, it burns just a small portion of the overall
system
power. It might cost *more* power to save the state to be
committed
from the scout threads than to just recompute it.

That is certainly a possibility, but I guess I'm not really sure of
that. Caches are awfully low-power. I see your point about
essentially partitioning the work into "hard/speculative" and
"easy/retirable", and to some degree that does make sense. I guess I'm
not sure what the marginal cost of saving that data/information is. My
inclination is to say that caching is almost always cheaper than
computation (see Terje's sig).

Here's another thought. Suppose I accept what you are saying (i.e.
partition the work using HW scout), wouldn't we better off with an
asymmetric thread execution model? In particular, I am thinking it
would make sense to have the scout thread act like a narrow OOO, and
then have the retire thread act like a wide InO. That seems to be the
logical extension of what you are proposing...

Quote:
3) Having a simple in-order core that goes fast by getting consistently
lucky is a good verification target.

That is probably true.


I guess generally, my problem with hardware scout is that in this
particular incarnation, ISTM the baby was thrown out with the
bathwater. Branch prediction should be done since we are right 90+% of
the time, ditto for caching. What you should really be doing is using
HW scout in co-operation with JIT or PGO to focus in on the 10-20% of
cases where you think ugly things will happen. I have not yet seen or
read whether this is case, or whether HW scout is applied in a more
shotgun like approach.

DK
Back to top
Colonel Forbin
Guest





Posted: Tue Oct 04, 2005 12:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

In article <1128322023.470611.297860@o13g2000cwo.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:
Quote:

That is certainly a possibility, but I guess I'm not really sure of
that. Caches are awfully low-power. I see your point about
essentially partitioning the work into "hard/speculative" and
"easy/retirable", and to some degree that does make sense. I guess I'm
not sure what the marginal cost of saving that data/information is. My
inclination is to say that caching is almost always cheaper than
computation (see Terje's sig).

I wonder if it isn't time to give up all this fingerpointing between
the heavily cached cpu-too-fast-for-memory crowd and the huge number
of slow, stupid CPUs CM-1 crowd.

I propose a benchmark of computational entropy that simply measures
how much power is consumed in getting a particular job done in a given
amount of time, assuming that the job is computationally large and
preferably something of practical value.

The merit figure of a system should be what multiple of the theoretical
minimum computational entropy is consumed in accomplishing the task.

We can actually calculate and measure these things directly, so why
waste so much time bickering over implementation details?
Back to top
Oliver S.
Guest





Posted: Tue Oct 04, 2005 6:44 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Quote:
Because the scout thread is more speculative than a normal OoO
machine, there will be more of these events, and so more dissipation
as a result.

I'm pretty sure scouting won't be speculative; I've outlines my
ideas in: 433f2456$0$26199$9b4e6d93@newsread2.arcor-online.net>.
Back to top
David Hopwood
Guest





Posted: Tue Oct 04, 2005 7:18 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Colonel Forbin wrote:
Quote:
David Kanter <dkanter@gmail.com> wrote:

That is certainly a possibility, but I guess I'm not really sure of
that. Caches are awfully low-power. I see your point about
essentially partitioning the work into "hard/speculative" and
"easy/retirable", and to some degree that does make sense. I guess I'm
not sure what the marginal cost of saving that data/information is. My
inclination is to say that caching is almost always cheaper than
computation (see Terje's sig).

I wonder if it isn't time to give up all this fingerpointing between
the heavily cached cpu-too-fast-for-memory crowd and the huge number
of slow, stupid CPUs CM-1 crowd.

I propose a benchmark of computational entropy that simply measures
how much power is consumed in getting a particular job done in a given
amount of time, assuming that the job is computationally large and
preferably something of practical value.

You mean how much energy is consumed, presumably. But latency, at least,
is also important.

--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Back to top
David Kanter
Guest





Posted: Tue Oct 04, 2005 8:15 am    Post subject: Re: What do you think of Sun's Niagara Reply with quote

Iain McClatchie wrote:
Quote:
Iain> 2) Most events that cause power dissipation stem from cache
Iain> misses. Branch mispredicts are a biggie: you have to
Iain> flush the pipe and dump all the work you just did.

David> I don't see any reason why this should be true as a
David> principal rather than as a temporary sort of issue. IIRC,
David> Intel and others are working on 'sleep' transistors which
David> will effectively turn off portions of the chip while they
David> are unused.

But these don't help, because you don't know that the work is wasted
until after it has already been done.

Absolutely. But, branches are predicted correctly about 95-99% of the
time (from what I can see so far), so I guess ultimately, I'm not
hugely worried.

Quote:
The "sleep transistors" are just another form of gated clock. A
physical design guy was telling me last Friday that ASICs nowadays
have *thousands* of gated clock domains, due to automated tools that
figure out which flops to turn off. By now, probably any CPU you
might buy is already taking naps during D$ misses.

Ok fair enough.

Quote:
Generally, the latency between determining that a pipe stage is not
needed to shutting down its clock is less than a cycle, unless it's
really big (like the first combining stage of a multiplier).

David> Caches are awfully low-power.

Oooh I'm not so sure about that. The SRAMs can be made low power if
they can run slowly. But you have to be careful about tag power (an
8-way set-associative cache checks 8 tags for each data access. Tags
are a lot bigger than 1/8 of a access word.).

I guess my point was that generally SRAMs are cooler than computational
logic, not cool in an absolute sense. Your point is taken.

Quote:
David> I am thinking it would make sense to have the scout thread act
David> like a narrow OOO, and then have the retire thread act like a
David> wide InO.

Me too:
http://groups.google.com/group/comp.arch/browse_thread/thread/50cf4ec00bcff65d/e90ec8cb530ca0f7?q=Iain+McClatchie+run-ahead&rnum=1&hl=en#e90ec8cb530ca0f7

David> Branch prediction should be done since we are right 90+% of
David> the time, ditto for caching.

Um, they didn't do branch prediction? Even one or two branches out?

To quote Kevin Krewell:
"Because the pipeline is short and there are multiple threads per core,
branch prediction becomes unnecessary and was also jettisoned."

http://www.mdronline.com/watch/watch_abstract.asp?Volname=Issue%20%23181&SID=1304&on=1&SourceID=00000377000000000000

To me, that is going overboard. If I can do something right 90% of the
time, why stop?

David
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB