| Author |
Message |
Iain McClatchie
Guest
|
Posted:
Tue Oct 04, 2005 8:15 am Post subject:
Re: What do you think of Sun's Niagara |
|
|
Iain> 2) Most events that cause power dissipation stem from cache
Iain> misses. Branch mispredicts are a biggie: you have to
Iain> flush the pipe and dump all the work you just did.
David> I don't see any reason why this should be true as a
David> principal rather than as a temporary sort of issue. IIRC,
David> Intel and others are working on 'sleep' transistors which
David> will effectively turn off portions of the chip while they
David> are unused.
But these don't help, because you don't know that the work is wasted
until after it has already been done.
The "sleep transistors" are just another form of gated clock. A
physical design guy was telling me last Friday that ASICs nowadays
have *thousands* of gated clock domains, due to automated tools that
figure out which flops to turn off. By now, probably any CPU you
might buy is already taking naps during D$ misses.
Generally, the latency between determining that a pipe stage is not
needed to shutting down its clock is less than a cycle, unless it's
really big (like the first combining stage of a multiplier).
David> Caches are awfully low-power.
Oooh I'm not so sure about that. The SRAMs can be made low power if
they can run slowly. But you have to be careful about tag power (an
8-way set-associative cache checks 8 tags for each data access. Tags
are a lot bigger than 1/8 of a access word.).
David> I am thinking it would make sense to have the scout thread act
David> like a narrow OOO, and then have the retire thread act like a
David> wide InO.
Me too:
http://groups.google.com/group/comp.arch/browse_thread/thread/50cf4ec00bcff65d/e90ec8cb530ca0f7?q=Iain+McClatchie+run-ahead&rnum=1&hl=en#e90ec8cb530ca0f7
David> Branch prediction should be done since we are right 90+% of
David> the time, ditto for caching.
Um, they didn't do branch prediction? Even one or two branches out? |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Oct 04, 2005 1:44 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
In article <1128412050.660151.107230@z14g2000cwz.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:
| Quote: | Iain McClatchie wrote:
Iain> 2) Most events that cause power dissipation stem from cache
Iain> misses. Branch mispredicts are a biggie: you have to
Iain> flush the pipe and dump all the work you just did.
David> I don't see any reason why this should be true as a
David> principal rather than as a temporary sort of issue. IIRC,
David> Intel and others are working on 'sleep' transistors which
David> will effectively turn off portions of the chip while they
David> are unused.
But these don't help, because you don't know that the work is wasted
until after it has already been done.
Absolutely. But, branches are predicted correctly about 95-99% of the
time (from what I can see so far), so I guess ultimately, I'm not
hugely worried.
|
As the objective is to extend look-ahead from one branch to 20+,
a 5% miss rate is serious. |
|
| Back to top |
|
 |
Ken Hagan
Guest
|
Posted:
Tue Oct 04, 2005 2:12 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
| Quote: | Colonel Forbin wrote:
I propose a benchmark of computational entropy that simply measures
how much power is consumed in getting a particular job done in a given
amount of time, assuming that the job is computationally large and
preferably something of practical value.
|
David Hopwood wrote:
| Quote: | You mean how much energy is consumed, presumably. But latency, at least,
is also important.
|
I think you read that too quickly. He wants to measure the
"power consumed in getting a particular job done in a given amount
of time".
If the time (latency) is fixed, then power and energy are equivalent
measures. (As an aside, how would the power vary with the latency?)
I find it interesting that this benchmark has two numerical values,
and you really need to know both to know the meaning of either.
Perhaps this is generally true and we should ignore any benchmark
that gives only a single figure of merit. |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Tue Oct 04, 2005 4:15 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
Peter Grandi wrote:
| Quote: | On Sun, 25 Sep 2005 22:13:07 +0200, Milos Becvar
becvarm@fel.cvut.cz> said:
becvarm> where Niagara can not compete.
Sun started by targeting the engineering workstation market, but
they have long since abandoned that (to x86, isn't that amazing)
now most of what they care about seems to be largely the same
market as IBM mainframes, with Oracle instead of DB2 and Java
instead of Cobol/PL1.
|
Can Oracle benefit from having lots more but slightly slower
threads? What concurrency mechanism is Oracle using that would
work better here? Async i/o or something else?
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Peter Grandi
Guest
|
Posted:
Tue Oct 04, 2005 4:15 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
[ ... on Sun's upcoming Niagara ... ]
becvarm> The only question for me is relatively small cache size
becvarm> of 16KB for L1I and 8KB for L1D. And shared L2 is
becvarm> "only" 3MB for 32 threads! They claim that temporal
becvarm> locality is poor in commercial workloads and latency
becvarm> would be hidden by multithreading but is still sounds
becvarm> relatively low.
I guess that «commercial workloads» is code for ''this is purely
an Oracle and sometimes Java oriented server which is all we
care about''.
becvarm> If we compare Niagara with Intel Montecito 1.72 billion
becvarm> transistor monster also described in that issue of IEEE
becvarm> Micro, the Niagara wons in terms of clean and nice
becvarm> architecture.
A comparison with Opteron is even more interesting, as Sun also
sells Opteron for those markets where they haven't locked in
their customers into the SPARC/Oracle/Java combination.
becvarm> For 60W you can have 32 threads on Niagara or 4 threads
becvarm> on Montecito for 100+W. On the other hand, the
becvarm> Montecito is also targetted to technical workloads
An aside here: but wait, wasn't the Itanic what Intel so
passionately «targetted to technical workloads»? :-)
becvarm> where Niagara can not compete.
Sun started by targeting the engineering workstation market, but
they have long since abandoned that (to x86, isn't that amazing)
now most of what they care about seems to be largely the same
market as IBM mainframes, with Oracle instead of DB2 and Java
instead of Cobol/PL1.
Continuation of the aside above: one of the more entertaining
ironies in a similar parallel history is that HP designed the
Itanic architecture with an obvious and large bias towards
engineering workstations or compute servers as a next gen
replacement for PA, and IA64 was bought by Intel as a ''business
server'' architecture, and I reckon that an even greater irony
is that it is SGI that is left putting Itanic to the use for
which it was so obviously targeted.
[ ... ] |
|
| Back to top |
|
 |
David Hopwood
Guest
|
Posted:
Tue Oct 04, 2005 4:15 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
Ken Hagan wrote:
| Quote: | Colonel Forbin wrote:
I propose a benchmark of computational entropy that simply measures
how much power is consumed in getting a particular job done in a given
amount of time, assuming that the job is computationally large and
preferably something of practical value.
David Hopwood wrote:
You mean how much energy is consumed, presumably. But latency, at least,
is also important.
I think you read that too quickly. He wants to measure the
"power consumed in getting a particular job done in a given amount
of time".
If the time (latency) is fixed, then power and energy are equivalent
measures. (As an aside, how would the power vary with the latency?)
|
Oh, I see. Yes, I did read it too quickly. The benchmark result for a
given job would need to be expressed as a graph of time against power
(or time against energy).
--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> |
|
| Back to top |
|
 |
Iain McClatchie
Guest
|
Posted:
Tue Oct 04, 2005 9:53 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
Kevin> "Because the pipeline is short and there are multiple
Kevin> threads per core, branch prediction becomes unnecessary
Kevin> and was also jettisoned."
Okay. I've done paper designs in that space before. When I
last looked at it, the instruction fetch had to know not to fetch
the instruction after the branch, or you lost way too much of
the pipeline's bandwidth. I did this by designing the instruction
set to mark the instruction *before* the branch. That way the
fetch logic had time to mux in a different thread target.
A gang of one- or two- way inorder cores with switch-on-branch is
neat because all the cores can use the same instruction cache,
which should improve cache hits if multiple threads are executing
the same code. The idea is that each thread makes one big fetch
at each branch, once the branch is resolved. When the thread gets
rescheduled, these instructions are fed into the pipe one-by-one.
With some attention to detail you can guarantee throughput and
even some latency bound, which in many cases (routers) is all that
may be needed.
But, if you want to make throughput guarantees you can't get any
benefit from statistical processes like caches and predictors.
The caches end up being used as buffers. |
|
| Back to top |
|
 |
Peter Grandi
Guest
|
Posted:
Tue Oct 04, 2005 9:55 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
| Quote: | On 4 Oct 2005 00:47:30 -0700, "David Kanter" <dkanter@gmail.com
said:
|
[ ... ]
dkanter> Absolutely. But, branches are predicted correctly
dkanter> about 95-99% of the time (from what I can see so far),
dkanter> so I guess ultimately, I'm not hugely worried.
Uhm another one of my pet peeves :-). This statistic while true
is ridiculous, because it really sort of is about _backwards_
jumps, that is about code with lots of loops. Predicting
_forward_ jumps is not as easy :-).
That is the architecture strips away the information that this
block is a loop, and then ''branch prediction'' just expensively
and pointlessly figures it out again.
Problem is, it is the forward branches that are the killers, and
even insanities like trace scheduling are not that awesomely
effective, at least if one considers the costs.
BTW, one of my constant refrains in this newsgroups for the
past few centuries is dislike of pitifully pathetic attempts
to simulate ''vectorial'' computational patterns with lower
level code patterns and architectural features (for example
simulating vector registers with cache lines). And note that I
dislike complex instructions, which are a different thing.
I am one of those instead who would like to see more use of the
''likely'' bit in a branch instruction, which could be set by a
compiler, as in some architectures. Also, and I wonder how
effective and easy to implement this could be, a ''likely'' bit
in an instruction could be replaced instead by a condition code
or register that could have been set earlier dynamically. |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Oct 04, 2005 10:02 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
In article <yf3irwdusre.fsf@base.gp.example.com>,
pg_nh@0509.exp.sabi.co.UK (Peter Grandi) writes:
|> >>> On 4 Oct 2005 00:47:30 -0700, "David Kanter" <dkanter@gmail.com>
|> >>> said:
|>
|> dkanter> Absolutely. But, branches are predicted correctly
|> dkanter> about 95-99% of the time (from what I can see so far),
|> dkanter> so I guess ultimately, I'm not hugely worried.
|>
|> Uhm another one of my pet peeves :-). This statistic while true
|> is ridiculous, because it really sort of is about _backwards_
|> jumps, that is about code with lots of loops. Predicting
|> _forward_ jumps is not as easy :-).
There is a 'solution' to that. All forward jumps are required
to be unconditional, whereupon predicting them becomes trivial.
Pretty neat, eh?
Obviously, such an architecture is an ideal match for a language
based around the COME FROM statement!
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
John F. Carr
Guest
|
Posted:
Tue Oct 04, 2005 10:06 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
In article <1128444825.945632.164620@f14g2000cwb.googlegroups.com>,
Iain McClatchie <iain-3@truecircuits.com> wrote:
| Quote: | Kevin> "Because the pipeline is short and there are multiple
Kevin> threads per core, branch prediction becomes unnecessary
Kevin> and was also jettisoned."
Okay. I've done paper designs in that space before. When I
last looked at it, the instruction fetch had to know not to fetch
the instruction after the branch, or you lost way too much of
the pipeline's bandwidth. I did this by designing the instruction
set to mark the instruction *before* the branch. That way the
fetch logic had time to mux in a different thread target.
|
We used to call those "delayed branches".
I heard that early 68020s had a bug where the prefetcher would
keep fetching past an unconditional branch, possibly accessing
invalid memory if the branch was at the end of a page.
--
John Carr (jfc@mit.edu) |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Tue Oct 04, 2005 10:30 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
| Quote: | As the objective is to extend look-ahead from one branch to 20+,
a 5% miss rate is serious.
|
FWIW, I haven't measured a miss rate above 2% yet, but I have only
worked with 'workstation' benchmarks, not anything TPC-C like.
DK |
|
| Back to top |
|
 |
Thomas Lindgren
Guest
|
Posted:
Tue Oct 04, 2005 11:03 pm Post subject:
Re: What do you think of Sun's Niagara |
|
|
pg_nh@0509.exp.sabi.co.UK (Peter Grandi) writes:
| Quote: | I am one of those instead who would like to see more use of the
''likely'' bit in a branch instruction, which could be set by a
compiler, as in some architectures. Also, and I wonder how
effective and easy to implement this could be, a ''likely'' bit
in an instruction could be replaced instead by a condition code
or register that could have been set earlier dynamically.
|
Here's a paper looking at setting the prediction bit dynamically:
Mahlke and Natarajan. Compiler Synthesized Branch Prediction. Micro 1996.
http://www.eecs.umich.edu/~mahlke/papers/1996/mahlke_micro96.pdf
Best,
Thomas
--
Thomas Lindgren
"It's becoming popular? It must be in decline." -- Isaiah Berlin |
|
| Back to top |
|
 |
Casper H.S. Dik
Guest
|
Posted:
Wed Oct 05, 2005 12:02 am Post subject:
Re: What do you think of Sun's Niagara |
|
|
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> writes:
| Quote: | Oh, I see. Yes, I did read it too quickly. The benchmark result for a
given job would need to be expressed as a graph of time against power
(or time against energy).
|
Which wouldn't be much different from sometimes quoted numbers of
foobars/Watt; i.e., you need one additional number for each benchmark.
Casper |
|
| Back to top |
|
 |
Milos Becvar
Guest
|
Posted:
Wed Oct 05, 2005 12:15 am Post subject:
Re: What do you think of Sun's Niagara |
|
|
Nick Maclaren wrote:
| Quote: | In article <dh8agm$2m2$1@gemini.csx.cam.ac.uk>,
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
|> In article <dh70cc$11jg$1@ns.felk.cvut.cz>,
|> Milos Becvar <becvarm@fel.cvut.cz> wrote:
|
|> >There is relatively detailed description of Niagara architecture
|> >in March-April 2005 issue of IEEE Micro.
|
|> Thanks for the reminder. It wasn't in when I last looked. I must
|> take another look.
May/June, actually.
To be 100 % accurate, there was an article about Niagara in March-April |
and about HW Scouting in May-June.
| Quote: |
Unfortunately, it didn't tell me much that I didn't know. The
figures would have been more interesting if they had had some
information on what they represented, though the ones on the
"cache enlargement" were OK.
I am interested in much more technical information, such as
exactly what the NT bit is associated with. For example,
consider code like the following:
p = (p != NULL ? *p : *def1);
p = (p != NULL ? *p : *def2);
p = (p != NULL ? *p : *def3);
p = (p != NULL ? *p : *def4);
p = (p != NULL ? *p : *def5);
Does this require the compiler to use different registers for
each p in order to get advantage from scouting?
What is the effect on the cache and TLB, and does the system
have a mechanism to prevent dynamic deadly embraces?
This is an interresting issue. I heard at ISCA04 at the paper |
from Univ. of Illinois that similar problem occur on Itanium
when "advanced load" instruction is aggressively used.
They called this phenomenon "wild load" and TLB spoiling
is the real issue.
Regards,
Milos |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Wed Oct 05, 2005 12:15 am Post subject:
Re: What do you think of Sun's Niagara |
|
|
In article <4342b6a4$0$571$b45e6eb0@senator-bedfellow.mit.edu>,
John F. Carr <jfc@mit.edu> wrote:
| Quote: |
I heard that early 68020s had a bug where the prefetcher would
keep fetching past an unconditional branch, possibly accessing
invalid memory if the branch was at the end of a page.
|
Bugs of that nature have been legion. There were related ones on
several of the System/370 range (and probably the 360/91), most
RISCs, and others. Where they don't involve a direct security
exposure, the 'solution' was/is/often to document their clobbering
as a requirement on the interrupt handler. Ugh.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
|
|
|
|