| Author |
Message |
Nick Maclaren
Guest
|
Posted:
Tue Aug 09, 2005 4:16 pm Post subject:
Re: internal call/ret stack |
|
|
In article <Pine.LNX.4.61.0508091601040.9761@tyr.diku.dk>,
"Peter \"Firefly\" Lund" <firefly@diku.dk> writes:
|> On Tue, 9 Aug 2005, Torben Ęgidius Mogensen wrote:
|>
|> > It has always struck me as a bad decision to heap-allocate the frames,
|>
|> But if you never return, why use the stack? ;)
|>
|> (For the uninitiated: SML/NJ uses the socalled continuation passing style
|> in which called "functions" take the "continuation" of the program as a
|> parameter. When they "return", they "call" the continuation. In this
|> scheme there ARE no returns, only invocations consisting of gotos with
|> parameter passing. Look, more generality! Isn't it clean and
|> well-designed?
Quite. To your satire, of course. No, to the literal question!
When debugging failures in large, complex programs, it is sometimes
really rather useful to answer the question "How on EARTH did I
get HERE?" without having to recompile the application.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Tue Aug 09, 2005 4:16 pm Post subject:
Re: internal call/ret stack |
|
|
On Tue, 9 Aug 2005, Nick Maclaren wrote:
| Quote: | When debugging failures in large, complex programs, it is sometimes
really rather useful to answer the question "How on EARTH did I
get HERE?" without having to recompile the application.
|
Yes, definitely. Even just a dump of the return stack (without
parameters) would help a lot. The continuation passing style used in
SML/NJ with old activation records that may or may not have been reclaimed
by the garbage collector and which nobody keeps track of anyway doesn't
seem to be useful for that.
An ability to attach watchpoints, tracepoints, and breakpoints to
functions without recompilation, preferably semiautomatic, would be very
useful. SUN seems to have released such a tool recently, according to
blog entries on planet.(freedesktop|gnome).org.
The IBM dprobes seem to be useful for the same thing but the people
working on it never seemed to understand that providing a pure mechanism
wasn't enough, you need to provide useful tools with real user interfaces,
too.
What has this got to do with SML/NJ? ;)
-Peter |
|
| Back to top |
|
 |
Tom Linden
Guest
|
Posted:
Tue Aug 09, 2005 4:16 pm Post subject:
Re: internal call/ret stack |
|
|
On Tue, 9 Aug 2005 16:07:41 +0200, Peter "Firefly" Lund <firefly@diku.dk>
wrote:
| Quote: | On Tue, 9 Aug 2005, Torben Ęgidius Mogensen wrote:
It has always struck me as a bad decision to heap-allocate the frames,
But if you never return, why use the stack? ;)
If the language supports lexical inheritance you need it. If it supports |
scoped error handling you need it, and it is very convenient for a number
of other things. Can't recall an implementation which used the heap for
frames, I guess that is why we always have called them stack frames,
because
we use both stack and frame pointers.
BTW, what is SML/NJ?
| Quote: |
(For the uninitiated: SML/NJ uses the socalled continuation passing style
in which called "functions" take the "continuation" of the program as a
parameter. When they "return", they "call" the continuation. In this
scheme there ARE no returns, only invocations consisting of gotos with
parameter passing. Look, more generality! Isn't it clean and
well-designed? Yes, I know that a CALL instruction is like an invocation
with an implicit continuation passed as a hidden parameter, but still...
I
think special-casing makes sense here, too.)
implementor has a lot of freedom in when data is deallocated, so you
can do things like escape analysis and region inference to get better
locality and earlier deallocation.
It is my impression that they are better at generating papers than they
are at improving performance.
Meanwhile, a simple hardware assist can get better results without the
enormous cognitive and educational load of escape analysis and what-not
inference. Which I happen to like a lot, of course. I just don't think
they are all that practical yet.
-Peter |
|
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Tue Aug 09, 2005 4:16 pm Post subject:
Re: internal call/ret stack |
|
|
On Tue, 9 Aug 2005, Terje Mathisen wrote:
| Quote: | Siimilar tricks are common with compiled Standard ML code: heap
allocation is just a question of advancing a pointer except in the rare
case where a garbage collection is called for. Several functions that
call each other in just the right way can share the allocation and the
check for whether to garbage collect or not.
:-)
|
And it happens automatically, mind you :)
On the other hand, ML programs tend to allocate far too much memory for
their own good "because garbage collection is cheap". Well, it is, but...
A lot of it is due to compiler generated stuff rather than programmer
visible stuff.
SML/NJ allocates invocation frames on the heap, for example. It's all
nice and general and stuff (the old-timers here will love it) but it is
also a stupid thing to do because they /do/ behave in a very special way
that does not NEED full generality and which makes it easy to special-case
by using the CPU stack. Suddenly your locality becomes a lot better...
Anybody who equals general, clean, and well-designed from now on will get
a black mark in my book.
-Peter |
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Tue Aug 09, 2005 11:13 pm Post subject:
Re: internal call/ret stack |
|
|
On Tue, 9 Aug 2005, Tom Linden wrote:
Let's take this part first:
| Quote: | BTW, what is SML/NJ?
|
http://www.smlnj.org/
It's an implementation of the (extremely nice) language Standard ML.
Standard ML is a standardization of the language ML,
http://citeseer.ist.psu.edu/context/6130/0
originally developed as a meta-language for proof searching tactics in an
automated proof generator, hence the name.
ML was one of the three things Robin Milner got the Turing award for in
1991:
http://www.fairdene.com/picalculus/robinmilner.html
http://www.acm.org/awards/turing_citations/milner.html
ML pioneered a fantastic type system and a wonderful exception model.
| Quote: | But if you never return, why use the stack? ;)
If the language supports lexical inheritance you need it.
|
Whether or not a physical stack is used or not is an implementation
detail.
| Quote: | If it supports
scoped error handling you need it,
|
I am not sure I follow here... If you are talking about the now pretty
much standard exception model, ML pioneered it and SML/NJ was one of the
early compilers (probably the first real compiler for the language --
Torben?) for it.
| Quote: | and it is very convenient for a number
of other things. Can't recall an implementation which used the heap for
frames, I guess that is why we always have called them stack frames, because
we use both stack and frame pointers.
|
And now you know why CS types call them "invocation records" ;)
As I wrote (in the part I just snipped), you can play even more tricks
than say using a linked list of heap allocated invocation records. If you
use continuation passing you don't even return... you just call (jump) to
the next continuation.
Andrew Appel, "Compiling with Continuations", Cambridge University
Press, 1992.
http://citeseer.ist.psu.edu/context/238/0
-Peter |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Aug 10, 2005 4:15 pm Post subject:
Re: internal call/ret stack |
|
|
Peter "Firefly" Lund wrote:
| Quote: | The 21064 (1992) had one; I don't remember earlier CPUs that had one.
That early?
|
The Harris RTX microprocessors had a 256 deep on-chip
return stack around 1986. Maybe not the earliest
microprocessor with on-chip stacks, but it was
deeper than 1.
Best Wishes |
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Wed Aug 10, 2005 4:15 pm Post subject:
Re: internal call/ret stack |
|
|
On Wed, 10 Aug 2005 fox@ultratechnology.com wrote:
| Quote: | The Harris RTX microprocessors had a 256 deep on-chip
return stack around 1986. Maybe not the earliest
microprocessor with on-chip stacks, but it was
deeper than 1.
|
But was it an architectural feature or not? The stuff I am talking about
is NOT an architectural feature, it is a microarchitectural optimization
that is transparent to the code (except for the slowdown when you abuse
it).
-Peter |
|
| Back to top |
|
 |
Seongbae Park
Guest
|
Posted:
Wed Aug 10, 2005 9:02 pm Post subject:
Re: internal call/ret stack |
|
|
"Peter "Firefly" Lund" <firefly@diku.dk> wrote in message
| Quote: | Finding the instruction to return to quickly can be of immense value.
|
Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | I don't doubt it can be useful.
But I am not convinced the value
would be "immense". What am I
missing?
|
The return instruction is an indirect branch,
hence the target address is not known to the early pipeline stages.
So unless the processor has a return address stack,
it may not be able to pre-fetch instructions.
i.e. the overhead is about the same as branch misprediction.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/" |
|
| Back to top |
|
 |
Seongbae Park
Guest
|
Posted:
Wed Aug 10, 2005 11:20 pm Post subject:
Re: internal call/ret stack |
|
|
Maynard Handley <name99@name99.org> wrote:
| Quote: | In article <dddq08$mmi$1@news1nwk.SFbay.Sun.COM>,
Seongbae Park <Seongbae.Park@Sun.COM> wrote:
"Peter "Firefly" Lund" <firefly@diku.dk> wrote in message
Finding the instruction to return to quickly can be of immense value.
Dan Koren <dankoren@yahoo.com> wrote:
I don't doubt it can be useful.
But I am not convinced the value
would be "immense". What am I
missing?
The return instruction is an indirect branch,
hence the target address is not known to the early pipeline stages.
So unless the processor has a return address stack,
it may not be able to pre-fetch instructions.
i.e. the overhead is about the same as branch misprediction.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
There is a bug on some versions of the 7450 that causes what is
essentially indirect branch prediction on that CPU to occasionally
branch wrong, and so on those versions of the chip it is disabled. The
net slowdown is sufficiently small (1 or 2%) that only fairly low-level
experts even know it exists; it's certainly not common knowledge among a
slashdot level of machead.
(Note that this bug affects prediction of indirect branches (ie vtable
calls), not prediction of call returns which are handled by a different
mechanism of on the 7400 series. So it's not directly comparable, but
it's the same sort of deal; the high end figure of 3% slowdown is
presumably for aggressively OO C++ code and is a higher effect than call
return.)
|
The question is how often do you do procedure returns
and what is the penalty of misprediction.
Since number of dynamic indirect calls in a program
tends to be significantly lower than number of dynamic procedure returns,
your example doesn't exactly answer the question
(though it's very interesting regardless).
Oh yeah...errata can be really scary.
| Quote: | So the bottom line is, yeah, it's nice, but I don't know that 1.5%
equals immense.
Maynard Handley
|
To be more clearer, I agree that it _can_ be of immense value but
I don't agree that it _is_ of immense value.
This depends a lot on workload, and some workload could be
affected by this significantly.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/" |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Thu Aug 11, 2005 12:15 am Post subject:
Re: internal call/ret stack |
|
|
"Peter "Firefly" Lund" <firefly@diku.dk> wrote in message
news:Pine.LNX.4.61.0508090124420.9343@tyr.diku.dk...
| Quote: | On Mon, 8 Aug 2005, Dan Koren wrote:
Think about it this way: if you're going to
restore an entire stack frame from memory,
one more register doesn't make that much of
a difference.
Finding the instruction to return to quickly can be of immense value.
^^^^^^^ |
------------|||||||
I don't doubt it can be useful.
But I am not convinced the value
would be "immense". What am I
missing?
Thx,
dk |
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Thu Aug 11, 2005 12:15 am Post subject:
Re: internal call/ret stack |
|
|
On Wed, 10 Aug 2005, Dan Koren wrote:
| Quote: | But I am not convinced the value
would be "immense". What am I
missing?
|
Pipelining?
-Peter |
|
| Back to top |
|
 |
Maynard Handley
Guest
|
Posted:
Thu Aug 11, 2005 12:15 am Post subject:
Re: internal call/ret stack |
|
|
In article <dddq08$mmi$1@news1nwk.SFbay.Sun.COM>,
Seongbae Park <Seongbae.Park@Sun.COM> wrote:
| Quote: | "Peter "Firefly" Lund" <firefly@diku.dk> wrote in message
Finding the instruction to return to quickly can be of immense value.
Dan Koren <dankoren@yahoo.com> wrote:
I don't doubt it can be useful.
But I am not convinced the value
would be "immense". What am I
missing?
The return instruction is an indirect branch,
hence the target address is not known to the early pipeline stages.
So unless the processor has a return address stack,
it may not be able to pre-fetch instructions.
i.e. the overhead is about the same as branch misprediction.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
|
There is a bug on some versions of the 7450 that causes what is
essentially indirect branch prediction on that CPU to occasionally
branch wrong, and so on those versions of the chip it is disabled. The
net slowdown is sufficiently small (1 or 2%) that only fairly low-level
experts even know it exists; it's certainly not common knowledge among a
slashdot level of machead.
(Note that this bug affects prediction of indirect branches (ie vtable
calls), not prediction of call returns which are handled by a different
mechanism of on the 7400 series. So it's not directly comparable, but
it's the same sort of deal; the high end figure of 3% slowdown is
presumably for aggressively OO C++ code and is a higher effect than call
return.)
(7457 BTIC bug)
http://66.102.7.104/search?q=cache:NybYXFdiAb4J:www.mountall.com/ppc/free
scale/MPC7457CE.pdf+7457+BTIC+errata&hl=en&client=firefox-a#31
(BTW reading these errata documents for any CPU is pretty scary. You get
the feeling it's amazing anything at all works.)
So the bottom line is, yeah, it's nice, but I don't know that 1.5%
equals immense.
Maynard Handley |
|
| Back to top |
|
 |
Anton Ertl
Guest
|
Posted:
Mon Aug 15, 2005 12:56 pm Post subject:
Re: internal call/ret stack |
|
|
Maynard Handley <name99@name99.org> writes:
| Quote: | There is a bug on some versions of the 7450 that causes what is
essentially indirect branch prediction on that CPU to occasionally
branch wrong, and so on those versions of the chip it is disabled. The
net slowdown is sufficiently small (1 or 2%) that only fairly low-level
experts even know it exists; it's certainly not common knowledge among a
slashdot level of machead.
(Note that this bug affects prediction of indirect branches (ie vtable
calls), not prediction of call returns which are handled by a different
mechanism of on the 7400 series. So it's not directly comparable, but
it's the same sort of deal; the high end figure of 3% slowdown is
presumably for aggressively OO C++ code and is a higher effect than call
return.)
|
For every indirect call (e.g., C++ virtual function call), there is
also a return, so I would expect the effect of the return stack in OO
code to be at least as large as that of indirect branch prediction.
Reasons why it will usually be higher:
- An appropriately sized return stack has a higher prediction
accuracy.
- Even in aggressively OO code there will be direct calls, and thus
more returns than indirect calls.
Anyway, the kind of code where indirect branch prediction is really
exercised (and is much more important than the return stack) is
efficient virtual machine interpreters. Up to 13% of the instructions
executed by such interpreters are indirect branches
<http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz>,
and they can spend more than half of their time in branch
mispredictions.
To the best of my knowledge, the BTIC does not predict indirect
branches (don't ask me why not). The link above mentions the b and bc
instructions (direct branches), not bctr and bcctr (indirect
branches).
I have also performed measurements on the PPC7447A in my iBook
(comparing the run-times of
<http://www.complang.tuwien.ac.at/forth/threading-v1/pro-btb.c> and
<http://www.complang.tuwien.ac.at/forth/threading-v1/anti-btb.c>; use
gcc <3.1 for compiling them, later gccs tend to pessimize them and
destroy its value as a micro-benchmark), and they indicate that no
indirect branch prediction is happening. Of course, this could be
because the BTIC was disabled on that CPU, or it could be because the
BTIC never predicts indirect branches.
BTW, I am seeing curious results on the PPC970 for these
micro-benchmarks: The one designed for 50% BTB prediction accuracy
<http://www.complang.tuwien.ac.at/forth/threading/direct.c> performs
better than both pro-btb.c and anti-btb.c. My first guess was that
this is coming from a code alignment sensitivity of the PPC970, but I
tried inserting nops at the start of main(), and it did not make a
difference.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
| Back to top |
|
 |
|
|
|
|