interrupting for overflow and loop termination
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
interrupting for overflow and loop termination
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Frode Vatvedt Fjeld
Guest





Posted: Tue Sep 13, 2005 12:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Terje Mathisen <terje.mathisen@hda.hydro.com> writes:

Quote:
At the ADD site:

add eax,[esi]
jo local_handler
...
local_handler:
call global_handler
jmp return_to_normal_code

Yeah, it seems like this would use 9 bytes of code, vs just 1 for
INTO.

Right, provided the local_handler can be located within the
signed-8-bit range. If not, the INTO savings balloon to 14 bytes.
(Actually, for my particular architecture, the call will be either 3
or 6 bytes (i.e. a register-indirect call with 8 or 32 bits offset,
and perhaps even a segment override prefix) yielding the range 2+3+2=7
to 5+7+5=17 bytes vs. the 1 INTO.

BTW: does anyone actually know something about the cost of the x86
INTO (when OF=0), especially relative to a conditional branch?

--
Frode Vatvedt Fjeld
Back to top
Seongbae Park
Guest





Posted: Tue Sep 13, 2005 7:57 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

John Mashey <old_systems_guy@yahoo.com> wrote:
Quote:
David Hopwood wrote:
andrewspencers@yahoo.com wrote:
Terje Mathisen wrote:

A slightly different situation is where you have code that in practice
always handles integers that fit in a single word, but that can't be
statically guaranteed to do so, and the language specification says that
bignum arithmetic must be supported -- the obvious example being Smalltalk.
There were some attempts to support this in hardware (e.g. "Smalltalk on
a RISC"; also something on SPARC that I can't remember the details of),
but it turned out to be easier and faster for implementations of Smalltalk
and similar languages to use other tricks that don't require hardware support.

Yes.
1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
which used the high-30 bits as integers, and the low 2 bits as tags; if
either low 2-bit field were non-zero, it trapped.

And taddcctv (Tagged-ADD-and-set-CC-with-Trap-on-oVerflow)
has been deprecated in SPARC v9,
meaning the opcode will continue to work as specified,
but no performance guarantee (i.e. it may be emulated entirely by software).
V9 specifically suggests replacing taddcctv
with taddcc (non-trap version, essentially just an addcc
with overflow bit from tag portion - hence it's not that expensive to implement)
followed by a branch-on-overflow-set (or by a trap-on-overflow-set).

Of course, having been declared "deprecated" in 1992,
this baggage still has to be carried forward -
it's not possible to reclaim the opcode space taken by these instructions yet.
I don't know if SPARC will ever be able to do so.
Probably not until there's a SPARC v10, if that ever happens.

....
Quote:
Anyway, it's pretty clear that relevant mechanisms were being discussed
~20 years ago, but nobody seems to have figured out features that
actually make implementation sense.

Probably because there's no such feature,
beside a fast general purpose user-level trap mechanism you mentioned
in the snipped part of your post.

Quote:
I'd be delighted to see a
well-informed proposal that had sensible hardware/software
implementations and really helped LISP/Smalltalk/ADA and hopefully
other languages...

I think LISP/Smalltalk/ADA market is just too small to justify
adding any significant change in the general purpose ISA,
unless this yet-to-be-invented mechanism is easy and cheap
to implement or it is for some other purpose which happened to
help them (like a fast user-level trap).

Having said that, it would be interesting to see any good proposal in this area.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
John Mashey
Guest





Posted: Tue Sep 13, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

David Hopwood wrote:
Quote:
andrewspencers@yahoo.com wrote:
Terje Mathisen wrote:

A slightly different situation is where you have code that in practice
always handles integers that fit in a single word, but that can't be
statically guaranteed to do so, and the language specification says that
bignum arithmetic must be supported -- the obvious example being Smalltalk.
There were some attempts to support this in hardware (e.g. "Smalltalk on
a RISC"; also something on SPARC that I can't remember the details of),
but it turned out to be easier and faster for implementations of Smalltalk
and similar languages to use other tricks that don't require hardware support.

Yes.
1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
which used the high-30 bits as integers, and the low 2 bits as tags; if
either low 2-bit field were non-zero, it trapped.

2) ~1988, while working on MIPS-II, I/we spent a lot of time talking
with Smalltalk & LISP friends, potential customers, etc, asking:
"Are there any modest extensions that would help you a lot, and would
be reasonable to implement?

Short answer: NO.

Longer answer:
a) They said either give them a complete, tailored solution [which they
didn't expect], or just make the CPU run fast, but don't bother with
minor enhancements. Some said they knew about the SPARC feature, but
didn't use it.

b) Some said: they were all doing fairly portable versions, had learned
a lot of good tricks, and minor improvements that required major
structural changes just weren't worth it.

c) I spent some time with David Kay (of XEROX PARC/Smalltalk fame) on
this, i.e., were there features that would be substantially helpful?

The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).
- Some kind of general mask/check mechanism that could generate such
traps on particular bit combinations, either:
- on completed address generation (maybe)
- on input value to ALU operation (maybe)
- on output value from ALU (bad)
- on output fetched by load instruction (bad)

But we were not able to generate a specific-enough proposal for
something that was sure to be really useful, and could be reasonable to
implement.

In that round we did add the TRAP instructions for MIPS-II, but they
were primarily for ADA, although could be used elsewhere.

I mention this, because as often happens, people throw around ideas for
features without having much concern for *serious* implementation
issues.

Implementors would not be happy designing a mechanism:
- that seems only useful for a few specific cases
- that is easily handled by normal user code
- that introduces an extra new trap type, that requires especially
efficient handling, because it's expected to be used essentially
in-line, and not as an error indicator.
- and that may well introduce gate dealys in critical paths, even if
it's minimal hardware.

As I've posted many a time, traps are *notorious* for causing
implementation bugs in hardware or software, so people do their best
not to introduce new flavors of them unless strong evidence is provided
that they are needed or are really worth it for performance.

There is a great deal of pushback in introducing features that might
add gate delays in awkward places, of which two are:
a) Something only computable on the *output* of an ALU operation
b) The result of a load operation

In many implementations, such paths may be among the critical paths.
Sometimes, the need to get a trap indication from an ALU, FP ALU,
Load/store unit to the instruction fetch unit may create a long wire
that causes serious angst, or yelling in design meetings.

Integer Overflow is one of a), but it's simple enough not to be likely
to add gate delays. Nevertheless, it is something determinable only
late in the cycle, so many ISA designers have chose not to have it be
trappable. HP PA, MIPS, an Alpha designers all did choose the
minimalist approach, in which:
- There is no OVFL flag in the Condition Code ... because there is no
CC.
- There is a reasonably complete set of ADD / SUB operations, each with
2 flavors: arithmetic/signed and logical/unsigned. The former always
cause traps on overflow, the latter never do. Compilers generate the
latter for C unsigned, and for synthesis of complex addressing
arithmetic. This assumed that you wanted to make the normal case fast,
at the expense of needing multi-instruction sequences to get explicit
tests for overflow without trapping.

Anyway, this does show that it is possible, across a wide range of
designs, to detect integer Overflow in a timely fashion. Likewise,
it's even less time-constrained (as SPARC does) to trap on bit-tests in
the input values. Finally, a lot of floating-point trap tests can be
done on the input, or use the MIPS trick of examining the inputs and
stalling if it cannot be sure the operation will complete without trap
[discussed here earlier].

On the other hand, the kind of features that I described above in the
Kay are much tougher. To be really useful, you'd want to have
something like:
- a mask register that specified which bits of a value should be
checked
- a compare register
- a flag to say whether to trap if equal or not equal, i.e.,:
if (flag) then
{if ((value & mask) == (compare & mask)) then trap();}
else
{ if ((value & mask) != (compare & mask)) then trap();}

You could do this with (value XOR compare) & mask, but in any case, you
still need a comparator tree somewhere.

And then, you'd actually probably want several sets of
mask/compare/flag regs, and you'd need variants of operations that
would enable the checking. Note, of course, that

It *might* be plausible to use this for input operands, although no one
would appreciate the extra read ports/bus loads, but at least the
checks could go on in parallel with the ALU operation. Nobody would be
very happy about doing this on the output of the ALU or load
instructions.

This is a LOT of mechanism, and so needs serious justification ... and
as for doing counter comparisons, no way.

The closest real designs get to this sort of thing, or to
lower-overhead loop control are:

a) Special counter registers that help speed up branches, found in some
general ISAs.

b) Zero Overhead Loop (Buffers) found in some DSPs.

Anyway, it's pretty clear that relevant mechanisms were being discussed
~20 years ago, but nobody seems to have figured out features that
actually make implementation sense. I'd be delighted to see a
well-informed proposal that had sensible hardware/software
implementations and really helped LISP/Smalltalk/ADA and hopefully
other languages...
Back to top
Jan Vorbrüggen
Guest





Posted: Tue Sep 13, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Quote:
The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).

Later, you mention the MIPS TRAP instruction(s)...is that along this line?
What about the VAX's CHMx with x=U (which passes an additional constant
parameter to the trap routine)? Couldn't one have a fast user-mode trap,
and then make sure that an address-alignment trap was handled that way?
(That would also go some way towards handling unaligned operands quicker.)

Jan
Back to top
Guest






Posted: Tue Sep 13, 2005 1:31 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

John Mashey wrote:
Quote:
The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).
Isn't a user-level trap mechanism effectively available in the upcoming

Vanderpool/Pacifica processors due at the end of this year/beginning of
next year? (Though I don't know how low-overhead it'll be.)
Since the processors are (supposedly) fully virtualizable, user-level
and supervisor-level code no longer have different views of the
processor. I.e. in contemporary non-virtualizable systems, user-level
code knows that it's running at user-level, and is designed to use only
user-level processor features (and the kernel running at
supervisor-level kills it if it misbehaves), but in a fully
virtualizable system, user-level code can be designed to run at
supervisor-level and use all processor features (including traps)
because it doesn't know (or care) that it's actualy running at
user-level, and the kernel virtualizes the supervisor-level processor
features which the user-level code uses.
Although virtual machine monitors such as VMware do this, the critical
difference is that VMware running on standard x86 processors must
verify/interpret supervisor-level code running at user-level, so it's
only useful for coarse-grained virtualization (i.e. an entire virtual
machine running a standard OS), whereas a kernel running on the new
virtualizable x86 processors doesn't have to verify/interpret
supervisor-level code running at user-level since the processor traps
on all sensitive instructions, so it's useful for fine-grained
virtualization--i.e. regular programs, which run at user-level, can be
coded to run at supervisor level, and the virtualization has zero
performance impact except to the extent that those programs actually
use supervisor-level processor features.
VMware obviously is going to be written to take advantage of the new
processors, but my point here is that the host OS kernel could be
designed to take advantage of it too, to let even regular user-level
programs running on the host OS use supervisor-level features--e.g.
traps.
Back to top
Iain McClatchie
Guest





Posted: Tue Sep 13, 2005 2:10 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Mash> There is a great deal of pushback in introducing features that
Mash> might add gate delays in awkward places, of which two are:
Mash> a) Something only computable on the *output* of an ALU
Mash> operation
Mash> b) The result of a load operation

Mash> In many implementations, such paths may be among the critical
Mash> paths. Sometimes, the need to get a trap indication from an
Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit
Mash> may create a long wire that causes serious angst, or yelling
Mash> in design meetings.

Hmm... a feature that hangs some logic on the output of the ALU or
load pipe, and causes a pipe flush and IF retarget if the logic
detects some condition.

I don't think this is a problem, Mash. We're already doing this
for integer overflow and various floating-point exceptions. Suppose
for a moment that the additional complexity of the feature added a
pipe stage to this recurrence... in an OoO core, who cares? GPR
writeback is unaffected, you just have more logic writing to the tag
bits in the reorder buffer.

It's not like we're going to see one or more exception per every 1000
instructions... right?

Now what would be very unpopular with the CPU guys would be
instructions that monkey around with the dataflow inside the ALU.
I skimmed the description of the Sparc tagged adds, but they
sounded like just the kind of thing I'd want to kick out of the
hardware, because getting data through the ALU really is the
common case.

Heck, I'd like to get rid of sign extension on loads. In an earlier
proposal, I wanted to bolt an ALU (including shifter) onto the end of
the load pipe, so that the op after the load could be scheduled with
the load in one go. The trouble is that raw pointer chasing is just
too popular, and you don't want the load pipe latency dinking back
and forth between two values.

Side note: earlier in this thread people seemed to be having trouble
with the difference between jumps/branches and exceptions. On OoO
CPUs, there is one relevant distinction: predicted versus nonpredicted
control flow. For instance, it might be totally reasonable for the
processor to predict TLB faults on certain load instructions, and
avoid the double pipe flush by predicting the exception.

So... exceptions to get out of loops is not changing the problem
that the core faces.

Now, a separate issue is how that control flow is encoded. It is
definitely the case that instruction fetch engines are having a
great deal of difficulty with all these branches. Once predicted,
verifying the predictions is actually not too bad, which is why trace
caches are so enticing.
Back to top
John Mashey
Guest





Posted: Tue Sep 13, 2005 4:15 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Jan Vorbrüggen wrote:
Quote:
The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).

Later, you mention the MIPS TRAP instruction(s)...is that along this line?
What about the VAX's CHMx with x=U (which passes an additional constant
parameter to the trap routine)? Couldn't one have a fast user-mode trap,
and then make sure that an address-alignment trap was handled that way?
(That would also go some way towards handling unaligned operands quicker.)

No, both are still kernel traps. What I've long wished for, but didn't
have a design early enough in MIPS-land was something reminiscent of
Alpha PALcode, or the IBM PPC mechanism for handling unaligned data,
but workable in user-state.

NOTE: WARNING: this post is different (from my usual), because it
describes something for which I do *not* have a design, just some
thoughts from over the years. I have warned many times that one has to
work out *all* the details, because the devil lies in their
interactions. Note that we never implemented anything like this,
because none of us could come up with a mechanism that we were sure was
widely useful and that didn't have implementation issues, and it was
clear, later on, that we couldn't graft this on top of existing MIPS
very well. It might be, that had we had more time in the original
design, that we could have included an appropriate mechanism ... or it
might be that there was no sensible design.

I wished for something general enough to:
a) Fix alignment errors, i.e., one would like to be able to run a
binary with/without alignment checking. [Recall that MIPS could handle
alginment errors, but needed a recompile to use LWL/LWR, etc].
b) Be able to trap unimplemented instructions, i.e., like
floating-point operations on original MIPS R2000, before the FPU was
available, or for machines that didn't have one, rather than doing
coprocessor-unusable traps. Also, one might do not-yet-implemented
instructions, like sqrt (which was not there in MIPS-I, but added
later). One might consider doing integer mul/div this way, where some
designs had them, and some didn't.
c) Likewise, support for parts of IEEE FP that one didn't want to do in
hardware.
d) Tagged-trap support for LISP, Smalltalk, etc.
e) Other user-desired or more likely, language-system desired features

One can summarize these into several groups:
A) Managing binary compatibility across a family whose implemented
features vary. Note that a good mechanism would let you run binaries
with new instructions on old systems, given the right emulation code.
B) Handling cases where most of the time, simple hardware can do the
right time, but in a small fraction of the cases, driven by data, one
needs to do something else, but it's not a fault in the usual sense,
i.e., and error from which immediate recovery is unlikely.

A) includes b) and c). B) includes d) and e). a) has characteristics
of both A) and B).


All of, these are for needs where:
a) You want to execute code straightforwardly in the protection/address
context of the user program.
b) You want to keep overhead "low enough".
c) You want user-level programming flexibility

This would probably require:
a) A bunch of extra registers to record the location and nature of the
trapped instruction. The location pieces is probably not too bad. The
"nature" piece wants to crack the instruction, it's inputs, and outputs
into a useful form. [I.e., akin to the way MIPS utlbmiss TRAP sets up
registers with values in useful places to lessen the code path for TLB
refill].

This can be a lot: read about the PPC's "DSISR" register to helpo
alignment-code fixups.

One would prefer not to have to refetch and interpret the instruction
completely in common cases.

b) A few regular registers reserved for the use of the trap code.
These could be regular registers [the way that MIPS reserves two
registers that the kernel can trash whenever it wants], or they could
be extra new ones.

c) Probably, for reasonable code, one needs mechanisms to fetch input
operands and get outputs back to the right place(s). In typical RISCs,
input operands might be able to be presented in a pair of special
registers. Getting the output back to the right register may take some
work, including something like an indirect register specifier or an
S/360 EXEecute instruction.

d) There is a lot of software-convention work to be done.

e) One has to decide what to do about asynchronous interrupts and
further exceptions, and the extent to which a user-level trap routine
has access to features beyond normal user code. Such routines
certainly cannot be allowed to block external interrupts arbitrarily.

In general, there are a lot of details and their interactions to get
right, and there is a lot of tension betweeen "gneral-enough" and "more
extra hardware and complexity than is worth it."
Back to top
JJ
Guest





Posted: Wed Sep 14, 2005 12:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

John Mashey wrote:
Quote:
David Hopwood wrote:
andrewspencers@yahoo.com wrote:
Terje Mathisen wrote:

A slightly different situation is where you have code that in practice
always handles integers that fit in a single word, but that can't be
statically guaranteed to do so, and the language specification says that
bignum arithmetic must be supported -- the obvious example being Smalltalk.
There were some attempts to support this in hardware (e.g. "Smalltalk on
a RISC"; also something on SPARC that I can't remember the details of),
but it turned out to be easier and faster for implementations of Smalltalk
and similar languages to use other tricks that don't require hardware support.


snipping

Quote:

Anyway, it's pretty clear that relevant mechanisms were being discussed
~20 years ago, but nobody seems to have figured out features that
actually make implementation sense. I'd be delighted to see a
well-informed proposal that had sensible hardware/software
implementations and really helped LISP/Smalltalk/ADA and hopefully
other languages...

I suspect in current single threaded processor designs, clock to the
max, with current cache model, such a proposal would be hard to come by
and justify esp when the memory wall forces such extreme locality of
reference and so many wait states.

A processor designed solely around communicating sequential processes
running on multiple MTAs can fairly well hide memory latency (well
known).

By sharing a high issue rate RLDRAM with say 200M-400M interleaved load
stores per sec driven by a nice hash box to destroy all locality of
reference from numerous PE requests, and to reduce bank collisions to
random chance, object support comes naturally. The hashing takes 32b
Object-MMU IDs and hashes with 32b linear index to the particular PA
size. Object IDs are generated by new[] using a PRNG. MMU IDs are
enumerated at boot time over Links. A 32MByte RLDRAM can appear to
store upto 1M single line objects, more typically <<100k objects of all
types and sizes. By trading space for rehashes, performance can be kept
good. Message object IDs are passed around through channels syncronized
by !,?. Besides occam support, ADA, Lisp, Smalltalk support comes to
mind all the time.

Object support in hardware to a very fine grain level (32 byte pages or
lines) with full protections of all object lines. It makes lists,
sparse arrays, hash tables a snap, all fit right on top of each other
all Mashed up as long as memory is <say 70% full. The MMU model can be
tested out in a compiler for its own object store but this test is only
single threaded.

For more performance the scheme can be replicated at lower <ns and
higher 50ns levels for raw flat memory thoughput or volume. At the sub
ns level, it allows say 16 way interleaved N cycle concurrent SRAM
banks to appear to have performance of MMU issue box even with
relatively slow SRAMs (or maybe even 5ns DRAM). At the other end, the
SDRAM controller has little throughput but latency is only a few times
that of RLDRAM.

You takes the wait states from few huge processors or numerous hardware
threads from many simple processors, I'll take many threads anytime. In
this scheme its the MMU thats really interesting, the PEs are just
little grunt boxes to generate enough memory requests to keep MMU near
100%. Even the PE ISA doesn't matter much a 486 RISC ISA would work as
well as anything else with the extra par support.

Anyway I will describe it at cpa2005 for anyone interested

johnjakson at usa dot ...
Back to top
Seongbae Park
Guest





Posted: Wed Sep 14, 2005 2:51 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

John Mashey <old_systems_guy@yahoo.com> wrote:
....
Quote:
You hardware guys are all alike [in hating sign-extension on loads]
:-).

I haven't met a hardware guy who likes that, either.

Quote:
We seriously looked at various schemes found elsewhere, i.e., where one
loads zero-extended partial-word data, and then uses an explicit EXT to
sign-extend. We had enough data to prefer having both zero-extend and
sign-extend as operations, and if push had really come to shove, I
would have lived with an explicit EXT, although having done 68K
compiler work, and dealt with some of the funny optimization hassles
(i.e., can one get correct results without the EXT, sometimes?)
I certainly preferred to have the signed-load opcodes as first choice.
My second choice would have been 2-cycle load-signeds.

Well, if the sign-extend version takes more cycles than zero-extend
- I suppose your second choice meant such a case -
it creates the same funny optimization hassle
and such an optimization accompanies occasional bug reports that cry wolf
over the zero-extend load that correctly replaced sign-extend load
("It's a signed char in my code.
Why is the compiler using a zero-extend load ?
The compiler must be buggy!").

And since ISAs usually don't define exact cycles nor they require
two operations to take same number of cycles or issue/execution/etc resources,
implementations of ISAs that have both versions
tend to take an extra cycle for sign-extend load.

Quote:
Third choice was the explicit EXT.
--

#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
John Mashey
Guest





Posted: Wed Sep 14, 2005 5:43 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Iain McClatchie wrote:
Quote:
Mash> There is a great deal of pushback in introducing features that
Mash> might add gate delays in awkward places, of which two are:
Mash> a) Something only computable on the *output* of an ALU
Mash> operation
Mash> b) The result of a load operation

Mash> In many implementations, such paths may be among the critical
Mash> paths. Sometimes, the need to get a trap indication from an
Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit
Mash> may create a long wire that causes serious angst, or yelling
Mash> in design meetings.

Hmm... a feature that hangs some logic on the output of the ALU or
load pipe, and causes a pipe flush and IF retarget if the logic
detects some condition.

I don't think this is a problem, Mash. We're already doing this
for integer overflow and various floating-point exceptions. Suppose
for a moment that the additional complexity of the feature added a
pipe stage to this recurrence... in an OoO core, who cares? GPR
writeback is unaffected, you just have more logic writing to the tag
bits in the reorder buffer.

Of course (i.e., it might not matter in an OoO), but you may have
missed the careful weasel-words "In many implementations". After all,
of the horde of distinct pipeline implementations that have ever
existed, only a tiny fraction are OoO...

For what it's worth, there was some argument about this (overflow in
R2000) in 1985, because it was literally the *only* integer exception
that needed to be detected after the ALU stage, and in time to inhibit
register writeback, and somebody was worried about a possible extra
delay for a while.

Quote:
Now what would be very unpopular with the CPU guys would be
instructions that monkey around with the dataflow inside the ALU.
I skimmed the description of the Sparc tagged adds, but they
sounded like just the kind of thing I'd want to kick out of the
hardware, because getting data through the ALU really is the
common case.
Again, I don't think the SPARC tagged ops are so bad, because they just

look at two bits each of the two inputs, so one can detect the trap
early.

Quote:

Heck, I'd like to get rid of sign extension on loads. In an earlier
proposal, I wanted to bolt an ALU (including shifter) onto the end of
the load pipe, so that the op after the load could be scheduled with
the load in one go. The trouble is that raw pointer chasing is just
too popular, and you don't want the load pipe latency dinking back
and forth between two values.

You hardware guys are all alike [in hating sign-extension on loads]
:-).
We seriously looked at various schemes found elsewhere, i.e., where one
loads zero-extended partial-word data, and then uses an explicit EXT to
sign-extend. We had enough data to prefer having both zero-extend and
sign-extend as operations, and if push had really come to shove, I
would have lived with an explicit EXT, although having done 68K
compiler work, and dealt with some of the funny optimization hassles
(i.e., can one get correct results without the EXT, sometimes?) I
certainly preferred to have the signed-load opcodes as first choice.
My second choice would have been 2-cycle load-signeds. Third choice
was the explicit EXT.
Back to top
Jan Vorbrüggen
Guest





Posted: Wed Sep 14, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Quote:
Later, you mention the MIPS TRAP instruction(s)...is that along this line?
What about the VAX's CHMx with x=U?
No, both are still kernel traps.

Hmmmm...ISTR that CHMU was dispatched through a kernel-owned vector page,
but executed in the mode of its caller, i.e., user mode if called from
user mode. You do get an additional indirection through the process-
specific vector to get at the process-specific handler (compared to the
absolutely minimal implementation), but that seems a good tradeoff
compared to dragging along an additional word of process state at each
context switch, or doing the load lazily.

[excellent discussion of design issued snipped]

Quote:
a) A bunch of extra registers to record the location and nature of the
trapped instruction. The location pieces is probably not too bad. The
"nature" piece wants to crack the instruction, it's inputs, and outputs
into a useful form.

The latter would be nice, but not an absolute requirement, e.g., for UUOs.
Of course, it would make sense to make the information that is already
present in the pipeline (e.g., the aligned load address and the offset bits)
available if possible. However, IMO the emphasis is on reducing the over-
head of the trap substantially by staying in the caller's mode, and if that
means there is some case-specific logic in the handler, so be it. Facili-
tating dispatch to the responsible subhandler by providing some of the
instruction bits suitable to that purpose in an easily accessible way
would be a big advantage, I suspect.

Quote:
b) A few regular registers reserved for the use of the trap code.
These could be regular registers [the way that MIPS reserves two
registers that the kernel can trash whenever it wants], or they could
be extra new ones.

Shadow registers. Can we make the assumption that such traps cannot be
nested? Otherwise, the shadow registers become a part of context-switch
state, which again could be saved/restored lazily but is a bother.

Quote:
c) Probably, for reasonable code, one needs mechanisms to fetch input
operands and get outputs back to the right place(s). In typical RISCs,
input operands might be able to be presented in a pair of special
registers.

Isn't that kind of stuff already done explicitly (i.e., without support
by the processor) in PAL routines? A regular instruction layout will
facilitate this substantially.

Quote:
e) One has to decide what to do about asynchronous interrupts and
further exceptions, and the extent to which a user-level trap routine
has access to features beyond normal user code. Such routines
certainly cannot be allowed to block external interrupts arbitrarily.

This one is a bother, and certainly hard to get right. PAL-like restrictions
might apply, at least to some degree. Likely need tool support to get this
right in the specification.

Quote:
In general, there are a lot of details and their interactions to get
right, and there is a lot of tension betweeen "gneral-enough" and "more
extra hardware and complexity than is worth it."

Quite. But given the pain with the issues you so nicely outlined in the
snipped part, it appears to me it should be worth the effort. OTOH, graf-
ting this onto an existing ISA is probably HARD, and where are the new
ISA development projects 8-)?

Jan
Back to top
Nick Maclaren
Guest





Posted: Wed Sep 14, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

In article <1126658582.506368.173210@g43g2000cwa.googlegroups.com>,
John Mashey <old_systems_guy@yahoo.com> wrote:
Quote:

For what it's worth, there was some argument about this (overflow in
R2000) in 1985, because it was literally the *only* integer exception
that needed to be detected after the ALU stage, and in time to inhibit
register writeback, and somebody was worried about a possible extra
delay for a while.

Why on earth was that? I.e. why should it need to inhibit register
writeback? MIPS is twos complement, and the only real advantage of
that is that it enables writeback and overflow flagging to be done
in either order.

If the architecture specified that writeback did not occur if overflow
occurred, then the designers weren't thinking about that aspect. It
isn't as if it wasn't an ancient problem, after all.


Regards,
Nick Maclaren.
Back to top
John Mashey
Guest





Posted: Wed Sep 14, 2005 3:09 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Seongbae Park wrote:
Quote:
John Mashey <old_systems_guy@yahoo.com> wrote:
...
You hardware guys are all alike [in hating sign-extension on loads]
:-).

Well, if the sign-extend version takes more cycles than zero-extend
- I suppose your second choice meant such a case -
it creates the same funny optimization hassle
and such an optimization accompanies occasional bug reports that cry wolf
over the zero-extend load that correctly replaced sign-extend load
("It's a signed char in my code.
Why is the compiler using a zero-extend load ?
The compiler must be buggy!").

Yes, but the complaints are much worse when people disassemble code and
see a bunch of EXTs that are clearly unnecessary, i.e., visible
instructions almost always get more attention/flak/whinging than slow
instructions, unfortunately. I spent some time tuning a 68K compiler
years ago at Convergent, and this kind of thing came up, and it wasn't
trivial to fix at the time, and get it right, at least in pcc.
Back to top
John Mashey
Guest





Posted: Wed Sep 14, 2005 3:22 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Seongbae Park wrote:

Quote:
I think LISP/Smalltalk/ADA market is just too small to justify
adding any significant change in the general purpose ISA,
unless this yet-to-be-invented mechanism is easy and cheap
to implement or it is for some other purpose which happened to
help them (like a fast user-level trap).

Well, that's why we never did it. We certainly couldn't justify
expensive features for that market, but we hoped to find modest useful
ones that might be general enough to have other uses as well. Maybe if
we could have afforded another 6 months to do the original MIPS-I ISA,
we might have thought of something reasonable, but after that, it was
probably too late. Nothing very complex would have fit in the R2000 in
any case, although I would have given up a few TLB entries had we
gotten a good solution here.
Back to top
Scott A Crosby
Guest





Posted: Thu Sep 15, 2005 12:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

On 13 Sep 2005 08:33:17 -0700, "John Mashey" <old_systems_guy@yahoo.com> writes:

Quote:
I wished for something general enough to:

a) Fix alignment errors, i.e., one would like to be able to run a
binary with/without alignment checking. [Recall that MIPS could handle
alginment errors, but needed a recompile to use LWL/LWR, etc].

b) Be able to trap unimplemented instructions, i.e., like
floating-point operations on original MIPS R2000, before the FPU was
available, or for machines that didn't have one, rather than doing
coprocessor-unusable traps. Also, one might do not-yet-implemented
instructions, like sqrt (which was not there in MIPS-I, but added
later). One might consider doing integer mul/div this way, where some
designs had them, and some didn't.
c) Likewise, support for parts of IEEE FP that one didn't want to do in
hardware.

A) Managing binary compatibility across a family whose implemented
features vary. Note that a good mechanism would let you run binaries
with new instructions on old systems, given the right emulation code.

About a month ago, during a discussion on mul/div on SPARC, someone
here suggested what I thought was a cute technique for doing
this. What happens is when the CPU tries to run an illegal instruction
and traps, the kernel backpatches the executable to jump to an
appropriate emulation routine. The compiler is required to always
follow such a not-universally-implemented instruction with enough
no-ops so there's always room for the back-patch. However, if the
binary is targetted only for hardware with the instruction, the
compiler isn't required to generate the no-ops.

The ABI is such that all binaries are linked with an appropriate
emulation library for the kernel to backpatch jumps to point to. The
no-op space overhead might be reduced if the ISA included a special
save&jump instructions designed for this purpose.

On old hardware there's no loss in performance, and the kernel only
gets involved with one trap once for each instruction, not once for
each execution of an unsupported instruction. And on new hardware the
cost is a few extra no-ops. Software targetting new hardware only
doesn't even pay the no-op overhead.

Scott
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 4 of 7

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB