interrupting for overflow and loop termination
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
interrupting for overflow and loop termination
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Terje Mathisen
Guest





Posted: Mon Sep 19, 2005 9:14 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Andy Freeman wrote:

Quote:
Terje Mathisen wrote:

Eliot Miranda wrote:

highest dynamic frequency for integer math is likely to be for iteration.

Right, it is hard to imagine how it could be otherwise.


Most checks on iteration vars can be pulled out of the loop. In most
of the
others, it can be done every <large constant> iterations.

I.e. all of the JIT optimizations possible for a Java implementation to
catch overflows, can also be used to convert those variables to bigints.

By reserving the all-zero tag for integers you can even use regular
ADD/SUB operations, and MUL/DIV with just a shift fixup.

Is this how current implementations do it on x86?

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
John Mashey
Guest





Posted: Tue Sep 20, 2005 12:06 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Terje Mathisen wrote:
Quote:
glen herrmannsfeldt wrote:

This would be a _very_ special case, since it would require a loop with
only a single int variable, or at least you must be able to prove that
only this int can ever overflow. In that case you can of course skip the
overflow testing.

If your architecture can do this, then by all means use an inline
branch-and-link instead of the INTO style check!

But how much cost is there in not being able to retire the instruction
until the overflow status is known? Assuming the usual out of order,
but in order retire model.

Not much: Processing can easily continue for at least one or two cycles,
forwaring the preliminary result to the next user, and as long as the
vast majority of these ints won't actually overflow, the delayed
retirement does not correspond to any actual branch misses.

Then again, maybe all you need is a sticky overflow bit. You could
do some set of calculations and test once at the end for any overflow,
and clear the sticky overflow bit at that time.

Nice idea. If overflows are really rare, then it would be a win to redo
the entire loop to save testing on every operation.

People have sometimes used sticky overflow bits. To me, the hierarchy
is:
a) Precise exceptions: you know exactly where the exception occurred.
b) Sticky bits, you don't know exactly where, but get some bound.
c) No sticky-bits, explicit tests required everywhere, and hence, in
many cases, the test are omitted for speed and code size.

As a software guy, I am, of course, biased in favor of a). As a
software/hardware guy, I also realize that sometimes one may have to
compromise, but one should never be *looking* for reasons to compromise
on a) - at most, one might reluctantly admit that introducing some
imprecision is the lesser evil, grit one's teeth and bear it.

At MIPS, I (and the rest of the OS group) were among the loudest voices
in demanding precise exceptions everywhere, most of us having had to
deal with weird cases and hardware bugs and related software bugs too
many times in past lives. In the long run, this did turn out to help
chip verification as well.

To be more specific, we always wished for:

When a user-level instruction X caused an exception, and a trap to the
OS:

a) The Exception Program Counter (EPC) was set to point at the
instruction.
b) The CAUSE register was set to indicate the reason.
c) All instructions before X had been completed (or equivalent).
d) X itself had had no user-visible side-effects, i.e., there were no
partial stores, no register writebacks, no auto-increments, no shadow
registers to be distentangled, i.e., X had no effect except to cause
the trap.

We didn't quite get that, as:
a) and b): if X was in a Branch Delay slot, the EPC pointed at X-4, and
the BD-bit was set in the CAUSE register. Although this was soemtiems
a pin, and proved to be a source of designer complaint in some later
implementations, software people viewed it as tolerable, even thought
it sometimes meant messy debugger code and trickier emulation code (as
in floating poitn emulation done on systems lacking FPUs).

c) This wasn't true from the hardware viewpoint, but the CPU provided
this illusion to the programmer, so that was viewed as OK.
Specifically, consider the
sequence:

Y: (long-running FP operation, or integer multipley/divide)
....
X, causing trap.

At the time of the trap Y, might still be running in an independent
fucntional unit. However, this was all carefully defined so that:

Either the instruction was defined so that it could not ever cause an
exception (in MIPS, the divide doesn't raise a divide-by-zero; instead,
if the compiler sees variable1/varaible2, and can't be sure variable2
is non-zero, it generates code like:
DIV rs,rt
BEQZ rt,divide-by-zero
....

In the floating point cases, the insistence on precise exceptions ended
up encouraging the innovative "check the exponent fields quickly and
stall the CPU if there is any chance of an exception" patent of Tom
Riordan's that I've mentioned before.

In any of these cases, a reference to a register that was to ber
written by one fo these asynchronous units simply caused a stall.
Interrupt code woudl save away the regular integer registers first, by
which time any lignering MUL/DIV or FP ops would likely have completed.

Of course, a "fast-user-trap" feature would have to modify some of the
above, and I still wish we'd had time to think hard enough about that
early enough [2Q85].

Anyway, the message of all this is:

Start with the goal of keeping the exception model simple and precise.
Only back off reluctantly. I've posted many a time that dealing with
exceptions and their weird interactions has long been a source of
problems, as even experienced humans miss things. Although there are
some things about MIPS that I'd do differently if I could do it over,
THIS wasn't one of them.
Back to top
Stephen Fuld
Guest





Posted: Tue Sep 20, 2005 12:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

"John Mashey" <old_systems_guy@yahoo.com> wrote in message
news:1127156812.929231.226160@f14g2000cwb.googlegroups.com...

snip

Quote:
People have sometimes used sticky overflow bits. To me, the hierarchy
is:
a) Precise exceptions: you know exactly where the exception occurred.
b) Sticky bits, you don't know exactly where, but get some bound.
c) No sticky-bits, explicit tests required everywhere, and hence, in
many cases, the test are omitted for speed and code size.

Is it totally ridiculous to have a design that kept the instruction address
"with" the instruction as it is executed? Then, no matter in what order the
instructions were executed,or the state of other instructions or parts of
the CPU, the CPU could report the exact address of the instruction causing
the exception. It would certainly require an additional internal register
(to hold the instruction address) within each FU and additional space in the
renameing table, etc. but they could be written in parallel with the other
stuff that had to be written so it may not cause any additional time. The
only time these registers would be read is upon exeptions.

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
glen herrmannsfeldt
Guest





Posted: Tue Sep 20, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Stephen Fuld wrote:
(snip regarding overflow and possibly imprecise interrupts)

Quote:
Is it totally ridiculous to have a design that kept the instruction address
"with" the instruction as it is executed? Then, no matter in what order the
instructions were executed,or the state of other instructions or parts of
the CPU, the CPU could report the exact address of the instruction causing
the exception. It would certainly require an additional internal register
(to hold the instruction address) within each FU and additional space in the
renameing table, etc. but they could be written in parallel with the other
stuff that had to be written so it may not cause any additional time. The
only time these registers would be read is upon exeptions.

With large address space machines that is a lot of extra state to
hold, but it could be done.

The only machine I know in some detail with imprecise interrupts is
the 360/91. It was supposed to fit within the S/360 interrupt model
which only has a place for one address, which was the address of the
next instruction to be executed. When an interrupt happens all
partially executed instructions must be finished before the interrupt
is taken. This can result in more exceptions, resulting in the dreaded
multiple imprecise interrupt. There are indicators for which exceptions
occurred, but not how many of each or where.

-- glen
Back to top
John Mashey
Guest





Posted: Tue Sep 20, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Stephen Fuld wrote:
Quote:
"John Mashey" <old_systems_guy@yahoo.com> wrote in message
news:1127156812.929231.226160@f14g2000cwb.googlegroups.com...

snip

People have sometimes used sticky overflow bits. To me, the hierarchy
is:
a) Precise exceptions: you know exactly where the exception occurred.
b) Sticky bits, you don't know exactly where, but get some bound.
c) No sticky-bits, explicit tests required everywhere, and hence, in
many cases, the test are omitted for speed and code size.

Is it totally ridiculous to have a design that kept the instruction address
"with" the instruction as it is executed? Then, no matter in what order the
instructions were executed,or the state of other instructions or parts of
the CPU, the CPU could report the exact address of the instruction causing
the exception. It would certainly require an additional internal register
(to hold the instruction address) within each FU and additional space in the
renameing table, etc. but they could be written in parallel with the other
stuff that had to be written so it may not cause any additional time. The
only time these registers would be read is upon exeptions.

Of course not (totally ridiculous, that is), as PCs have to be kept
around (in some form or other) anyway, but implementation always
matters.

1) As a reminder, not all CPU designs are complex speculative OOO CPUs.
In fact, only a tiny fraction of distinct CPU designs are such. Some
extra stuff doesn't cost much in a big OOO CPU, but percentage-wise, it
may cost more in a simple pipeline or even a relatively simple 2-issue
in-order superscalar.

In some ways, an OOO design handles exceptions "easier" than the other
designs, in that most instructions are executed speculatively anyway,
and one needs all of the mechanisms for in-order graduation and
unwinding mispredicted code anyway. But consider tjhe (common) designs
with the following characteristics:
- in-order issue
- multiple functional units, with long latencies and potential
out-of-order completion

One may track the PC for each instruction, but suppose there are 4
independent FUs (like integer add, integer mul/divFP add/mul, FP
divide/sqrt), of which the latter 3 naturally have multi-cycle
operations, some very long, and of course, there may be queueing
effects on FUs.

Suppose your code looks like:

1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
..... instructions not depending on f1
2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
..... instructions not depending on f2
2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
..... instructions not depending on r1
4: INT ADD r2 = r3*r4 and this clobbers r2

Suppose one has an ISA in which every one of these can cause an
exception [on MIPS, INT DIV can't, but the rest can.]

THE GOOD CASE
If the ISA semantics follow the rules I described earlier
a) FP DIV and FP MUL stall until they are sure they don't cause an
exception. Then they run to completion.
b) The INT DIV stalls until it does a test-for-zero, and then it runs
as long as necessary. It's quite possible that 1, 2, and 3 are all
executing concurrently.

Then:
1) The behavior is identical on all implementations regardless of the
number and latency of functional units.
2) Likewise, if the OS has to redo the offending instruction and
continue execution, it can, relatively straightforwardly. Why would
would one want to do that? Isn't an exception The End?
Nope: in particular, the FPU might not implement all of IEEE 754, it
might implement the common cases, and trap on data-dependent, but
infrequent cases [as has been done in at least some implementations of
both MIPS and SPARC].

In any case, the IEEE standard recommends that users be allowed to
specify a trap handler for any of the standard exceptions, such that it
can compute a substitute result.

THE HARD CASE
Assume that 1-4 of the instructions above could raise an exception upon
completion, not by stalling. [As faras I can tell from a quick read,
HP PA RISC is somewhat like this.]

Here's are some exercises for the reader:

1) Enumerate the number and order of exceptions that might happen.


2) Write the IEEE trap handlers that handle traps from the FP
operations.
Note that if 1 traps, but late enough, one of its inputs is already
overwritten by the output of 2, which means the trap handler can't get
at the inputs.

2a) Design the logic to keep track of the dependencies above, i.e.,
requiring the tracking of any register used as an *input* by any
uncompleted operation, and stalling any operation that modifies such a
register (which of course, wrecks the very concurrency that one is
trying to get).

3) When a trap occurs, does the CPU complete any pending operations?
Suppose one of them causes an exception also? Design all of the state
provided to the kernel, and to the user trap handler.

4) In a family of CPUs, does the outcome of a program depend on the
relative latencies of various functional units? Discuss the
cirumstances under which this might be OK.

Note that the answer to 1 is: 0-4 exceptions, and in any order,
depending on the latencies, and also on the number of intervening
instructions.

MESSAGE: for some reason, many people want to propose complex
mechanisms. I say again: especially in this turf, simplicty really
pays off, and complexity is only accepted reluctantly, at least by
people who actually have to implement the hardware and software. One
can make complex setups work, but it's expensive.
Back to top
glen herrmannsfeldt
Guest





Posted: Wed Sep 21, 2005 12:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

John Mashey wrote:

(snip)

Quote:
Suppose your code looks like:

1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
.... instructions not depending on f1
2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
.... instructions not depending on f2
2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
.... instructions not depending on r1
4: INT ADD r2 = r3*r4 and this clobbers r2

Don't most processors use register renaming when a register is
overwritten like this? There has to be some way to keep track
of which value is going where.

Getting the right value into the real register for the interrupt
would be an extra challenge, though.

-- glen
Back to top
John Mashey
Guest





Posted: Wed Sep 21, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Peter Dickerson wrote:
Quote:
"glen herrmannsfeldt" <gah@ugcs.caltech.edu> wrote in message
news:yM2dnfGDOKa1HK3eRVn-hw@comcast.com...

Don't most processors use register renaming when a register is
overwritten like this? There has to be some way to keep track
of which value is going where.

Getting the right value into the real register for the interrupt
would be an extra challenge, though.

-- glen

I thought JM had explicitly included simple in-order pipelined processors.
Such microarchitectures don't normally rename.

Yes, especially since:
1) "Most" distinct CPU designs are in-order issue, whether superscalar
or not.
The fraction of OOO designs is minuscule, although of course (due to
X86) they account for a lot of $.

2) "Most" actual CPU chips are in-order, since embedded chips rarely
use OOO, and outsell PC / system chips.

Also, for this newsgroup, I'd guess that if someone actually has a
chance to participate in a CPU design, it is much more likely to be
in-order (in an FPGA, or an SoC) than an OOO chip, as the latter are
not done at very many places. The nubmer of people on the planet who
actually design OOO CPUs is a tiny fraction of the total who design
CPUs.
Back to top
Peter Dickerson
Guest





Posted: Wed Sep 21, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

"glen herrmannsfeldt" <gah@ugcs.caltech.edu> wrote in message
news:yM2dnfGDOKa1HK3eRVn-hw@comcast.com...
Quote:
John Mashey wrote:

(snip)

Suppose your code looks like:

1: FP DIV f1 = f2/f3 some number of clocks, more than FP MUL
.... instructions not depending on f1
2: FP MUL f2 = f3*f4 note this clobbers f2 [typically 2-8 clocks]
.... instructions not depending on f2
2: INT DIV r1 = r2/r3 [probably ~64 clocks on 64-bit CPU]
.... instructions not depending on r1
4: INT ADD r2 = r3*r4 and this clobbers r2

Don't most processors use register renaming when a register is
overwritten like this? There has to be some way to keep track
of which value is going where.

Getting the right value into the real register for the interrupt
would be an extra challenge, though.

-- glen

I thought JM had explicitly included simple in-order pipelined processors.
Such microarchitectures don't normally rename.

Peter
Back to top
Iain McClatchie
Guest





Posted: Wed Sep 28, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Mash> THE GOOD CASE
Mash> If the ISA semantics follow the rules I described earlier
Mash> a) FP DIV and FP MUL stall until they are sure they don't cause
an
Mash> exception. Then they run to completion.

Back when I worked on this stuff ('92-'94), it seemed that most
programs did not turn on any of the user-level exception triggers.
The only common exception was caused by denormalized inputs
requiring a trap to the kernel for emulation, since the hardware
(R8000 in this case) didn't do denorms. These traps caused quite
a lot of grief.

The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
we would have experienced less grief with a 5-cycle multiple-add
pipe that could do input and output denorms without exceptions.
Performance would have been lower, register pressure would have
been higher... but every application run by folks that weren't
sure whether they could turn on flush-to-zero mode would have
gone faster anyway.

I never had exposure to data that would have told me if there
were more folks (more sales dollars, really) in the can-flush-
denorms-to-zero camp than in the don't-know and must-handle-
denorms camps. But my guess is that handling denorms in hardware
would have been the better choice. In hindsight, we probably had
the area to do it, too. (You need an extra output shifter, IIRC.)
Back to top
Jan Vorbrüggen
Guest





Posted: Wed Sep 28, 2005 8:15 am    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Quote:
The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
we would have experienced less grief with a 5-cycle multiple-add
pipe that could do input and output denorms without exceptions.

Would a single cycle extension to the pipeline been enough? Was the
actual hardware (as built) already generating denorms, or does that
cause an exception as well?

Jan
Back to top
Bernd Paysan
Guest





Posted: Wed Sep 28, 2005 4:15 pm    Post subject: Re: interrupting for overflow and loop termination Reply with quote

Jan Vorbrüggen wrote:

Quote:
The R8000 had a 4-cycle multiply-add pipeline. I often wonder if
we would have experienced less grief with a 5-cycle multiple-add
pipe that could do input and output denorms without exceptions.

Would a single cycle extension to the pipeline been enough? Was the
actual hardware (as built) already generating denorms, or does that
cause an exception as well?

Handling denoms requires another barrel shift operation - you find out that
your result doesn't fit into the required range, so you shift it right by
the overflow exponent.

If your MAC pipeline is multiply (carry save adder network), sum, shift,
add, count leading zeros, shift (normalize), shift (denorms), it can take
up to seven or eight cycles.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Back to top
Eliot Miranda
Guest





Posted: Tue Nov 01, 2005 1:15 am    Post subject: tagged/boxed arith op frequencies [was interrupting for over Reply with quote

John Mashey wrote:
Quote:
Terje Mathisen wrote:

John Mashey wrote:


Eliot Miranda wrote:


John Mashey wrote:
Code sequences for polymorphic add/sub or compare would then look like

fetch operand one
fetch operand two
tagged add/sub
branch no overflow Ldone
code for non-tagged & overflow cases (method lookup)
...
Ldone:


Just out of curiosity, what's known about the dynamic frequencies of
the tweo cases?

[snip

So anyway, what do we actually *know* abotu the relative frequencies?

I finally got round to collecting some stats. These are all the
"primitive" arithmetic operations in the run of a Smalltalk program,
where "primitive" means arithmetic operations that are directly
implemented by the Smalltalk virtual machine. [The LargePositiveInteger
operations can be implemented in Smalltalk using SmallInteger and array
dereference, but performance is much worse than a primitive implementation]

The stats are extracted from a single run in which the system started
up, displayed a few windows, ran some computational benchmarks that are
representative exercises of the IDE such as searching for implementors
of a message, creating (but not displaying) an inspector on an object,
compiling a method, etc, and then shutdown. In starting up the system
the window manager's font names are queried and parsed, which accounts
for most of the single-precision float operations. No double-precision
ops occurred.

First comes the stats for the combined start-up, benchmark run and
shutdown. Second comes the data for just the benchmarks. Most
operation names should be self explanatory, but

quo: integer division rounding towards zero
// integer division rounding towards minus infinity
SmallInteger / is exact and results in a
Fraction (integral numerator/denominator)

The datatypes are
Float 32-bit IEEE floating-point (boxed)
SmallInteger 30-bit signed 2's complement integers
(tagged/unboxed/fixnum/immediate)
LargePositiveInteger arbitrary length (implicit) sign/magnitude integer
(> 30 bits)

Startup, Benchmarks, Shutdown # arith ops 7,655,837 # sends 47,852,103

datatype op # ops % arith ops
Float - 47245 0.617%
Float / 94870 1.239%
Float * 77042 1.006%
Float + 116041 1.516%
LargePositiveInteger // 20 <0.001%
LargePositiveInteger + 5060 0.066%
LargePositiveInteger bitAnd: 8 <0.001%
LargePositiveInteger bitShift: 12 <0.001%
LargePositiveInteger bitXor: 2118 0.028%
SmallInteger - 697772 9.114%
SmallInteger / 19863 0.259%
SmallInteger // 10902 0.142%
SmallInteger * 37233 0.486%
SmallInteger + 5721304 74.731%
SmallInteger bitAnd: 355382 4.642%
SmallInteger bitOr: 71064 0.928%
SmallInteger bitShift: 398173 5.201%
SmallInteger bitXor: 1160 0.015%
SmallInteger quo: 568 0.007%

datatype # ops % arith ops % total sends
Float 335198 4.378% 0.7%
LargePositiveInteger 7218 0.094% 0.015%
SmallInteger 7313421 95.527% 15.283%

Benchmarks # arith ops 2,909,298 # sends 19,741,035

datatype op # ops % arith ops
Float - 0 0.0%
Float / 204 0.007%
Float * 204 0.007%
Float + 0 0.0%
LargePositiveInteger // 10 <0.001%
LargePositiveInteger + 1202 0.041%
LargePositiveInteger bitAnd: 0 0.0%
LargePositiveInteger bitShift: 0 0.0%
LargePositiveInteger bitXor: 500 0.017%
SmallInteger - 145981 5.018%
SmallInteger / 3512 0.121%
SmallInteger // 2532 0.087%
SmallInteger * 8280 0.285%
SmallInteger + 2637848 90.67%
SmallInteger bitAnd: 53075 1.824%
SmallInteger bitOr: 9311 0.32%
SmallInteger bitShift: 45752 1.573%
SmallInteger bitXor: 683 0.023%
SmallInteger quo: 204 0.007%

datatype # ops % arith ops % total sends
Float 408 0.014% 0.002%
LargePositiveInteger 1712 0.059% 0.009%
SmallInteger 2907178 99.927% 14.727%

<disclaimer>These data are not necessarily representative of any
specific Smalltalk application beyond the VisualWorks IDE
itself.</disclaimer>

So about 15% of all sends are arithmetic primitives. In
integer-intensive code about 99.9% of integer operations are on tagged
integers, and about 90% are additions.
--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd
Back to top
John Mashey
Guest





Posted: Tue Nov 01, 2005 1:15 am    Post subject: Re: tagged/boxed arith op frequencies [was interrupting for Reply with quote

Eliot Miranda wrote:

Quote:
I finally got round to collecting some stats. These are all the
"primitive" arithmetic operations in the run of a Smalltalk program,
where "primitive" means arithmetic operations that are directly
implemented by the Smalltalk virtual machine. [The LargePositiveInteger
operations can be implemented in Smalltalk using SmallInteger and array
dereference, but performance is much worse than a primitive implementation]

The stats are extracted from a single run in which the system started
up, displayed a few windows, ran some computational benchmarks that are
representative exercises of the IDE such as searching for implementors
of a message, creating (but not displaying) an inspector on an object,
compiling a method, etc, and then shutdown. In starting up the system
the window manager's font names are queried and parsed, which accounts
for most of the single-precision float operations. No double-precision
ops occurred.

First comes the stats for the combined start-up, benchmark run and
shutdown. Second comes the data for just the benchmarks. Most
operation names should be self explanatory, but

quo: integer division rounding towards zero
// integer division rounding towards minus infinity
SmallInteger / is exact and results in a
Fraction (integral numerator/denominator)

The datatypes are
Float 32-bit IEEE floating-point (boxed)
SmallInteger 30-bit signed 2's complement integers
(tagged/unboxed/fixnum/immediate)
LargePositiveInteger arbitrary length (implicit) sign/magnitude integer
(> 30 bits)

Startup, Benchmarks, Shutdown # arith ops 7,655,837 # sends 47,852,103

datatype op # ops % arith ops
Float - 47245 0.617%
Float / 94870 1.239%
Float * 77042 1.006%
Float + 116041 1.516%
LargePositiveInteger // 20 <0.001%
LargePositiveInteger + 5060 0.066%
LargePositiveInteger bitAnd: 8 <0.001%
LargePositiveInteger bitShift: 12 <0.001%
LargePositiveInteger bitXor: 2118 0.028%
SmallInteger - 697772 9.114%
SmallInteger / 19863 0.259%
SmallInteger // 10902 0.142%
SmallInteger * 37233 0.486%
SmallInteger + 5721304 74.731%
SmallInteger bitAnd: 355382 4.642%
SmallInteger bitOr: 71064 0.928%
SmallInteger bitShift: 398173 5.201%
SmallInteger bitXor: 1160 0.015%
SmallInteger quo: 568 0.007%

datatype # ops % arith ops % total sends
Float 335198 4.378% 0.7%
LargePositiveInteger 7218 0.094% 0.015%
SmallInteger 7313421 95.527% 15.283%

Benchmarks # arith ops 2,909,298 # sends 19,741,035

datatype op # ops % arith ops
Float - 0 0.0%
Float / 204 0.007%
Float * 204 0.007%
Float + 0 0.0%
LargePositiveInteger // 10 <0.001%
LargePositiveInteger + 1202 0.041%
LargePositiveInteger bitAnd: 0 0.0%
LargePositiveInteger bitShift: 0 0.0%
LargePositiveInteger bitXor: 500 0.017%
SmallInteger - 145981 5.018%
SmallInteger / 3512 0.121%
SmallInteger // 2532 0.087%
SmallInteger * 8280 0.285%
SmallInteger + 2637848 90.67%
SmallInteger bitAnd: 53075 1.824%
SmallInteger bitOr: 9311 0.32%
SmallInteger bitShift: 45752 1.573%
SmallInteger bitXor: 683 0.023%
SmallInteger quo: 204 0.007%

datatype # ops % arith ops % total sends
Float 408 0.014% 0.002%
LargePositiveInteger 1712 0.059% 0.009%
SmallInteger 2907178 99.927% 14.727%

disclaimer>These data are not necessarily representative of any
specific Smalltalk application beyond the VisualWorks IDE
itself.</disclaimer

So about 15% of all sends are arithmetic primitives. In
integer-intensive code about 99.9% of integer operations are on tagged
integers, and about 90% are additions.
--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd

Thanks ... as usual, it's a lot better to measure than to guess...
Back to top
Terje Mathisen
Guest





Posted: Tue Nov 01, 2005 9:15 am    Post subject: Re: tagged/boxed arith op frequencies [was interrupting for Reply with quote

Eliot Miranda wrote:
Quote:
datatype # ops % arith ops % total sends
Float 408 0.014% 0.002%
LargePositiveInteger 1712 0.059% 0.009%
SmallInteger 2907178 99.927% 14.727%

disclaimer>These data are not necessarily representative of any
specific Smalltalk application beyond the VisualWorks IDE
itself.</disclaimer

So about 15% of all sends are arithmetic primitives. In
integer-intensive code about 99.9% of integer operations are on tagged
integers, and about 90% are additions.

Thanks!

There is nothing like a set of measurements when you've been assuming
for a while. :-)

So, according to those stats, being able to handle 30-bit integer adds
with the shortest possible inline code is a big win, even at quite
substantial overhead for LargeInteger operations.

The most important observation would seem to be that a SmallTalk
compiler needs to do the same kinds of optimization as a Java JIT, in
particular it should get rid of _all_ the overhead for all loop indices,
doing the checking and overflow handling outside the inner loop.

By reserving the all-zero tag for SmallInt, you can then compile most of
the code without ever even checking if overflows have happened:

for (i = 0; i < limit; i++) {
array[i] = ...
}

should compile into something like this:

.... Check that 'limit' is in range of SmallInt and array[]

; i stored in EDI, limit in ESI
next:
... ; calculation, result in EAX
mov array[EDI],EAX
add EDI,4 ; Increment tagged integer index
cmp EDI,ESI
jl next

I.e. identical code to what a C compiler would generate for the same
loop. The two-bit zero tag means that the tagged integer effectively
works as a scaled index, simplifying addressing further.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Terje Mathisen
Guest





Posted: Thu Nov 03, 2005 1:15 am    Post subject: Re: tagged/boxed arith op frequencies [was interrupting for Reply with quote

Eliot Miranda wrote:

Quote:
Terje Mathisen wrote:
I.e. identical code to what a C compiler would generate for the same
loop. The two-bit zero tag means that the tagged integer effectively
works as a scaled index, simplifying addressing further.

Right. And mov array[EDI-3],EAX would allow one to use non-zero tags

I would still prefer a zero tag for SmallInt since it is responsible for
such a large percentage of all operations:

This would seem to simplify a few other operations, since you don't have
to handle the issue of tag bits wrapping into the data itself.

OTOH, I guess you could fix this in most (all?) situations with
carefully chosen bias handling...

ADD -> sum = a + b - tag -> lea esi,[eax+ebx-tag]

SUB -> sum = a - b + tag -> lea esi,[eax+tag] / sub esi,ebx

Multiplication and division seems a lot harder though, you'd have to get
rid of the tags first, do the operation, rescale and replace the tag.

Is that the best way to handle it, or is ther some clever trick that I
didn't notice?

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 6 of 7

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB