| Author |
Message |
Guest
|
Posted:
Tue Dec 28, 2004 2:31 am Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Dan Koren wrote:
| Quote: | Isn't this exactly what an OS is supposed to be? ;-)
|
long ago and far away, my first programming job (as undergraduate) was
to re-implement 1401 MPIO (unit record<->tape front end for 709) in 360
assembler to run on 360/30 (as opposed to running the 360/30 in 1401
emulation mode). I got to design and write my own task manager,
storage/buffer manager, interrupt handler, device drivers, etc. program
grew to about 2000 cards. basically a fairly simple monitor ... as a
distinct from things normally called operating system. except for the
crypto, what is on the chip is less complex than this long ago & far
away monitor (nearly 40 years ago). |
|
| Back to top |
|
 |
Rupert Pigott
Guest
|
Posted:
Tue Dec 28, 2004 3:46 am Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Stephen Fuld wrote:
| Quote: |
"Rupert Pigott" <darkboong@try-removing-hotmail-this.com> wrote in message
news:41d04f54$0$15200$db0fefd9@news.zen.co.uk...
|
[SNIP]
| Quote: | Seen that FP hack more times than I care to mention and I've never
really liked it. I like code that clearly expresses what it means
too.. BUT the problem I have with the FP hack is that FP has very
different semantics from fixed point. :(
OK, so rise to the challenge and try to come up with some primitives that
better express what you want to do. Is it just an issue of wanting more
bits in an integer than the width of a register? Or is there something
else?
|
Depends on the app. There has been tons of research done on this
stuff. IME integers larger that the width of a reg and 64 bit FP
would be adequate for 99.9% of the apps out there. Some kinda
header file for working out what the "natural" sizes are would be
nice, but not essential for *most* applications. I'm not looking
for a system programming language here.
It would be interesting to see if the hardware changed after 20
years of such a language being dominant. Java doesn't really
count because it's saddled with the 32bit uber alles mentality.
Cheers,
Rupert
--
Threading sequential code through the eye of a parallel needle
makes little sense. ;) |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Tue Dec 28, 2004 6:14 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Stephen Fuld wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
Converting all the FLL/PLL operations to fp made the code much
simpler and easier to maintain, so now only the offsets are
calculated exactly, then immediately converted to fp.
Thanks, Terje. I am sensing something here, but it may not be right.
It seems that some of the uses of FP, including the one you pointed
out, really want someting else, but FP is closer to what they want
than is integer, hence it is easier/faster to "coerce" FP into what
you need. If that is true, there may be an opportunity to try to
find a set of primitive operations that better suit these needs than
either integer or FP. So, two questions, is my sense right; that
there might be something better suited for these types of situations
than FP, and if so, what might these be?
|
Not really:
Control theory loops (more or less) always works with fp values, it is
just that the input values are exact numbers of ticks (each tick being
1/2^32 seconds).
Since the ntp kernel defines 128 ms as 'infinity', i.e. the point where
it gives up tweaking frequency/phase, and instead just yanks the clock
into an approximately correct value, the maximum number of significant
bits in an ntp offset value would be about 32-3 = 29 bits.
Working with ieee doubles gives 20+ extra bits which means that most
calculations will still be exact.
| Quote: | A related question is to
attempt to catalog these types of uses for FP to see if there is any
commonality among them.
One other thought. Perhaps this is related to Nick's oft repeated
comment about wanting some kind of FP primitives other than what is
available, but he hasn't specified what he wants in enough detail for
me to understand that.
Sorry, me neither. |
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Tue Dec 28, 2004 9:13 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cqrm7e$q4h$1@osl016lin.hda.hydro.com...
| Quote: | Stephen Fuld wrote:
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
Converting all the FLL/PLL operations to fp made the code much simpler
and easier to maintain, so now only the offsets are calculated exactly,
then immediately converted to fp.
Thanks, Terje. I am sensing something here, but it may not be right.
It seems that some of the uses of FP, including the one you pointed out,
really want someting else, but FP is closer to what they want than is
integer, hence it is easier/faster to "coerce" FP into what you need. If
that is true, there may be an opportunity to try to find a set of
primitive operations that better suit these needs than either integer or
FP. So, two questions, is my sense right; that there might be something
better suited for these types of situations
than FP, and if so, what might these be?
Not really:
Control theory loops (more or less) always works with fp values, it is
just that the input values are exact numbers of ticks (each tick being
1/2^32 seconds).
Since the ntp kernel defines 128 ms as 'infinity', i.e. the point where
it gives up tweaking frequency/phase, and instead just yanks the clock
into an approximately correct value, the maximum number of significant
bits in an ntp offset value would be about 32-3 = 29 bits.
Working with ieee doubles gives 20+ extra bits which means that most
calculations will still be exact.
|
OK, so this seems like an artifact of the specification in that 29 bits
happens to be too long to assure adequate precision with 32 bit integers but
short enough to work well with 64 bit floats. Then it probably doesn't
generalize well. If you were assured of 64 bit integer instructions, would
they be perferable (probably faster than FP equivalents)? Are the other
places where FP is used instead of the "more natural" integer calculations
similar to this - just using the fact that the hardware provides a >32 bit
adder only in the guise of FP instructions?
| Quote: | A related question is to attempt to catalog these types of uses for FP to
see if there is any
commonality among them.
One other thought. Perhaps this is related to Nick's oft repeated
comment about wanting some kind of FP primitives other than what is
available, but he hasn't specified what he wants in enough detail for
me to understand that.
Sorry, me neither.
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
|
I'm not sure I get this. If you want to make bignum math easier, wouldn't
you want the help to be on the integer side? If you are willing to
sacrifice precision by using floating point, but want more significant
digits than 64 bit gives you, wouldn't you want 128 bit FP?
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Guest
|
Posted:
Tue Dec 28, 2004 11:55 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Rupert Pigott <darkboong@try-removing-hotmail-this.com> writes:
| Quote: | Stephen Fuld wrote:
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:cq6lq6$760$1@gemini.csx.cam.ac.uk...
I wonder how much die area one would save by doing this? Is it
really significant? How about some of the cores eliminating the
floating point and graphics instructions totally. These could
execute any kernal thread and the the compiler/linker could mark
user programs that were OK to run on the "reduced instruction set"
cores. Of course the OS would have to handle the scheduling a
little more carefully, but not to big a deal. Many simple
Do the usual routine. Trap on unsupported instruction and instead of
emulating the instruction in software simply reschedule the process
to a core that can handle it. For extra points tag the binary in
some way so the OS knows what it's requirements are next time. :)
|
It is a pity that the code for RSX was not more widley and better
known. RSX could handle CPUs with not differing ISA sets, but
different and dynamically changing IO topologies. Each CPU had its own
IO bus, and they could conect to other buses via a bus switch, so part
of the fun was "what CPU do we have to be on to diddle the controller
that can do this IO...". All done long ago. Oh, and that could still
handle the nasty case of asyncronous FP-11 interupts into kernel mode.
BTW, lots of this was dormant in VMS, and was I'm told resurected
for Galaxy systems.
--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be. |
|
| Back to top |
|
 |
Anne & Lynn Wheeler
Guest
|
Posted:
Wed Dec 29, 2004 1:06 am Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
prep@prep.synonet.com writes:
| Quote: | It is a pity that the code for RSX was not more widley and better
known. RSX could handle CPUs with not differing ISA sets, but
different and dynamically changing IO topologies. Each CPU had its own
IO bus, and they could conect to other buses via a bus switch, so part
of the fun was "what CPU do we have to be on to diddle the controller
that can do this IO...". All done long ago. Oh, and that could still
handle the nasty case of asyncronous FP-11 interupts into kernel mode.
BTW, lots of this was dormant in VMS, and was I'm told resurected
for Galaxy systems.
|
standard 360 & 370 SMPs had shared memory ... but independent I/O
(channels). characteristic of 360 & 360 multiprocessors were that they
could be separated into independent uniprocessors and still function
with their independent i/o interfaces.
common i/o was simulated by having device controllers with multiple
I/O (channel) attachments ... you configured the same controller at
the same address on the different I/O (channel) interfaces for the
different processors. basically the multi-interface device controllers
used the same technology for providing common device addressability in
smp ("tightly-coupled") configurations as well as availability and
common access in cluster ("loosely-coupled") configurations. i/o
driver had to be capable of recognizing situation where a device
controller was only available on a specific i/o interface for a
specific processor as well as possibly available on i/o interfaces for
all processors.
the exception was the 360/67 multiprocessor which had something called
the channel controller ... which configurated both memory boxes and
channel interfaces. with the channel cotnroller you could cleave an
multiprocessor configuration into uniprocessor ... allocating specific
memory boxes and channel interfaces to specific processors. in
multiprocessor configuration, the channel controller provided
configuration so that all processors could access all memory boxes (as
found in rest of 360 & 370 multiprocessor operation) as well as access
to all channel interfaces (not available in the other 360 & 370
multiprocessor configurations).
360/67 also supported both 24-bit and 32-bit addressing. it wasn't
until you got to the 3081 dyadic ... that you again found
configuration where all processors could access all i/o (channel)
interfaces (and supported both 24-bit and 31-bit ... not 32-bit ...
addressing). however, it wasn't possible to cleave a 3081 dyadic into
two independent operating uniprocessors.
slight drift, my wife did her stint in pok in charge of
loosely-coupled (aka cluster) architecture ...
http://www.garlic.com/~lynn/subtopic.html#shareddata
somewhat useful when we later started ha/cmp
http://www.garlic.com/~lynn/subtopic.html#hacmp
minor specific reference
http://www.garlic.com/~lynn/95.html#13
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/ |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 29, 2004 2:03 am Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Stephen Fuld wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
Not really:
Control theory loops (more or less) always works with fp values, it is
just that the input values are exact numbers of ticks (each tick being
1/2^32 seconds).
Since the ntp kernel defines 128 ms as 'infinity', i.e. the point where
it gives up tweaking frequency/phase, and instead just yanks the clock
into an approximately correct value, the maximum number of significant
bits in an ntp offset value would be about 32-3 = 29 bits.
Working with ieee doubles gives 20+ extra bits which means that most
calculations will still be exact.
OK, so this seems like an artifact of the specification in that 29 bits
happens to be too long to assure adequate precision with 32 bit integers but
short enough to work well with 64 bit floats. Then it probably doesn't
generalize well. If you were assured of 64 bit integer instructions, would
they be perferable (probably faster than FP equivalents)? Are the other
places where FP is used instead of the "more natural" integer calculations
similar to this - just using the fact that the hardware provides a >32 bit
adder only in the guise of FP instructions?
|
No, I really believe fp is _better_, i.e. for someting like expontial
decay, you need to multiply the previous value by 'p' and the current
value by '1-p' before adding them.
Using 128-bit values for everything (64:64) would be sufficient to avoid
both 134-year epoch wraparounds and low end truncation, but simply using
64-bit fp math delivers more or less equivalent performance with much
less cpu overhead.
| Quote: | Sorry, me neither.
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
I'm not sure I get this. If you want to make bignum math easier, wouldn't
you want the help to be on the integer side? If you are willing to
sacrifice precision by using floating point, but want more significant
digits than 64 bit gives you, wouldn't you want 128 bit FP?
|
Sure, I'd love to have guaranteed fast 64x64->128 integer mul available
on all platforms, but the only one I'm familiar with that does this is
Alpha. It seems like many 64-bit cpu makes integer mul a poor relative
of the fp hardware, in some cases (Itanium, the original Pentium, and
the Pentium4) integer mul is actually handled by the fp hardware!
If I cannot have fast integer ops, then a fast way to use the abundant
fp hardware to do extended precision would be nice.
Standard, fast (i.e. not more than 4 X slower than double) ieee 128-bit
fp math would be _very_ nice, but simply having both fused (Is that the
correct term for not doing any intermediate rounding?) and regular
MUL-ACC is very helpful.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Wed Dec 29, 2004 5:38 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
In article <cqrm7e$q4h$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
| Quote: | Stephen Fuld wrote:
A related question is to
attempt to catalog these types of uses for FP to see if there is any
commonality among them.
One other thought. Perhaps this is related to Nick's oft repeated
comment about wanting some kind of FP primitives other than what is
available, but he hasn't specified what he wants in enough detail for
me to understand that.
Sorry, me neither.
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
|
Eh? I have given examples often enough, but mostly said "Try coding
and see". Here are some examples:
Decompose and recompose a floating-point number into the exponent
and mantissa (sometimes available), providing correct error checking
(less often available).
Classify a floating-point number (sometimes available) in a way
convenient for a multi-way branch (rarely available). A potentially
useful form would be to classify two registers at once, so that the
problem cases could be removed in a single branch.
Split a floating-point number, with the mantissa divided for
subsequent exact multiplication (OK, Terje?)
Do a floating-point operation, returning the result and a suitable
indicator of the extra precision that was rounded off (this is not as
easy to specify as might appear).
Provide a table lookup function that uses N bits of the exponent
and M bits of the mantissa, suitable for starting Newton-Raphson
division and square root (and potentially other functions),
All are evil to emulate using standard operations.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 29, 2004 6:53 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Nick Maclaren wrote:
| Quote: | In article <cqrm7e$q4h$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
Eh? I have given examples often enough, but mostly said "Try coding
and see". Here are some examples:
Decompose and recompose a floating-point number into the exponent
and mantissa (sometimes available), providing correct error checking
(less often available).
|
Quite doable in a portable way, as long as double and t_int64 share the
same endianness. It won't be fast though, so hw support would indeed help.
| Quote: |
Classify a floating-point number (sometimes available) in a way
convenient for a multi-way branch (rarely available). A potentially
useful form would be to classify two registers at once, so that the
problem cases could be removed in a single branch.
|
Nice one! :-)
Two fp input regs, one integer output reg:
fclassify ireg,freg,freg
jmp fjmptable[ireg]
or more probably something like this, to make the normal case faster:
fclassify ireg,freg,freg
test ireg,SPECIAL_BITS
jnz special_case
.... regular code here, guaranteed not to cause exceptions
special_case:
jmp special_table[ireg]
| Quote: |
Split a floating-point number, with the mantissa divided for
subsequent exact multiplication (OK, Terje?)
|
Very nice, even though this is doable with integer masks. What would be
better would be a way to specify the bit (i.e. exponent value) to split
at. Currently the fastest way to do this is to add a suitably scaled
magic value, then subtract the same number to leave a truncated/rounded
result.
A final subtraction from the original value leaves the remainder.
| Quote: |
Do a floating-point operation, returning the result and a suitable
indicator of the extra precision that was rounded off (this is not as
easy to specify as might appear).
|
Not so easy to use either?
| Quote: |
Provide a table lookup function that uses N bits of the exponent
and M bits of the mantissa, suitable for starting Newton-Raphson
division and square root (and potentially other functions),
|
Another nice one, but quite hard to implement efficiently, since it
would require the fp registers to generate address unit values. It would
help a lot to simply have the first one you suggested, i.e. a fast split
into separate mantissa and exponent parts.
Two masks, shifts and a merge, and 'Bob's your uncle'. (Is that the
correct idiom?)
| Quote: | All are evil to emulate using standard operations
|
Mostly because you're forced to go through memory, even on cpus where
the internal hw registers are merged!
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Wed Dec 29, 2004 8:30 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cqshnt$bkj$1@osl016lin.hda.hydro.com...
| Quote: | Stephen Fuld wrote:
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
Not really:
Control theory loops (more or less) always works with fp values, it is
just that the input values are exact numbers of ticks (each tick being
1/2^32 seconds).
Since the ntp kernel defines 128 ms as 'infinity', i.e. the point where
it gives up tweaking frequency/phase, and instead just yanks the clock
into an approximately correct value, the maximum number of significant
bits in an ntp offset value would be about 32-3 = 29 bits.
Working with ieee doubles gives 20+ extra bits which means that most
calculations will still be exact.
OK, so this seems like an artifact of the specification in that 29 bits
happens to be too long to assure adequate precision with 32 bit integers
but short enough to work well with 64 bit floats. Then it probably
doesn't generalize well. If you were assured of 64 bit integer
instructions, would they be perferable (probably faster than FP
equivalents)? Are the other places where FP is used instead of the "more
natural" integer calculations similar to this - just using the fact that
the hardware provides a >32 bit adder only in the guise of FP
instructions?
No, I really believe fp is _better_, i.e. for someting like expontial
decay, you need to multiply the previous value by 'p' and the current
value by '1-p' before adding them.
|
OK, but if you can choose the value of p (and many situations work fine with
a range of values for p), then you can do better with integers by making p a
suitable negative power of 2 and using shifts - but you knew that :-)
| Quote: | Using 128-bit values for everything (64:64) would be sufficient to avoid
both 134-year epoch wraparounds and low end truncation, but simply using
64-bit fp math delivers more or less equivalent performance with much less
cpu overhead.
|
Sure. But for the specific case you mentioned (the NTP kernel), if you had
say X86 - 64, is floating point faster than 64 bit integers (which, by your
description) would provide enough precision, without the need for 64 X 64 =
128 bit multiply?
| Quote: | Sorry, me neither.
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
I'm not sure I get this. If you want to make bignum math easier,
wouldn't you want the help to be on the integer side? If you are willing
to sacrifice precision by using floating point, but want more significant
digits than 64 bit gives you, wouldn't you want 128 bit FP?
Sure, I'd love to have guaranteed fast 64x64->128 integer mul available on
all platforms, but the only one I'm familiar with that does this is Alpha.
It seems like many 64-bit cpu makes integer mul a poor relative of the fp
hardware, in some cases (Itanium, the original Pentium, and the Pentium4)
integer mul is actually handled by the fp hardware!
If I cannot have fast integer ops, then a fast way to use the abundant fp
hardware to do extended precision would be nice.
|
So this is a legitimate comp.arch, ISA issue. From Mash's posts, I would
guess CPU designers wouldn't like this as it requires two register writes in
one instruction. But it provides substantial benefit in some, albeit
limited, curcumstances. ISTM that the need for multiprecision multiply is
reduced in 64 bit versus 32 bit systems, as there would be fewer of them for
manipulating an X digit bignum. Is that right? Is this one of those things
where it would get more use if the dominent programming languages had good
native support?
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Wed Dec 29, 2004 8:40 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cqucs3$dqg$1@osl016lin.hda.hydro.com...
| Quote: | Nick Maclaren wrote:
In article <cqrm7e$q4h$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
Eh? I have given examples often enough, but mostly said "Try coding
and see". Here are some examples:
|
snipped Nick's good examples and Terje's comments that I think improve the
suggestions
OK, so the obvious question is that since at least some of these seem like
they would be pretty easy to implement (as Nick said, some CPUs have them)
and at least some people consider them to be very useful, why aren't they
(the easy and more useful ones) more prevalent? Someone like Andy or Mash
(with real world mainstream CPU architecture experience) care to comment on
why they aren't more prevalent?
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 29, 2004 9:53 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Stephen Fuld wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
Using 128-bit values for everything (64:64) would be sufficient to avoid
both 134-year epoch wraparounds and low end truncation, but simply using
64-bit fp math delivers more or less equivalent performance with much less
cpu overhead.
Sure. But for the specific case you mentioned (the NTP kernel), if you had
say X86 - 64, is floating point faster than 64 bit integers (which, by your
description) would provide enough precision, without the need for 64 X 64 =
128 bit multiply?
|
Working with 64:64 bit fixed point values would still require a fast
64x64->128 bit mul, if that's not available then fp is _much_ faster.
[snip]
| Quote: | If I cannot have fast integer ops, then a fast way to use the abundant fp
hardware to do extended precision would be nice.
So this is a legitimate comp.arch, ISA issue. From Mash's posts, I would
guess CPU designers wouldn't like this as it requires two register writes in
one instruction.
|
Right, which was why Alpha required two opcodes to retrieve each half of
the full result. Starting on a new isa, the hw would be free to optimize
by using the same 128-bit result to supply both halves at once, or you
could go the other way by having an opcode with two result registers
that gets split into two micro-ops by the decoder.
| Quote: | But it provides substantial benefit in some, albeit
limited, curcumstances. ISTM that the need for multiprecision multiply is
reduced in 64 bit versus 32 bit systems, as there would be fewer of them for
manipulating an X digit bignum. Is that right? Is this one of those things
where it would get more use if the dominent programming languages had good
native support?
|
For bignum you always need size-doubling mul: If a 64-bit system doesn't
provide this, you've almost reduced it to the same speed as a 32-bit
system that does have 32x32->64 mul.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 29, 2004 10:47 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Stephen Fuld wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cqucs3$dqg$1@osl016lin.hda.hydro.com...
Nick Maclaren wrote:
In article <cqrm7e$q4h$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
Eh? I have given examples often enough, but mostly said "Try coding
and see". Here are some examples:
snipped Nick's good examples and Terje's comments that I think improve the
suggestions
OK, so the obvious question is that since at least some of these seem like
they would be pretty easy to implement (as Nick said, some CPUs have them)
and at least some people consider them to be very useful, why aren't they
(the easy and more useful ones) more prevalent? Someone like Andy or Mash
(with real world mainstream CPU architecture experience) care to comment on
why they aren't more prevalent?
|
My guess:
They aren't part of the ieee standard, so language designers can't
depend on having them, so sw doesn't get written to use them, so
software traces show they aren't used.
I.e. your usual negative feedback loop. :-)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Dec 30, 2004 5:33 am Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
Terje Mathisen wrote:
| Quote: | Nick Maclaren wrote:
In article <cqrm7e$q4h$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
(If I'm allowed to wish, I'd like all fp implementations to contain
sufficient primities to make bignum math much simpler.)
Eh? I have given examples often enough, but mostly said "Try
coding
and see". Here are some examples:
Decompose and recompose a floating-point number into the
exponent
and mantissa (sometimes available), providing correct error
checking
(less often available).
Quite doable in a portable way, as long as double and t_int64 share
the
same endianness. It won't be fast though, so hw support would indeed
help.
Classify a floating-point number (sometimes available) in a way
convenient for a multi-way branch (rarely available). A
potentially
useful form would be to classify two registers at once, so that the
problem cases could be removed in a single branch.
Nice one! :-)
Two fp input regs, one integer output reg:
fclassify ireg,freg,freg
jmp fjmptable[ireg]
or more probably something like this, to make the normal case faster:
fclassify ireg,freg,freg
test ireg,SPECIAL_BITS
jnz special_case
... regular code here, guaranteed not to cause exceptions
special_case:
jmp special_table[ireg]
Split a floating-point number, with the mantissa divided for
subsequent exact multiplication (OK, Terje?)
Very nice, even though this is doable with integer masks. What would
be
better would be a way to specify the bit (i.e. exponent value) to
split
at. Currently the fastest way to do this is to add a suitably scaled
magic value, then subtract the same number to leave a
truncated/rounded
result.
A final subtraction from the original value leaves the remainder.
Do a floating-point operation, returning the result and a
suitable
indicator of the extra precision that was rounded off (this is not
as
easy to specify as might appear).
Not so easy to use either?
Provide a table lookup function that uses N bits of the
exponent
and M bits of the mantissa, suitable for starting Newton-Raphson
division and square root (and potentially other functions),
Another nice one, but quite hard to implement efficiently, since it
would require the fp registers to generate address unit values. It
would
help a lot to simply have the first one you suggested, i.e. a fast
split
into separate mantissa and exponent parts.
Two masks, shifts and a merge, and 'Bob's your uncle'. (Is that the
correct idiom?)
All are evil to emulate using standard operations
Mostly because you're forced to go through memory, even on cpus where
the internal hw registers are merged!
Terje
--
- <Terje.Mathisen@hda.hydro.com
"almost all programming can be viewed as an exercise in caching"
|
Terje, you appear to understand Nick a lot better than I do.
Can you enlighten me, please, which parts of his list can't be easily
done with SSE2 or, in case of single-precision FP, with Altivec? |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Dec 30, 2004 4:50 pm Post subject:
Re: Will multicore CPUs have identical cores? |
|
|
In article <cqucs3$dqg$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
[ A lot snipped. ]
| Quote: | All are evil to emulate using standard operations
Mostly because you're forced to go through memory, even on cpus where
the internal hw registers are merged!
|
Precisely. Which tends to do horrible things to the pipelines.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
|
|
|
|