Will multicore CPUs have identical cores?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Will multicore CPUs have identical cores?
Goto page Previous  1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Terje Mathisen
Guest





Posted: Fri Dec 31, 2004 12:34 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

already5chosen@yahoo.com wrote:
Quote:
Terje, you appear to understand Nick a lot better than I do.
Can you enlighten me, please, which parts of his list can't be easily
done with SSE2 or, in case of single-precision FP, with Altivec?

Really simple actually: It all boils down to having fast (integer)
access to the bitpattern of fp values, and vice versa.

Any SIMD opcode set that supports both double precision and integer
operations on the same registers would work well.

The two final requirements are for table lookups to start iterative
function evaluation, and the ability to quickly branch on a test of a fp
number.

Both of these effectively require you to move the fp value to an integer
register.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Paul A. Clayton
Guest





Posted: Fri Dec 31, 2004 4:09 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

In article <1103576477.715631.125700@c13g2000cwb.googlegroups.com>,
"fepp" <jakob@virtutech.com> wrote:

Quote:
There was a research paper I read on combining Alpha EV5 and EV4 cores
on the same chip, roughly 5 EV4 took the same place as 1 EV5. Result:
for multithreaded workloads, IO-intense threads used EV4s efficiently,
while compute-hungry threads used EV5s. Think of it as Pentiums +
Niagaras on the same chip. For general-purpose computing, this is a
fairly good idea.

Also, power might be a good reason for heterogeniety -- use a simple
core when idle or doing little work, switch to more complex core only
when required.

R. Kumar et al.'s "Single-ISA Heterogenous Multi-Core
Architectures: The Potential for Processor Power
Reduction" looked at a chip using EV4, EV5, EV6, and
EV8 (minus SMT) and dynamic thread migration to reduce
power consumption. (Based on Figure 1, which compares
the core sizes, I would guess you meant 4 EV5 cores
equal the area of an EV6. EV5 looks to be less than twice
the area of EV4--and EV8 [minus SMT] is even larger,
about 9x the size of EV6.)


Paul A. Clayton
just a technophile (and 'Dysthymicdolt' reachable at aol.com)
Back to top
Terje Mathisen
Guest





Posted: Fri Dec 31, 2004 6:20 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

Nick Maclaren wrote:

Quote:
In article <cqucs3$dqg$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

[ A lot snipped. ]


All are evil to emulate using standard operations

Mostly because you're forced to go through memory, even on cpus where
the internal hw registers are merged!


Precisely. Which tends to do horrible things to the pipelines.

One example is when you want to start a NR iteration with some kind of
lookup function:

Saving the double value, then loading the top half back into an integer
reg will/can suffer a really ugly stall:

Since the sizes of the store and the load doesn't match, you might have
to wait until the store has propagated all the way to ram before the
load can even start.

Going the opposite direction is even worse, since loading a wider item
than each of the two stores, must stall until both store operations have
retired. I.e. a full pipeline flush. :-(

Having 64-bit integer registers avoids this particular trap.

Terje

PS. No, you cannot use a single prec store to avoid the size mismatch
when going from fp to int, since that might cause an exponent overflow
and subsequent loss of all precision.
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Stephen Fuld
Guest





Posted: Fri Dec 31, 2004 9:17 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cr38dc$9vm$1@osl016lin.hda.hydro.com...
Quote:
already5chosen@yahoo.com wrote:
Terje, you appear to understand Nick a lot better than I do.
Can you enlighten me, please, which parts of his list can't be easily
done with SSE2 or, in case of single-precision FP, with Altivec?

Really simple actually: It all boils down to having fast (integer) access
to the bitpattern of fp values, and vice versa.

Any SIMD opcode set that supports both double precision and integer
operations on the same registers would work well.

The two final requirements are for table lookups to start iterative
function evaluation, and the ability to quickly branch on a test of a fp
number.

Both of these effectively require you to move the fp value to an integer
register.

In thinking about this, how about an alternative solution? The idea is to
add what remains to be addded to the FP unit to allow it to do the integer
operations that were needed on the FP registers. Since an integer ALU is
pretty small, and the FP unit has much of what it needs already (e.g. an
integer adder that is most of the bits of a full register wide), the silicon
cost might be pretty modest. You wouldn't need multiple ALUs or extremely
fast ones as you are just trying to overcome the need for the slow move to
integer registers. You would need additional op codes to specify the
instructions, but you gain some side benefit in that I am sure that smart
users would come up with "integer" uses for these extra registers and ALU
independent of their floating point capability. What integer capabilities
would you need to support the FP bit fiddling stuff that one wants to do?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
James Van Buskirk
Guest





Posted: Fri Dec 31, 2004 10:19 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cr38dc$9vm$1@osl016lin.hda.hydro.com...

Quote:
The two final requirements are for table lookups to start iterative
function evaluation, and the ability to quickly branch on a test of a fp
number.

Why do you need table lookups for iterative function evaluation?
It seems to me that if you're going to do that for SIMD sqrt
instructions you would have to have a table where you can look up 2
or 4 numbers simultaneously. Is that feasible? LUTs don't save you
more than one iteration unless they're really big. Here's an
example of something you could code in some SIMD ISAs at this point:

! File test_magic.f90
! Public domain 2002 James Van Buskirk
! Tests sqrt function for Z'5F375A86'

module mysqrt_mod
implicit none
contains
function my_sqrt(x)
real, intent(in) :: x
real my_sqrt
real temp
integer magic
data magic / Z'5F375A86' /

my_sqrt = transfer(magic-ishft(transfer(x,magic),-1),my_sqrt)
my_sqrt = my_sqrt*(1.5-(x*my_sqrt)*(0.5*my_sqrt))
my_sqrt = my_sqrt*(1.5-(x*my_sqrt)*(0.5*my_sqrt))
temp = x*my_sqrt
my_sqrt = temp*(1.5-temp*(0.5*my_sqrt))
end function my_sqrt
end module mysqrt_mod

program test_magic
use mysqrt_mod
implicit none
real x

write(*,'(a)',advance='no') ' Enter the value of x:> '
read(*,*) x
if(x < 0) then
write(*,'(a)') ' Sorry, square root of negative not allowed.'
stop
end if
write(*,*) sqrt(x), my_sqrt(x)
end program test_magic

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Back to top
Terje Mathisen
Guest





Posted: Sun Jan 02, 2005 4:20 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

Stephen Fuld wrote:
Quote:
In thinking about this, how about an alternative solution? The idea is to
add what remains to be addded to the FP unit to allow it to do the integer
operations that were needed on the FP registers. Since an integer ALU is
pretty small, and the FP unit has much of what it needs already (e.g. an
integer adder that is most of the bits of a full register wide), the silicon
cost might be pretty modest. You wouldn't need multiple ALUs or extremely
fast ones as you are just trying to overcome the need for the slow move to
integer registers. You would need additional op codes to specify the
instructions, but you gain some side benefit in that I am sure that smart
users would come up with "integer" uses for these extra registers and ALU
independent of their floating point capability. What integer capabilities
would you need to support the FP bit fiddling stuff that one wants to do?

Sometimes you'd need memory addressing, i.e. table lookups!

The various SIMD instruction sets often contain integer/logical
operations, but never memory addressing (?), except for the special case
of Altivec in-register nybble lookups.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Terje Mathisen
Guest





Posted: Sun Jan 02, 2005 4:33 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

James Van Buskirk wrote:

Quote:
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cr38dc$9vm$1@osl016lin.hda.hydro.com...


The two final requirements are for table lookups to start iterative
function evaluation, and the ability to quickly branch on a test of a fp
number.

Why do you need table lookups for iterative function evaluation?
It seems to me that if you're going to do that for SIMD sqrt
instructions you would have to have a table where you can look up 2
or 4 numbers simultaneously. Is that feasible? LUTs don't save you

Parallel lookups mostly don't work, but do take a look at the people
using GPUs for vector/simd style fp programming:

Graphics chips do allow you to lookup multiple values (nearly) in
parallel. :-)

Quote:
more than one iteration unless they're really big. Here's an
example of something you could code in some SIMD ISAs at this point:

! File test_magic.f90
! Public domain 2002 James Van Buskirk
! Tests sqrt function for Z'5F375A86'

module mysqrt_mod
implicit none
contains
function my_sqrt(x)
real, intent(in) :: x
real my_sqrt
real temp
integer magic
data magic / Z'5F375A86' /

my_sqrt = transfer(magic-ishft(transfer(x,magic),-1),my_sqrt)

What does transfer() do ...(google)... OK, so F90 actually has an
intrinsic function that requires fast coersion of values between fp and
integer interpretation! Nice. :-)

Quote:
my_sqrt = my_sqrt*(1.5-(x*my_sqrt)*(0.5*my_sqrt))
my_sqrt = my_sqrt*(1.5-(x*my_sqrt)*(0.5*my_sqrt))
temp = x*my_sqrt
my_sqrt = temp*(1.5-temp*(0.5*my_sqrt))

Effectively calculating inverse square root, with a final multiplication
by x to get square root.

Quote:
end function my_sqrt
end module mysqrt_mod

Thanks for the code!

The one time I really needed code like this, sqrt(1/x) was desired, so
no extra calculation by x. However, the results needed to be nearly full
double precision, and the code spent more than 50% of the total runtime
doing this, which meant that a fairly big lookup table that reduced the
nuber of iterations to just two was really worthwhile.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Stephen Fuld
Guest





Posted: Sun Jan 02, 2005 7:55 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cr7b87$rpj$1@osl016lin.hda.hydro.com...
Quote:
Stephen Fuld wrote:
In thinking about this, how about an alternative solution? The idea is
to add what remains to be addded to the FP unit to allow it to do the
integer operations that were needed on the FP registers. Since an
integer ALU is pretty small, and the FP unit has much of what it needs
already (e.g. an integer adder that is most of the bits of a full
register wide), the silicon cost might be pretty modest. You wouldn't
need multiple ALUs or extremely fast ones as you are just trying to
overcome the need for the slow move to integer registers. You would need
additional op codes to specify the instructions, but you gain some side
benefit in that I am sure that smart users would come up with "integer"
uses for these extra registers and ALU independent of their floating
point capability. What integer capabilities would you need to support
the FP bit fiddling stuff that one wants to do?

Sometimes you'd need memory addressing, i.e. table lookups!

You already have memory access in the form of loads and stores to/from the
FP registers. About all you need for that (as far as I can tell), is the
ability to use an FP register as an index offset in a load instruction.
This is basically just using doing an add to get the address to pass to the
load unit, which already can load to the FP registers. I guess you don't
really need the indexed stores and probably don't need partial word
operations if you are using it just for table lookups for seed values, so
you are talking one additional op code. Not too bad.

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Nick Maclaren
Guest





Posted: Sun Jan 02, 2005 4:13 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

In article <cr7c13$vbc$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
Quote:

The two final requirements are for table lookups to start iterative
function evaluation, and the ability to quickly branch on a test of a fp
number.

Why do you need table lookups for iterative function evaluation?
It seems to me that if you're going to do that for SIMD sqrt
instructions you would have to have a table where you can look up 2
or 4 numbers simultaneously. Is that feasible? LUTs don't save you

Parallel lookups mostly don't work, but do take a look at the people
using GPUs for vector/simd style fp programming:

Er, parallel lookups DO work! They were (and are) standard practice
on vector systems. The reason is that they aren't typically used (or
available) on modern microprocessors is that they are real bandwidth
hogs. But, unless you can provide that bandwidth, they are a waste
of time.

As we so clearly agree, the point about improved facilities for lookup
using floating-point registers as input is to avoid pipeline glitches.
I have seen the cost of the pipeline problems dominate the cost of
actually doing the calculation.


Regards,
Nick Maclaren.
Back to top
Nick Maclaren
Guest





Posted: Sun Jan 02, 2005 4:23 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

In article <HiNBd.1208027$Gx4.813636@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
Quote:

You already have memory access in the form of loads and stores to/from the
FP registers. About all you need for that (as far as I can tell), is the
ability to use an FP register as an index offset in a load instruction.
This is basically just using doing an add to get the address to pass to the
load unit, which already can load to the FP registers. I guess you don't
really need the indexed stores and probably don't need partial word
operations if you are using it just for table lookups for seed values, so
you are talking one additional op code. Not too bad.

A lot of the confusion here is between people who have implemented
high-quality multi-precision code or special functions (i.e. perfect
handling of boundary and exceptional cases, high accuracy and high
performance) and those who have not. As so often, the devil is in the
details - until you have done such a project for real, it can be very
hard to understand the issues and why most current designs are
inadequate.


Regards,
Nick Maclaren.
Back to top
Terje Mathisen
Guest





Posted: Mon Jan 03, 2005 12:10 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

Nick Maclaren wrote:

Quote:
In article <HiNBd.1208027$Gx4.813636@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:

You already have memory access in the form of loads and stores to/from the
FP registers. About all you need for that (as far as I can tell), is the
ability to use an FP register as an index offset in a load instruction.
This is basically just using doing an add to get the address to pass to the
load unit, which already can load to the FP registers. I guess you don't
really need the indexed stores and probably don't need partial word
operations if you are using it just for table lookups for seed values, so
you are talking one additional op code. Not too bad.

A lot of the confusion here is between people who have implemented
high-quality multi-precision code or special functions (i.e. perfect
handling of boundary and exceptional cases, high accuracy and high
performance) and those who have not. As so often, the devil is in the
details - until you have done such a project for real, it can be very
hard to understand the issues and why most current designs are
inadequate.

BTDT, several times. <BG>

Today when hiking around in Nordmarka (the forests covering the hills
north of Oslo, I suddenly realized how I could do exact but still
efficient range reduction on really big values input to trig functions,
i.e. something like sin(1e300). :-)

Please check that this is correct:

Use a 1000+ bit value for 1/pi, stored as an array of fp values
one_over_pi[], with something like 24 significant bits per entry.

Start by checking if range reduction is really needed, i.e. is abs(x) >
2*pi or so?

If it is, split it into two parts, each with 26/27 significant mantissa
bits.

Multiply the one_over_pi[] array by each of these halves, this will be
exact since all products will have maximum 51 significant bits.

The key idea is that only a very limited number of these multiplications
are needed: Terms that are too small to contribute can be snipped, and
so can those that can only generate factors that corresponds to an
integral number of revolutions.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Stephen Fuld
Guest





Posted: Mon Jan 03, 2005 1:14 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:cr8lji$2aa$1@gemini.csx.cam.ac.uk...
Quote:
In article <HiNBd.1208027$Gx4.813636@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:

You already have memory access in the form of loads and stores to/from the
FP registers. About all you need for that (as far as I can tell), is the
ability to use an FP register as an index offset in a load instruction.
This is basically just using doing an add to get the address to pass to
the
load unit, which already can load to the FP registers. I guess you don't
really need the indexed stores and probably don't need partial word
operations if you are using it just for table lookups for seed values, so
you are talking one additional op code. Not too bad.

A lot of the confusion here is between people who have implemented
high-quality multi-precision code or special functions (i.e. perfect
handling of boundary and exceptional cases, high accuracy and high
performance) and those who have not. As so often, the devil is in the
details - until you have done such a project for real, it can be very
hard to understand the issues and why most current designs are
inadequate.

I may be missing things totally, but I don't feel confused. Terje asked for
a capability in order to implement some code that he, and apparently many
others feel is important in a more efficient way than can be currently done.
I proposed such an instruction. I may be totally missing something here,
but ISTM that I don't need to have implemented high precision special
functions, etc. in order to suggest a solution to the sub part of that
problem that Terje well specified. If I missed part of the specification,
fine, I am more than willing to admit that (and to try to correct it) if
someone shows me what I missed. But otherwise, I don't see the relevance of
your comment to my proposed additional instruction for faster use of seed
values.

I am not trying to be difficult - just trying to understand the need and
perhaps explore solutions for that need. So what am I missing?

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Bernd Paysan
Guest





Posted: Mon Jan 03, 2005 1:40 am    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

Terje Mathisen wrote:
Quote:
Today when hiking around in Nordmarka (the forests covering the hills
north of Oslo, I suddenly realized how I could do exact but still
efficient range reduction on really big values input to trig functions,
i.e. something like sin(1e300). :-)

Please check that this is correct:

Use a 1000+ bit value for 1/pi, stored as an array of fp values
one_over_pi[], with something like 24 significant bits per entry.

Start by checking if range reduction is really needed, i.e. is abs(x)
2*pi or so?

If it is, split it into two parts, each with 26/27 significant mantissa
bits.

Multiply the one_over_pi[] array by each of these halves, this will be
exact since all products will have maximum 51 significant bits.

The key idea is that only a very limited number of these multiplications
are needed: Terms that are too small to contribute can be snipped, and
so can those that can only generate factors that corresponds to an
integral number of revolutions.

Hm, once you exceed 2pi*2^(mantissa size), you get into aliasing mode. I.e.
each LSB change takes you more than 2pi away, i.e. you can take that
integer multiply of 2pi out of the equation. Thus you can reduce n*2^m to
n*x, with x=2^m mod 2pi (n and m integers). The table doesn't need to be
complete for all possible m, and you can decompose the number: a+b+c mod
2pi is identical to (a mod 2pi + b mod 2pi + c mod 2pi) mod 2pi.

So my algorithm would be: Split the number in 3 parts, each in the form
n*2^(32m), with n and m integer (all n therefore are 32 bit integers). Use
the ms as index into the table, and compute (n*x) mod 2pi for each part
(this doesn't need to be an extra-precise 2pi, but a precise modulo
operation - one that uses a multiply&accumulate without intermediate
rounding). Add the parts together, and do another mod 2pi on the (small)
result.

You can calculate 2^m mod 2pi successively, without having to calculate a
high-precission pi at all. Use the identity sin(2x) = 2 sin x cos x, and
cos(2x)=1 - 2 sinē(x) --- atan2(sin(x), cos(x)) gives you the required mod
2pi value. Start with sincos(1.0).

I don't care about exact reduction, since for me, a floating point number is
an interval (+- 1/2ULP), so sin(1e300) is correct when it delivers anything
between 1 and -1 - an interval arithmetic sin(1e300) should deliver [-1;1].

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Back to top
Terje Mathisen
Guest





Posted: Mon Jan 03, 2005 4:37 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

Bernd Paysan wrote:
[snipped discussion which seemed to agree that you could do the
reduction in a simplified manner?]

Quote:
I don't care about exact reduction, since for me, a floating point number is
an interval (+- 1/2ULP), so sin(1e300) is correct when it delivers anything
between 1 and -1 - an interval arithmetic sin(1e300) should deliver [-1;1].

I used to be firmly in that camp, but the programmer in me would still
like to do the 'right thing' even when faced with obviously out-of-range
inputs. :-)

Also, there are situations where you know that some huge value is exact,
not just a fp approximation, in which case exact range reduction does
make sense.

OTOH, always returning 0 (or 1.0 or NaN?) for sin(1e300) would at least
give the user/programmer a hint that something was suspicious. :-(

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Nick Maclaren
Guest





Posted: Mon Jan 03, 2005 5:02 pm    Post subject: Re: Will multicore CPUs have identical cores? Reply with quote

In article <cr9guk$a4e$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
Quote:

BTDT, several times. <BG

I know :-)

Quote:
Today when hiking around in Nordmarka (the forests covering the hills
north of Oslo, I suddenly realized how I could do exact but still
efficient range reduction on really big values input to trig functions,
i.e. something like sin(1e300). :-)

Please check that this is correct:

Been there - done that. Yes, it works. No, it isn't very efficient,
but isn't at all bad. As far as I know, it is the only viable way to
do fully accurate range reduction - i.e. all of the others are far
slower.


Regards,
Nick Maclaren.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Page 5 of 6

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB