RISC vs. CISC design principles
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
RISC vs. CISC design principles
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Paul Rubin
Guest





Posted: Thu Jan 13, 2005 10:26 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

"James Van Buskirk" <not_valid@comcast.net> writes:
Quote:
Is it still 3 IPC in Opteron, then? There are codes, such as DFT
codes, that really make you wish RISC had won and CISC had lost.
My 21164 could sustain 4 IPC 8 years ago and had over 2X the clock
of the nearest contemporary Pentium.

1) I think the x86 deals with that through special SSE2 instructions
for common operations like DFT.

2) Opteron has 16 registers, an improvement over 8.
Back to top
Stephen Fuld
Guest





Posted: Thu Jan 13, 2005 10:34 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

"Paul A. Clayton" <carchreader@aol.comnomail> wrote in message
news:20050113105819.01270.00000034@mb-m23.aol.com...

snip

Quote:
I should not have used practicality. My concern was with comparing
design principles given a clean slate for modern tradeoffs and trying
to understand the reasoning behind the design choices that generated
CISCs and RISCs in their historical context.

With regard to both variable length instructions and unalligned storage
operations, you have to go back in history to the original RISC era. The
idea was that you gained so much by eliminating the off chip connection
delay in favor of everything within one chip that the single chip
requirement pretty much dominated everything else. Now look at the number
of transistors one could get on a single chip at that time. That dictated
eliminating a lot of features that might otherwise be desirable. So all
instructions being the same length saved a lot of transistors (and speeded
decoding) and that was a much bigger issue then than it is now. That is why
you see the multi-length instruction sets added in some RISC chips, and the
minimal cost of pretty much full generality of X-86 being quite fast.

With regard to unalligned memops, I think it is usefull to divide them into
two cases. The first is where the entire operation is contained within one
cache line/page. These will be much more frequent and probably are easier
to make fast. The other is where a cache line or even a page boundry is
crossed, which are much less frequent, and of course have to be done
correctly, but are less important to be fast. Note that as cache lines get
larger, the first case becomes more frequent. And since (to go back to your
question), the first RISC chips had no on-chip cache, all cases had to go
directly to memory, and unalligned accesses were much costlier (both in
terms of time and the then all important transistor count.

I suspect if some architect demanded unalligned access support in a
hypothetical new chip or new version of an existing chip that doesn't now
have it, the hardware guys would grumble a lot and then do a good job of
making it fast in the common cases and correct in all cases. But I would
appreciate comments from people who know more about that aspect of things
than I do.

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Greg Lindahl
Guest





Posted: Thu Jan 13, 2005 10:43 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

In article <7xpt09qlye.fsf@ruckus.brouhaha.com>,
Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote:

Quote:
1) I think the x86 deals with that through special SSE2 instructions
for common operations like DFT.

This subthread was talking about Opteron. Opteron uses "double decode"
for SSE, so these instructions consume 2 slots. I wouldn't be
surprised if the various Pentium implementations did something
similar... if you only have 2 floating point units, how do you
expect them to implement an instruction that does 4 single-precision
floating point ops?

-- greg
Back to top
Nick Maclaren
Guest





Posted: Thu Jan 13, 2005 10:49 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

In article <SRxFd.286$S11.23@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.fuld@PleaseRemove.att.net> wrote:
Quote:

Secondly, the paragraph that you snipped explains why all portable
programs (and most correct ones) use packing and unpacking primitives
when dealing with arbitrary (binary) input files.

But don't these primitives benefit from being able to handle unalligned data
efficiently?

Yes and no. Because of the endian and other problems I mentioned,
there is little point in accessing the data DIRECTLY - macros or
functions are always a better solution. And the difference in
efficiency between using (say) unaligned integer loads and loading
a character at a time is usually small.


Regards,
Nick Maclaren.
Back to top
Paul Rubin
Guest





Posted: Thu Jan 13, 2005 10:56 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

lindahl@pbm.com (Greg Lindahl) writes:
Quote:
This subthread was talking about Opteron. Opteron uses "double decode"
for SSE, so these instructions consume 2 slots. I wouldn't be
surprised if the various Pentium implementations did something
similar... if you only have 2 floating point units, how do you
expect them to implement an instruction that does 4 single-precision
floating point ops?

I thought they had a way of making a double precision unit do two
simultaneous single precision ops.
Back to top
Stefan Monnier
Guest





Posted: Thu Jan 13, 2005 11:02 pm    Post subject: Re: RISC vs. CISC design principles Reply with quote

Quote:
I have always had the impression that for the original Californian RISC
designs, a major consideration was that all of a processor core should
fit on one chip. Grudgingly, (some of) the MMU was made optional or put

Agreed. I tend to think of the "typical RISC" as being designed using the
following idea:
"hey, if we move some of the complexity out of the CPU and into the
compiler, we can use those fancy mainframe techniques like pipelining
on mere microprocessors".

Which of course was only relevant at the time (because single-chip
transistor budgets were just big enough for that to work but small enough
that those techniques couldn't be used without the simplifications of
RISC).

I also felt that a significant element was the fact that this particular
design point had the advantage of being "academy-friendly": it was OT1H
very clean and OTOH doable without monstruous investment, and the icing on
the cake was that it even worked well.


Stefan
Back to top
Terje Mathisen
Guest





Posted: Fri Jan 14, 2005 1:51 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

Nick Maclaren wrote:

Quote:
In article <cs6184$a81$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
That's what the BSD code did when I looked at it.

Not too surprising, it is the obvious solution. :-)

Quote:
|> This will work as long as the input was valid, i.e. a terminating zero
|> was actually found.

But did those correctly diagnose the error if the input was NOT
valid? That is what I meant.

To get correct behaviour, you'll have to reload the last (aligned!)
word, using regular (trapping) operations.

This is actually similar to the way you can rewrite Java programs to
only require tests at buffer ends, split the code path, and then add a
known-to-trap load in the case where that should have happened. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Guest






Posted: Fri Jan 14, 2005 4:29 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

"Is it still 3 IPC in Opteron, then? There are codes, such as DFT
codes, that really make you wish RISC had won and CISC had lost.
My 21164 could sustain 4 IPC 8 years ago and had over 2X the clock
of the nearest contemporary Pentium."

Yes, it is still 3 x86 instructions; but an 'average' x86 instruction
contains
1.4 RISC instructions of semantic content.

In terms of SSE, opteron can decode 3 SSE 128-bit wide instructions per
cycle
with is equivalent to six RISC DP FP ops or six DP FP ops + six Load
ops
(dependingupon reg-reg or reg-mem versions). These algorythms end up
pipeline (or memory) limited not decode limited.

In terms of clock rate, I dont see any RISCs keeping up with the CISCs
(refering back to cubic dollars argument).

"It also means having one-register or two-
register instructions, which is unpleasant in DFT codes because
the whole idea is to reuse as many results as possible, and when
the architecture says that each instruction overwrites a potentially
useful result, it makes former RISC assembly programmers unhappy."

Let me just say that I suspect that certain varieties of mov ops
followed
by compute ops will get automagically converted into single 3-register
pipeline ops in the not to distant future. So if you stop thinking
about
instructions only getting expanded into pipelineable ops, your RISC
unhappiness will subside.

Mitch
Back to top
Andi Kleen
Guest





Posted: Fri Jan 14, 2005 5:09 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

MitchAlsup@aol.com writes:
Quote:

In terms of clock rate, I dont see any RISCs keeping up with the CISCs
(refering back to cubic dollars argument).

The PPC970? It hasn't quite reached 3Ghz yet as it was promised,
but neither has the Opteron. Opteron is shipping 2.4Ghz, PPC970
2.5Ghz for a longer time.

-Andi
Back to top
D. J. Bernstein
Guest





Posted: Fri Jan 14, 2005 5:46 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

James Van Buskirk wrote:
Quote:
two-register instructions, which is unpleasant

The RISC philosophy in a nutshell: ``We want x = y; x *= z to be
compressed into x = y * z, but we don't want x = mem[i]; y *= x to be
compressed into y *= mem[i].''

Quote:
My 21164 could sustain 4 IPC 8 years ago and had over 2X the clock
of the nearest contemporary Pentium.

I'm curious: How much did you pay for that 21164?

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago
Back to top
Maynard Handley
Guest





Posted: Fri Jan 14, 2005 7:45 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

In article <pan.2005.01.13.03.00.01.273325@areilly.bpc-users.org>,
Andrew Reilly <andrew-newspost@areilly.bpc-users.org> wrote:

Quote:
On Thu, 13 Jan 2005 02:46:34 +0000, Maynard Handley wrote:

In article <cs41hi$62t$1@gemini.csx.cam.ac.uk>,
nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
MitchAlsup@aol.com> wrote:
"At current hardware budgets, the aligned memory access
requirement is probably the least useful of the RISC mechanisms."

In my career, I have never seen a significant use of it except to
cover up misdesigned interfaces - in particular, ones that have
failed to take the decision whether they are based on semi-abstract
types like integers and floating-point or on precisely specified
bit patterns.


You obviously have never programmed AltiVec, have you, Nick?

What's that got to do with anything? (I haven't programmed AltiVec,
per-se, myself. If a compiler has done it on my behalf, good on it. If a
compiler hasn't been able to do it on my behalf, then perhaps that says
something about the architecture of AltiVec.)

So let's review.
Nick says "unaligned memory access is not very useful".
I say that it sodding well is useful, it's a shame (though very
understandable) that AltiVec does not provide it, and that ten years of
experience working with modern codecs has shown me many many situations
where material is NOT "naturally" aligned.

Your response to that, parroted by Nick, is
"I've never actually used AltiVec (but you're wrong anyway), and by the
way modern codecs do a fine job of describing how the bit stream is
packed".
Excuse me for barfing at the sheer pointlessness of this reply, since
the packedness of the material in the bitstream has ZERO to do with the
issue of how well it is adapted to naturally aligned packing. Heck, even
the most clueless undergrad should know that the first stage in decoding
data (or the last stage in encoding data) consist of bit-parsing and
twiddling to handle the entropy coding, usually followed by a table
lookup. It's only at that point that you handle modelling (transforms,
motion comp and so on) which is where something like AltiVec is useful.

My whole point was that the specific nature of these codecs (for example
the way that H264 breaks the image up into variable sized blocks which
can be as small as 4x4) means that however you slice and dice the
problem (and you have complete control over the memory structures ---
these are all internal) you're going to spend a lot of your time wanting
to load vectors that are not aligned to a multiple of 16.

If Nick wants to say that unaligned memory access is not useful for his
little corner of the world, a corner that does not deal with
multi-media, that's fine. But Nick, as is his way, is very fond of
making grandiose claims for the entire freaking computer universe.

(More about audio algorithms below.)

Quote:
While I understand why AltiVec does not allow for unaligned accesses,
and accept that it may well have been and continue to be the correct
tradeoff, the fact is that it is a pain to deal with. And, Nick, please
don't give me any BS about how properly designed code would not require
this. If you've no experience with either AltiVec programming or modern
day audio and video compression algorithms, you're not in a position to
make this claim.

I would say that modern-day audio and video compression standards are a
good example of file (and communication) formats done *well*, by Nick's
standards, as they are universally (in my experience) defined in terms of
packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok,
later */, which was what Nick was complaining about (I believe).

At an audio *algorithm* level, rather than file format level, I've never
encountered anything that would enforce or encourage unaligned floating
point accesses, which is just as well, since most of the DSPs I code for
are still word-addressed.

So, for example, if one is dealing with, say, MPEG audio, one is faced
with the problem of computing the convolution at pretty much the last
stage of the algorithm, using an index that increments by one each
iteration --- meaning that 3 times out of 4 the data one wants to load
is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide)
registers.

Maynard
Back to top
Andrew Reilly
Guest





Posted: Fri Jan 14, 2005 7:57 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

On Fri, 14 Jan 2005 03:45:16 +0000, Maynard Handley wrote:

Quote:
So let's review.
Nick says "unaligned memory access is not very useful".

And the context was "RISC vs CISC", and (to me) the unalignedness was in
terms of individual words of whatever sort. Natural alignment of data
types. That's not a really big restriction, and I'll stick by saying that
it's no biggie.

Quote:
I say that it sodding well is useful, it's a shame (though very
understandable) that AltiVec does not provide it, and that ten years of
experience working with modern codecs has shown me many many situations
where material is NOT "naturally" aligned.

And it seems, now, that you've lept in and said that because AltiVec
requires alignment not just on floating point boundaries but on entire
fixed-length vectors of them, that "unaligned access" (without further
restriction) is necessary. Of course. However understandable (your
words), that does seem to be a pretty crippling defficiency of AltiVec,
particularly for nearly all of the audio signal processing algorithms that
I can think of. How "RISC" is AltiVec if compilers can't use it to help
speed up existing algorithms and existing code? Is it RISC just because
it has a monumental alignment restriction?

All I can say to that argument is that it's a pretty daft extension to the
notion of "natural alignment", particularly if the object of the exercise
is to be able to compute existing numeric algorithms efficiently, rather
that just being able to claim the best peak flops numbers.

Quote:
Your response to that, parroted by Nick, is
"I've never actually used AltiVec (but you're wrong anyway), and by the
way modern codecs do a fine job of describing how the bit stream is
packed".
Excuse me for barfing at the sheer pointlessness of this reply, since
the packedness of the material in the bitstream has ZERO to do with the
issue of how well it is adapted to naturally aligned packing.

Try reading the thread again, after your barf. The issue being
responded-to was "unaligned" values occurring in popular (but perhaps
poorly or unfortunately specced) file and wire formats. The sentence
above makes no sense at all in the context of the discussion.

Quote:
Heck, even
the most clueless undergrad should know that the first stage in decoding
data (or the last stage in encoding data) consist of bit-parsing and
twiddling to handle the entropy coding, usually followed by a table
lookup.

Yup. Nicely defined, and access not susceptable to endianness or
alignment issues. Not like many disk file and network protocols, which
are pretty much defined as fwrite(desc, *(some_C_struct), 1,
sizeof(*some_C_struct)), on some specific computer system, to the eventual
annoyance of anyone using a system with different alignment/ endianness/
compiler struct padding / compiler switches/ etc.

Quote:
It's only at that point that you handle modelling (transforms, motion
comp and so on) which is where something like AltiVec is useful.

You brought AltiVec up. Hadn't been mentioned before in the thread. We
*had* been discussng file and wire formats and alignment issues, though.

Quote:
My whole point was that the specific nature of these codecs (for example
the way that H264 breaks the image up into variable sized blocks which
can be as small as 4x4) means that however you slice and dice the
problem (and you have complete control over the memory structures ---
these are all internal) you're going to spend a lot of your time wanting
to load vectors that are not aligned to a multiple of 16.

Yup. That's how maths works. You don't, however, ever need to read any
of those individual floating point numbers from non-aligned addresses.

Quote:
If Nick wants to say that unaligned memory access is not useful for his
little corner of the world, a corner that does not deal with
multi-media, that's fine. But Nick, as is his way, is very fond of
making grandiose claims for the entire freaking computer universe.

(More about audio algorithms below.)

While I understand why AltiVec does not allow for unaligned accesses,
and accept that it may well have been and continue to be the correct
tradeoff, the fact is that it is a pain to deal with. And, Nick,
please don't give me any BS about how properly designed code would
not require this. If you've no experience with either AltiVec
programming or modern day audio and video compression algorithms,
you're not in a position to make this claim.

I would say that modern-day audio and video compression standards are a
good example of file (and communication) formats done *well*, by Nick's
standards, as they are universally (in my experience) defined in terms
of packed bit-strings, rather than fwrite(c-struct) /*
and-hope-it-ports-ok, later */, which was what Nick was complaining
about (I believe).

At an audio *algorithm* level, rather than file format level, I've
never encountered anything that would enforce or encourage unaligned
floating point accesses, which is just as well, since most of the DSPs
I code for are still word-addressed.

So, for example, if one is dealing with, say, MPEG audio, one is faced
with the problem of computing the convolution at pretty much the last
stage of the algorithm, using an index that increments by one each
iteration --- meaning that 3 times out of 4 the data one wants to load
is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide)
registers.

Well, that sucks. Doesn't AltiVec have permutation operations to at least
help with that sort of thing?

Is there no scope for doing the loop-order inversion trick, so that the
words in your altivec vectors are successive bins, and the shifting-order
index is over blocks of bins? That tends to need more memory bandwidth
than the in-register accumulator approach, but maybe machines with AltiVec
have such bandwidth (in cache, anyway)?

I'd just note that AltiVec and its restrictions don't by any means define
the universe of multimedia and audio implementation strategies. Lots of
that still takes place on DSPs and other embedded processors that work
just fine one word at a time.

--
Andrew
Back to top
James Van Buskirk
Guest





Posted: Fri Jan 14, 2005 7:57 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

"D. J. Bernstein" <djb@cr.yp.to> wrote in message
news:slrncuen6e.1rmh.usenet@stoneport.math.uic.edu...

Quote:
I'm curious: How much did you pay for that 21164?

I hate to think about it, but OTOH I could have shelled out
that much and more and not gotten as much use, education, and
entertainment out of it. BTW, have you ever considered
changing djbfft to get better opcounts along the lines of

http://home.comcast.net/~kmbtib/ ?

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Back to top
James Van Buskirk
Guest





Posted: Fri Jan 14, 2005 7:57 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

<MitchAlsup@aol.com> wrote in message
news:1105658951.043680.297100@c13g2000cwb.googlegroups.com...

Quote:
Yes, it is still 3 x86 instructions; but an 'average' x86 instruction
contains
1.4 RISC instructions of semantic content.

I would have thought that what an 'average' x86 instruction did
would depend on the kind of code being executed. The instruction

addt $f0, $f1, $f2

has the semantic content 'take what's in register $f0, add it
to what's in register $f1, and overwrite the register of my
choosing ($f2) with the result.' In a 2-register ISA, that
would involve storing the contents to $f0 in memory, adding
$f1 to $f0, copying the contents of $f0 to $f2, then reloading
$f0 from memory. Looks like about 4 x86 instructions to me,
rather inconsistent with a 1.4X semantic content advantage
for x86. Since DFT code is FP addition-heavy, this is the
most frequent instruction in this kind of algorithm.

Quote:
In terms of SSE, opteron can decode 3 SSE 128-bit wide instructions per
cycle
with is equivalent to six RISC DP FP ops or six DP FP ops + six Load
ops
(dependingupon reg-reg or reg-mem versions). These algorythms end up
pipeline (or memory) limited not decode limited.

But there are two big problems with the above:
1) Opteron can't sustain issuance of 3 128-bit SSE2 instructions
per clock.
2) At any given stage of a DFT or convolution, there is no
guarantee that your data will be aligned (higher level languages
don't do so) or even contiguous, so the 128-bit wide versions
of the SSE2 instructions aren't in general useful. As outlined
earlier, many or even most SSE2 instructions will be wasted
saving data before it gets overwritten by 2-register operations
or spilling and filling due to the smaller register file.

Quote:
Let me just say that I suspect that certain varieties of mov ops
followed
by compute ops will get automagically converted into single 3-register
pipeline ops in the not to distant future. So if you stop thinking
about
instructions only getting expanded into pipelineable ops, your RISC
unhappiness will subside.

Dream on. Maybe x86 will be able, after over 2 decades and
causing many difficult to find mega-stalls, to emulate
3-register operations, kind of like the way it was able to
incorporate more physical registers by going OOO, but OOO
carries baggage with it and so, I would imagine, would
emulation of 3-register semantics.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Back to top
Paul Rubin
Guest





Posted: Fri Jan 14, 2005 7:57 am    Post subject: Re: RISC vs. CISC design principles Reply with quote

"James Van Buskirk" <not_valid@comcast.net> writes:
Quote:
addt $f0, $f1, $f2

has the semantic content 'take what's in register $f0, add it
to what's in register $f1, and overwrite the register of my
choosing ($f2) with the result.' In a 2-register ISA, that
would involve storing the contents to $f0 in memory, adding
$f1 to $f0, copying the contents of $f0 to $f2, then reloading
$f0 from memory. Looks like about 4 x86 instructions to me,

You can add register to register.
mov f2, f0 # copy f0 to f2
add f1, f2 # f2 += f1

The x86 OOO may be able to combine these two instructions into one
internal operation.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
Page 3 of 8

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB