RISC vs. CISC design principles
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
RISC vs. CISC design principles
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Eric P.
Guest





Posted: Thu Jan 20, 2005 1:05 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

Terje Mathisen wrote:
Quote:

Eric P. wrote:

Terje Mathisen wrote:
Hmmm... what else might might be affected?

- Load-Store queue must do more complex overlap checks before
allowing read or write reordering

Not too much though: Currently it must take into consideration both base
and length of each operations, this extension could conservatively
extend this to be the aligned base, and the extended length.

Oops yes, I should have seen that. I kept thinking this required
arithmetic. If the largest operand is 8 bytes then round down to
a 16 byte boundary and check for overlap on the 16 byte blocks.


Eric
Back to top
Guest






Posted: Thu Jan 20, 2005 1:10 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

"I don't follow this. If a segment offset wraps around, it is the
same as linear address wrap around which is the same as any other
page straddle. I see only 2 pages and therefore 2 cache lines
being touched."

Intel: Pentium Pro Family Developer's Manual: Volume 3 Operating System
Writer's Manual Copyright 1996, page 3-9 order number 242692

Quote "If the granularity flag is clear, the segment size can range
from 1 Byte to 1 MByte in byte increments" Unquote

So, the Hardware has to be capable of splitting a cache line with
address wrap. As I said, this is Yech material and this never happens
in actual practice, excepting the test vector writers. All of the OSs
(I know about) that used the large granuals, thereby, restricting the
problem to 2 memory faults/address.

Mitch
Back to top
Terje Mathisen
Guest





Posted: Thu Jan 20, 2005 2:23 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

Greg Lindahl wrote:

Quote:
In article <csldqe$nra$3@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:


The good thing is that the average L1 cache line size seems to increase
at about the same rate as the maximum/average load unit size, which
means that the percentage of misaligned loads that would straddle a
boundary stays about the same or goes down.


Is this a statement about your personal codes, or do you really think
this generalization applies to everyone? Lots of codes can't use SIMD
instructions, so their load unit size hasn't changed at all.

I was thinking of code that do some kind of buffer processing:

If integer registers, then the data size has increased from 16 via 32 to
64 bits while I've been programming, and from 32 to 64 while caches have
been in use.

On many of these algorithms, some kind of SIMD can be employed, and that
means a further step to 128 bit accesses.

The codes that are fixed at a given access unit size probably depend
much less on arbitrary alignment accesses, because they use arrays or
structs generated by some compiler.

If we're talking about packet/protocol processing, then SIMD can again
help in locating/extracting data items & boundaries.

Do you have an example of non-simd capable algorithms that also need
misaligned accesses?

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Terje Mathisen
Guest





Posted: Thu Jan 20, 2005 2:32 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

MitchAlsup@aol.com wrote:

Quote:
"I don't follow this. If a segment offset wraps around, it is the
same as linear address wrap around which is the same as any other
page straddle. I see only 2 pages and therefore 2 cache lines
being touched."

Intel: Pentium Pro Family Developer's Manual: Volume 3 Operating System
Writer's Manual Copyright 1996, page 3-9 order number 242692

Quote "If the granularity flag is clear, the segment size can range
from 1 Byte to 1 MByte in byte increments" Unquote

Yes, I think I wrote that segments could end at any given byte limit.
Quote:

So, the Hardware has to be capable of splitting a cache line with
address wrap. As I said, this is Yech material and this never happens
in actual practice, excepting the test vector writers. All of the OSs
(I know about) that used the large granuals, thereby, restricting the
problem to 2 memory faults/address.

There is still no wraparound!

Or at least not in the way I understand the word:

Wraparound is what I see if I do a 16-bit access at address 65535 in a
16-bit/64 KB segment: This has to return one byte from the end of the
segment, and the other from address zero.

However, this _only_ occurs because the segment ends at the end of the
addressing range.

It is identical to reading 16 og 32 bits from address -1 or -2 in 4GB
segment/addressing range.

Right?

Or are you actually trying to say that it is possible to define a
segment of (say) 33 bytes, so that when you do a 4-byte access from
address 32, you get back the last byte in this segment, plus three bytes
from offset zero?

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Greg Lindahl
Guest





Posted: Thu Jan 20, 2005 5:47 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

In article <csmj5a$frf$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

Quote:
Is this a statement about your personal codes, or do you really think
this generalization applies to everyone? Lots of codes can't use SIMD
instructions, so their load unit size hasn't changed at all.

I was thinking of code that do some kind of buffer processing:

Well, in SPECint2000, if I remember correctly only 1 benchmark gets
any SIMD boost, and that's for vectorizing 32-bit integers. So bzip2
and gzip aren't getting any SIMD, and that sounds like what you mean
by "buffer processing".

Quote:
If integer registers, then the data size has increased from 16 via 32 to
64 bits while I've been programming, and from 32 to 64 while caches have
been in use.

Indeed, that is the case. But that doesn't change the size of a
character or floating point values. So the breadth of your
generalization would be fairly limited.

Quote:
If we're talking about packet/protocol processing, then SIMD can again
help in locating/extracting data items & boundaries.

Can you show any examples?

Quote:
Do you have an example of non-simd capable algorithms that also need
misaligned accesses?

Processing protocol headers, i.e. IP, where a network-order
multi-octet values typically end up on the wrong boundary? But that
wasn't what this sub-thread was talking about.

-- greg
Back to top
Terje Mathisen
Guest





Posted: Thu Jan 20, 2005 7:59 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

Greg Lindahl wrote:
Quote:
In article <csmj5a$frf$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:


Is this a statement about your personal codes, or do you really think
this generalization applies to everyone? Lots of codes can't use SIMD
instructions, so their load unit size hasn't changed at all.

I was thinking of code that do some kind of buffer processing:

Well, in SPECint2000, if I remember correctly only 1 benchmark gets
any SIMD boost, and that's for vectorizing 32-bit integers. So bzip2
and gzip aren't getting any SIMD, and that sounds like what you mean
by "buffer processing".

OK, we're talking around each other here!

You're mostly interested in what compilers can or could conceivably do
with exisiting code, while I look at using these capabilities to enable
handwritten asm code to really outperform said compiler. :-)

Quote:
If integer registers, then the data size has increased from 16 via 32 to
64 bits while I've been programming, and from 32 to 64 while caches have
been in use.

Indeed, that is the case. But that doesn't change the size of a
character or floating point values. So the breadth of your
generalization would be fairly limited.

Indeed it doesn't change the underlying data items much, if at all.
However, as long as I have multiple such items in a row, SIMD is a
natural way to retrieve them.
Quote:

If we're talking about packet/protocol processing, then SIMD can again
help in locating/extracting data items & boundaries.

Can you show any examples?

Do you have an example of non-simd capable algorithms that also need
misaligned accesses?

Processing protocol headers, i.e. IP, where a network-order
multi-octet values typically end up on the wrong boundary? But that
wasn't what this sub-thread was talking about.

OK, let's look at some kind of protocol processing:

Assume you can control network buffer allocations, so that the start of
a packet will end up at fixed 16-byte alignment, OK?

To process the first 32 bytes of such a header on a PPC, I would use
AltiVec to load these bytes into two 16-byte registers, then use the
in-register lookup opcode (permute?) to extract & align the items I need.

At this point I've got everything setup to either go back via a ram
buffer, or directly to integer registers, with no packing/unpacking,
endian-ness or other worries remaining.

In fact, I can setup the permute vectors either during initialization,
or as part of protocol negotiation, so that the actual processing
becomes branchless even if the protocol contains variable parts, like a
peer-negotiated endianness.

Another option is to use some part of the header to determine the
precence of optional parts of the packet, by using that initial part os
the lookup index to select the proper permuste vector to load.

Doesn't this sound useful, even if you cannot do it from portable C?

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Greg Lindahl
Guest





Posted: Thu Jan 20, 2005 1:06 pm    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

In article <csnn2e$376$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

Quote:
You're mostly interested in what compilers can or could conceivably do
with exisiting code, while I look at using these capabilities to enable
handwritten asm code to really outperform said compiler. :-)

No, I'm interested in both. If you'd like to present a solution to
gzip or bzip2 which manages to use SIMD to great benefit, you can be
sure I'd look at teaching a compiler to do that trick...

Quote:
Assume you can control network buffer allocations, so that the start of
a packet will end up at fixed 16-byte alignment, OK?

To process the first 32 bytes of such a header on a PPC, I would use
AltiVec to load these bytes into two 16-byte registers, then use the
in-register lookup opcode (permute?) to extract & align the items I need.

Right, except you have some fields which overlap the two registers.
Not knowing the details of Altivec, I still doubt it has such an
instruction; without one, you have to do several shifts, masks, and an
and.

Quote:
Doesn't this sound useful, even if you cannot do it from portable C?

Lots of things can sound useful, but I have no idea from the sound if
they're fast.

But we're getting away from the main point I was aiming at making,
which was that your generalization about the width of things growing.

-- greg
Back to top
Terje Mathisen
Guest





Posted: Thu Jan 20, 2005 5:28 pm    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

Greg Lindahl wrote:

Quote:
In article <csnn2e$376$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:


You're mostly interested in what compilers can or could conceivably do
with exisiting code, while I look at using these capabilities to enable
handwritten asm code to really outperform said compiler. :-)


No, I'm interested in both. If you'd like to present a solution to
gzip or bzip2 which manages to use SIMD to great benefit, you can be
sure I'd look at teaching a compiler to do that trick...

OK, for zip I'd look at first doing the token extraction: This is a
multi-level table lookup problem, which starts by packing bytes together
in native order, then extracting a suitably long bitstring which can be
used as an index, right?

After first getting (say) 128 bits into a SIMD register, this register
can deliver a bunch of tokens with just a single shift right and mask
between them, OK?

Quote:
Assume you can control network buffer allocations, so that the start of
a packet will end up at fixed 16-byte alignment, OK?

To process the first 32 bytes of such a header on a PPC, I would use
AltiVec to load these bytes into two 16-byte registers, then use the
in-register lookup opcode (permute?) to extract & align the items I need.

Right, except you have some fields which overlap the two registers.

No problem!

Quote:
Not knowing the details of Altivec, I still doubt it has such an
instruction; without one, you have to do several shifts, masks, and an
and.

AFAIK Altivec does indeed allow you to treat the two input registers as
a single 32-byte buffer, from which you can perform arbitrary byte (or
even nybble?) extraction.

I.e. it replaces all the shifts, masks and merge operations, while
simultaneously handling any endianness issues transparently.
Quote:

Doesn't this sound useful, even if you cannot do it from portable C?

Lots of things can sound useful, but I have no idea from the sound if
they're fast.

I believe permute is single cycle issue and one or two-cycle latency,
with no need to worry about cache misses or any of that stuff. :-)

In fact, it might even be a good idea to stream data directly into such
registers, without polluting any expensive cache memory: It will only be
used once anyway.
Quote:

But we're getting away from the main point I was aiming at making,
which was that your generalization about the width of things growing.

Oops, in that case we're in violent agreement: I was trying to say that
even with SIMD, on average the number of accesses per cache line is
increasing, which makes the more or less free misalignment handling
within such a cache line even more valuable.

Sorry if I wasn't clear enough. :-(

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
hobold
Guest





Posted: Thu Jan 20, 2005 7:42 pm    Post subject: AltiVec vector permute (was: Re: Unaligned accesses) Reply with quote

Terje Mathisen wrote:
Quote:
Greg Lindahl wrote:

In article <csnn2e$376$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
[...]
Not knowing the details of Altivec, I still doubt it has such an
instruction; without one, you have to do several shifts, masks, and
an
and.

AFAIK Altivec does indeed allow you to treat the two input registers
as
a single 32-byte buffer, from which you can perform arbitrary byte
(or
even nybble?) extraction.

All AltiVec implementations so far have a full 32x16 byte crossbar. It

is accessed with a triadic operation ("vector permute") that takes a
'left' and a 'right' input vector of 16 bytes each, and an index vector
of 16 bytes. The two input vectors are conceptually concatenated to
form a 32 byte array and the index vector's bytes (the low five bits,
to be precise) specify which source byte is to be placed into the
corresponding result byte. I.e. Result[i] := Source[Index[i]];

All AltiVec implementations fully pipeline the crossbar (with an
effective latency between 1 and 3 cycles, depending on chip model (and
the surrounding instructions in case of the PPC970)) and use it in
parallel with the SIMD ALU. So sustainable throughput is one permute
per clock plus another vector ALU instruction plus a vector load/store
(not on the original MPC7400/7410 which is a dual issue machine).

Quote:
Doesn't this sound useful, even if you cannot do it from portable
C?

Lots of things can sound useful, but I have no idea from the sound
if
they're fast.

I believe permute is single cycle issue and one or two-cycle latency,
with no need to worry about cache misses or any of that stuff. :-)

Vector permute is the single most powerful instruction primitive I have
ever encountered. You can rearrange bytes across several vectors, or
you can do quick lookups in invariant 32 byte tables. Population count?
Look it up! Bit reverse? Look it up! Arbitrary bit permutations within
a byte? Look it up! Count leading zeros in a byte? Look it up!

Did I mention I like it? :-) Programmers with too much time on their
hands can do very interesting things with vector permute. The highest
performing AltiVec routines usually consist of 50% permutes and 50% ALU
operations. Permute is much more than just memory alignment support.

<advertisement> Try it once, the cost of entry has fallen to an all
time low of $500 including a full suite of developer tools!
</advertisement> :-)

Holger
Back to top
Terje Mathisen
Guest





Posted: Thu Jan 20, 2005 8:15 pm    Post subject: Re: AltiVec vector permute Reply with quote

hobold wrote:
Quote:
Terje Mathisen wrote:
All AltiVec implementations so far have a full 32x16 byte crossbar. It
is accessed with a triadic operation ("vector permute") that takes a
'left' and a 'right' input vector of 16 bytes each, and an index vector
of 16 bytes. The two input vectors are conceptually concatenated to
form a 32 byte array and the index vector's bytes (the low five bits,
to be precise) specify which source byte is to be placed into the
corresponding result byte. I.e. Result[i] := Source[Index[i]];

All AltiVec implementations fully pipeline the crossbar (with an
effective latency between 1 and 3 cycles, depending on chip model (and
the surrounding instructions in case of the PPC970)) and use it in
parallel with the SIMD ALU. So sustainable throughput is one permute
per clock plus another vector ALU instruction plus a vector load/store
(not on the original MPC7400/7410 which is a dual issue machine).

[snip]
Quote:

I believe permute is single cycle issue and one or two-cycle latency,
with no need to worry about cache misses or any of that stuff. :-)

Vector permute is the single most powerful instruction primitive I have
ever encountered. You can rearrange bytes across several vectors, or
you can do quick lookups in invariant 32 byte tables. Population count?
Look it up! Bit reverse? Look it up! Arbitrary bit permutations within
a byte? Look it up! Count leading zeros in a byte? Look it up!

It would seem like the 5-bit (effectively 4 when working with powers of
two) index size would force you to split any bytes to be checked into
two nybbles, and then merge the results afterwards, but I'm sure there's
a way to use permute to do this as well, right?

Let's see...

Popcount() of a 64-bit item could be handled by first copying one half
of a register into the other half of another (using permute of course!),
masking to nybbles and merging, then looking up all 16 nybbles with a
single permute, before a horizontal copy which AltiVec probably has?

Bit reverse is very similar, except that you need a full 32-byte lookup
table and a flag bit on those nybbles which originated in the high half,
so that the result will be ready to merge.

Bit permute is the same as bit reverse, just another lookup table.

Counting leading bytes would seem to be slightly harder, since you need
some way to merge two or more results. This can probably be handled with
a vector compare against zero, and a mask which is generated from the
first non-zero item and down.

Quote:
Did I mention I like it? :-) Programmers with too much time on their
hands can do very interesting things with vector permute. The highest
performing AltiVec routines usually consist of 50% permutes and 50% ALU
operations. Permute is much more than just memory alignment support.

So basically you're confirming what I've said, i.e. having a
programmable crossbar like this is much more powerful than just having
mis-alignment support. :-)

Quote:
advertisement> Try it once, the cost of entry has fallen to an all
time low of $500 including a full suite of developer tools!
/advertisement> :-)

I've been drooling over this (as a quick comp.arch search will confirm)
since the first time I heard about the opcode.

I might revert to downloading the architecture/instruction set manual
and write some paper-only code. :-(

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Christian Bau
Guest





Posted: Fri Jan 21, 2005 3:44 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

In article <41eeffad@news.meer.net>, lindahl@pbm.com (Greg Lindahl)
wrote:

Quote:
In article <csmj5a$frf$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

Is this a statement about your personal codes, or do you really think
this generalization applies to everyone? Lots of codes can't use SIMD
instructions, so their load unit size hasn't changed at all.

I was thinking of code that do some kind of buffer processing:

Well, in SPECint2000, if I remember correctly only 1 benchmark gets
any SIMD boost, and that's for vectorizing 32-bit integers. So bzip2
and gzip aren't getting any SIMD, and that sounds like what you mean
by "buffer processing".

If integer registers, then the data size has increased from 16 via 32 to
64 bits while I've been programming, and from 32 to 64 while caches have
been in use.

Indeed, that is the case. But that doesn't change the size of a
character or floating point values. So the breadth of your
generalization would be fairly limited.

If we're talking about packet/protocol processing, then SIMD can again
help in locating/extracting data items & boundaries.

Can you show any examples?

On an Altivec processor, it would be quite trivial to read a data
structure sixteen byte at a time, use vperm instructions to exchange
bytes and add or remove padding as required, and store the result in a
format that is compatible with C data structures for the compiler used.
Back to top
Christian Bau
Guest





Posted: Fri Jan 21, 2005 3:53 am    Post subject: Re: Unaligned accesses (was Re: RISC vs. CISC design princip Reply with quote

In article <41ef6677$1@news.meer.net>, lindahl@pbm.com (Greg Lindahl)
wrote:

Quote:
Right, except you have some fields which overlap the two registers.
Not knowing the details of Altivec, I still doubt it has such an
instruction; without one, you have to do several shifts, masks, and an
and.

Definition of vperm in C:

typedef struct { unsigned char b [16]; } vec_byte16;

vec_byte16 vperm (vec_byte16 src1, vec_byte16 src2, vec_byte16 perm) {
int i;
vec_byte16 tmp;

for (i = 0; i < 16; ++i) {
unsigned char index = perm.b [i];
tmp.b [i] = (index & 0x10) ? src2.b [index & 0x0f]
: src1.b [index & 0x0f];
}
return tmp;
}

So you are fine unless you need three input vectors to determine sixteen
byte of the result; that would happen if the input format has lots of
padding.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8
Page 8 of 8

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB