How does this make you feel?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
How does this make you feel?
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Steve
Guest





Posted: Tue Jun 28, 2005 8:15 am    Post subject: How does this make you feel? Reply with quote

Way back in 2000 or so, I did a little prototyping for an application I
was considering that led me to conclude there are certain CPU features
that might be worth adding to existing CPU designs (I'm thinking of
commodity architechtures, mostly x86 as that is the kind of system
I've had acess to in recent years) that would be potentially
interesting to use. At the outset I must admit that I have no real
technical knowledge of CPU or ASIC design, nor do I have much in the
way of assembler experience. I am influenced in this to a minor degree
by my exposure to the Amiga custom chipset and the thourough
documentation that was contained in the Amiga Hardware Reference
Manual_, first edition. As I know nearly nothing about what I am going
to describe, and because I have made certain assumptions from (probably
incorrect) infrences, I expect that my discussion here will contain
many silly errors. Plus, this is written quickly as I am rather busy,
and unfortunately have little time to devote to matters related to
computer science. Finally, my expertise in programming is largely
confined to Linux and a language which shall go unnamed but which
starts with the letter 'C'.

To start, then, I have noticed that CPU registers are rather simple.
Too simple, it seems. I struck me one day to imagine the utility of a
register that could describe a range of memory, rather than merely a
specific memory location. It is well known that there are memory
access penalties in some architectures for addresses that do not fall
on word boundaries. With the exception of character pointers and
sixteen-bit integers, in usual applications most pointer accesses are
on so-called word boundaries. On x86, this is usually a thirty-two bit
size, but now we are seeing sixty-four bit machines hitting the retail
market. For the sake of this speculative discussion I am going to
entirely ignore odd addresses and their associated problems.

So, the vast majority of address decoding in the CPU occurs in
situations where the least significant two or four bits are zero. When
we are considering page tables in the MMU, such as with 4K pages on
x86, the resolution of an address falls even further with the least
significant twelve bits of page table entry being all zero. This seems
to be a bit of a waste, as some will realise when they think of the
occasional programming and memory benefit to be had from packing bits
when space is at an extreme premium.

We all know that memcpy(1) is one of the most common idioms of
practical programming. Despite the efficiency of on-chip cache, etc.,
it seems a waste in even this simple instance to require a compiler to
include a relatively lengthly piece of code just to move one section of
memory to another. What if there were registers that could be made to
describe arbitrary ranges of memory, and which in some way were largely
integrated with the instruction set of the CPU?

Imagine that CPU registers are modified in such a way as to include a
mask that described their range:

On x86, you might have the Ax register tied to the 'Amx' mask. This
hypothetical Amx register would, in effect, describe an arbitrary
'page' size that modified the way the CPU interpreted addresses
contained in the Ax register. In effect, you would be chopping up the
Ax register in two pieces: one component of the register (probably the
most significant bits) would describe the address, and some bits to
identify the address range, or span.

Assume for the sake of argument that the 32 bit hypothetical Axm
register is divided equally into two 16-bit partitions. First is the
page size, specified in bytes, also known as the address multiplier.
Second is the number of bits of the Ax register to be allocated to
range calculations. In the normal case, the Axm register might contain
all thirty-two bits zero, indicating that the register should behave as
normal, the way one would expect in today's common architectures;
giving backwards compatibility. But let's assume that a compiler or
assembler programmer has found a use for this new functionality and
wishes to do a memcpy(1) across a range of memory. The Axm register
mask might then be loaded with a value like 0x0200080. In this case,
the interpretation of the Ax register is modified as follows:

0x0200 equates to a 1K page size.
0x0008 equates to a range size of 8 bits.
Thus, the Ax register is divided:

Ax: |xxxxxxxxxxxxxxxxxxxxxxxx|xxxxxxxx|
| address | range |

The address portion of the Ax register is unconditionally multiplied by
its 'page' size, and then covers the range specified by the eight
remaining bits, again perhaps multiplied by the local page size.

In an instruction like mmove, the destination register would, of
course, be any legal address, and would denote the starting address for
the copy. I imagine that this scheme would easily apply to comparison
instructions, and with the added compliction in a hypothetical future
architecture allowing three instruction operands, to AND, OR, XOR, NOR,
shift and rotate fuctions, and so forth. I am not going to consider
the possible validity of this scheme applied to integer math
operations, save to say that bright chip designers might fashion a
schema that meshed with such functions, as well as with other less
common assembler instruction idioms.

Clearly, in this hypothetical arrangement there are Axm mask values
which would be invalid, and as well there would be Ax register values
which would fall outside the capacity of the address bus. Presumably,
extensions to the exception or trap mechanisms would allow for input
validation and sane address decoding.

I realise that the implementation of this kind of approach to address
coding and decoding is a bit of a devilish problem. Instruction sets,
their features, coding, and modalities are tightly integrated at a
particular level of abstraction. I understand, perhaps incorrectly,
that the approach to contemporary CPU design is something of a modular
affair. I imagine that the on-chip address decoder(s) are something of
a self-contained unit that is integrated with the cache-control logic
and the MMU, and the actual address bus. The adder is something of a
standalone functional unit and only ever sees data values that have
been acquired by a whole lot of address logic pre-processing. In terms
of an instruction like mmove, I suggest that the logic that is devoted
now to the actual job of transferring data from a register or EA to
another register or EA, _might_ be extended to cover iterations over a
range without an unreasonable expenditure of chip real-estate.
Similarly, for boolean algebra primitives, this supposition could hold
true. We now have dual-core CPUs as evidence that there is no real
shortage of on-chip real estate.

Considering a technology like hyperthreading, there is a big question
of implementation. There are two broad approaches to instruction set
enhancements. Recently with Intel we have seen MMX, SSE, and SSE2
instruction set additions that appear to be bolted on to the original
CPU architecture as if they were adding a co-processor on-chip. The
other option, and the one I favour in my wild fantasies, is a tightly
coupled extension of an existing instruction set. With something like
hyperthreading, an application that is copying memory simply blocks
until it is done -- resulting in something of an atomic operation --
while other 'threads' might continue to execute, potentially contending
for the memory bus and basic functional CPU components in a manner not
entirely unlike what was seen from the Amiga's blitter and copper.
However, I realise that chances are that the practical constraints of
CPU design, and the way that I assum VHDL is used, will suggest an
entirely new CPU product.

To go futher into the wild blue yonder of ignorant speculation, I would
suggest that this kind of address and register scheme might even apply
to the general architecture of memory management as we know it. If
page tables were described similarly to registers as above, we could
have relatively arbitrary page sizes, which would be very interesting
to virtual memory systems. If practical.

This is all pretty much off the top of my head, and I am limited by my
lack of expertise in this area. I have not studied architectures other
than those offered in contemporary retail systems. I have no idea what
goes on in the world of big iron, nor what is being done with
special-purpose engines as may be found in high-end video display
processors. I wouldn't recognise a vector engine if it bit me in the
ass, and I couldn't even imagine what the PDP people created. This is
just a very sketchy outline of an approach to address decoding and
register architecture that I have not seen considered in anything I
have read to date.

I am very interested to hear what others think of this rough scheme,
and whatever else, to learn why there is apparently nothing at all
similar appearing in modern systems accessible to the layman.


Regards,

Steve
Back to top
JJ
Guest





Posted: Tue Jun 28, 2005 2:11 pm    Post subject: Re: How does this make you feel? Reply with quote

Go back to school in some fashion, brick, web, book etc. Theres lots of
free web reading if you look.

Use google groups search to search for previous "really great ideas"
but use your own terms.

For your project work learn FPGAs, Verilog or VHDL, plus compilers,
OSes etc and implement what you learn about comp arch in FPGA in a
small scale. It will probably look like a mips or dlx or microblaze. To
some extant you can upgrade those architectures with your own ideas.
Forget about hacking x86, its already the most hacked up arch that ever
existed.

Should only take a few years out of your life, seriously

johnjakson at usa dot com
Back to top
Edward A. Feustel
Guest





Posted: Tue Jun 28, 2005 4:15 pm    Post subject: Re: How does this make you feel? Reply with quote

"Steve" <steve49152@yahoo.ca> wrote in message
news:1119928828.746326.94220@f14g2000cwb.googlegroups.com...
..
Quote:

To start, then, I have noticed that CPU registers are rather simple.
Too simple, it seems. I struck me one day to imagine the utility of a
register that could describe a range of memory, rather than merely a
specific memory location.
Some architectures have implemented this idea.

The "register" data structure is called a descriptor. Usually there is a
base and length.
A segment register on the ix86 is a base and the length is contained in the
page table,
so the length is in terms of pages.

Unfortunately, this is not a basic type in C and UNIX does not deal with it
very well.
Ed



----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Back to top
Steve
Guest





Posted: Wed Jun 29, 2005 6:02 am    Post subject: Re: How does this make you feel? Reply with quote

Edward A. Feustel wrote:
Quote:
"Steve" <steve49152@yahoo.ca> wrote in message
news:1119928828.746326.94220@f14g2000cwb.googlegroups.com...

To start, then, I have noticed that CPU registers are rather simple.
Too simple, it seems. I struck me one day to imagine the utility of a
register that could describe a range of memory, rather than merely a
specific memory location.
Some architectures have implemented this idea.

That's nice to know. Now if only I knew what those architectures
were...

Quote:
The "register" data structure is called a descriptor. Usually there is a
base and length.

Semantics.

Quote:
A segment register on the ix86 is a base and the length is contained in the
page table, so the length is in terms of pages.

That's not really the same thing, and anyways, who wants to use a
segmented memory architecture machine if it is like the x86 family.
When I moved from the 68000 to the 80386 (remembering vaguely the 80186
from high school) I was mortified to read about small memory model
targets, the segment registers, and the rest.

One of my points is that the utility of the kind of register/descriptor
I described (reinvented) seems obvious and it is shocking to think that
nobody in the hardware logic world thought it would be apropriate for
desktops or workstations (now I'm assuming that mips and sparc don't
have anything similar).

Quote:
Unfortunately, this is not a basic type in C and UNIX does not deal with it
very well.

That's UNIX's problem. Linux and FreeBSD are open source, as are any
number of other operating systems. The GNU tools can be hacked, and so
can the compilers and assemblers. I don't see this as a showstopper,
although if someone wants to implement varible sized pages in their
MMU, the VM of any UNIX-like kernel is going to have to be gutted and
written from scratch. And they'll have to do something about POSIX
compliance, too, but that should be easy compared to the rest of it.

Whether anything like it, or any other 'exotic' features hit desktops
architectures within the next twenty years, is a matter of economics
and rash speculation. Personally, I hope that Intel doesn't become the
chip borg and assimilate all potential competitors to their dominance
of the market.


Regards,

Steve
Back to top
Tom Linden
Guest





Posted: Wed Jun 29, 2005 7:12 am    Post subject: Re: How does this make you feel? Reply with quote

On 28 Jun 2005 18:02:19 -0700, Steve <steve49152@yahoo.ca> wrote:

Quote:
One of my points is that the utility of the kind of register/descriptor
I described (reinvented) seems obvious and it is shocking to think that
nobody in the hardware logic world thought it would be apropriate for
desktops or workstations (now I'm assuming that mips and sparc don't
have anything similar).

While you may be shocked a number of compilers and OS's have always

employed
descriptors, sometimes also known as dope vectors. It is not necessary to
provide this in the hardware, indeed PL/I has been using this since its
inception. You are led to this redicovery after using brain-dead languages
like C and provided you have a modicm of insight. Look at the calling
mechanisms in VMS.
Back to top
Grumble
Guest





Posted: Wed Jun 29, 2005 8:15 am    Post subject: Re: How does this make you feel? Reply with quote

Steve wrote:

Quote:
[...] if someone wants to implement variable sized pages in their MMU

e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.

http://www.intel.com/design/itanium2/manuals/251110.htm
Back to top
Eric P.
Guest





Posted: Wed Jun 29, 2005 4:15 pm    Post subject: Re: How does this make you feel? Reply with quote

Steve wrote:
Quote:

Edward A. Feustel wrote:

Unfortunately, this is not a basic type in C and UNIX does not deal with it
very well.

That's UNIX's problem. Linux and FreeBSD are open source, as are any
number of other operating systems.

Well actually as the manufacturer, it would be your problem.
Either no one would port to your ISA, in which case you make no sales,
or they would ignore the hardware feature, in which case it is just
costly baggage like the x86 and x64 segmentation.

Quote:
although if someone wants to implement varible sized pages in their
MMU, the VM of any UNIX-like kernel is going to have to be gutted and
written from scratch.

This has been looked into. The basic idea is to include a valid bits
mask in the TLB address matcher to control how big a page size is.
The idea is to minimize the number of TLB miss loads. The x86 has
dual page size or 4KB and 4MB/2MB. The Alpha also had such a feature
for 8KB and 64KB pages, and I vaguely recall seeing a DEC patent on it.
I don't know if VMS ever used the feature.

The problem for the operating system is that it needs to managing
these variable sized pages. Google or search citeseer for 'superpage'.
That means additional work for free pages management to split up
large pages and reaggregate them later (binary buddy probably).

But the biggest OS overhead is detecting when superpages are
possible and creating them on the fly. The cost of recogizing
a potential superpage and upgrading a set up individual pages
to a superpage, or spliting them back apart again apparently
eats up all the potential savings.

It seems to me the only way this would make practical sense is to
make page size a memory section creation option and add a linkage
option for EXE's to specify it. Once defined at memory section
create it would remain fixed. The OS would also need to adjust
page table management, and this make the feature non portable.

Eric
Back to top
Jeremy Linton
Guest





Posted: Wed Jun 29, 2005 4:15 pm    Post subject: Re: How does this make you feel? Reply with quote

x86 also supports multiple page sizes (4k, 2M, and 4M) No one uses the
2M mode, but both windows and linux use mixed pages sizes 4k and 4M. The
4M page use is very restricted.



Grumble wrote:
Quote:
Steve wrote:

[...] if someone wants to implement variable sized pages in their MMU


e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.

http://www.intel.com/design/itanium2/manuals/251110.htm
Back to top
Rick Jones
Guest





Posted: Thu Jun 30, 2005 12:15 am    Post subject: Re: How does this make you feel? Reply with quote

Grumble <devnull@kma.eu.org> wrote:
Quote:
Steve wrote:

[...] if someone wants to implement variable sized pages in their MMU

e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.

Which is remarkably similar to the variable page size support in
PA-RISC 2.0 :)

rick jones
--
Process shall set you free from the need for rational thought.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Back to top
John Mashey
Guest





Posted: Thu Jun 30, 2005 12:15 am    Post subject: Re: How does this make you feel? Reply with quote

Eric P. wrote:
Quote:
Steve wrote:


although if someone wants to implement varible sized pages in their
MMU, the VM of any UNIX-like kernel is going to have to be gutted and
written from scratch.

This has been looked into. The basic idea is to include a valid bits
mask in the TLB address matcher to control how big a page size is.
The idea is to minimize the number of TLB miss loads. The x86 has
dual page size or 4KB and 4MB/2MB. The Alpha also had such a feature
for 8KB and 64KB pages, and I vaguely recall seeing a DEC patent on it.
I don't know if VMS ever used the feature.

Well, actually, I think the first one to do this in a general way was
the MIPS R4000; maybe somebody else did it earlier, buit if so, I don't
know it:

1) Patent 5,263,140, "Variable page size per entry translation
look-aside buffer", by Tom Riordan, Filed Jan 23, 1991, granted Nov 16,
1993.
This is actually a pretty readable patent.

This was for the general mechanism in the R4000, which first shipped in
a system in 1Q92, such that when each TLB entry was loaded, one could
use a page-size register to specify the size for that TBL entry,
independently of any other, without specially dedicated sections of TLB
[which had been done before.]

Page sizes were 4KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB.

2) Just for fun, I backtracked through old design documents and email.

R2000 (1986) and R3000 (1988) had 4KB pages.
R6000 (1989) had 16KB pages

R4000: were were working on this in 1988.

The architecture specs say:

May 1988: it was a 32-bit CPU, with 4K pages, except there were 2
BIGENTRIES that mapped 4MB each. It was still that way in September.

In late 88, early 89, we decided that the R4000 had to be 64-bit.

June 1989: 64/32-bit CPU, with 4KB/16KB pagesize to be selected at
RESET.
That was done, thinking that some of our OSs were using 4KB pages, but
we also had a version that worked on R6000, with 16KB pages.

Feb 1990: the scheme actually implemented, with page-size selected
whenever a TLB entry was written.

It took a few years for OSs to really take advantage of this, but they
have since the mid-1990s. I don't recall what all the DEC Alpha folks
did, but they certainly understood the issues. Put another way, there
were multiple groups of people who were thinking about widespread use
of multiple page sizes at leat 15 years ago, and implementing it in
software at least 10+ years ago.

3) The *reason* for all this was the following argument that I & some
others kept pounding on until Tom came up with the clever
implementation technique that allowed the fast variable-size page
match. The argument:

a) Moore's Law was going to keep RAM memories increasing 4X / 3 years.
A straight-line on a log-scale chart is fairly easy to project several
generations ahead.

b) We knew by early-1989 that the R4000 was going to have >32-bit
addressing.

c) There was no way that we were going to be able to cost-effectively,
indefinitely grow the TLB 4X / 3 years, as that is expensive real
estate, and has to be fast.

d) Over time, with fixed page sizes, the TLB would naturally map a
smaller and smaller fraction of physical memory, and we knew some
programs would naturally grow to consume whatever memory they could.

d) Hence, we'be better be able to grow the page sizes, with enough OS
control to do what made sense. I never liked the restrictive version
with a few large entries of fixed size.

======
I'd have to second JJ's comments to Steve: go spend a few years
learning about this stuff. It's hard enough to do things well with a
lot of experience; with near-zero experience, the likelihood of
inventing useful features is ~0 at best.

Descriptors have a long and honorable history in software and sometimes
in hardware, and people have often had thoughtful reasons for doing
them, even if the design style has mostly fallen out of favor, with the
particular excpetion of the IBM AS/400's pointers that include
authority tags. But Steve's proposed feature isn't one of those.
Back to top
Steve
Guest





Posted: Thu Jun 30, 2005 7:49 am    Post subject: Re: How does this make you feel? Reply with quote

Grumble wrote:
Quote:
Steve wrote:

[...] if someone wants to implement variable sized pages in their MMU

e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.

http://www.intel.com/design/itanium2/manuals/251110.htm

Ok, but the manual you cite here is not specific enough, nor indexed
well enough, for me to tell whether it supports those different page
sizes in the same page table. Anyhow, the scheme I was thinking about
would allow page sizes on power of two boundaries, and potentially
interleaved so that a set of pages in virtual memory might be seen in
the following order: 4k 1M 64K 256k 32k etc. -- and they would be
logically adjacent. This would require a little more trickery on the
part of the descriptor data structure, but is a logical extension.

I'm sorry I was not more elaborate with my initial description, but I
thought I would simplify it for the sake of the argument. However, now
that you have brought it up, I feel obligated to mention that the page
tables would necessarily have to be arranged with someting like a
binary tree structure in a case like this.


Regards,

Steve
Back to top
John Mashey
Guest





Posted: Thu Jun 30, 2005 8:15 am    Post subject: Re: How does this make you feel? Reply with quote

Steve wrote:

Quote:
True enough. But, I don't believe that I have necessarily invented
something new, but rather identified a new wrinkle on an old problem.
Inasmuch as what I have described collides with previous ideas, I have
thought that my particular approach to making general purpose registers
more complex, and then applying that complexity to enhance the
functionality of the general instruction set, makes for something of an
innovation. Expanding that to VM and page table management would be
icing on the cake, as it were.

Whether there is any real end-user utility to being able to issue an
XOR instruction that applies to a 1M range of VM; or whether being
able to add a large column of integers, perhaps in steps; or whether
being able to do large mmoves makes practical sense as a single
instruction, is something I cannot analyse with sufficient rigor at my
current level of knowledge. It _seems_ to me that the semantics of a
large fraction of assembler instructions could be expanded to take
advantage of a more complex address decoder, but I have not examined
this in detail. Yet. But I should.

Yes, but you really need to go study a bunch more, else you are wasting
your time. [This is not to discourage anyone from having new ideas,
it's simply an observation that there is a minimal level of
hardware+software knowledge needed before proposing ISA extensions is
more than a waste of time.]

It's not a question of colliding with previous ideas, it's that:
a) An amazing number of different things have been tried over the
years, which is why it's important to know the history. Things
somewhat like this were done 30-40 years ago, in extremely popular
computers ... and have since disappeared, for good reasons.

b) Of the various combinations of ISA x cache design x MMU design x
memory system x systems design x OS x langauges, some work and some
don't.

c) The commercial systems of the 1950s and 1960s often supplied various
memory-to-memory variable-length operations. This lives on in the IBM
S/360 (circa 1967) and its descendents:

SS instructions: 2 memory addresses, and a length (1-256-bytes). These
are on arbitrary byte boundaries, and hence pretty useful for COBOL,
PL/I, etc. The original operations included:
NC AND character
CLC Compare Logical Character
MVC Move Character
OC OR Character
XC EXclusive OR Character

(There are a bunch more including Translate (and Test), and
decimal-string operators with 2 lengths.)
You can use an EXECUTE instruction, with such instructions as targets
to supply a dynamically-computed length field at run-time, i.e.:
EX R1,move low-order byte of R1 is OR'd into a copy of "move"
.....
move: MVC 0(0,R2),0(R3): copies some nubmer of bytes from 0(R3) to
0(R2)

d) S/370 (circa 1970) added some more, including (essentially) the
"bcopy" or "memcpy" instruction MVCL, which is very close to what
you've suggested, but actually works usefully:

MVCL: Move Long:
MVCL Ra, Rb: Ra and Rb each specify a register-pair, where the first
register gives a memory address, and the second gives a byte-count (up
to 16MB). This copies the data from 0(Rb) to 0(Ra), and if the second
legnth is shorter, it uses the high-order byte of [Rb+1] to pad. This
allows a nice zero-fill: just set [Rb+1] = 0.

This is carefully designed to allow interrupts to happen, because no
one is willing for interrupts to be blocked while 16MB of memory is
zeroed/copied. That requires updating all 4 registers (adding to the
addresses, subtracting from the lengths), so that the instruction can
be restarted correctly.

The manual description is a tightly-worded 2 solid pages to covert all
the cases that can happen.

This has no restrictions of alignment, and barely any of size (16MB),
and survives exceptions without weird extra state. These isntructions
essentailly use register-pairs as byte-string-descriptors, and are
relatively straightforward to use.

the compare version is:
CLCL Ra, Rb compares long strings
but they left the logical operators out [no NCL, OCL, XCL]

That gives "memcmp" directly.

e) The DEC VAX (circa 1978), provided a similar, albeit perhaps even
more baroque set of instructions, but certainly including direct
equivalents of MVCL (VAX MOVC) and CLCL (VAX CMPC).

f) Hence, the most successful mainframe ISA, and the most successful
minicomputer ISA both had features that essentially used address:length
descriptors to do long memory operations.

g) And then, these features (essentially) disappeared from new ISA
designs, including those for most microprocessors. While there are a
few memory-to-memory designs done in the last10 years or so, I don't
think they are among the really popular designs. The closest popular
one would be X86's combination of REP + MOVS, but that's not the same
thing at all.

One might wonder why that happened...

+ Perhaps the later desaigners were just dumb.
- I've known lots of them, and I doubt it.

+ Perhaps the later designers were ignorant of the S/370 and VAX.
- I suppose that might possibly be true now, but it certainly wasn't
in the 1970s, 1980s, and early 1990s. Most serious microprocessor ISA
designers were quite familiar with these, particular since some of the
later designers had implemented the earlier ones and wer quite familiar
with them. I.e., the IBM 801 RISC folks certainly knew S/370, and the
DEC Alpha folks knew VAX. Certainly most people in this field had at
least studied these ISAs, or more likely, had used eitehr or both of
these systems for many years.

+ Maybe C and UNIX distorted CPU design, especially with RISCs
- Possible, but as I've posted various times, various RISC CPU
designers definitely cared about non-C languages and non-UNIX operating
systems.

+ MAybe later designers' insistence on measuring performance impacts
versus implementation costs caused them to ignroe potentially-wonderful
features whose only problem was that they needed a new OS and new
language to make use of them.
- Always possible. Personally, I'd be delighted to see a brand-new
ISA + OS+ language combination that gave real breakthroughs. Hpwever,
the track record for such things has rarely been good, although I still
admire the thought behind the Burroughs B5000 ... but that was a long
time ago.

OK, so why?
- It is no accident that it takes 2 pages to describe MVCL.

- It is a common mistake to count instructions executed, rather than
cycles consumed. More than one complex design has had powerful
instructions that were outperformed by sequences of simpler ones: S/360
MVC was sometimes beaten by Load Multiple/Store Multiple sequences.

- I've posted numerous times about the care needed to do
memory->register or register->memory operations when the addresses can
cross cache-line or page boundaries. They're rife with special cases,
implementation bugs, and extra cost ... that designers are loath to
pay, because they cost space and sometimes gate delays, and in
practice, don't seem to yield proportionate extra performance. This is
not to say there might not be a role for these, jsut that they don't
seem to mesh very well with teh main lines of CPU design.

- The implementation issues for memory-memory are even worse, at least
if added onto a typical CPU core. Both S/360 and VAX were designed for
microcoded implementations, and the extra cost might not be too bad,
although it was notable that some of the most cost-effective
implementations [360/44, DEC microVAX] didn't implement all of the
variable-length instructions.

- Most recent CPU ISAs are designed to allow cost-effective
implementations of pipelining and usually multiple-issue. As I've
noted before, complex memory-addressing is one of the most
problematical features to have in high-speed implementations. [My
usual example is the claim that the extra address modifiers added going
from MC 68010 -> 68020 were a mistake in this regard ... and they did
disappear in the later coldfire derviatives.]

In most designs, it may be possible to pipeline the simpler operations,
but usually the complex multiple-memory operand things end up taking
over all the crucial machine resources, and stop the pipeline in its
tracks. They also tend to serialize operations to keep complexity
down. *Sometimes*, with enough work on the design, the hardware can
indeed do better if it know an entire address+length in one fell swoop.
For example, in a uniprocessor with write-back caches, smething like a
MVCL can avoid fetching a cache line that is abotut to be completely
overwritten.

- And finally, in many designs, the path:
1) fetch register(s)
2) add displacement (or index or shifted index)
3) provide address to MMU and cache

is pretty important. I've heard fierce arguments over features that
might cost a gate delay or two. in the first two steps in this path.
In some cases, the *only* adressing mode allowed is (register), i.e.,
no displacement or indexing. AMD29000 did that, for example. Some of
us (like MIPS) allowed only displacement(base). Others allow a few
more, and there is legitimate room for different approaches, but this
is the kind of stuff serious designers worry about.

The *last* thing most such designers would want is a feature, such that
- In order to access just the address in the register (step 1)

- The CPU has to fetch another register (the "Axm")

- The CPU has to do a *variable* shift of the address register
depending on the value of the Axm just fetched. [A fixed shift is easy
and cheap, which is why some ISAs do shifted-index, say of 1, 2 or 3
bits.] An large arbitrary variable shift is not so cheap.

- And if that's not bad enough, the net effect is that *almost any*
instruction seems like it could turn into a multi-memory-operation, and
this is only discoverable in the address generation stage, not in
instruction decode. This is really bad news in aggressive pipeline
designs, such as speculative out-of-order ones, as it multiples the
resources needed to track instructions, and complexifies the load/store
units. From seeing what went on with S/360/VAX designs, I think this
feature would make decent pipelining expensive in the way that most
irriates CPU designers, i.e., that there is a lot of complexity needed
to cope with rare cases.

- Aggressive current CPUs have deep load/store queues of address and
data that are "in flight", and need to make sure the right things
happen in all teh cases. Multiple memory operations don't help this
any, and may make it a lot worse.

- Finally, it is unclear that this feature helps much, at least if
added to typical current designs. As far as I can tell, the address
starts on a power-of-2 boundary, which means it's not directly useable
for memcpy. With the possible exception of
writeback-cache-optimization, it's hard to see how this is much faster
than the straightforward instructions in current CPUs, where people
have worked very hard to optimize loads and stores and overlap them.

- Also, of course, the feature, as described, is non-interruptable.
MVCL and its siblings were interruptable for good reason...

So, like I said, study some more. A while ago, in the WIZ thread, I
posted suggestions for digital design knowledge desirable for software
people to participate meaningfully in this turf:

http://groups-beta.google.com/group/comp.arch/browse_frm/thread/a060bc84cdc66f60?scoring=d&q=wiz+mashey+&hl=en

and there are related discussions nearby in that thread.
Back to top
Jan Vorbrüggen
Guest





Posted: Thu Jun 30, 2005 8:15 am    Post subject: Re: How does this make you feel? Reply with quote

Quote:
g) And then, these features (essentially) disappeared from new ISA
designs, including those for most microprocessors.

One of the few exceptions was the transputer. Because of the instructions
supporting channel communications - which for processor-internal channels
are just a special form of memcpy - it was natural to support this as a
seperate instruction as well. For the second generation, a nice MOVE2D
instruction was added that allowed you to, for instance, extract a column
of a 2D matrix into a contiguous array, operate on it (e.g., perform an
FFT), and scatter it back into the 2D matrix.

Quote:
+ Maybe C and UNIX distorted CPU design, especially with RISCs
- Possible, but as I've posted various times, various RISC CPU
designers definitely cared about non-C languages and non-UNIX operating
systems.

I do think this did play some role - I'm convinced it put support for
descriptor-like data structures on a lower priority than it would other-
wise have had.

Quote:
+ MAybe later designers' insistence on measuring performance impacts
versus implementation costs caused them to ignroe potentially-wonderful
features whose only problem was that they needed a new OS and new
language to make use of them.

Rather, I'd think that smart programmers showed they could use RISC
primitives to implement, say, a memcpy just as efficiently as microcode
could, except perhaps for some of the cache effects (see below) and, of
course, at the expense of quite complicated code to handle all possible
alignments etc.

Quote:
- It is no accident that it takes 2 pages to describe MVCL.

Indeed. And it increases the likelihood some implementation gets some
corner case wrong.

Quote:
*Sometimes*, with enough work on the design, the hardware can
indeed do better if it know an entire address+length in one fell swoop.
For example, in a uniprocessor with write-back caches, smething like a
MVCL can avoid fetching a cache line that is abotut to be completely
overwritten.

....which is way ISA designer added such things was WH64 (write hint 64
bytes - Alpha). That is of course much more generally useful besides
being used by memcpy and friends.

Incidentally, the transputer's MOVE instruction supplies a nice lesson in
why such designs are difficult. It was, of course, interruptible, because
supporting interrupts efficiently was a design goal for the transputer.
In the case of an interrupt, the current state was saved by microcode into
defined memory locations and restored later (due the design, there could
only be one level of interrupt). However, the first implementations got
the saved state wrong: when the instruction was resumed, it would re-read
the last location that had already been processed. "Normal" memory doesn't
care, but if you use this to read out a FIFO, you have a problem. Various
software workarounds needed to be developed for this oversight...

Jan
Back to top
Steve
Guest





Posted: Thu Jun 30, 2005 8:15 am    Post subject: Re: How does this make you feel? Reply with quote

John Mashey wrote:
Quote:
Eric P. wrote:
[TLB extras]
1) Patent 5,263,140, "Variable page size per entry translation
look-aside buffer", by Tom Riordan, Filed Jan 23, 1991, granted Nov 16,
1993.
This is actually a pretty readable patent.

I've made a mental note to read this.

Quote:
This was for the general mechanism in the R4000, which first shipped in
a system in 1Q92, such that when each TLB entry was loaded, one could
use a page-size register to specify the size for that TBL entry,
independently of any other, without specially dedicated sections of TLB
[which had been done before.]

Page sizes were 4KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB.
[Mips Rx000 stuff snipped]
It took a few years for OSs to really take advantage of this, but they
have since the mid-1990s. I don't recall what all the DEC Alpha folks
did, but they certainly understood the issues. Put another way, there
were multiple groups of people who were thinking about widespread use
of multiple page sizes at leat 15 years ago, and implementing it in
software at least 10+ years ago.

3) The *reason* for all this was the following argument that I & some
others kept pounding on until Tom came up with the clever
implementation technique that allowed the fast variable-size page
match. The argument:

a) Moore's Law was going to keep RAM memories increasing 4X / 3 years.
A straight-line on a log-scale chart is fairly easy to project several
generations ahead.

b) We knew by early-1989 that the R4000 was going to have >32-bit
addressing.

c) There was no way that we were going to be able to cost-effectively,
indefinitely grow the TLB 4X / 3 years, as that is expensive real
estate, and has to be fast.

d) Over time, with fixed page sizes, the TLB would naturally map a
smaller and smaller fraction of physical memory, and we knew some
programs would naturally grow to consume whatever memory they could.

d) Hence, we'be better be able to grow the page sizes, with enough OS
control to do what made sense. I never liked the restrictive version
with a few large entries of fixed size.

With 64bit addressing (which won't last forever) and the potential for
truly huge data sets for any number of real-world applications, the sky
is pretty much the limit, ain't it.

Quote:
I'd have to second JJ's comments to Steve: go spend a few years
learning about this stuff. It's hard enough to do things well with a
lot of experience; with near-zero experience, the likelihood of
inventing useful features is ~0 at best.

True enough. But, I don't believe that I have necessarily invented
something new, but rather identified a new wrinkle on an old problem.
Inasmuch as what I have described collides with previous ideas, I have
thought that my particular approach to making general purpose registers
more complex, and then applying that complexity to enhance the
functionality of the general instruction set, makes for something of an
innovation. Expanding that to VM and page table management would be
icing on the cake, as it were.

Whether there is any real end-user utility to being able to issue an
XOR instruction that applies to a 1M range of VM; or whether being
able to add a large column of integers, perhaps in steps; or whether
being able to do large mmoves makes practical sense as a single
instruction, is something I cannot analyse with sufficient rigor at my
current level of knowledge. It _seems_ to me that the semantics of a
large fraction of assembler instructions could be expanded to take
advantage of a more complex address decoder, but I have not examined
this in detail. Yet. But I should.

In another current thread some posters have supplied citations to a
number of relevent reference works that concern this field. I have
access to a university library that subscribes to the usual journals,
as well as containing a decent set of older print journals. I've
skimmed a few of the Bell and IBM journal articles from the seventies,
mostly for researching the history of language design, Unix, and such,
but I have not made a comprehensive study to date. I probably should,
and will, time permitting.

Quote:
Descriptors have a long and honorable history in software and sometimes
in hardware, and people have often had thoughtful reasons for doing
them, even if the design style has mostly fallen out of favor, with the
particular excpetion of the IBM AS/400's pointers that include
authority tags. But Steve's proposed feature isn't one of those.

No, and I note that descriptors play a large part of the SCSI
specification, although that is an entirely different ball of wax.


Regards,

Steve
Back to top
Anne & Lynn Wheeler
Guest





Posted: Thu Jun 30, 2005 4:15 pm    Post subject: Re: How does this make you feel? Reply with quote

"John Mashey" <old_systems_guy@yahoo.com> writes:
Quote:
This has no restrictions of alignment, and barely any of size (16MB),
and survives exceptions without weird extra state. These isntructions
essentailly use register-pairs as byte-string-descriptors, and are
relatively straightforward to use.

the compare version is:
CLCL Ra, Rb compares long strings
but they left the logical operators out [no NCL, OCL, XCL]

That gives "memcmp" directly.

the 360 instructions would check start & end locationsfor fetch and
store violation before starting the instruction (fetch and store
protection specification were on 2k aligned bounduries), lengths in
these instructions were never more than 256 ... so worst case
fetch/store protection areas were the start and end. on 360/67 with
virtual memory support and 4k pages ... the start and end locations
were also pre-checked for available page (before starting
instruction). the worst case on 360/67 was 8 virtual pages:

1) execute instruction that crossed 4k boundary (2 pages)
2) SS instruction (target of the execute) that crossed 4k boundary
(2 more pages)
3) source location of the SS instruction that crossed page boundary
(2 more pages)
4) target location of the SS instruction that crossed page boundary
(2 more pages)

the interruptable "long" instructions (introduced with 370) were not
defined as having all required storage locations to be pre-checked;
they were defined as being able to check on a byte-by-byate basis and
causing an interrupt (with updated register values which allowed for
restarting the instruction). I was involved in shooting a microcode
bug on 370/125 (& 370/115) where the microcoders had incorrectly
checked starting and ending locations on long instructions before
starting (if something was wrong with the ending location, it would
interrupt before starting the instruction ... which was correct for
the 360 instructions but incorrect for the 370 "long" instructions).

in a recent discussion on this subject ... it has been brought to my
attention that more recent machines have fixed a "bug" in the
(original 360) translate SS instructions. translate instructions take
a 256 character "table" that is used for changing or testing the
source string. standard 360 involved checking the table starting
address and the table ending address (start+256). However, a
programmer that knew that they had a constrained set of characters in
the input stream were allowed to define "short" tables (less than 256
bytes). However, the original instruction implementations would check
worst case table ending address (start+256). the instruction bug fix
is that if the start of a table is within 256 bytes of a boundary, the
instruction is pre-executed, checking each byte in the input string
for possible values that would address table byte on the other side of
the boundary (aka the translate instructions took each input byte and
added its value to the table start address to index a one byte field).

some recent postings
http://www.garlic.com/~lynn/2005j.html#36 A second look at memory access alignment
http://www.garlic.com/~lynn/2005j.html#38 virtual 360/67 support in cp67
http://www.garlic.com/~lynn/2005j.html#39 A second look at memory access alignment
http://www.garlic.com/~lynn/2005j.html#40 A second look at memory access alignment
http://www.garlic.com/~lynn/2005j.html#44 A second look at memory access alignment
http://www.garlic.com/~lynn/2005k.html#41 Title screen for HLA Adventure? Need help designing one

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB