| Author |
Message |
Grumble
Guest
|
Posted:
Thu Jun 30, 2005 4:15 pm Post subject:
Re: How does this make you feel? |
|
|
Steve wrote:
| Quote: | Grumble wrote:
Steve wrote:
[...] if someone wants to implement variable sized pages in their MMU
e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.
http://www.intel.com/design/itanium2/manuals/251110.htm
Ok, but the manual you cite here is not specific enough, nor indexed
well enough, for me to tell whether it supports those different page
sizes in the same page table.
|
A few IPF gurus read this group, maybe they'll jump in :-)
You might find more information in the system architecture guide:
http://www.intel.com/design/itanium/manuals/iiasdmanual.htm#system
--
Regards, Grumble |
|
| Back to top |
|
 |
Grumble
Guest
|
Posted:
Thu Jun 30, 2005 4:15 pm Post subject:
Re: How does this make you feel? |
|
|
Rick Jones wrote:
| Quote: | Grumble wrote:
Steve wrote:
[...] if someone wants to implement variable sized pages in their MMU
e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.
Which is remarkably similar to the variable page size support in
PA-RISC 2.0 :)
|
Talk about a coincidence! ;-) |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Thu Jun 30, 2005 9:58 pm Post subject:
Re: How does this make you feel? |
|
|
John Mashey wrote:
| Quote: |
Eric P. wrote:
Steve wrote:
although if someone wants to implement varible sized pages in their
MMU, the VM of any UNIX-like kernel is going to have to be gutted and
written from scratch.
This has been looked into. The basic idea is to include a valid bits
mask in the TLB address matcher to control how big a page size is.
The idea is to minimize the number of TLB miss loads. The x86 has
dual page size or 4KB and 4MB/2MB. The Alpha also had such a feature
for 8KB and 64KB pages, and I vaguely recall seeing a DEC patent on it.
I don't know if VMS ever used the feature.
Well, actually, I think the first one to do this in a general way was
the MIPS R4000; maybe somebody else did it earlier, buit if so, I don't
know it:
1) Patent 5,263,140, "Variable page size per entry translation
look-aside buffer", by Tom Riordan, Filed Jan 23, 1991, granted Nov 16,
1993.
This is actually a pretty readable patent.
|
I think the DEC patent I saw was 5,568,415 "Content addressable memory
having a pair of memory cells storing don't care states for address
translation" filed 1993, granted 1996. It would appear to apply
to just the idea of of using a 2 bit value for each PTE virtual
address match bit to specify Invalid, DontCare, Match_0, Match_1.
| Quote: | snip
Feb 1990: the scheme actually implemented, with page-size selected
whenever a TLB entry was written.
It took a few years for OSs to really take advantage of this, but they
have since the mid-1990s. I don't recall what all the DEC Alpha folks
did, but they certainly understood the issues. Put another way, there
were multiple groups of people who were thinking about widespread use
of multiple page sizes at leat 15 years ago, and implementing it in
software at least 10+ years ago.
|
The docs I read gave me the impression that this was tough to do
transparently on the fly.
For example, most code pages are process sharable. To upgrade two
contiguous and properly aligned 8 KB pages to one 16 KB page across
all accessing processes requires allocating a 16 KB frame, copying the
content from the two 8 KB pages, patching all the process page tables
that referenced the small frames, shoot down the changed TLB entries,
and releasing the two small frames. And this has to work in an SMP
environment with those other processes possibly executing on a
different cpu while the page table is being restructured on the fly.
Like I said, I was under the impression this is tough.
So if they have accomplished this, it is impressive.
| Quote: | 3) The *reason* for all this was the following argument that I & some
others kept pounding on until Tom came up with the clever
implementation technique that allowed the fast variable-size page
match. The argument:
a) Moore's Law was going to keep RAM memories increasing 4X / 3 years.
A straight-line on a log-scale chart is fairly easy to project several
generations ahead.
b) We knew by early-1989 that the R4000 was going to have >32-bit
addressing.
c) There was no way that we were going to be able to cost-effectively,
indefinitely grow the TLB 4X / 3 years, as that is expensive real
estate, and has to be fast.
d) Over time, with fixed page sizes, the TLB would naturally map a
smaller and smaller fraction of physical memory, and we knew some
programs would naturally grow to consume whatever memory they could.
d) Hence, we'be better be able to grow the page sizes, with enough OS
control to do what made sense. I never liked the restrictive version
with a few large entries of fixed size.
|
Ok, but this does not necessarily imply the OS support _dynamic_
page size selected and changeable on the fly. For example
a boot time selected page size would suffice, with an API
call for apps to query the current system page size.
Eric |
|
| Back to top |
|
 |
Rick Jones
Guest
|
Posted:
Thu Jun 30, 2005 11:22 pm Post subject:
Re: How does this make you feel? |
|
|
Steve <steve49152@yahoo.ca> wrote:
| Quote: | Grumble wrote:
Steve wrote:
[...] if someone wants to implement variable sized pages in their MMU
e.g. the Itanium 2 supports 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M, 1G, and 4G pages.
http://www.intel.com/design/itanium2/manuals/251110.htm
Ok, but the manual you cite here is not specific enough, nor indexed
well enough, for me to tell whether it supports those different page
sizes in the same page table.
|
The PA-RISC 2.0 supported the different page sizes at the same time,
so I suspect that means the same page table. I'd be _VERY_ surprised
if Itanium didn't as well.
rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... |
|
| Back to top |
|
 |
John Mashey
Guest
|
Posted:
Fri Jul 01, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
Eric P. wrote:
| Quote: | John Mashey wrote:
Eric P. wrote:
It took a few years for OSs to really take advantage of this, but they
have since the mid-1990s. I don't recall what all the DEC Alpha folks
did, but they certainly understood the issues. Put another way, there
were multiple groups of people who were thinking about widespread use
of multiple page sizes at leat 15 years ago, and implementing it in
software at least 10+ years ago.
The docs I read gave me the impression that this was tough to do
transparently on the fly.
For example, most code pages are process sharable. To upgrade two
contiguous and properly aligned 8 KB pages to one 16 KB page across
all accessing processes requires allocating a 16 KB frame, copying the
content from the two 8 KB pages, patching all the process page tables
that referenced the small frames, shoot down the changed TLB entries,
and releasing the two small frames. And this has to work in an SMP
environment with those other processes possibly executing on a
different cpu while the page table is being restructured on the fly.
Like I said, I was under the impression this is tough.
So if they have accomplished this, it is impressive.
Ok, but this does not necessarily imply the OS support _dynamic_
page size selected and changeable on the fly. For example
a boot time selected page size would suffice, with an API
call for apps to query the current system page size.
|
Actually, that turns out not to be true...
Google: irix page dplace OR irix pagesize_data OR altix dplace
In practice, you don't have to do a lot of dynamic changes, but it
really helps to be able to select pagesizes at various times and
differently for different programs. http://www.ukaff.ac.uk/hints.shtml
is a web page offering advice to its users, for example. There is a
hierarchy of advice:
1) Sysadmin can set default page sizes for system.
2) Users can give compile-time options, so that page-size wishes are
expressed in the binary object file.
3) Users can set environment variables PAGESIZE_DATA and
PAGESIZE_STACK,
which allows them to easily experiment, even using somebody else's
binary.
4) The dplace(1) command can modify the pagesize settings before
executing a command, although it's original reason for existence was to
handle ccNUMA memory placement.
Code text-size is the easy one, because it's fixed, regardless of who
is sharing it, and in any case, for most programs that care about this
stuff, data code-size << data size. SGI Origins were at 1 Terabyte of
main memory by 2001, and I think individual Altix systems go up to
24TB. I know there are customer systems actually out there with 10TB
or more. Sometimes an individual program (parallelized across
processors) uses almost the entire memory for one problem.
It's quite easy for big fluid dynamics codes to use arbitrarily-large
amounts of data; it's perfectly reasonable that a 1-10TB data usage
could be incurred by a program whose code was 1-10MB, i.e., the data is
on the order of amillion times bigger than the code.
Anyway, in practice, one doesn't need the full generality of having the
OS figure out dynamic combination/splitting of pages, but it is clear
that one size doesn't fit all, even on the same machine. Especially
with the environment variables, it doesn't take much work to try a few
combinations to see if it makes a difference.
It's been a long time, so I don't recall the details, or how much
experimentation (if any) there was with full-dynamic page-size
handling. Note that the ccNUMA versions of IRIX already did a lot of
dynamic page-migration, and that was already tricky enough. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Sat Jul 02, 2005 10:46 pm Post subject:
Re: How does this make you feel? |
|
|
John Mashey wrote:
| Quote: |
Eric P. wrote:
Ok, but this does not necessarily imply the OS support _dynamic_
page size selected and changeable on the fly. For example
a boot time selected page size would suffice, with an API
call for apps to query the current system page size.
Actually, that turns out not to be true...
Google: irix page dplace OR irix pagesize_data OR altix dplace
In practice, you don't have to do a lot of dynamic changes, but it
really helps to be able to select pagesizes at various times and
differently for different programs. http://www.ukaff.ac.uk/hints.shtml
is a web page offering advice to its users, for example. There is a
hierarchy of advice:
1) Sysadmin can set default page sizes for system.
2) Users can give compile-time options, so that page-size wishes are
expressed in the binary object file.
3) Users can set environment variables PAGESIZE_DATA and
PAGESIZE_STACK,
which allows them to easily experiment, even using somebody else's
binary.
4) The dplace(1) command can modify the pagesize settings before
executing a command, although it's original reason for existence was to
handle ccNUMA memory placement.
Code text-size is the easy one, because it's fixed, regardless of who
is sharing it, and in any case, for most programs that care about this
stuff, data code-size << data size. SGI Origins were at 1 Terabyte of
main memory by 2001, and I think individual Altix systems go up to
24TB. I know there are customer systems actually out there with 10TB
or more. Sometimes an individual program (parallelized across
processors) uses almost the entire memory for one problem.
It's quite easy for big fluid dynamics codes to use arbitrarily-large
amounts of data; it's perfectly reasonable that a 1-10TB data usage
could be incurred by a program whose code was 1-10MB, i.e., the data is
on the order of amillion times bigger than the code.
Anyway, in practice, one doesn't need the full generality of having the
OS figure out dynamic combination/splitting of pages, but it is clear
that one size doesn't fit all, even on the same machine. Especially
with the environment variables, it doesn't take much work to try a few
combinations to see if it makes a difference.
It's been a long time, so I don't recall the details, or how much
experimentation (if any) there was with full-dynamic page-size
handling. Note that the ccNUMA versions of IRIX already did a lot of
dynamic page-migration, and that was already tricky enough.
|
Dropping the general split/merge features would seem to help.
It gives 95% of the functionality and eliminates nasty design issues.
But I'm not so sure it can be completely eliminated by decree.
- My intuition tells me that the free list would degenerate
into a set of mostly small 4 KB frames. Assuming a binary buddy
free frame manager, and apply a set of random allocate and free
requests, then after a relatively few iterations I suspect
there would be no large frames left in the pool. Because the
working sets are randomly ordered then trimming the WS list
would not likely produce significant frame coalescence.
This tells me that the OS will likely wind up dealing with
small pages anyway. Without page coalescence programs would
degenerate into small page sizes.
- If a large frame is desired but not available then the OS cannot
wait for one to show up because there is no guarantee that would ever
happen. So the OS must be prepared to assign a small page.
This means that one can associate a size hint with a virtual memory
region/section, but cannot expect it to be obeyed.
- The start and end of virtual memory regions are not necessarily
aligned and sized for convenient large page assignment. So the
OS must deal with fiddly bits at either end. This means a region
can consist of a variety of page sizes anyway.
- If a large page is outswapped, ideally to a single large swap block,
there is no guarantee that a large page will be available to
swap it back in. If there were only smaller pages available then
the OS has to split up the swap block.
- Outswap file space could also degenerate into a pool of small
free block and not necessary be contiguously available for outswap
of a single large page. So the OS must be prepared to scatter a large
page to a set of small swap blocks or a deadlock might result.
If only page splits can take place but no merges then over time
it the system would degenerate into just small page allocations.
So it appears to me that such an OS cannot avoid at least minimal
split-merge support. It might be easier if merge could only happen
on inswap because then it could be dealt with as just a variation
on page fault clustering. Split could only occur on outswap.
Eric |
|
| Back to top |
|
 |
Steve
Guest
|
Posted:
Sun Jul 03, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
John Mashey wrote:
| Quote: | Steve wrote:
True enough. But, I don't believe that I have necessarily invented
something new, but rather identified a new wrinkle on an old problem.
Inasmuch as what I have described collides with previous ideas, I have
thought that my particular approach to making general purpose registers
more complex, and then applying that complexity to enhance the
functionality of the general instruction set, makes for something of an
innovation. Expanding that to VM and page table management would be
icing on the cake, as it were.
Whether there is any real end-user utility to being able to issue an
XOR instruction that applies to a 1M range of VM; or whether being
able to add a large column of integers, perhaps in steps; or whether
being able to do large mmoves makes practical sense as a single
instruction, is something I cannot analyse with sufficient rigor at my
current level of knowledge. It _seems_ to me that the semantics of a
large fraction of assembler instructions could be expanded to take
advantage of a more complex address decoder, but I have not examined
this in detail. Yet. But I should.
Yes, but you really need to go study a bunch more, else you are wasting
your time. [This is not to discourage anyone from having new ideas,
it's simply an observation that there is a minimal level of
hardware+software knowledge needed before proposing ISA extensions is
more than a waste of time.]
|
Obviously I am not really familiar with all of the relevant issues in
this discussion as I don't think about this stuff every day. So bear
with me here as I address your comments as best I can.
| Quote: | It's not a question of colliding with previous ideas, it's that:
a) An amazing number of different things have been tried over the
years, which is why it's important to know the history. Things
somewhat like this were done 30-40 years ago, in extremely popular
computers ... and have since disappeared, for good reasons.
|
The computing world was different then. Resources were constrained in
ways that are no longer relevant now. Today, memory bandwidth and IO
bandwidth are the biggest bottlenecks for many applications (that do
not spend most of their time waiting on user input) that are not also
working with very large data sets. CPU designers then had constraints
then that are now less important because of the Moore's law phenomenon.
But I certainly agree that it is useful to avoid making the same
mistakes people made previously.
| Quote: | b) Of the various combinations of ISA x cache design x MMU design x
memory system x systems design x OS x langauges, some work and some
don't.
c) The commercial systems of the 1950s and 1960s often supplied various
memory-to-memory variable-length operations. This lives on in the IBM
S/360 (circa 1967) and its descendents:
|
.... which is now aparently z/OS. Can't say I know what they're doing
with the zSeries, but it's bound to be interesting.
| Quote: | SS instructions: 2 memory addresses, and a length (1-256-bytes). These
are on arbitrary byte boundaries, and hence pretty useful for COBOL,
PL/I, etc. The original operations included:
NC AND character
CLC Compare Logical Character
MVC Move Character
OC OR Character
XC EXclusive OR Character
(There are a bunch more including Translate (and Test), and
decimal-string operators with 2 lengths.)
You can use an EXECUTE instruction, with such instructions as targets
to supply a dynamically-computed length field at run-time, i.e.:
EX R1,move low-order byte of R1 is OR'd into a copy of "move"
....
move: MVC 0(0,R2),0(R3): copies some nubmer of bytes from 0(R3) to
0(R2)
d) S/370 (circa 1970) added some more, including (essentially) the
"bcopy" or "memcpy" instruction MVCL, which is very close to what
you've suggested, but actually works usefully:
|
It you mean by 'actually works usefully' you mean to say it is
implemented in a functioning system, then I agree.
| Quote: | MVCL: Move Long:
MVCL Ra, Rb: Ra and Rb each specify a register-pair, where the first
register gives a memory address, and the second gives a byte-count (up
to 16MB). This copies the data from 0(Rb) to 0(Ra), and if the second
legnth is shorter, it uses the high-order byte of [Rb+1] to pad. This
allows a nice zero-fill: just set [Rb+1] = 0.
|
Ok
| Quote: | This is carefully designed to allow interrupts to happen, because no
one is willing for interrupts to be blocked while 16MB of memory is
zeroed/copied. That requires updating all 4 registers (adding to the
addresses, subtracting from the lengths), so that the instruction can
be restarted correctly.
|
Fine; you are describing a memory to memory copy operation implemented
in a CPU that has one execution pathway. I understand that you cannot
usually afford to accumulate pending interrupt signals while waiting
for some long operation to complete, but hardware CPU threads would
seem to obviate this concern. So long as a hypothetical MVCL
instruction does not take over the CPU core to such an extent that it
blocks the normal execution of other hardware threads then servicing
interrupts is unaffected. And of course, nobody in their right mind
would even dream of doing a large copy in the top-half of their ISR.
I don't see that this is a necessarily valid concern today.
| Quote: | The manual description is a tightly-worded 2 solid pages to covert all
the cases that can happen.
This has no restrictions of alignment, and barely any of size (16MB),
and survives exceptions without weird extra state. These isntructions
essentailly use register-pairs as byte-string-descriptors, and are
relatively straightforward to use.
|
I suppose so. However, I don't suppose it is necessarily all that
hard to set up your registers with a two-instruction sequence
MOVE ea1, Axm
MOVE ea2, Ax
In practice, within the equivalent of at least one HLL equivalent of a
functional block, the succeeding references to the Ax register will
utilize the value of 'Axm'. I don't see that as being particularly
difficult, particularly when the code generation will usually be done
by the compiler.
| Quote: | the compare version is:
CLCL Ra, Rb compares long strings
but they left the logical operators out [no NCL, OCL, XCL]
That gives "memcmp" directly.
|
Ok, what you have described is basically indexed register memory access
with a range. But practically speaking, specifying a change in the
behavior of a register by way of a control register has different
semantic and architectural implications to the whole CPU. Direct
comparisons to existing instructions and addressing modes narrows the
scope of this discussion a little too much.
| Quote: | e) The DEC VAX (circa 1978), provided a similar, albeit perhaps even
more baroque set of instructions, but certainly including direct
equivalents of MVCL (VAX MOVC) and CLCL (VAX CMPC).
f) Hence, the most successful mainframe ISA, and the most successful
minicomputer ISA both had features that essentially used address:length
descriptors to do long memory operations.
g) And then, these features (essentially) disappeared from new ISA
designs, including those for most microprocessors. While there are a
few memory-to-memory designs done in the last10 years or so, I don't
think they are among the really popular designs. The closest popular
one would be X86's combination of REP + MOVS, but that's not the same
thing at all.
One might wonder why that happened...
|
Economics? Among other things, popular CPUs have always needed to cost
much less than mainframe CPUs, otherwise the general public would never
have bought as many home and small businenss computers as they did.
Costs have fallen, however, and what you can buy today for $1000 would
have been less than a wet dream to a programmer of the 1960's.
| Quote: | + Perhaps the later desaigners were just dumb.
- I've known lots of them, and I doubt it.
|
There's been a *lot* of emphasis on clock-speed improvements in the
world of microcomputers. That's certainly not the only reason why
people may not have looked closely at the idea of widening their
registers in a non-intuitive way, but I suppose there are only so many
people doing CPU architecture design, and a limited number of
architectures; and so there were a limited number of research avenues
that could be explored at any given time. Market forces would have
affected this situation as well: companies were usually expected to
produce a saleable product and could probably only afford to allocate a
small fraction of their expertise to pure research projects that
wouldn't necessarily produce near-term payoffs.
| Quote: | + Perhaps the later designers were ignorant of the S/370 and VAX.
|
I couldn't say. I don't know what they teach these days in computer
science and engineering courses.
| Quote: | - I suppose that might possibly be true now, but it certainly wasn't
in the 1970s, 1980s, and early 1990s. Most serious microprocessor ISA
designers were quite familiar with these, particular since some of the
later designers had implemented the earlier ones and wer quite familiar
with them. I.e., the IBM 801 RISC folks certainly knew S/370, and the
DEC Alpha folks knew VAX. Certainly most people in this field had at
least studied these ISAs, or more likely, had used eitehr or both of
these systems for many years.
+ Maybe C and UNIX distorted CPU design, especially with RISCs
- Possible, but as I've posted various times, various RISC CPU
designers definitely cared about non-C languages and non-UNIX operating
systems.
|
Most non-C languages would have been targeted to UNIX systems anyways,
but there's another factor. C is one of the few languages ever in
common use that revealed details of the underlying architecture to the
programmer during normal use. Whatever system you might be using to
write applications in, say, Ruby, you aren't going need to worry about
the CPU architecture. Plus, existing ISAs (as you put them) work well
enough for most people and most applications, so there are relatively
few people who might have cause to complain, or to think about the
issues.
| Quote: | + MAybe later designers' insistence on measuring performance impacts
versus implementation costs caused them to ignroe potentially-wonderful
features whose only problem was that they needed a new OS and new
language to make use of them.
- Always possible. Personally, I'd be delighted to see a brand-new
ISA + OS+ language combination that gave real breakthroughs. Hpwever,
the track record for such things has rarely been good, although I still
admire the thought behind the Burroughs B5000 ... but that was a long
time ago.
OK, so why?
|
Good question.
| Quote: | - It is no accident that it takes 2 pages to describe MVCL.
- It is a common mistake to count instructions executed, rather than
cycles consumed. More than one complex design has had powerful
instructions that were outperformed by sequences of simpler ones: S/360
MVC was sometimes beaten by Load Multiple/Store Multiple sequences.
|
The approach I am advocating here does happen to have the property of
reducing code-segment memory accesses, and reducing cache use. Memory
bandwidth is an issue for non-IO bound tasks, and so reducing off-chip
memory accesses for code can only speed things up. Perhaps in the past
memory bus speeds were not so far out of sync with processor clock
speeds as they are today.
| Quote: | - I've posted numerous times about the care needed to do
memory->register or register->memory operations when the addresses can
cross cache-line or page boundaries. They're rife with special cases,
implementation bugs, and extra cost ... that designers are loath to
pay, because they cost space and sometimes gate delays, and in
practice, don't seem to yield proportionate extra performance. This is
not to say there might not be a role for these, jsut that they don't
seem to mesh very well with teh main lines of CPU design.
|
You must know that you can get around non-aligned memory accesses by
expanding the offending instruction into a sequence of aligned
accesses. I would imagine that this might happen in microcode in some
designs, and is probably a really annoying thing to deal with. But
this is a general problem that all CPUs must handle unless they
expressly forbid non-aligned accesses.
Back to memcpy... Non-aligned accesses, as well as non-aligned copy
are probably really annoying, but the problem must already be solved.
Every PCI or AGP videocard on the market probably has a general purpose
hardware engine for arbitrary copies. Hooking into such an engine for
compares or logical operations shouldn't be impossible. If such
devices are not commonly found on CPUs because their instruction sets
don't indicate them, that would seem to be an artifact of the 'main
lines of CPU design' to date.
| Quote: | - The implementation issues for memory-memory are even worse, at least
if added onto a typical CPU core. Both S/360 and VAX were designed for
microcoded implementations, and the extra cost might not be too bad,
although it was notable that some of the most cost-effective
implementations [360/44, DEC microVAX] didn't implement all of the
variable-length instructions.
|
I have nevery used DEC machines, so I cannot really comment here on
this.
| Quote: | - Most recent CPU ISAs are designed to allow cost-effective
implementations of pipelining and usually multiple-issue. As I've
noted before, complex memory-addressing is one of the most
problematical features to have in high-speed implementations. [My
usual example is the claim that the extra address modifiers added going
from MC 68010 -> 68020 were a mistake in this regard ... and they did
disappear in the later coldfire derviatives.]
In most designs, it may be possible to pipeline the simpler operations,
but usually the complex multiple-memory operand things end up taking
over all the crucial machine resources, and stop the pipeline in its
tracks. They also tend to serialize operations to keep complexity
down. *Sometimes*, with enough work on the design, the hardware can
indeed do better if it know an entire address+length in one fell swoop.
For example, in a uniprocessor with write-back caches, smething like a
MVCL can avoid fetching a cache line that is abotut to be completely
overwritten.
|
Well, there's nothing stopping the programmer from informing the CPU
about memory access patterns, but it is rarely done.
And as far as an architecture that supports some number of hardware
ins. threads, the memory bus arbitration can be fairly complicated.
You might even imagine a small bus arbitration engine that could be
configured with different policies via microcode. That way, the OS
could configure the system to limit the burst length to a sane value
for each hardware thread, and thus alleviating some concerns of
stalling. I suppose the actual instruction mix encountered in the real
world varies greatly, though, and I suppose it is difficult to
anticipate every possible ins. mix at the design phase.
As for the cache, well, another poster mentioned the WH64 ins. from the
Alpha. I think I've encountered a similar instruction when I read
about the AMD K6-2. If the compiler or the programmer can inform the
CPU when it doesn't need to invalidate a cache line (which might be the
case during a memcpy), or alternately inform the CPU that it *should*
prefetch something, then that is where those decisions are best made.
CPUs may be able to do decent branch prediction in some cases, but
anticipating memory access patterns is much harder, I suspect.
| Quote: | - And finally, in many designs, the path:
1) fetch register(s)
2) add displacement (or index or shifted index)
3) provide address to MMU and cache
is pretty important.
|
As in being a rather common idiom and should be as fast as possible.
| Quote: | I've heard fierce arguments over features that
might cost a gate delay or two. in the first two steps in this path.
In some cases, the *only* adressing mode allowed is (register), i.e.,
no displacement or indexing. AMD29000 did that, for example. Some of
us (like MIPS) allowed only displacement(base). Others allow a few
more, and there is legitimate room for different approaches, but this
is the kind of stuff serious designers worry about.
The *last* thing most such designers would want is a feature, such that
|
Are you completely sure about that?
| Quote: | - In order to access just the address in the register (step 1)
- The CPU has to fetch another register (the "Axm")
- The CPU has to do a *variable* shift of the address register
depending on the value of the Axm just fetched. [A fixed shift is easy
and cheap, which is why some ISAs do shifted-index, say of 1, 2 or 3
bits.] An large arbitrary variable shift is not so cheap.
|
I suppose your complaint here is that the effective width of each
register is possibly doubled, and that there's more glue than you would
expect attached to each each one. A crossbar attached to each register
and a serialized access process would add a lot complexity and delay to
each register access, but would not increase the physical width of the
register. Without a crossbar, I guess the solution might be to shadow
the register and modifier to a more usable result on store.
| Quote: | - And if that's not bad enough, the net effect is that *almost any*
instruction seems like it could turn into a multi-memory-operation, and
this is only discoverable in the address generation stage, not in
instruction decode. This is really bad news in aggressive pipeline
designs, such as speculative out-of-order ones, as it multiples the
resources needed to track instructions, and complexifies the load/store
units. From seeing what went on with S/360/VAX designs, I think this
feature would make decent pipelining expensive in the way that most
irriates CPU designers, i.e., that there is a lot of complexity needed
to cope with rare cases.
|
Um. I think the CPU would know, from the content of the register
descriptor, when the opcode arguments are resolved during instruction
decode. The set-up to handle a complex address should occur in-line
with this stage of the instruction execution, as would syntactic and
semantic validation. Yes, I guess this would irritate the guy who has
to design the ins. pipeline, but unless I completely miss the mark,
pipelining is a response to on-chip wavefront propogation speeds vs.
clock-cycle frequencies. In a design that relies more heavily on CPU
threading, pipelining could be much less important. Of course, I
don't really know what the timing numbers are like when you measure the
factors related to pipeline economics, and so I could be completely
full of crap here.
| Quote: | - Aggressive current CPUs have deep load/store queues of address and
data that are "in flight", and need to make sure the right things
happen in all teh cases. Multiple memory operations don't help this
any, and may make it a lot worse.
|
Making a ranged instructions operation atomic would simplify coherency
problems. i.e. you only stall another in-flight instruction if it's
memory accesses collide with an instruction running in another context.
This raises the potential for deadlocks and delays in interrupt
handlers, but might not otherwise pose a serious risk...
| Quote: | - Finally, it is unclear that this feature helps much, at least if
added to typical current designs. As far as I can tell, the address
starts on a power-of-2 boundary, which means it's not directly useable
for memcpy. With the possible exception of
writeback-cache-optimization, it's hard to see how this is much faster
than the straightforward instructions in current CPUs, where people
have worked very hard to optimize loads and stores and overlap them.
|
memcpy is a special case. This point can be moot because the compiler
and the operating system's memory allocator will arrange things so that
data structures are usually aligned. Image processing tasks, say, are
not always helped by this, but that's life. But what if the CPU
borrows an engine from the co-processor world? The whole issue
becomes largely moot, I think.
| Quote: | - Also, of course, the feature, as described, is non-interruptable.
MVCL and its siblings were interruptable for good reason...
|
Dealt with above, hopefully with sufficient detail.
I'll certainly have a look at those discussions, but I thought I would
clarify my position first. Hopefully I have not completely
misunderstood your objections and have described a plausible scenario
for the kind of registers I suggest. Perhaps there is no way an
existing design could be easily adapted; that is not my concern. If it
is entirely impractical to design registers this way, and that the
memory access paradigm implied is too unweildy for a gain that is too
small, then I will find out when I learn more about the field. But it
does not seem out of the question with what I know.
Everything depends on the logic and architecture specifics that are
required to support complex addressing schemes for arbitrary opcodes.
Without sitting down and really hammering out the details of an
instruction set and the specifics of its addressing modes; and without
proposing a specific on-chip arrangement of resources, it is going to
be difficult to say anything conclusive about my hypothetical approach.
As a designer, you apparently have some serious objections off the top
of your head. I respect your expertise, but what you have written does
not seem to present any show-stoppers. Perhaps a more detailed study
would show otherwise, and perhaps I don't know enough about CPU design
to comment meaningfully. But I don't think I'm totally off the mark.
Regards,
Steve |
|
| Back to top |
|
 |
Andy Nelson
Guest
|
Posted:
Sun Jul 03, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
| Quote: | John Mashey wrote:
In practice, you don't have to do a lot of dynamic changes, but it
really helps to be able to select pagesizes at various times and
differently for different programs. http://www.ukaff.ac.uk/hints.shtml
|
heh. I'm the instigator of much of the experimentation on page size issues
that you quote on the ukaff web page, though one of the system folk there
(Richard West) actually wrote the documentation you cite. I've now moved
on mostly and am doing relatively little computing there. Most of my experience
with page stuff comes from using perfex and adjusting my code based on its
output. Several years ago, I made a brief response to a post of yours
(probably you've forgotten by now :) to the effect that perfex's authors
were truly Blessed By God, since they'd helped me speed up my code so much
by accounting for and fixing various page and cache effects.
I'm amused to see that the circle is coming around the other way and glad
to see that some of my results have made it onto the radar screens of folks
in the hw and os design end of things. Perhaps in future generations of
machines I'll get 'exactly what I want in a machine' (snort! I'm greedy
and will always want more :-)
I'll comment directly on your comments here, and some of Eric P's below.
From my experience both are true.
| Quote: | is a web page offering advice to its users, for example. There is a
hierarchy of advice:
1) Sysadmin can set default page sizes for system.
|
Or, more precisely, turn on/off the OS support for various page sizes,
and to set the initial distribution of pagesizes (i.e. 30% 64k,
30% 1m, 30% 16m etc etc) when the machine first boots up. The
distribution tends to get scrambled after that, and it doesn't seem
possible to get it back all the way without a full reboot.
| Quote: | 2) Users can give compile-time options, so that page-size wishes are
expressed in the binary object file.
|
As implmented on irix/mipspro, this sort of amounts to turning on/off
permission to use big pages or not, as I understand the options anyway.
Basically this:
f90 -bigp_on code.f
| Quote: | 3) Users can set environment variables PAGESIZE_DATA and
PAGESIZE_STACK,
which allows them to easily experiment, even using somebody else's
binary.
|
My experience is that these give hints to the OS about what page sizes
to use and that they are followed if it is 'easy' but not if it is hard,
especially if the page sizes are not immediately available. What works a
little better is using dplace:
| Quote: | 4) The dplace(1) command can modify the pagesize settings before
executing a command, although it's original reason for existence was to
handle ccNUMA memory placement.
|
because it has an option to set a wait time for page coalescence.
It still doesn't work very well though because what happens is that
the executable sits idle, sometimes for 20-30 minutes depending on how
much memory you ask for and how long you tell it to wait for coalescence.
It also seems to help if you have a dummy program like this one, for
whatever memory size you want to grab:
parameter(i_oneGB=<integer for 1GB array>)
double precision array(i_oneGB)
do i=1,i_oneGB
array(i)=0d0
enddo
end
that you run under dplace several times before starting the real run.
It seems to prime the page distribution a lot better. You get
more after the 2nd/3rd iteration of such a run, but it still will not
recover all of the pages back to the initial boot time distribution.
I've also, for reasons unknown, watched dplace totally ignore my
requests as well, given me only whatever pages were available without
any wait. Both the wait and the wrong page size distribution are
major frustrations.
I wouldn't mind even the waits if it worked perfectly (or even 'better')
because getting my choice of page distribution can make a difference of
a factor of 3-4 or so in speed for my code. For a week long run that does
make a bit of a difference :-) It makes even more difference when you go
back to a time allocation committee and have to explain why your job
didn't finish in the time they allocated you.
BUT: during these waits, the cpu sits idle too. 0% usage for user and sys
etc. Once in a while it grabs a bit more memory, not always of my
desired page size, but most of the time it sits idle. I would like
to see it actually working for me and trying to grab and coalesce
pages as much as possible rather than sitting there doing nothing.
What makes this especially frustrating is that supposedly I have
sole access to this memory through a batch allocation. That means that
whatever messed up page size distribution was there from whatever
was running there before, should be completely discarded/discardable,
and sent back to some sort of pristine original/boot time state.
I think this has to do with other theads simmering in comp.arch right
now though, called 'garbage collection'. From what I read in those
threads, garbage collection is a rather large mess that even Oscar
the Grouch is well advised to stay out of. That said, from my
reading of comp.arch threads, it gets easier if you don't have
requirements about not using various amounts and kinds of scratch
resources.
Somehow, the OS ought to be told (or have the ability to be told)
that rather than running in some nook or cranny of cpu time or memory,
it can take a lot of cpu time if it wants, as well as use various bits
of my memory allocation as its temp space, as long as it does produces
the results I want in the end. Right now, that doesn't seem to be
the case.
Eric P:
| Quote: | Dropping the general split/merge features would seem to help.
It gives 95% of the functionality and eliminates nasty design issues.
|
Not a hw guy enough to be sure about the exact definitions of
'general' split/merge stuff, as opposed to what is implemented
on machines I've run on, but anyway. I'm just commenting on what I
have seen. You folks may have to divine what that means in terms
of generality/specificity of the designs.
| Quote: | But I'm not so sure it can be completely eliminated by decree.
- My intuition tells me that the free list would degenerate
into a set of mostly small 4 KB frames. Assuming a binary buddy
|
Yup. Thats what happens, though on irix/O3k's the smallest
page size is usually 16k. You usually get sort of 1/2 the memory
in 16/64k pages after a long uptime, with some larger sizes,
in cases I've seen, 1m and 16m were turned on.
| Quote: | - If a large frame is desired but not available then the OS cannot
wait for one to show up because there is no guarantee that would ever
happen. So the OS must be prepared to assign a small page.
This means that one can associate a size hint with a virtual memory
region/section, but cannot expect it to be obeyed.
|
That 'hinting' stuff appears to be what is actually done with the
environment variables in irix. To a lesser extent also with the
dplace commands. It says to grab a large page if available, wait
for a while for one to be created if not, and fall back to smaller
pages after a while.
| Quote: | - The start and end of virtual memory regions are not necessarily
aligned and sized for convenient large page assignment. So the
OS must deal with fiddly bits at either end. This means a region
can consist of a variety of page sizes anyway.
|
I've sort of wondered about this a bit. Perhaps my comments are only
partly related to what you are talking about, but its the only data
I have to share. Not being an OS/hw guy, I always guessed that
there had to be some alignment issues in order to coalesce pages in any
sensible way. On the other hand, I've seen cases where I had run
several dplace cycles (see above) on a cpu/memory allocation of
8cpus/4GB with results that indicated it to me it wasn't taking
as much advantage of that as it could. What I mean is that I ran
with options to grab/make 16mb pages. It would until it ran out.
Then it would start coalescing smaller pages. Lots of 64k pages
would get coalesced into 1mb pages, but almost none would be
made into 16mb pages. At the end I could have 40-50 or so 16mb
pages per node (2GB in one node), 1000 or so 1mb pages per node
and the rest small. If there were no alignment/contiguity issues,
why wouldn't it group some of the 1000 1m pages? Even if there
were, why wouldn't it? There were certainly enough that I don't
think it would even be possible to have at least some of them be
properly aligned. More than that, through using an irix command
called 'numa_view' I could check the page allocations/size/distribution
of the running jobs. I would frequently see cases where it grabbed
(eg) 1024 consecutive 16k pages. Why not coalesce them into a single
big page?
| Quote: | If only page splits can take place but no merges then over time
it the system would degenerate into just small page allocations.
|
I think the merge stuff is what dplace and some logic in the OS
do on irix. Other OS's I've worked with appear to have some
multiple or variable page size handling stuff, but in my (incomplete)
experience not as sophisticated--and as I just outlined above even
the irix version is not as sophisticated/perfect as it needs to be to
avoid user/sysadmin frustration.
Now for some wish list stuff that I as a user would like to see in
future OS's re page handling stuff. Some of this I wrote about
above, adn all of it has an obvious bias of a guy who runs HPC jobs,
usually from within a batch handler, so be aware of that. It also
ignores all of your (Eric's) issues about swapping in/out and page
size change issues there.
-There has to be some way of incorporating page size handling stuff
with batch handling stuff. I know vaguely of 'cpusets' a bit,
both in irix somehow and more recently in linux. It seems to me to
be an important thing that if I am going to have exclusive access
to some set of cpus, then I ought to get exclusive access to the
memory associated with those cpus too, especially if I've also explicitly
asked for it in my batch request script. That includes exclusive control
over telling the OS what page sizes to use in that memory. It
seems there ought to be a way of telling the OS to forget about
all the page allocation and page size data structures (whatever
they may be) that it created for some previous job running on those
cpus/memory. Right now, at least for irix where most of my
experience is, that doesn't happen.
-I realize there may be issues about the OS owning some pages
of memory within my cpu/memory allocation, doing whatever it
is that OSs do with their memory. Some of this at least, is probably
that page/pagesize handling stuff, so this is partly a repeat
of the above. There needs to be some way of resetting the resettable
parts of this sort of OS data. There also needs to be some way of telling
the OS to get its grubby paws off of my memory, as much as possible,
migrating it to some other place on the machine (like a boot set of
cpus or something), or if it can't, then to squish it all into one
corner, so I can have as much free/contiguous space available as
possible so as to make/remake big pages from it.
-I'm not clear on exactly how this point works and could be wrong,
but here goes another item: If my job calls some set of library
routines, through some .so sort of mechanism, I suppose that
if someone else's routine later wants to load the same library,
the memory for that is already allocated out of my space. Ok
fine, but what about these issues:
-what if I want small pages and the other guy wants big pages?
It would be nice if the .so mechanism was subordinate to my
batch allocation.
-what if my job ends, and the memory remains allocated out
of what was my memory space because the other guy's job is
not done, and then some future job comes along and allocates
that set of cpus and memory, but wants some other sort of
page size distribution?
Regards,
Andy
--
Andy Nelson Theoretical Astrophysics Division (T-6)
andy dot nelson at lanl dot gov Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy Los Alamos, NM 87545 |
|
| Back to top |
|
 |
John Mashey
Guest
|
Posted:
Sun Jul 03, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
Andy Nelson wrote:
| Quote: | John Mashey wrote:
In practice, you don't have to do a lot of dynamic changes, but it
really helps to be able to select pagesizes at various times and
differently for different programs. http://www.ukaff.ac.uk/hints.shtml
heh. I'm the instigator of much of the experimentation on page size issues
that you quote on the ukaff web page, though one of the system folk there
(Richard West) actually wrote the documentation you cite. I've now moved
on mostly and am doing relatively little computing there. Most of my experience
with page stuff comes from using perfex and adjusting my code based on its
output. Several years ago, I made a brief response to a post of yours
(probably you've forgotten by now :) to the effect that perfex's authors
were truly Blessed By God, since they'd helped me speed up my code so much
by accounting for and fixing various page and cache effects.
I'm amused to see that the circle is coming around the other way and glad
to see that some of my results have made it onto the radar screens of folks
in the hw and os design end of things. Perhaps in future generations of
machines I'll get 'exactly what I want in a machine' (snort! I'm greedy
and will always want more :-)
|
Thanks for the note; it's nice to see real experience.
Just a few more comments from me, as this has now strayed far away from
the original topic.
0) Long ago, when we were pushing hard to get a decent range of page
sizes to have hardware support, we had a lot of guesses about what
could be done, and how automatic it could get. The one thing we knew
was that if the hardware couldn't do a good range of page sizes, the OS
would never have a chance to find out.
1) I note that you weren't worried much about page-outs. I would
observe that that's often true of people who do a lot of big HPC jobs,
and who are thus especially interested in large-page issues. No one in
their right mind routinely runs, for example, big CFD codes that are
paging a lot.
2) The real problem is that OS's aren't telepathic and precognitive :-)
There is a constant tension between:
a) Running in an inefficient state
AND
b) Spending resources to clean up the state to be able to run more
efficiently.
And this same tension has occurred numerous times, and it especially
occurs whenever one or more resources are over-subscribed, especially
in large chunks. Running a bunch of small tasks is pretty easy;
running a bunch of small tasks with 1 large one that consumes 50% of
memory is OK; running two tasks, each of which want 51% of memory can
get very ugly.
This shows up at least in CPU scheduling, cache-affinity scheduling,
timing and selection of page-outs, ccNUMA memory affinity scheduling,
and the multiple-page-size coalescence issues.
For instance, you have a large task whose code is paging (a). Maybe it
would be better to (b) flush all its code pages for a while and let
other tasks run, then allocate enough space for the entire code (and
use big pages), and swap the whole thing in, give it big quanta for a
while, then flush it, etc. On the other hand, if this program starts
doing much other I/O (and the other tasks aren't), this may not be such
a good idea.
3) Unfortunately, there are still problems in scheduling and memory
resource allocation that have been with us for at least 30-40 years, so
I suspect the variable-page-size issues (which have been around a
while, but have gotten especially noticed in the last 10 years) ...
will be with us a few years yet.
4) I suspect the answer will be, as it has for most others:
a) The OS does the best it can to do well automatically. Not being
telepathic, it will guess wrong sometimes: it will decide that it
should spend many cycles to rearrange the locations of a task's
memory...just before the task exists :-)
b) The OS needs to be able to accept a hierarchy of hints, and use them
when it can. |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Sun Jul 03, 2005 1:51 pm Post subject:
Re: How does this make you feel? |
|
|
Andy Nelson <andy@thermo.lanl.gov> wrote:
| Quote: | Or, more precisely, turn on/off the OS support for various page sizes,
and to set the initial distribution of pagesizes (i.e. 30% 64k,
30% 1m, 30% 16m etc etc) when the machine first boots up. The
distribution tends to get scrambled after that, and it doesn't seem
possible to get it back all the way without a full reboot.
|
Relocating small pages to make room for a big one is not an option? You
would of course pick a candidate big page with few small pages allocated
on it. Some small pages might be candidates for just unmapping, letting
the owner page them in again if they were really useful.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Colonel Forbin
Guest
|
Posted:
Mon Jul 04, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
In article <1gz4d02.11xvw7z12ojw8uN%nospam@ab-katrinedal.dk>,
Niels Jørgen Kruse <nospam@ab-katrinedal.dk> wrote:
| Quote: |
Andy Nelson <andy@thermo.lanl.gov> wrote:
Or, more precisely, turn on/off the OS support for various page sizes,
and to set the initial distribution of pagesizes (i.e. 30% 64k,
30% 1m, 30% 16m etc etc) when the machine first boots up. The
distribution tends to get scrambled after that, and it doesn't seem
possible to get it back all the way without a full reboot.
Relocating small pages to make room for a big one is not an option? You
would of course pick a candidate big page with few small pages allocated
on it. Some small pages might be candidates for just unmapping, letting
the owner page them in again if they were really useful.
|
Assuming you can't organize efforts to force management to allocate
computing resources in a more productive fashion, don't waste huge
amounts of time waiting for storage to coalesce. Try to modify your
code to take maximum advantage of what you have up front while sending
artificial hints to the OS to give you what you want, and adapt as you
go. In most cases spending hours watching CPU utilization hover around
zero pegs the bozo meter unless you can demonstrate an application
speedup which more than offsets the wait.
One of the best lessons I learned from KR Hammond about DB2 was how to
lie to the optimizer.
You need to do a statistical analysis of what approach will result in
the computational products you need in the minimum time given the
resource constraints you have to work with.
I agree with Nick about the desirability of writing portable code, since
what you do will likely outlive you if it proves generally useful.
OTOH, you have to meet your mission targets. The notion of watching a
supercomputer do nothing while waiting for storage to optimize is likely
to earn you a pink slip with your last paycheck.
Thus, work with what you have. Don't adopt some sort of idealistic
programmatic purity. Try to isolate the storage optimization hacks and
document them so future workers can remove or adapt them without
materially affecting the accuracy or value of the core mission
objective.
LANL isn't "Lifestyles of the Rich and Atomic" any more. You are paid
to produce results in the immediate term, not to write the King James
version of the HPC Portable Bible.
In auto racing parlance, you need the "Wicker Bill," not the
"Aerodynamically Optimized Spoiler." |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Mon Jul 04, 2005 12:15 am Post subject:
Re: How does this make you feel? |
|
|
Niels Jørgen Kruse wrote:
| Quote: |
Andy Nelson <andy@thermo.lanl.gov> wrote:
Or, more precisely, turn on/off the OS support for various page sizes,
and to set the initial distribution of pagesizes (i.e. 30% 64k,
30% 1m, 30% 16m etc etc) when the machine first boots up. The
distribution tends to get scrambled after that, and it doesn't seem
possible to get it back all the way without a full reboot.
Relocating small pages to make room for a big one is not an option? You
would of course pick a candidate big page with few small pages allocated
on it. Some small pages might be candidates for just unmapping, letting
the owner page them in again if they were really useful.
|
Yes, on further thought, it might not be that bad.
With a global working set list the OS would likely have a set of
reverse pointers from each frame back to its referencing page table
entries. So it would be relatively straight forward to force frames
to coalesce by (glossing over all sorts of SMP race conditions):
select a small free frame, calc its buddy frame number as a candidate,
move the candidates contents to a new substitute frame, use the
reverse map to patch all the ptes referencing the old candidate to
the substitute, pop the candidate out of the global ws list and
push the substitute at the same spot, and coalesce the two small
free buddies. It would probably be similar to the page table
manipulations required for an outswap.
An OS with local working set lists would likely not have the
reverse map so it would need to add one. Maybe the it could use
the working set list entries it already maintains for each process.
Eric |
|
| Back to top |
|
 |
Steve
Guest
|
Posted:
Tue Jul 05, 2005 6:55 am Post subject:
Re: How does this make you feel? |
|
|
Jan Vorbrüggen wrote:
| Quote: | g) And then, these features (essentially) disappeared from new ISA
designs, including those for most microprocessors.
One of the few exceptions was the transputer. Because of the instructions
supporting channel communications - which for processor-internal channels
are just a special form of memcpy - it was natural to support this as a
seperate instruction as well. For the second generation, a nice MOVE2D
instruction was added that allowed you to, for instance, extract a column
of a 2D matrix into a contiguous array, operate on it (e.g., perform an
FFT), and scatter it back into the 2D matrix.
|
Nice. I seem to recall reading advertisments for the transputer in way
old Byte magazines.
| Quote: | + Maybe C and UNIX distorted CPU design, especially with RISCs
- Possible, but as I've posted various times, various RISC CPU
designers definitely cared about non-C languages and non-UNIX operating
systems.
I do think this did play some role - I'm convinced it put support for
descriptor-like data structures on a lower priority than it would other-
wise have had.
|
C and UNIX originated a while ago. It's not too difficult to see that
their original designers never anticipated certain real-world problem
domains. POSIX .4 is but one example of a feature required of modern
systems that ends up being difficult to implement and use today.
| Quote: | + MAybe later designers' insistence on measuring performance impacts
versus implementation costs caused them to ignroe potentially-wonderful
features whose only problem was that they needed a new OS and new
language to make use of them.
Rather, I'd think that smart programmers showed they could use RISC
primitives to implement, say, a memcpy just as efficiently as microcode
could, except perhaps for some of the cache effects (see below) and, of
course, at the expense of quite complicated code to handle all possible
alignments etc.
|
But this is what computers are for: makeing the job easier for the
user. If it is possible to offload work from the programmer in such a
common case, why not?
| Quote: | - It is no accident that it takes 2 pages to describe MVCL.
Indeed. And it increases the likelihood some implementation gets some
corner case wrong.
|
Laugh. Well there are always going to be people who don't fully read
the documentation as they should!
| Quote: | *Sometimes*, with enough work on the design, the hardware can
indeed do better if it know an entire address+length in one fell swoop.
For example, in a uniprocessor with write-back caches, smething like a
MVCL can avoid fetching a cache line that is abotut to be completely
overwritten.
...which is way ISA designer added such things was WH64 (write hint 64
bytes - Alpha). That is of course much more generally useful besides
being used by memcpy and friends.
|
Noted above.
| Quote: | Incidentally, the transputer's MOVE instruction supplies a nice lesson in
why such designs are difficult. It was, of course, interruptible, because
supporting interrupts efficiently was a design goal for the transputer.
In the case of an interrupt, the current state was saved by microcode into
defined memory locations and restored later (due the design, there could
only be one level of interrupt). However, the first implementations got
the saved state wrong: when the instruction was resumed, it would re-read
the last location that had already been processed. "Normal" memory doesn't
care, but if you use this to read out a FIFO, you have a problem. Various
software workarounds needed to be developed for this oversight...
|
Microcode updates are much more common these days, and so bugs in the
CPU are not quite the problem they once were.
This discussion has drifted a little off topic from the core of my
initial posting (which is natural for Usenet) and I feel that some of
you may have misread my intentions. Over the last couple of days I
have had some time to consider my position in light of your comments
and have decided that I obviously must make my case more explicit. So.
The idea that prompted my original post concerned a (possibly) new way
of constructing CPU registers. In my initial discussion, I suggested
that there might be a second pseudo-register that would modify the
behaviour of a conventionally conceieved register. Another poster
suggested implicitly that I could be talking about using a second
register to modify another. This point is not quite moot. I am
advocating a change in the way registers are conceived from a CPU
design standpoint.
I want to view a register as something that has mutable semantics. I
want it to apply to a general instruction set on some hypothetical
architecture in a meaningfull way. While it may make conventional
sense to modify the behavior of one register with another, as is seen
with various cannonical addressing modes today, this is not in line
with the philosophy that I am thinking about. I want you to think
about complexifying the way a register is implemented on-chip; if this
thought experiment logically indicates certain changes to the design of
a CPU or its ISA instrution set, that is part of the next step of
design. The specifics of its on-chip implementation are best left
unsaid at the moment.
For now, let's consider a hardware register construct and see what it
gives us in terms of its flexability. I suggest that for the purpose
of this discussion we are not going to talk about modifying some legacy
system. This is a _de novo_ design that turns on the structure of its
registers.
So, we might suggest 'R1' as a 64-bit register that may take on
additional properties that affect its application with an instruction.
The on-chip logic that accompanies the register includes provisions to
change its behavior (according to my previous suggestion). First,
there is a way to partition its bits into a address and length
component: specified perhaps by a special instruction that sets its
parameters:
rcfg r1, 54:8
.... which gives us 54 bits of addressing and eight bits of length.
There has to be a third component, register 'chunk' size, that in
effect dictates the 'word' size of that particular register. In this
hypothetical design, different registers may have different partitions
and 'chunk' sizes. So, we might have a hypothetical register config
instruction like this:
rcfg r1, 54:8:19
.... which gives us a register configured to address 'chunks' of 2^19
bits (65536 bytes), with 54 bits of address space, and 8 bits of
'length' which allows in this case a span of 16M in 'steps' of 64k.
I prefer this approach as it illustrates the idea of a register that
has attributes, rather than a register that is simply modified by
another. Such a register would be configured by software to work
within an arbitrary lexical scope in most practical applications. From
a compiler perspective -- at least -- this indicates an entirely
different approach to register allocation and use. This is a different
way to view memory addressing.
On the hypothetical chip, there are a bunch of differences. Instead of
simply storing a number, the value of a register now has context that
is dependent upon its configuration. The 'chunk', or effective word
size is variable. The partiition of the register is variable, subject
to interdependent semantic constraints which might mandate that an
'address' register (or a register used for addressing) must be capable
of addressing the entire virtual memory space. Futher, the CPU should
be able to validate addresses and addressing modes during ins. decode
in order to raise an appropriate signal.
It was mentioned earlier in this thread that in the past many things
have been tried, and it is implied that many innovations had proven
impractical. For this case, silicon implementation issues aside for
the moment, what is gained by complexifying registers?
o We would have a way of lexically scoping memory ranges and (dare I
say it) block size to a register in a flexable way.
o The semantics of *arbitrary* traditional instructions are
potentially made much richer, and in a way that reduces code size.
There is an entire discussion waiting in the wings to be had over how
traditional addressing modes might be applied to such registers:
register indexed, indirect, pre- and post incriment/decriment, etc.
o More functionality is taken on by the CPU directly, indicating a
shift | |