| Author |
Message |
John Mashey
Guest
|
Posted:
Mon Aug 15, 2005 8:15 am Post subject:
Re: PART 3. Why it seems difficult to make an OOO VAX compet |
|
|
Eric P. wrote:
| Quote: | John Mashey" <old_systems_guy@yahoo.com> writes:
But the bottom line is: the VAX ISA was very difficult to keep
competitive. The obvious decoding complexity is always there, in one
form or another, but the more serious problem is execution complexity
that lessens effective ILP and is thus a continual drag on performance
with reasonable implementations.
In case anyone is still interested in this topic,
there are a bunch of papers by Bob Supnik at
http://simh.trailing-edge.com/papers.html
covering a variety of DEc design issues.
Great material; thanks for posting; Bob is doing a dandy job preserving |
old stuff. In particular, if somebody actually wants to build things,
it is really useful to get insight about design processes and
tradeoffs.
The HPS postings were useful too.
| Quote: | The one labeled "VLSI VAX Micro-Architecture" is from 1988
(marked "For Internal Use Only, Semiconductor Engineering Group")
mentions at the end the ways a VAX might get lower CPI. It says
"However the VAX architecture is highly resistant to macro-level
parallelism:
- Variable length specifiers make parallel decoding of specifiers
difficult and expensive
- Interlocks within and between instructions make overlap of
specifiers with instruction execution difficult and expensive
Most (but not all) VAX architects feel that the costs of macro-level
parallelism outweighs the benefits; hence this approach is
not being actively pursued."
So it would seem that the designers felt at that time that decode
was a major impediment.
|
I actually hadn't read this before I posted, but obviously, I'd talked
to VAX implementors in the late 1980s, and what they complained about
sank in.
Anyway, thanks for posting. |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Mon Aug 15, 2005 1:34 pm Post subject:
Re: Code density and performance? |
|
|
| Quote: | But again the trouble is that rearranging existing code and
algorithms even with profile is a mere palliative compared to
designing program logic and data structure to improve locality.
|
I think they are orthogonal, or at least span a largish angle 8-).
What profiling does is optimize the code based on _actual_ usage,
and that is something at which people are notoriouly bad. It will
also automatically give you the "outlining" of basic blocks imple-
menting exception code etc., which Park mentioned. Here, the main
reason is to improve branch prediction, of course.
Also, profiling tools will help in telling you where those non-
localities actually are, if they are good. Remember, diagnosis
is a required first step before therapy; otherwise, you will quite
likely end up with a dead patient...
Jan |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Mon Aug 15, 2005 1:40 pm Post subject:
Re: Code density and performance? |
|
|
| Quote: | pg_nh> A 4KiB superpage is almost as a cluster of 8 512B pages, but with
pg_nh> these differences (mainly):
pg_nh> * PLUS: a single mapping.
Oops, I forgot this one:
* PLUS: single IO instead of 8 IOs.
Both pluses seem fairly minor to me except in some older systems with
really high TLB or per-IO overheads (like lack of chaining).
|
I believe the issue here is more of the per-page overhead in I/O, i.e.,
PFN handling, working-set locking, TLB synchronizsation in a parallel
system, etc. VMS had always had page-fault clustering, in spite of the
per-page overhead being considerable.
The fact remains that even with the default (IIRC) 8 KB pages on the Alpha,
the VMS guys spent a lot of effort putting what they called "granularity
hints" into both system and, where possible, process address spaces to use
the processors variable-sized superpages (IIRC, the Alpha had that most
flexible scheme that allows you to specify an power-of-two aligned super-
page in any TLB entry) to reduce TLB misses as much as possible. So TLB
misses definitely _are_ a problem for such systems...or you would need to
postulate that DEC^H^H^HCompaQ^H^H^H^H^H^HHP were not only incompetent
marketing- and sales-wise, but also in the technical area.
Jan |
|
| Back to top |
|
 |
Ketil Malde
Guest
|
Posted:
Mon Aug 15, 2005 1:49 pm Post subject:
Re: Code density and performance? |
|
|
pg_nh@0506.exp.sabi.co.UK (Peter Grandi) writes:
| Quote: | ketil> Occasionally a program will exhaust memory, and
ketil> cause other applications to be swapped. However,
ketil> when that program terminates and memory is released,
Oh another guy who does not believe that memory is
infinite... :-)
|
Sorry?
| Quote: | ketil> there is a ten to twenty second lag
My impression is that this is due more to inefficient Linux
paging and swapping than to any intrinsic requirement. More
below...
|
Quite possible.
| Quote: | ketil> when I switch to another application -- even if the
ketil> computer has been idle for a while. Seems to me the OS
ketil> could start to swap back in in the background?
Sure, but what? Here you are talking of an interactive app, and
knowing which one to swap back in would require the kernel to
anticipate what the user is going to do next.
|
Certainly the OS can't know which application is going to be paged in
next. What it does know, however, is that the hundreds of megabytes
just released form a terminated application is *not* going to be asked
for again.
So populating the memory from swap with *anything* when the computer
is otherwise IO-idle would be an improvement - at worst, you'd have to
page in something else instead. At best, the system would be ready
for you when you start using it again.
And - the normal workstation use is to have all interactive
applications' working set in RAM.
| Quote: | However the argument can be that a wrong guess does not cost
much;
|
Right. Except in OS complexity, perhaps.
| Quote: | However there are some social problems that may explain why this
kind of thing has not happened yet in the Linux kernel:
* As a cheap way to gain a degree of control over the direction
of Linux kernel development several major corporations have
hired virtually all of the top kernel developers:
[..]
This means that the top kernel developers now have in effect
infinite RAM on their own PCs... ''scratch my itch'' :-).
|
And fast CPUs, etc. Are you suggesting Linux kernel developers should
simply be paid less? :-)
| Quote: | * Most of the research on VM has assumed that the goal is to
minimize the ''space time integral'', that is resource
consumption, and doing as you suggest does not do that.
|
Both CPU and IO is mostly idle on a (my) workstation, most of the
time, so there are "free" resources to spend if there is a
benefit. Also, IMO latency and not throughput is the biggest problem
with workstation performance.
| Quote: | But I recently found a hint of something ever more bizarre, that
splitting a swap area into several smaller contiguous swap areas
(that is on the the disc, not across discs) can improve things
significantly:
|
That *is* strange.
-k
--
If I haven't seen further, it is by standing in the footprints of giants |
|
| Back to top |
|
 |
Torben Ćgidius Mogensen
Guest
|
Posted:
Mon Aug 15, 2005 3:42 pm Post subject:
Re: Silly new instructions |
|
|
"Peter \"Firefly\" Lund" <firefly@diku.dk> writes:
| Quote: | On Sat, 13 Aug 2005, Dan Koren wrote:
One could also argue the case that having the SP
and the PC (almost) invisible and inaccessible to
(user mode) software could bring very considerable
benefits.
So you could load/store stuff relative to the SP (and possibly
relative to a base pointer) but you couldn't get the effective address
of that stuff?
That wouldn't play well with automatic arrays in C, for example, as
they are typically implemented.
|
And similarly for call-by-reference to stack-allocated objects an
Pascal-style non-local variables.
Moving the SP to an unnumbered register with limited access will
require a complete rethinking of the way addresses are handled. It
could work like this:
- Registers and memory add an extra tag bit that identify whether
what you have is a pointer or integer.
- Pointers are always intervals of memory addresses, so they take up
two words/registers (aligned). The tag bit in the second word
indicates whether the interval stores code or data. The PC is an
interval plus offset.
- No pointer arithmetic, only pointer+offset.
- Load/store instructions take a pointer and offset and trap if the
offset takes the address outside the interval. Store to code
addresses is not allowed in user mode.
- All jumps are relative to an address (i.e., interval) stored in a
register-pair or memory (or current PC). It is checked that the
offset is within bounds and that the address is to code (when in
user mode).
- Instructions exist for joining adjacent intervals or splitting an
interval in two. These can be used for managing stacks, heaps etc.
You can only join intervals of the same type (code or data), and
when you split an interval, the two new intervals will be of the
same type as the original.
- When overwriting a register or memory-word that is tagged as an
address interval, both words in the interval must be overwritten.
I.e., you need to check the tags for the full double-word.
- Superviser mode can set up intervals at "real" addresses and tag
them as desired.
Note that this doesn't give full protection, as it is possible to
split an interval and still load/store relative to the original. But
it means that you can give a user process access only to limited code
and data spaces.
Note that it is, AFAICS, possible to implement C with these
restrictions. Pointers just take up more space than usual (three
words: Two for an interval and one for an offset into this).
Torben |
|
| Back to top |
|
 |
Seongbae Park
Guest
|
Posted:
Mon Aug 15, 2005 7:10 pm Post subject:
Re: Code density and performance? |
|
|
Peter Grandi <pg_nh@0506.exp.sabi.co.UK> wrote:
....
| Quote: | Seongbae.Park> If you want better locality, you need a
Seongbae.Park> combination of compile time and link time
Seongbae.Park> optimization [ ... ]
pg_nh> Sure, ideally, but these cost a lot more time, and
pg_nh> programmers can't be asked.
|
For example, Sun's Performance Analyzer
can produce a mapfile from performance measurement
to optimize for code locality
(and it's certainly not the only one that can do that sort of thing).
This mapfile can be used in your build environment
- it requires only a minimal impact in the build environment
since the mapfile doesn't need to be generated
everytime you build.
| Quote: | Seongbae.Park> Time as in compilation/linking time but no effort
Seongbae.Park> for programmers especially just turning on
Seongbae.Park> certain optimization flags. e.g. link time
Seongbae.Park> optimizer can reorder at function/basic block
Seongbae.Park> granularity to gather functions and basic blocks
Seongbae.Park> that are close in the static call graph, and
Seongbae.Park> compilers can "outline" cold blocks (like
Seongbae.Park> exception handler or error handling path).
What in essence such optimizers do is to run 'lorder | tsort'
inside compilation units...
This is still just a bit more fine grained static clustering,
and in its simplicity 'lorder | tsort' does most of the same
things, and at the function granularity too if one enables
splitting.
|
It's a wrong tool for the job.
Splitting object files for each functions require significant
changes for build environment, and is totally unnecessary.
e.g. simply putting functions into different ELF sections
would give the same degree of freedom to appropriate tools
without impacting user environment.
If somone bothered to use lorder|tsort,
s/he could easily put just a bit more effort
and could potentially get *much* better outcome.
| Quote: | I think we have already had in past discussions of how nice it
would be to have 'else infrequently' hints in the language,
not that would programmers would be bothered to use that. :-/
|
....and most programmers, even the most competent ones,
will suffer from the consequence of putting wrong "hints".
But let's not go down that rathole again.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/" |
|
| Back to top |
|
 |
John Ahlstrom
Guest
|
Posted:
Tue Aug 16, 2005 12:03 am Post subject:
Re: Silly new instructions |
|
|
John Savard wrote:
| Quote: | On Sat, 13 Aug 2005 12:53:48 -0700, John Ahlstrom
ahlstromjk@comcast.net> wrote, in part:
I believe using a GPR as SP or PC (or both) was also patented
thus preventing PDP-11 knockoffs. A very considerable benefit.
It was the UNIBUS patents that did in DCC...
John Savard
|
What did in National Semiconductor?
They had built a knock-off uProcessor PDP-11,
advertised it heavily and then cancelled it just
before delivery, IIRC.
--
"If you can't drink their booze, take their money, and then vote
against them anyway, you don't belong in this game."
L O'Donnell, Jr |
|
| Back to top |
|
 |
Tom Linden
Guest
|
Posted:
Tue Aug 16, 2005 12:15 am Post subject:
Re: Silly new instructions |
|
|
On Mon, 15 Aug 2005 17:11:45 -0400, Dan Koren <dankoren@yahoo.com> wrote:
| Quote: |
"John R. Levine" <johnl@iecc.com> wrote in message
news:ddqqjc$2sl$1@xuxa.iecc.com...
No, it was a 32-bit VAX knock-off.
Looked good on paper.
The 16032 was extremely buggy. Was
the 32032 any better?
Dunno. I was at National at the time,
but didn't work on/with it. Merely
watching from the sides ;-)
dk
We ported PL/I Fortran and Pascal beginning with the 16032 and ending |
with the 32032. It wasn't so much that it was bugging from a prograsmming
POV, it was early silicon bugs that plagued us. The 32032 was far more
stable, but that was largely due to the TI fab that made them. In fact,
National had lost some of the tapes so TI had to put in a lot of effort
to rebuild the date for the fabs. The instruction set was largely a VAX
knock-off. |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Tue Aug 16, 2005 12:15 am Post subject:
Re: Silly new instructions |
|
|
"John R. Levine" <johnl@iecc.com> wrote in message
news:ddqqjc$2sl$1@xuxa.iecc.com...
| Quote: |
No, it was a 32-bit VAX knock-off.
Looked good on paper.
The 16032 was extremely buggy. Was
the 32032 any better?
|
Dunno. I was at National at the time,
but didn't work on/with it. Merely
watching from the sides ;-)
dk |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Tue Aug 16, 2005 12:15 am Post subject:
Re: Silly new instructions |
|
|
"John Ahlstrom" <ahlstromjk@comcast.net> wrote in message
news:Ar-dnZ2dnZ31dPK3nZ2dnWt7nd6dnZ2dRVn-0J2dnZ0@comcast.com...
| Quote: | John Savard wrote:
On Sat, 13 Aug 2005 12:53:48 -0700, John Ahlstrom
ahlstromjk@comcast.net> wrote, in part:
I believe using a GPR as SP or PC (or both) was also patented
thus preventing PDP-11 knockoffs. A very considerable benefit.
It was the UNIBUS patents that did in DCC...
What did in National Semiconductor?
|
Its management and brilliant marketing people.
I happen to know as I was there... 8-((
(of course not in marketing)
| Quote: | They had built a knock-off uProcessor PDP-11,
|
No, it was a 32-bit VAX knock-off. Looked good
on paper.
| Quote: | advertised it heavily and then cancelled it just before delivery, IIRC.
|
Incorrect. National's 32032 actually shipped,
and systems were built using it (e.g. Tolerant),
The problem was that it shipped late and never
reached sufficient numbers to become sustainable.
dk |
|
| Back to top |
|
 |
John R. Levine
Guest
|
Posted:
Tue Aug 16, 2005 12:15 am Post subject:
Re: Silly new instructions |
|
|
| Quote: | No, it was a 32-bit VAX knock-off. Looked good
on paper.
|
The 16032 was extremely buggy. Was the 32032 any better?
R's,
John |
|
| Back to top |
|
 |
Anne & Lynn Wheeler
Guest
|
Posted:
Tue Aug 16, 2005 12:15 am Post subject:
Re: Code density and performance? |
|
|
Seongbae Park <Seongbae.Park@Sun.COM> writes:
| Quote: | For example, Sun's Performance Analyzer can produce a mapfile from
performance measurement to optimize for code locality (and it's
certainly not the only one that can do that sort of thing). This
mapfile can be used in your build environment - it requires only a
minimal impact in the build environment since the mapfile doesn't
need to be generated everytime you build.
|
note that vs/repack did that ... it was done at the science center
http://www.garlic.com/~lynn/subtopic.html#545tech
in the early 70s ... and was used as part of the analysis of
rewritting apl storage maangement (garbage collection) in the port
from apl\360, small (16kbytes to 32kbytes) real storage swapping to
cms\apl large virtual memory operation.
it was also used for analysis of some of the "large" 360 application
in moving them from the os/360 real storage environment to os/vs2
virtual memory environment.
it was released as a product in the March of 1976.
i was a little annoyed ... i had written some of the data collection
software for vs/repack ... but the real analysis work was done by
hatfield. it was priced software and the science center was on the
list of organizations ... that if employees produced priced software,
they got the equivalent of one month license for all copies sold the
first year.
in april '76, the science center was removed from the list of
organizations, where employees got the first month's license.
i got to release the resource manager in may of '76 ... which was the
first priced operating system code ... aka the unbundling announcement
of 6/23/69 started pricing for application software, but kernel
software was still free. the resource manager got chosen to be the
guinea pig for first priced kernel software (i got to spend a couple
months with business people working out kernel pricing policty)
.... lots of past posts on unbundling and resource manager being priced
http://www.garlic.com/~lynn/subtopic.html#unbundle
i even offered to give up my salary ... since the first year sales of
the resource manager was so good that the first months license was
well over $1m.
in any case article by Hatfield from the early 70s:
D. Hatfield & J. Gerald, Program Restructuring for Virtual Memory, IBM
Systems Journal, v10n3, 1971
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/ |
|
| Back to top |
|
 |
Joe Pfeiffer
Guest
|
Posted:
Tue Aug 16, 2005 8:15 am Post subject:
Re: Silly new instructions |
|
|
johnl@iecc.com (John R. Levine) writes:
| Quote: | No, it was a 32-bit VAX knock-off. Looked good
on paper.
The 16032 was extremely buggy. Was the 32032 any better?
|
It showed a lot of VAX-influenced thinking, but it certainly wasn't a
knock-off (at least, as I use the term -- same instruction set). I
also thought it made it to market.
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr |
|
| Back to top |
|
 |
Joe Pfeiffer
Guest
|
Posted:
Tue Aug 16, 2005 8:15 am Post subject:
Re: Silly new instructions |
|
|
John Ahlstrom <ahlstromjk@comcast.net> writes:
| Quote: | John Savard wrote:
On Sat, 13 Aug 2005 12:53:48 -0700, John Ahlstrom
ahlstromjk@comcast.net> wrote, in part:
I believe using a GPR as SP or PC (or both) was also patented
thus preventing PDP-11 knockoffs. A very considerable benefit.
It was the UNIBUS patents that did in DCC...
|
DCC, as in Digital Computer Controls? The DCC-116 I used in a
computer interfacing class in the mid-1970s was a Data General Nova
knockoff, not any kind of DEC. I was under the impression DG got rid
of them by buying them.
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr |
|
| Back to top |
|
 |
Tim McCaffrey
Guest
|
Posted:
Tue Aug 16, 2005 3:57 pm Post subject:
Re: PART 3. Why it seems difficult to make an OOO VAX compet |
|
|
In article <1123511995.056652.53920@g43g2000cwa.googlegroups.com>,
old_systems_guy@yahoo.com says...
| Quote: |
Among the problems with comp.arch is that it fills up with opinions
that don't survive even minimal perusal of the literature...
1) One can argue about the PC, but if one reads the VAX-study
references I quoted, one finds [Emer & Clark] that Immediates (PC)+
were 2.4% of the specifiers, and Absolute @(PC)+ were 0.6%, or 3% of
the total. Personally, I didn't think that was worth the other
problems, and neither did many of the other RISC designers,(ARM being a
notable exception), nor X86 nor 68K, but it does help with code size.
While I don't have numbers handy, I can see why the ARM folks |
have PC as GPR, if you want to load a big constant (like an
address), you have to load the word from memory. Putting it
near the code and doing a PC relative load makes sense. I can't
prove it, but I bet this is painful for the caches, as the data
cache and code cache end up overlapping quite a bit (at least
for the XScale, the caches don't do any snooping, so they
probably don't thrash, but they may end up with a fair amount
of cruft in them).
- Tim |
|
| Back to top |
|
 |
|
|
|
|