Real-time determinism (Was: 16K pentium level one cache)
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Real-time determinism (Was: 16K pentium level one cache)

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Ken Hagan
Guest





Posted: Wed Dec 01, 2004 9:22 pm    Post subject: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Terje Mathisen wrote:
Quote:
Bernd Paysan wrote:

Desktop CPUs follow the Linus Torvalds definition of "real time",
where a fast enough PC is by default "real time". Embedded systems
still follow a more scientific definition. And deterministic timing
is a prerequisite to that definition.

You can for instance lock all or part of the cache on some modern
cpus, to effectively gain back some of that determinism. OTOH all
cpus are 'real time', it is only that by designing for their worst-
case timing specs you give away one to three order of magnitude in
processing power. :-(

If my program and data were strictly bounded in memory usage (and
any other resources that might matter), would this still be true?
For example, if I have a program (and lets assume it is the OS if
that makes people happy) that avoids virtual memory, how bad is my
worst case timing? (3 orders of magnitude sounds pessimistic.)

For another example, if that program spends some time in a tight
loop bashing a small amount of data, can I use the L1 or L2 cache
timings rather than main memory timings to work out the worst case
timing for that section? (Or do cache architectures make it "hard"
to guarantee that given areas of memory will "eventually" reside
in the cache.)

If not, what would have to change before I could do so and is that
a change that would be worth making? (Who'd lose? Who'd win?)
Back to top
Terje Mathisen
Guest





Posted: Fri Dec 03, 2004 12:30 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Ken Hagan wrote:

Quote:
Terje Mathisen wrote:

Bernd Paysan wrote:

Desktop CPUs follow the Linus Torvalds definition of "real time",
where a fast enough PC is by default "real time". Embedded systems
still follow a more scientific definition. And deterministic timing
is a prerequisite to that definition.

You can for instance lock all or part of the cache on some modern
cpus, to effectively gain back some of that determinism. OTOH all
cpus are 'real time', it is only that by designing for their worst-
case timing specs you give away one to three order of magnitude in
processing power. :-(


If my program and data were strictly bounded in memory usage (and
any other resources that might matter), would this still be true?
For example, if I have a program (and lets assume it is the OS if
that makes people happy) that avoids virtual memory, how bad is my
worst case timing? (3 orders of magnitude sounds pessimistic.)

Not by much:

~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for a
main memory access (at least in larger SMP machines)?
Quote:

For another example, if that program spends some time in a tight
loop bashing a small amount of data, can I use the L1 or L2 cache
timings rather than main memory timings to work out the worst case
timing for that section? (Or do cache architectures make it "hard"
to guarantee that given areas of memory will "eventually" reside
in the cache.)

Your only guarantee is if you do lock data inside the cache, or if you
can also turn off all sources of interrupts, so that said inner loop can
be guaranteed to run to completion each time, without anything else
disturbing its cache accesses.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Ken Hagan
Guest





Posted: Fri Dec 03, 2004 3:42 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Terje Mathisen wrote:
Quote:

~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for a
main memory access (at least in larger SMP machines)?

Really? 250ns? I thought 386 machines had 60 or 70ns RAMs even back
then. What am I missing?

Quote:
Your only guarantee is if you do lock data inside the cache, or if you
can also turn off all sources of interrupts, so that said inner loop
can be guaranteed to run to completion each time, without anything
else disturbing its cache accesses.

Let's assume I can do that:- doesn't a cache's ability to retain data
depend on the data addresses? If so, that would make analysis, er,
quite hard.

Let's assume I can't do that:- how do real-time systems deliver the
kind of predictability that Bernd was talking about?

I suppose my next thought is, is such determinism useful? After all,
the power might go down or the CPU might overheat, so even if your
algorithm is timed to perfection, your *system* still has only a
statistical guarantee of performance.
Back to top
Nick Maclaren
Guest





Posted: Fri Dec 03, 2004 4:11 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

In article <copfv8$pl8$1$8302bc10@news.demon.co.uk>, "Ken Hagan" <K.Hagan@thermoteknix.co.uk> writes:
|> Terje Mathisen wrote:
|> >
|> > ~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for a
|> > main memory access (at least in larger SMP machines)?
|>
|> Really? 250ns? I thought 386 machines had 60 or 70ns RAMs even back
|> then. What am I missing?

The fact that the RAM cycle time is only part of the total time
needed to read some data!


Regards,
Nick Maclaren.
Back to top
Ken Hagan
Guest





Posted: Fri Dec 03, 2004 6:33 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Nick Maclaren wrote:
Quote:

The fact that the RAM cycle time is only part of the total time
needed to read some data!

Granted, but at (say) http://www.sysopt.com/articles/latency/
I read that PC100 SDRAM takes about 50ns for the first transfer.
That's the sort of figure I've been carrying around in my head,
hence my surprise at Terje's numbers.

Is that the relevant figure for a bog standard PC? If so, what
is the breakdown of (additional) costs on "large SMP machines"?
Back to top
Jouni Osmala
Guest





Posted: Fri Dec 03, 2004 7:39 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Quote:
Terje Mathisen wrote:

~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for a
main memory access (at least in larger SMP machines)?


Really? 250ns? I thought 386 machines had 60 or 70ns RAMs even back
then. What am I missing?


Your only guarantee is if you do lock data inside the cache, or if you
can also turn off all sources of interrupts, so that said inner loop
can be guaranteed to run to completion each time, without anything
else disturbing its cache accesses.


Let's assume I can do that:- doesn't a cache's ability to retain data
depend on the data addresses? If so, that would make analysis, er,
quite hard.

First thing to ensure, is that you could ensure with embedded system
that has enough registers, that you have as many items of different
address as you have ways. If you localize your memory accesses even
though they are random they hit in relatively close to each other. For
instance a 16kb cache with 4 ways and 32byte linesize. If you Ensure
that ALL your memory requests hit in 4 ranges which each size if 4kb you
guarantee that you need to load each cacheline only ONCE, since all the
rest of the thing is working. Also make certain that if you use ENTIRE
4kbrange start at 32byte aliased address, since if you don't do that
then the LAST uses same cache location as first one and you could start
swapping if your other 3 ways have used. Of course without enough
registers you would have to spend one of ways for swapping the
registers. Now with 8 way 16kb cache you have 8 x 2kb ranges and so on.
Else when I though about direct mapped cache I just realized that with
random access you need to do memory->memory copy at the beginning to
limit worst case cache misses. For instance if you loop several times
certain data structures whose total size is more less than your cache
then for having best worst case cache behavior best solution would be
copying the data structures to a single continuous block and then loop
through it and then copy it back. Don't all embedded programmers do that
already ?

Quote:
Let's assume I can't do that:- how do real-time systems deliver the
kind of predictability that Bernd was talking about?

I suppose my next thought is, is such determinism useful? After all,
the power might go down or the CPU might overheat, so even if your
algorithm is timed to perfection, your *system* still has only a
statistical guarantee of performance.

Well it all depends on how important the proper working of the machine
where the embedded chip lays is. For instance properly designed systems
don't over heat as long as ambient temperature is within spec. Also with
bigger systems when power is out its probably out for EVERY component.
Now with failure of one component could make system inoperable, with
lots of components its better be damn sure for each and every component
that it works as its supposed to be since, all probabilities combined
the system wouldn't work quite often if each part would work at 99%
reliability and you have 70 components the system wouldn't work in 50%
of time. Also if your system is medicational equipment or aircraft or
radar, or car, or ... System failures kill people. On the best case its
just TV and you have annoyed customers that don't buy your products
because it failed on them and give bad word of mouth for your products.
IF the software is just a portion of equation and failures kill and its
expected to be used in large numbers you'd better ensure that it works
with 99.99999999999 probability. And any potential reasons to fail are
eliminated.

Jouni Osmala
Helsinki University of Technology
Back to top
Stephen Fuld
Guest





Posted: Fri Dec 03, 2004 9:34 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

"Ken Hagan" <K.Hagan@thermoteknix.co.uk> wrote in message
news:copfv8$pl8$1$8302bc10@news.demon.co.uk...

snip

Quote:
I suppose my next thought is, is such determinism useful?

For some applications, it is essential. As in literally "bet your life"
essential.

Quote:
After all,
the power might go down

Redundant power systems.

Quote:
or the CPU might overheat,

Detected by independent hardware and notification is given.

Quote:
so even if your
algorithm is timed to perfection, your *system* still has only a
statistical guarantee of performance.

Sure. And the whole thing might be hit by a giant meteor. But the goals is
to make failures *estremely* rare, and deterministic response times are one
tool to do that in some situations. See "Rate Monotinic Analysis".

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Stefan Monnier
Guest





Posted: Fri Dec 03, 2004 9:41 pm    Post subject: Re: Real-time determinism Reply with quote

Quote:
Your only guarantee is if you do lock data inside the cache, or if you can
also turn off all sources of interrupts, so that said inner loop can be
guaranteed to run to completion each time, without anything else disturbing
its cache accesses.

And even so, you can get variations because of the state of the branch
predictor, etc...

As for how to deal with those things, other than use a processor without
caches, without branch predictions, without OOO: use worst case analysis,
and to get better bounds on your worst case analysis you can do things like
abstract interpretation where you simulate the state of your cache(s) to try
and predict as many of the cache hits as possible.


Stefan
Back to top
Nick Maclaren
Guest





Posted: Fri Dec 03, 2004 10:16 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

In article <coppvq$fbi$1$8302bc10@news.demon.co.uk>,
Ken Hagan <K.Hagan@thermoteknix.co.uk> wrote:
Quote:
Nick Maclaren wrote:

The fact that the RAM cycle time is only part of the total time
needed to read some data!

Granted, but at (say) http://www.sysopt.com/articles/latency/
I read that PC100 SDRAM takes about 50ns for the first transfer.
That's the sort of figure I've been carrying around in my head,
hence my surprise at Terje's numbers.

Is that the relevant figure for a bog standard PC? If so, what
is the breakdown of (additional) costs on "large SMP machines"?

No.

On my Pentium II, it is taking about 100 nS to read a datum and
250 to write one. The data has to go through all of the cache
and other memory logic to get to the RAM. The reason that large
SMPs are slower is because the cache coherence logic is a LOT
more complicated and time-consuming.


Regards,
Nick Maclaren.
Back to top
Bernd Paysan
Guest





Posted: Sat Dec 04, 2004 2:22 am    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

Nick Maclaren wrote:
Quote:
On my Pentium II, it is taking about 100 nS to read a datum and
250 to write one. The data has to go through all of the cache
and other memory logic to get to the RAM. The reason that large
SMPs are slower is because the cache coherence logic is a LOT
more complicated and time-consuming.

On my Athlon64, it takes about 75ns to read a datum from DRAM (not same
page), and I even had to write my own benchmark to find that value ;-).
The cache coherence protocol of the Opteron AFAIK would add about 25ns
to that, but since I don't have a dual or quad Opteron box at hand, I
can't check.

Someone asked how DSPs achieve both predictability and performance. The
answer is quite simple: by making things explicit. Most DSPs have
on-chip SRAM (except the early ones, which had it off-chip), which does
not work as cache, but as direct accessible fast SRAM. If you need to
access something slower (like DRAM), you'd write a copy routine that
reads in a chunk of data, and *then* you can rely on the timing of
DRAM, too. Few, if any DSPs have branch predictors. Those with long
pipelines (Ti C6x) have many delay slots. And "long pipeline" does mean
"more than 4 stages" here.

Writing programs for DSPs is a completely different programming style
from writing desktop applications.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Back to top
Andrew Reilly
Guest





Posted: Sat Dec 04, 2004 4:52 am    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

On Fri, 03 Dec 2004 23:22:30 +0100, Bernd Paysan wrote:
Quote:
Someone asked how DSPs achieve both predictability and performance. The
answer is quite simple: by making things explicit. Most DSPs have
on-chip SRAM (except the early ones, which had it off-chip), which does
not work as cache, but as direct accessible fast SRAM. If you need to
access something slower (like DRAM), you'd write a copy routine that
reads in a chunk of data, and *then* you can rely on the timing of
DRAM, too.

DRAM timing isn't random, either. If things are explicit, then you know
what the timing will be. The company I work for has made a DSP-farm box
for many years with DRAM (EDO!) attached to each 20MHz 56002. The 'k2
only has 16 bit address registers, so access to the DRAM has to be through
paged address windows anyway, so we made that window match the DRAM
row/column structure. Software has to make one type of access to open a
row, but after that can do back-to-back column accesses within the row at
the same speed (0ws) as SRAM accesses. So knowing that constraint, you
naturally write code (and lay out data) to make as much use of each page
as possible. Fast *and* predictable. Just more effort.

Quote:
Few, if any DSPs have branch predictors. Those with long pipelines (Ti
C6x) have many delay slots. And "long pipeline" does mean "more than 4
stages" here.

Makes for interesting code: the branch instruction itself has a five cycle
latency, and each of those cycles can issue eight instructions (it's
VLIW), so you have to fill about forty instruction slots to push it to
peak performance. The compiler is surprisingly good at doing that (for
DSP-style code, anyway.)

Quote:
Writing programs for DSPs is a completely different programming style
from writing desktop applications.

Probably, I've never really written any of the latter :-)

--
Andrew
Back to top
David Wang
Guest





Posted: Sat Dec 04, 2004 4:36 pm    Post subject: Re: Real-time determinism Reply with quote

Ken Hagan <K.Hagan@thermoteknix.co.uk> wrote:
Quote:
Terje Mathisen wrote:

~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for a
main memory access (at least in larger SMP machines)?

Really? 250ns? I thought 386 machines had 60 or 70ns RAMs even back
then. What am I missing?

DRAM devices store data in capacitors. Think of it as an tiny
analog value that needs to be "pumped up" to a digital value.
So DRAM access is usually a two step process. One, move the
tiny analog value to the sense amps (takes tRCD time). Two,
move the digital value from the sense amp in the DRAM devices
into the DRAM controller (takes tCAS time).

60 or 70ns in FPM DRAM devices is tRAC, which is tRCD + tCAS.
So that's the amount of time it takes for the DRAM controller
to begin to put the command on the command/address busses and
for the first chunks of the data to return from the DRAM devices.

You're missing the time it takes for the CPU to get the command
to the DRAM controller, then get the data from the DRAM
controller back into the CPU. That's most of the latency in
many systems.

For a big SMP box, there are a lot of levels of hierarchy between
a given CPU core and the DRAM controller, each level of hierarchy
adds to the latency.

For a small PC class box, 100ns is not a bad guess at unloaded
memory latency (unloaded meaning there aren't 8 different requests
already pending. If there are, your request will just have to
wait, and the queuing delay adds to your latency).

Modern DRAM devices take about ~20ns for tRCD, ~20ns for tCAS.
AMD Opteron boxes can get to DRAM and back in something like
50~60ns, assuming the data is already in the sense amps, so no
tRCD component, just the time it takes to get an address to
the DRAM devices and for the DRAM devices to move some data
from the sense amps back into the memory controller. Intel 1P
PC's take roughly 80~90ns to do something similar, primarily
because the system controller is a separate piece of silicon,
and there's more time spent in moving command/data around.



--
davewang202(at)yahoo(dot)com
Back to top
Anton Ertl
Guest





Posted: Sat Dec 04, 2004 6:36 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

"Ken Hagan" <K.Hagan@thermoteknix.co.uk> writes:
Quote:
Granted, but at (say) http://www.sysopt.com/articles/latency/
I read that PC100 SDRAM takes about 50ns for the first transfer.
That's the sort of figure I've been carrying around in my head,
hence my surprise at Terje's numbers.

Is that the relevant figure for a bog standard PC?

No. There are big variations in memory latency between PCs. E.g.,
here are some data from <2004Jul24.204851@mips.complang.tuwien.ac.at>:

Mach. 1 2 3 4 5 6 7 8 9
....
16M 373.7 230.9 199.8 180.9 144.6 158.6 197.8 79.0 68.9

Machines 1-4 are sinle-processor Alphas (PCs in many respects), 5-9
are Intel- or AMD-based machines (7, 8 are dual-CPU machines). The
numbers are latency times in ns, for a different-page access. I think
Terjes number (250ns) is quite optimistic for a big SMP, but I have
not seen latency numbers for such machines posted lately.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Guest






Posted: Sat Dec 04, 2004 7:42 pm    Post subject: Re: Real-time determinism (Was: 16K pentium level one cache) Reply with quote

"Ken Hagan" <K.Hagan@thermoteknix.co.uk> writes:

Quote:
Terje Mathisen wrote:

~2 cycles at 3+ GH is less than a nanosecond, vs. maybe 250 ns for
a main memory access (at least in larger SMP machines)?

Really? 250ns? I thought 386 machines had 60 or 70ns RAMs even back
then. What am I missing?

Re read the spec sheet and take notice of what the 70ns applies to, and
remember that the number is with no load, or very close to it.

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB