| Author |
Message |
Sander Vesik
Guest
|
Posted:
Tue Nov 30, 2004 10:51 pm Post subject:
Re: 16K pentium level one cache |
|
|
karl_m@acm.org wrote:
| Quote: |
Anton Ertl wrote:
karl_m@acm.org writes:
Anton Ertl wrote:
karl_m@acm.org (karl malbrain) writes:
Does anyone know the organization of the Pentium level one cache?
8KB I + 8KB D write-back, IIRC 4-way set-associative.
It's probably 2-way set associative.
Do the 2 bits of set-associatiation come out of the 13 bits of
address?
Which 13 bits? Each way is addressed with bits 5-11 (bits 0-4 are
for
addressing within the line).
I'm interested in bank interleaving. Thanks, karl m
|
IIRC you can't (and wouln't want anyways) that on a P5
--
Sander
+++ Out of cheese error +++ |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Dec 01, 2004 4:21 am Post subject:
Re: 16K pentium level one cache |
|
|
Sander Vesik wrote:
| Quote: | karl_m@acm.org wrote:
Anton Ertl wrote:
karl_m@acm.org writes:
Anton Ertl wrote:
karl_m@acm.org (karl malbrain) writes:
Does anyone know the organization of the Pentium level one
cache?
8KB I + 8KB D write-back, IIRC 4-way set-associative.
It's probably 2-way set associative.
Do the 2 bits of set-associatiation come out of the 13 bits of
address?
Which 13 bits? Each way is addressed with bits 5-11 (bits 0-4
are
for
addressing within the line).
I'm interested in bank interleaving. Thanks, karl m
IIRC you can't (and wouln't want anyways) that on a P5
|
I'm trying to fit a 1K AES table into the L1 cache and access it in
randomly in constant time. I need to know the penalties for multiple
instruction access into the same bank and don't understand how the
hardware levels work together in this.
Thanks, karl m |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 01, 2004 1:06 pm Post subject:
Re: 16K pentium level one cache |
|
|
karl_m@acm.org wrote:
| Quote: | I'm trying to fit a 1K AES table into the L1 cache and access it in
randomly in constant time. I need to know the penalties for multiple
instruction access into the same bank and don't understand how the
hardware levels work together in this.
|
Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
I am much more interested in what sort of min/average/median/max times
you are seeing!
The best you can do with a 1K table is to allocate it on a 1K address
boundary, and then just forget about it.
If you access it often enough, it will stay in L1 cache. If you access a
lot of other stuff, it might flush some of your table entries out of L1.
To get around this, you could even try to allocate the table as the firt
1K part of an aligned 4K block, and then place all other data items
outside that range, effectively giving away 25% of your memory space.
Two-way (or higher) associativity would allow you to place one more
table within the same (modulo 4K) address range, while still avoiding
conflicts.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Wed Dec 01, 2004 2:47 pm Post subject:
Re: 16K pentium level one cache |
|
|
Terje Mathisen wrote:
| Quote: | Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
|
AFAIK, quite a lot of DSPs still offer deterministic timing. The sources of
non-determinism, like caches and OOO execution go into some DSPs now, too,
but overall, a DSP is both a modern CPU, and deterministic.
Desktop CPUs follow the Linus Torvalds definition of "real time", where a
fast enough PC is by default "real time". Embedded systems still follow a
more scientific definition. And deterministic timing is a prerequisite to
that definition.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Grumble
Guest
|
Posted:
Wed Dec 01, 2004 4:43 pm Post subject:
Re: 16K pentium level one cache |
|
|
karl_m@acm.org wrote:
| Quote: | Sorry. We're having trouble with Advanced Encryption Standard
implementations on the Pentium. We have a 1024 byte table that we need
to access in constant time without bank conflict stalls. What is the
size of the bank?
|
Since you are optimizing for the P5 micro-architecture, you could post
part of your code to comp.lang.asm.x86 and discuss it there.
--
Regards, Grumble |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 01, 2004 7:31 pm Post subject:
Re: 16K pentium level one cache |
|
|
Bernd Paysan wrote:
| Quote: | Terje Mathisen wrote:
Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
AFAIK, quite a lot of DSPs still offer deterministic timing. The sources of
non-determinism, like caches and OOO execution go into some DSPs now, too,
but overall, a DSP is both a modern CPU, and deterministic.
|
Memory chips have somewhat variable access times as well, any kind of
shared bus is out, interrupt sources must be controlled etc, but you're
still more or less right, OK? :-)
| Quote: |
Desktop CPUs follow the Linus Torvalds definition of "real time", where a
fast enough PC is by default "real time". Embedded systems still follow a
more scientific definition. And deterministic timing is a prerequisite to
that definition.
|
You can for instance lock all or part of the cache on some modern cpus,
to effectively gain back some of that determinism. OTOH all cpus are
'real time', it is only that by designing for their worst-case timing
specs you give away one to three order of magnitude in processing power. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Wed Dec 01, 2004 10:41 pm Post subject:
Re: 16K pentium level one cache |
|
|
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote in message news:<coju23$3oi$1@osl016lin.hda.hydro.com>...
| Quote: | karl_m@acm.org wrote:
I'm trying to fit a 1K AES table into the L1 cache and access it in
randomly in constant time. I need to know the penalties for multiple
instruction access into the same bank and don't understand how the
hardware levels work together in this.
Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
|
Actually I just need to avoid DATA-DEPENDENT timings
| Quote: | I am much more interested in what sort of min/average/median/max times
you are seeing!
|
The range I'm working with is 4396375 to 4397755 cycles per 5K loops.
| Quote: | The best you can do with a 1K table is to allocate it on a 1K address
boundary, and then just forget about it.
|
I'm focusing on just the BANK STALLS from data dependent array
indexing.
| Quote: |
If you access it often enough, it will stay in L1 cache. If you access a
lot of other stuff, it might flush some of your table entries out of L1.
|
I'm not worried about cache flushes. They're random events in the AES
environment.
| Quote: | To get around this, you could even try to allocate the table as the firt
1K part of an aligned 4K block, and then place all other data items
outside that range, effectively giving away 25% of your memory space.
|
I'm not worried about giving away space. Just program organization
that avoids both thrashing and stalls.
| Quote: | Two-way (or higher) associativity would allow you to place one more
table within the same (modulo 4K) address range, while still avoiding
conflicts.
|
I'm not following this. Do you know what the bank size is on the
PII/III/IV caches and what the interrelation to the associtivity is?
karl m |
|
| Back to top |
|
 |
Jouni Osmala
Guest
|
Posted:
Thu Dec 02, 2004 4:20 am Post subject:
Re: 16K pentium level one cache |
|
|
| Quote: | Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
AFAIK, quite a lot of DSPs still offer deterministic timing. The
sources of
non-determinism, like caches and OOO execution go into some DSPs now,
too,
but overall, a DSP is both a modern CPU, and deterministic.
Memory chips have somewhat variable access times as well, any kind of
shared bus is out, interrupt sources must be controlled etc, but you're
still more or less right, OK? :-)
Desktop CPUs follow the Linus Torvalds definition of "real time", where a
fast enough PC is by default "real time". Embedded systems still follow a
more scientific definition. And deterministic timing is a prerequisite to
that definition.
You can for instance lock all or part of the cache on some modern cpus,
to effectively gain back some of that determinism. OTOH all cpus are
'real time', it is only that by designing for their worst-case timing
specs you give away one to three order of magnitude in processing power.
:-(
Terje
|
Well if you disable interrupts when entering certain sections, and
design for worst case, your timings isn't that bad.
[Consider cache a deterministic device, for which you must take account
in your design.] Also branch branch predictor should be deterministic in
a loop or recursion after warming up when entering section.
Indeterminism comes from interrupts, which could be disabled for certain
SHORT period of time. And the states of caches between sections. And the
cache effects could be BETTER, than what you design for. For
determinism, if you ensure, that certain peaces of data DO NOT align in
cache or you have enough WAYS to deal with that data you have
deterministic device with the data. TLB effects maybe pronounced, but
its just a matter of ensuring that you maximum number of TLB misses
isn't too big for section of code, and having hardware pagewalker, so
you won't hit the caches badly.
On the other hand if you have lots of big sections where you disable
interrupts that couldn't be called a REAL TIME system .
Jouni Osmala |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Thu Dec 02, 2004 12:45 pm Post subject:
Re: 16K pentium level one cache |
|
|
| Quote: | The range I'm working with is 4396375 to 4397755 cycles per 5K loops.
|
That is about 300 ppm variation - this seems perfectly deterministic to
me. I would expect that much variation from measurement error alone.
Jan |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Thu Dec 02, 2004 9:34 pm Post subject:
Re: 16K pentium level one cache |
|
|
"Bernd Paysan" <bernd.paysan@gmx.de> wrote in message
news:fjg082-n0l.ln1@miriam.mikron.de...
snip
| Quote: | Desktop CPUs follow the Linus Torvalds definition of "real time", where a
fast enough PC is by default "real time". Embedded systems still follow a
more scientific definition. And deterministic timing is a prerequisite to
that definition.
|
I would modify that to "some" Embedded systems. While there are many
embedded systems that do require such determinism, there are many that
don't. As an example, consider the embedded code for the system processor
of a two processor design disk drive. That is, the processor that does the
command processing, cache management, etc. but not the detailed device
handling. While there are timeouts for such things, they are large enough
that they are usually not relevant. That is, it is allowed to take several
milliseconds to interpret a SCSI command. But your product wouldn't be
competitive if it took that long. The performance of the product is a
competitive issue, not a deterministic "works or doesn't" issue. In these
cases, cache is generally worth while, as the performance advantages far
outweigh the possibility of occasional "long" operations. Lots of user
interface stuff in other embedded systems is like this. The interface would
be "sluggish", but it would still work. But having an overall faster
interface might be worth the "price" of occasional "slower" operations.
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Eric
Guest
|
Posted:
Thu Dec 02, 2004 9:51 pm Post subject:
Re: 16K pentium level one cache |
|
|
Jouni Osmala wrote:
| Quote: |
Well if you disable interrupts when entering certain sections, and
design for worst case, your timings isn't that bad.
snip
On the other hand if you have lots of big sections where you disable
interrupts that couldn't be called a REAL TIME system .
|
Not true. You appear to be confusing the term Real Time with
Event Driven. While there are lots of arguments as to what the
words "Real Time" mean, and terms like Soft and Hard Real time,
the best one I have heard for Hard Real Time is roughly:
"Must _provably_ accomplish certain tasks by certain times".
Lots of real time systems have no interrupts at all.
Polling loops and programmed IO are all they require.
In fact, in some articles I have read the authors contended
that systems with interrupts are NOT real time because of the
non determinism it introduces. (Not that I necessarily agree
with such sentiments, I just report it).
If you know what all the deadlines are, and you know how
long each code section takes, you can determine where to poll,
and prove that the machine will or will not work as designed.
The problem here is knowing how long a section of code takes.
If you use the worst case numbers then you might as well not
have spent those hundreds dollars buying a Pentium 4,
because you are effectively ignoring all the fancy bits,
and just bought a $5 embedded cpu instead.
Eric |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Thu Dec 02, 2004 11:36 pm Post subject:
Re: 16K pentium level one cache |
|
|
Jan Vorbrüggen <jvorbrueggen-not@mediasec.de> wrote in message news:<317vhaF383gm9U1@individual.net>...
| Quote: | The range I'm working with is 4396375 to 4397755 cycles per 5K loops.
That is about 300 ppm variation - this seems perfectly deterministic to
me. I would expect that much variation from measurement error alone.
|
Actually, the variation reflects bank-conflict-stalls on the triple
issue Athlon. I can defeat it by inserting 2 or 3 NOPs between
cache-accesses to fence-off the possibility of stalling. Thanks, karl
m |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Fri Dec 03, 2004 12:05 pm Post subject:
Re: 16K pentium level one cache |
|
|
karl malbrain wrote:
| Quote: | Terje Mathisen <terje.mathisen@hda.hydro.com> wrote in message news:<coju23$3oi$1@osl016lin.hda.hydro.com>...
Karl, all modern cpus, and in this regard even the Pentium is somewhat
modern (:-(), make it effectively impossible to get totally
deterministic timing.
Actually I just need to avoid DATA-DEPENDENT timings
|
Ah!!!
You're trying to avoid all _repeatable_ forms of DATA-DEPENDENT timings,
since these could be used to determine the current key value, right?
Having really random variations would be OK, but anything that leaks
info is bad. :-(
(BTW, have you considered a (really) random length delay loop on the
exit of these function calls? Alternatively, time the current run, and
spin-loop until a minimum value is achieved. This would effectively
remove nearly all possibilities to do timing-related crypt-analysis!)
Re. your bank conflict problems: Simply rearrange your Pentium version
of the code to never do more than one L1 cache access/cycle. On the P5
this is quite easy to do, and you really shouldn't give away any
significant performance by doing it.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
David Wagner
Guest
|
Posted:
Fri Dec 03, 2004 6:23 pm Post subject:
Re: 16K pentium level one cache |
|
|
Terje Mathisen wrote:
| Quote: | (BTW, have you considered a (really) random length delay loop on the
exit of these function calls? Alternatively, time the current run, and
spin-loop until a minimum value is achieved. This would effectively
remove nearly all possibilities to do timing-related crypt-analysis!)
|
I'm skeptical. It seems unlikely to me that this would be enough.
You've just decreased the S/N ratio a bit, but an attacker can still
pick the signal out of the noise by just taking more observations and
averaging them. (The noise will start to average out, eventually
leaving only the signal.) |
|
| Back to top |
|
 |
David Wagner
Guest
|
Posted:
Fri Dec 03, 2004 7:01 pm Post subject:
Re: 16K pentium level one cache |
|
|
Terje Mathisen wrote:
| Quote: | A spin loop doing RDTSC would keep going until past a certain value,
reducing the maximum possible information left to be modulo the length
of said loop.
|
That's different from a random delay, which is what you proposed
previously. Slowing the cipher down to its worst-case execution time
should work to defend against timing attacks. However, for AES, this
is extremely expensive, as the worst-case execution time is much, much
higher than the average-case.
In other words, I remain skeptical that you can get security at a
reasonable performance overhead. |
|
| Back to top |
|
 |
|
|
|
|