| Author |
Message |
karl malbrain
Guest
|
Posted:
Wed Dec 01, 2004 1:29 am Post subject:
AES constant time table access |
|
|
For Reference: the 14 bit Pentium cache address breaks down:
13 12 11 10 09 08 07 06 05 04 03 02 01 00
|--- index value --------| |-bank-| |byte|
A one cycle conflict stall occurs when both the u and v pipes request
data from the same bank. Does this also hold true on the newer
Pentiums?
Thanks, karl m |
|
| Back to top |
|
 |
Grumble
Guest
|
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Dec 01, 2004 9:57 pm Post subject:
Re: AES constant time table access |
|
|
Grumble wrote:
| Quote: | karl malbrain wrote:
For Reference: the 14 bit Pentium cache address breaks down:
14 bits? That would be the Pentium MMX, no?
|
Yes. Do you have the bit breakdowns for PII,III,IV?
(sorry, but I'm not approved for that newsgroup)
These manuals don't breakdown the LEVEL ONE CACHE bank bits. They do
say that the Pentium 2,3,4 processors each dispatch up to 3 micro-ops
per cycle, each of which could possibly encounter a bank conflict
stall, or perhaps two.
It seems to me that the assembly instruction stream should be able to
prevent bank stalls under the 2 or 3 dispatch limit.
Thanks, karl m |
|
| Back to top |
|
 |
Andreas Kaiser
Guest
|
Posted:
Sat Dec 04, 2004 2:22 am Post subject:
Re: AES constant time table access |
|
|
On 30 Nov 2004 12:29:46 -0800, karl_m@acm.org (karl malbrain) wrote:
| Quote: | A one cycle conflict stall occurs when both the u and v pipes request
data from the same bank. Does this also hold true on the newer
Pentiums?
|
PentiumMMX is similar. But Pentium-Pro/II/III/M/4 are all limited to
one load and one store per cycle, so while loads and stores may
interfere, there are no bank conflicts between loads.
But in case of the P4, I doubt that anyone outside the Intel design
team has enough knowledge of all the implementation aspects which may
affect deterministic vs. seemingly erratic behaviour of a program like
yours. It's just a bit too complex. |
|
| Back to top |
|
 |
Karl Malbrain
Guest
|
Posted:
Sat Dec 04, 2004 2:57 am Post subject:
Re: AES constant time table access |
|
|
Andreas Kaiser wrote:
| Quote: | On 30 Nov 2004 12:29:46 -0800, karl_m@acm.org (karl malbrain) wrote:
A one cycle conflict stall occurs when both the u and v pipes
request
data from the same bank. Does this also hold true on the newer
Pentiums?
PentiumMMX is similar. But Pentium-Pro/II/III/M/4 are all limited to
one load and one store per cycle, so while loads and stores may
interfere, there are no bank conflicts between loads.
|
Would this be true on the Athlon as well?
| Quote: | But in case of the P4, I doubt that anyone outside the Intel design
team has enough knowledge of all the implementation aspects which may
affect deterministic vs. seemingly erratic behaviour of a program
like
yours. It's just a bit too complex.
Do you know how the cache is organized on the P4? |
Thanks, karl m |
|
| Back to top |
|
 |
Andreas Kaiser
Guest
|
Posted:
Sat Dec 04, 2004 6:23 am Post subject:
Re: AES constant time table access |
|
|
On 3 Dec 2004 13:57:02 -0800, "Karl Malbrain" <karl_m@acm.org> wrote:
| Quote: | PentiumMMX is similar. But Pentium-Pro/II/III/M/4 are all limited to
one load and one store per cycle, so while loads and stores may
interfere, there are no bank conflicts between loads.
Would this be true on the Athlon as well?
|
Roughly comparable to P5. K7 and K8 have 8 banks of 8 bytes width.
There are 2 load/store cache ports operating in parallel. Strictly
in-order (on hits), so this part is fairly easy to manage. But the
back end of the second load/store queue (for stores and cache misses)
may interfere with the cache ports as well, when there are many
committed stores without enough idle cache ports to use. So in your
case, better avoid saturating the cache ports.
| Quote: | Do you know how the cache is organized on the P4?
|
First you have to distinguish between the latest generation (90nm
parts, Prescott, Noconah) and earlier (e.g. Northwood), as they are
significantly different.
I don't remember anything about banks. But all P4s are limited in
associativity. 4-way or 8-way on paper - however associativity ends at
A15/A19. So all ways in a set share the same value for A16/A20 and
above.
What makes things complicated is the amount of speculation. The P4
runs cache accesses out of order. When it later detects that a load
was done prior to a preceding store to the same address, it cleans up
the mess and replays at least the relevant instructions (could be
dozens of instructions). This likely is only one of the speculations
used in the P4 designs.
To add a few bits on processor complexity: don't expect the machines
to be exact, when erring on the safe side may save space and time,
statistically. Checks for load/store conflicts, self modifying code
etc, may show some mount of false hits. Maybe it slows down by 0.2%
due to false hits but speeds up by 2% due to less delays. |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Sun Dec 05, 2004 3:14 am Post subject:
Re: AES constant time table access |
|
|
Andreas Kaiser <A.Kaiser@gmx.net> wrote in message news:<ea12r05he90se4jb0chhoo3pmgtp5vllen@4ax.com>...
| Quote: | On 3 Dec 2004 13:57:02 -0800, "Karl Malbrain" <karl_m@acm.org> wrote:
PentiumMMX is similar. But Pentium-Pro/II/III/M/4 are all limited to
one load and one store per cycle, so while loads and stores may
interfere, there are no bank conflicts between loads.
Would this be true on the Athlon as well?
Roughly comparable to P5. K7 and K8 have 8 banks of 8 bytes width.
There are 2 load/store cache ports operating in parallel. Strictly
in-order (on hits), so this part is fairly easy to manage. But the
back end of the second load/store queue (for stores and cache misses)
may interfere with the cache ports as well, when there are many
committed stores without enough idle cache ports to use. So in your
case, better avoid saturating the cache ports.
|
On the K7/K8 the AES algorithm uses a maximum of two ports in the same
cpu cycle. Do you know how long the ports stay busy?
| Quote: | Do you know how the cache is organized on the P4?
First you have to distinguish between the latest generation (90nm
parts, Prescott, Noconah) and earlier (e.g. Northwood), as they are
significantly different.
I don't remember anything about banks. But all P4s are limited in
associativity. 4-way or 8-way on paper - however associativity ends at
A15/A19. So all ways in a set share the same value for A16/A20 and
above.
|
The 4K tables & ~128 byte data areas can be localized within 64K.
| Quote: | What makes things complicated is the amount of speculation. The P4
runs cache accesses out of order. When it later detects that a load
was done prior to a preceding store to the same address, it cleans up
the mess and replays at least the relevant instructions (could be
dozens of instructions). This likely is only one of the speculations
used in the P4 designs.
|
The AES code proceeds without branches from top to bottom. I don't
think speculation is a factor.
| Quote: | To add a few bits on processor complexity: don't expect the machines
to be exact, when erring on the safe side may save space and time,
statistically. Checks for load/store conflicts, self modifying code
etc, may show some mount of false hits. Maybe it slows down by 0.2%
due to false hits but speeds up by 2% due to less delays.
|
I've nailed down the problem on the Athlon to cache-bank-conflict
stalls which I removed by issuing NOPs. I haven't started
investigating emperically the Pentium, where I fear that thrashing the
cache over the working set comes into play.
Thanks, karl m |
|
| Back to top |
|
 |
Andreas Kaiser
Guest
|
Posted:
Sun Dec 05, 2004 4:47 am Post subject:
Re: AES constant time table access |
|
|
On 4 Dec 2004 14:14:58 -0800, karl_m@acm.org (karl malbrain) wrote:
| Quote: | On the K7/K8 the AES algorithm uses a maximum of two ports in the same
cpu cycle. Do you know how long the ports stay busy?
|
Once a store is committed, it needs a cache port for a single cycle.
How and when exactly this is done is one of the details you have to
ask AMDs chip designers for (but don't expect an answer).
For a detailed but somewhat speculative thus not necessarily correct
description of the K8 core, see
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html.
| Quote: | The 4K tables & ~128 byte data areas can be localized within 64K.
|
Since the L1 data cache is physically tagged, it is not safe to have a
64K data space in subsequent pages in linear address space. Unless you
run under plain real-mode DOS. You can get assciativity conflicts
between seemingly subsequent pages, because they need not be
subsequent in physical address space.
The operating system may or may not follow your intentions. OS page
coloring may help (if implemented), but probably not in a low memory
situation.
| Quote: | The AES code proceeds without branches from top to bottom. I don't
think speculation is a factor.
|
I wasn't talking about branch speculation. There's more. What I
described is speculation on load/store addresses being independant.
| Quote: | where I fear that thrashing the cache over the working set comes
into play.
|
And if it runs in hyperthreaded mode, you are completely lost anyway. |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Tue Dec 07, 2004 12:55 am Post subject:
Re: AES constant time table access |
|
|
"Andreas Kaiser" <A.Kaiser@gmx.net> wrote in message
news:bbh4r0d1sgoiinnfpb9sfac66diu9b4bhk@4ax.com...
| Quote: | On 4 Dec 2004 14:14:58 -0800, karl_m@acm.org (karl malbrain) wrote:
On the K7/K8 the AES algorithm uses a maximum of two ports in the same
cpu cycle. Do you know how long the ports stay busy?
Once a store is committed, it needs a cache port for a single cycle.
How and when exactly this is done is one of the details you have to
ask AMDs chip designers for (but don't expect an answer).
|
Sorry I haven't made this clearer. The core area of the AES algorithm use
only loads from constant a 4096 byte table, no stores. Only at the edge of
the MMX implementation is there a store of EAX/EBX/ECX/EDX back into the
cache. (BTW, does the 4 byte MMX register load avoid cache-bank-conflict
stalls?)
Sorry for not knowing what K8 means; I'll leave the Opteron out of this for
the time being.....
| Quote: | The 4K tables & ~128 byte data areas can be localized within 64K.
Since the L1 data cache is physically tagged, it is not safe to have a
64K data space in subsequent pages in linear address space. Unless you
run under plain real-mode DOS. You can get assciativity conflicts
between seemingly subsequent pages, because they need not be
subsequent in physical address space.
|
Yes, I can see this problem. What a mess.
| Quote: | The operating system may or may not follow your intentions. OS page
coloring may help (if implemented), but probably not in a low memory
situation.
The AES code proceeds without branches from top to bottom. I don't
think speculation is a factor.
I wasn't talking about branch speculation. There's more. What I
described is speculation on load/store addresses being independant.
|
.... into the cache. Again, a messy problem.
| Quote: | where I fear that thrashing the cache over the working set comes
into play.
And if it runs in hyperthreaded mode, you are completely lost anyway.
|
Arrgh. Perhaps the AES code is going to need to be run in the kernel with
the other "cpu" turned off. Is this possible?
Thanks, karl m |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Tue Dec 07, 2004 2:37 am Post subject:
Re: AES constant time table access |
|
|
karl malbrain wrote:
| Quote: | Arrgh. Perhaps the AES code is going to need to be run in the kernel with
the other "cpu" turned off. Is this possible?
|
One idea might be to make all memory accesses non-cached, or you could
waste a few registers and run a very slow opcode (i.e. DIV) in parallel
with each memory access, with a serializing opcode before and after.
This would make the code almost two orders of magnitude slower though.
A much more feasible method is to time the execution, using RDTSC before
and after, then subtract this from a 'benchmark' value, chosen to be (at
least nearly) worst case.
The difference is used to index into a unpredicted branch table
(possibly done using a synthetic RET without a preceeding CALL?), that
jumps into an array of NOPs at such an offset that the total time will
be constant.
Such a technique would survive easily in a hyperthreaded environment,
since the activities of an unrelated process would only mask any
surviving timing info.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Tue Dec 07, 2004 2:53 am Post subject:
Re: AES constant time table access |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cp2jen$g4v$1@osl016lin.hda.hydro.com...
| Quote: | karl malbrain wrote:
Arrgh. Perhaps the AES code is going to need to be run in the kernel
with
the other "cpu" turned off. Is this possible?
One idea might be to make all memory accesses non-cached, or you could
waste a few registers and run a very slow opcode (i.e. DIV) in parallel
with each memory access, with a serializing opcode before and after.
This would make the code almost two orders of magnitude slower though.
|
And that's the current problem. We have versions of AES that a 256 byte
table, or even calculate the necessary table values from their GF-2
formulations, but they run 10x slower.
| Quote: | A much more feasible method is to time the execution, using RDTSC before
and after, then subtract this from a 'benchmark' value, chosen to be (at
least nearly) worst case.
|
Again, the worst case is 192 DRAM accesses in a function that currently
requires 500 cycles (after the 4K table is in L1 cache).
(...)
| Quote: | Such a technique would survive easily in a hyperthreaded environment,
since the activities of an unrelated process would only mask any
surviving timing info.
|
I tend to agree that the overall effect is randomization -- but these are
cryptographer folk who work with terms of 2^128 complexity factors. The
effect that causes the skepticism is the timing attack on an uninitialized
cache.
karl m |
|
| Back to top |
|
 |
Tom
Guest
|
Posted:
Tue Dec 07, 2004 4:49 am Post subject:
Re: AES constant time table access |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message news:cp2jen$g4v$1@osl016lin.hda.hydro.com...
| Quote: | karl malbrain wrote:
Arrgh. Perhaps the AES code is going to need to be run in the kernel with
the other "cpu" turned off. Is this possible?
One idea might be to make all memory accesses non-cached, or you could
waste a few registers and run a very slow opcode (i.e. DIV) in parallel
with each memory access, with a serializing opcode before and after.
This would make the code almost two orders of magnitude slower though.
A much more feasible method is to time the execution, using RDTSC before
and after, then subtract this from a 'benchmark' value, chosen to be (at
least nearly) worst case.
The difference is used to index into a unpredicted branch table
(possibly done using a synthetic RET without a preceeding CALL?), that
jumps into an array of NOPs at such an offset that the total time will
be constant.
|
Can't you just loop around RDTSC until you get the right answer?
| Quote: |
Such a technique would survive easily in a hyperthreaded environment,
since the activities of an unrelated process would only mask any
surviving timing info.
Terje
--
- <Terje.Mathisen@hda.hydro.com
"almost all programming can be viewed as an exercise in caching" |
|
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 08, 2004 1:24 am Post subject:
Re: AES constant time table access |
|
|
karl malbrain wrote:
| Quote: | I tend to agree that the overall effect is randomization -- but these are
cryptographer folk who work with terms of 2^128 complexity factors. The
effect that causes the skepticism is the timing attack on an uninitialized
cache.
|
That is the easiest part: Simply iterate over the 4K table before
starting, this would require just 64 load operations 64 bytes apart, and
make the timing constant
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Wed Dec 08, 2004 1:31 am Post subject:
Re: AES constant time table access |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:cp53id$ucb$1@osl016lin.hda.hydro.com...
| Quote: | karl malbrain wrote:
I tend to agree that the overall effect is randomization -- but these
are
cryptographer folk who work with terms of 2^128 complexity factors. The
effect that causes the skepticism is the timing attack on an
uninitialized
cache.
That is the easiest part: Simply iterate over the 4K table before
starting, this would require just 64 load operations 64 bytes apart, and
make the timing constant
|
I don't see any problem with this -- it would be 128 loads 32 bytes apart to
support the pentium, and add 128 cycles to a 500 cycle function. karl m |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Dec 08, 2004 2:15 am Post subject:
Re: AES constant time table access |
|
|
Tom wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message news:cp2jen$g4v$1@osl016lin.hda.hydro.com...
The difference is used to index into a unpredicted branch table
(possibly done using a synthetic RET without a preceeding CALL?), that
jumps into an array of NOPs at such an offset that the total time will
be constant.
Can't you just loop around RDTSC until you get the right answer?
|
That was my first suggestion, but since RDTSC is both long-latency (up
to 24 cycles) and non-serializing), it doesn not provide any way to get
cycle-accurate timing.
You can use it to get close to the desired value (i.e. within 30-50
cycles, then use the delay line code for the remander.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
|
|
|
|