P4: why 64K aliasing conflicts during RC4 cipher?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
P4: why 64K aliasing conflicts during RC4 cipher?

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Guest






Posted: Thu Dec 09, 2004 6:27 am    Post subject: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

Playing with various implementations of the RC4 encryption I paid
attention to the very strange behavior of the P4 core: it encodes fast
when encoder's indices are kept in a byte array (with decent asm
implementation P4C 3.4GHz achieves about 530MB/s which probably makes
it the fastest RC4 encoding engine amongst currently available general
purpose processors) but becomes painfully slow when the indices are
hold in a 32b words.

For the purpose of investigations I simplified the problem down to the
bare bones (See the code below) and tested execution speed at various
widths (RC4_INT) and lengths (RC4_DATA_LEN) of the indices array. Here
are the results:

CPU Clocks per RC4_STEP as a function of the length and the width of
the indices array:
--------------------------------------------------------
RC4_DATA_LEN/RC4_INT uchar=8b short=16b long=32b
32 6.02 6.07 6.02
64 5.53 5.52 5.58
128 5.68 5.72 10.45
256 5.36 10.17 16.11
512 N/A 15.89 18.83
1024 N/A 18.76 20.14
--------------------------------------------------------

It's easy to see that the performance hit has nothing to do with x86
addressing modes, partial register accesses and so on - it's all about
the size of the indices array measured in bytes. The encoding speed
takes the hit when the size of the array exceeds 256 bytes. It
deteriorates quickly in 256B..1024B range then more slowly in the
1024B..4096B range. There is no point in measurements at longer arrays
because L1D cache misses will obscure the picture.

As the next step I fired up vTune but instead of the answers vTune
supplied more questions. The only major difference between quick and
slow variants was the frequency of the "64K Aliasing Conflicts"
event.
For RC4_INT= long=32b
--------------------------------------------------
RC4_DATA_LEN 64K Alias conflicts/Step
64 0
128 0.61
256 2.7
512 3.8
1024 4.45
--------------------------------------------------
IMO, these results make no sense at all! Why 64K Aliasing Conflict?
According to Intel's documentation 64K aliasing conflict is "the
data conflict condition applies to addresses having identical value in
bits 15:6". Since my whole data set is only 512 to 4096 bytes such
conflict should never happen - when two addresses have identical
value in bits 15:6 the also have identical values of bits 31:16.
What am I missing? Please, help.



Test routine: (email me for a complete source of the test program)

/*--------------*/

#define RC4_MSK ((RC4_DATA_LEN)-1)
#define RC4_STEP(n) \
{ \
unsigned long sx, sy; \
sx = indices[x+(n)]; \
y = (y + sx) & RC4_MSK; \
sy = indices[y]; \
indices[y] = sx; \
indices[x+(n)] = sy; \
}

void RunTest(RC4_INT *indices, int nPages)
{
int y = 0;
if (nPages) do
{
int x;
for (x = 0; x < RC4_DATA_LEN; x += 8)
{
RC4_STEP(0)
RC4_STEP(1)
RC4_STEP(2)
RC4_STEP(3)
RC4_STEP(4)
RC4_STEP(5)
RC4_STEP(6)
RC4_STEP(7)
}
} while (--nPages);
}
/*--------------*/
Back to top
Terje Mathisen
Guest





Posted: Thu Dec 09, 2004 1:59 pm    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

already5chosen@yahoo.com wrote:
Quote:
CPU Clocks per RC4_STEP as a function of the length and the width of
the indices array:
--------------------------------------------------------
RC4_DATA_LEN/RC4_INT uchar=8b short=16b long=32b
32 6.02 6.07 6.02
64 5.53 5.52 5.58
128 5.68 5.72 10.45
256 5.36 10.17 16.11
512 N/A 15.89 18.83
1024 N/A 18.76 20.14
--------------------------------------------------------

It's easy to see that the performance hit has nothing to do with x86
addressing modes, partial register accesses and so on - it's all about
the size of the indices array measured in bytes. The encoding speed
takes the hit when the size of the array exceeds 256 bytes. It
deteriorates quickly in 256B..1024B range then more slowly in the
1024B..4096B range. There is no point in measurements at longer arrays
because L1D cache misses will obscure the picture.

As the next step I fired up vTune but instead of the answers vTune
supplied more questions. The only major difference between quick and
slow variants was the frequency of the "64K Aliasing Conflicts"
event.
For RC4_INT= long=32b
--------------------------------------------------
RC4_DATA_LEN 64K Alias conflicts/Step
64 0
128 0.61
256 2.7
512 3.8
1024 4.45
--------------------------------------------------
IMO, these results make no sense at all! Why 64K Aliasing Conflict?
According to Intel's documentation 64K aliasing conflict is "the
data conflict condition applies to addresses having identical value in
bits 15:6". Since my whole data set is only 512 to 4096 bytes such
conflict should never happen - when two addresses have identical
value in bits 15:6 the also have identical values of bits 31:16.
What am I missing? Please, help.

I would suggest that you look into some lower-level allocation routines,
to get physically contigous memory, or in this particular application,
make sure that the bottom of your array is aligned on a 4K boundary.

If you cross such a page boundary, the physical address of the next
block might be anywhere in memory, right?

If you've already tried this, then I'd have to look into possible
interactions with other (input/output) buffers used. :-(

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Guest






Posted: Fri Dec 10, 2004 4:02 am    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

Quote:
If you cross such a page boundary, the physical address of the next
block might be anywhere in memory, right?

It's of course right that crossing page boundary leaves bits 31:16 of
the physical address anywhere but it can't cause 64K aliasing.
Remember, the alias is defined as the address that has identical value
in bits 15:6. Crossing page boundary can lead (with a small
probability) to the equality of bits 15:12 but bits 11:6 will remain
different.

Quote:
If you've already tried this,

I tried solely due to respect to your opinion. At 512 byte dataset
there is small but measurable difference of about 1.1 clock/step
between the best and the worst cases. At 1024 byte dataset the
difference is even smaller - about 0.25 clock/step. On the other sizes
there is no difference.


Quote:
then I'd have to look into possible interactions with other
(input/output) buffers used. :-(


My test routine doesn't access any other buffer. It's not a full cipher
- just a core running dry.
Back to top
Terje Mathisen
Guest





Posted: Fri Dec 10, 2004 4:06 am    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

already5chosen@yahoo.com wrote:

Quote:
If you cross such a page boundary, the physical address of the next
block might be anywhere in memory, right?


It's of course right that crossing page boundary leaves bits 31:16 of
the physical address anywhere but it can't cause 64K aliasing.
Remember, the alias is defined as the address that has identical value
in bits 15:6. Crossing page boundary can lead (with a small
probability) to the equality of bits 15:12 but bits 11:6 will remain
different.


If you've already tried this,


I tried solely due to respect to your opinion. At 512 byte dataset
there is small but measurable difference of about 1.1 clock/step
between the best and the worst cases. At 1024 byte dataset the
difference is even smaller - about 0.25 clock/step. On the other sizes
there is no difference.

Thanks, this is interesting...
Quote:


then I'd have to look into possible interactions with other

(input/output) buffers used. :-(

My test routine doesn't access any other buffer. It's not a full cipher
- just a core running dry.

OK, would you please send me (my email is still valid after more than 10
years on Usenet. :-) a complete copy of your test program?

I'd love to try and figure out what's happening here!

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
John Dallman
Guest





Posted: Fri Dec 10, 2004 8:02 pm    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

In article <1102555620.504911.309020@z14g2000cwz.googlegroups.com>,
already5chosen@yahoo.com () wrote:

Quote:
Playing with various implementations of the RC4 encryption I paid
attention to the very strange behavior of the P4 core: it encodes fast
when encoder's indices are kept in a byte array (with decent asm
implementation P4C 3.4GHz achieves about 530MB/s which probably makes
it the fastest RC4 encoding engine amongst currently available general
purpose processors) but becomes painfully slow when the indices are
hold in a 32b words.

Fascinating... I presume the 3.4GHz CPU, thus a Prescott, is the CPU on
which you're seeing this? Have you tried a Williamette (up to 2GHz) or a
Northwood (2.0 to 3.0GHz)? I'd expect very similar behaviour on them; if
there are differences, it might point to some quirk of the Prescott
caches.

Quote:
As the next step I fired up vTune but instead of the answers vTune
supplied more questions. The only major difference between quick and
slow variants was the frequency of the "64K Aliasing Conflicts"
event.

How are your flows of plaintext in and cipertext out arranged? Does this
vary with the table size? I'm wondering if the aliasing conflicts are
happening between the table and something else.

I've been unfortunate enough to have learned about 64K aliasing the hard
way, when one customer found a way to slow the product down by a factor of
about 100 by hitting it. Now, the performance hit for something that sets
off 64K aliasing is huge - 100+ clocks per instance - and if I understand
your table properly (I don't know anything about the internals of RC4)
you're only seeing about a factor of 3.5 slowdown. That seems to mean you
aren't actually getting aliases on a huge proportion of encodings. So I'm
wondering if your input or output buffers are involved.

---
John Dallman jgd@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"
Back to top
Guest






Posted: Sat Dec 11, 2004 10:20 pm    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

John Dallman wrote:
Quote:
In article <1102555620.504911.309020@z14g2000cwz.googlegroups.com>,
already5chosen@yahoo.com () wrote:

Playing with various implementations of the RC4 encryption I paid
attention to the very strange behavior of the P4 core: it encodes
fast
when encoder's indices are kept in a byte array (with decent asm
implementation P4C 3.4GHz achieves about 530MB/s which probably
makes
it the fastest RC4 encoding engine amongst currently available
general
purpose processors) but becomes painfully slow when the indices are
hold in a 32b words.

Fascinating... I presume the 3.4GHz CPU, thus a Prescott, is the CPU
on
which you're seeing this?

No, it's P4C, i.e. Nortwood.
Prescott is slower. I measured 391MB/s on P4E/3.2GHz. Since the cipher
scales linearly with the clock frequency, the fastest Prescott at
3.6GHz will produce 440MB/s. That's not only much slower than the
fastest Northwood but also probably slower that Itanium2/1.6Ghz and
Athlon64/2.6Ghz.

Quote:
Have you tried a Williamette

Yes, I tried couple of versions on Williamette. The behavior was
similar to Northwood. Since I'm not interested in performance of
Williamette I didn't try to investigate any deeper.

Quote:
(up to 2GHz) or a
Northwood (2.0 to 3.0GHz)? I'd expect very similar behaviour on them;
if
there are differences, it might point to some quirk of the Prescott
caches.

As the next step I fired up vTune but instead of the answers vTune
supplied more questions. The only major difference between quick
and
slow variants was the frequency of the "64K Aliasing Conflicts"
event.

Actually Prescott shows smaller difference between various sizes of the
dataset but the trend remains the same.

Quote:

How are your flows of plaintext in and cipertext out arranged? Does
this
vary with the table size? I'm wondering if the aliasing conflicts are

happening between the table and something else.


For the purpose of investigation I composed the test that does "dry
run" - no input, no output, just shuffling of the indices. Look at the
code in the original post.
The only "something else" that presents is the code itself. However
according to the docs "64K aliasing" only applies to L1D.

Quote:
I've been unfortunate enough to have learned about 64K aliasing the
hard
way, when one customer found a way to slow the product down by a
factor of
about 100 by hitting it. Now, the performance hit for something that
sets
off 64K aliasing is huge - 100+ clocks per instance

100+ clocks impact doesn't sound right. That sort of the penalty can be
caused by other conflict - L2 set conflict. L2 set conflict is common
on Nortwood Celeron but is pretty rare on the full P4s.
The impact of 64K aliasing conflict varies from nothing to 20 clocks.

Quote:
- and if I understand
your table properly (I don't know anything about the internals of
RC4)
you're only seeing about a factor of 3.5 slowdown. That seems to mean
you
aren't actually getting aliases on a huge proportion of encodings. So
I'm
wondering if your input or output buffers are involved.


VTune suggests that the conflict does happen on a huge proportion of
encoding steps. However documentation says: "A memory reference causing
64K aliasing conflict can be counted more than once in this stat." So
it's possible you are right.

Quote:
---
John Dallman jgd@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"
Back to top
Eric Young
Guest





Posted: Thu Dec 16, 2004 7:09 pm    Post subject: Re: P4: why 64K aliasing conflicts during RC4 cipher? Reply with quote

already5chosen@yahoo.com wrote:

Quote:
No, it's P4C, i.e. Nortwood.
Prescott is slower. I measured 391MB/s on P4E/3.2GHz. Since the cipher
scales linearly with the clock frequency, the fastest Prescott at
3.6GHz will produce 440MB/s. That's not only much slower than the
fastest Northwood but also probably slower that Itanium2/1.6Ghz and
Athlon64/2.6Ghz.

As another data point,
on a pre-release 1.2ghz AMD64, I get 223e6 bytes/sec (4k arrays,
different input/output buffers, swapping for each encryption) in 64bit
mode and 202e6 in 32bit mode. That would scale on a 2.6ghz machine,
upto 483e6 in 64bit mode and 437e6 in 32bit mode.

So the prescott is basically the same as Athlon64/Opteron

eric (who does not actually have acces to production AMD64 hardware
right now :-(
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB