Dual-Opteron performance variations
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Dual-Opteron performance variations

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Anton Ertl
Guest





Posted: Mon Nov 07, 2005 2:49 pm    Post subject: Dual-Opteron performance variations Reply with quote

I recently ran a program twice on a Dual-Opteron-246 system, and got
pretty big performance variations:

user time system time elapsed time
run1: run2 run1 run2 run1 run2
19.20 15.59 0.05 0.04 0:19.61 0:15.97 input5
810.91 617.25 15.82 15.70 13:49.58 10:35.68 input6
45274.23 41021.24 3659.72 3550.51 13:44:00 12:25:49 input7

The OS used here is the Debian-distributed Linux-2.6.5-1.358 kernel.

The program is pretty memory-intensive, as shown by runs with papiex
on a Pentium 4 (unfortunately no perfctr support on the Dual-Opteron
box):

For input5:
555268765 PAPI_L1_DCM
372081033 PAPI_L2_DCM
101666592 PAPI_TLB_DM
8881064027 PAPI_TOT_INS
30219809 Real usecs
69263802002 Real cycles
30095260 Proc usecs
68978330764 Proc cycles

Yes, only 0.13 IPC (and yes, this is 30s on a 2.26GHz Pentium 4 vs. 15s-20s
on a 2GHz Opteron; I guess the integrated memory controller shines
here, especially compared to the i845E with a single channel of DDR266
RAM on the Pentium 4 box).

The theories I have to explain this are:

1) Different virtual-to-physical mappings result in different number
of conflict misses in the caches. However, for the L2 cache it seems
unlikely to cause that much difference in conflict misses, because the
Opteron L2 cache is 16-way set associative. For the L1 cache it seems
unlikely to cause that much time difference, because L1 cache misses
are cheap relative to the also very frequent L2 misses. Also, on the
Pentium 4 box the two runs produced very similar timings, and if it
was related to virtual memory, I would expect that the Pentium 4 is
also affected.

2) The NUMA-ness of Dual Opteron memory. Maybe on run1 there was not
much local memory available, or for some other reason non-local memory
was allocated, but on run2 local memory was available and used (input7
requires 1142MB, and the box has 1GB per CPU, so after input7 there
should be some free memory on both CPUs). If that is the reason, I am
surprised that the effect is this large; I would expect differences of
this magnitude from micro-benchmarks, not from application benchmarks.

Looking in the kernel configuration, it seems that it was set
up for NUMA use:

CONFIG_K8_NUMA=y
CONFIG_DISCONTIGMEM=y
CONFIG_NUMA=y

If you want to play around with this stuff yourself, you find the
input files and the i386 binary of burg on
<http://www.complang.tuwien.ac.at/anton/lazyburg/>. You can run it with

for i in 5 6 7; do
grep -v `head -$i burgmasks|tail -1` gforth9.burg|time ./burg >/dev/null
done

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Bernd Paysan
Guest





Posted: Wed Nov 09, 2005 4:01 am    Post subject: Re: Dual-Opteron performance variations Reply with quote

Anton Ertl wrote:

Quote:
I recently ran a program twice on a Dual-Opteron-246 system, and got
pretty big performance variations:

user time system time elapsed time
run1: run2 run1 run2 run1 run2
19.20 15.59 0.05 0.04 0:19.61 0:15.97 input5
810.91 617.25 15.82 15.70 13:49.58 10:35.68 input6
45274.23 41021.24 3659.72 3550.51 13:44:00 12:25:49 input7

The OS used here is the Debian-distributed Linux-2.6.5-1.358 kernel.

Tried it on the dual-Opteron-Server at work, same NUMA config, kernel 2.6.11
(SuSE 9.3), no big differences to see between run 1 and run 2 (only first
item).

I also let it run on a dual-core Athlon X2 3800+ (512k cache), and checked
how two runs in parallel affect each other.

A single run took 18.41user, the two runs in parallel took
19.84user/19.68user (speed set to 2GHz manually, otherwise it would take
~25user for the single run, and the same ~20user for the two parallel;
looks like the cpufreqd won't power up a single thread while the
screensafer is running).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Back to top
Anton Ertl
Guest





Posted: Sun Nov 13, 2005 5:15 pm    Post subject: Re: Dual-Opteron performance variations Reply with quote

Bernd Paysan <bernd.paysan@gmx.de> writes:
Quote:
Anton Ertl wrote:

I recently ran a program twice on a Dual-Opteron-246 system, and got
pretty big performance variations:

user time system time elapsed time
run1: run2 run1 run2 run1 run2
19.20 15.59 0.05 0.04 0:19.61 0:15.97 input5
810.91 617.25 15.82 15.70 13:49.58 10:35.68 input6
45274.23 41021.24 3659.72 3550.51 13:44:00 12:25:49 input7

The OS used here is the Debian-distributed Linux-2.6.5-1.358 kernel.

I have also let it run on a 2GHz Clawhammer (Athlon 64 with 1MB L2 and
Socket 754), i.e., something that's pretty close to a
single-Opteron-246 system, and multiple runs are pretty close to each
other (about 15.6s for input5 and 591s for input6), so the slower run1
is probably due to accessing memory on the other CPU, and for some
reason the NUMA code in the kernel had not worked very well for run1.

I got the suggestion to use numactl with a NUMA-enabled kernel to pin
the process and its memory to a CPU.

On the Clawhammer box I also have the perfctr patch, so I measured the
cache and TLB misses:

input5:
tsc 31200922771 #cycles
event 0x0041FF43 160106224 #data cache refill from system
event 0x00410041 273679176 #data cache misses
event 0x0041FF46 35203932 #L1 and L2 TLB misses
event 0x004100C0 8812143854 #retired instructions

input6:
tsc 1182124170472 #cycles
event 0x0041FF43 6425684338 #data cache refill from system
event 0x00410041 10045007885 #data cache misses
event 0x0041FF46 2678501333 #L1 and L2 TLB misses
event 0x004100C0 232212098041 #retired instructions

So, an L2 cache miss every 36 instructions (for input6); yes, highly
memory bound.

What is strange is that the Clawhammer apparently has fewer L1 data
cache misses (273M) for input5 than the Northwood has L2 cache misses
(372M). I guess I misunderstand something about the K8 performance
counters; e.g., I also get, from an input 5 run:

event 0x0041FF42 273845400 #DC refills from L2
event 0x00410041 273857999 #DC misses
event 0x0041FF44 434389631 #DC writebacks
event 0x0041FF43 160237773 #DC refills from system

Shouldn't the number of misses be the sum of the refills from L2 and
syetm, and be at least as large as the writebacks?

Quote:
Tried it on the dual-Opteron-Server at work, same NUMA config, kernel 2.6.11
(SuSE 9.3), no big differences to see between run 1 and run 2 (only first
item).

I guess the NUMA support in the kernel has improved since 2.6.5.

Quote:
I also let it run on a dual-core Athlon X2 3800+ (512k cache), and checked
how two runs in parallel affect each other.

A single run took 18.41user,

Hmm, it looks like the larger L2 cache of the Clawhammer/Opteron helps
a little.

Quote:
the two runs in parallel took
19.84user/19.68user

So, despite all these cache misses, the memory controller is
apparently far from saturated. Cool.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB