| Author |
Message |
Randy Yates
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Seeking DSP/x86 Performance Comparisons |
|
|
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
--Randy |
|
| Back to top |
|
 |
Vladimir Vassilevsky
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Mark Borgerding wrote:
| Quote: |
It is good to develop in C using the floating point instead of hacking
integers in the assembler. The whole point in developing on PC is doing
it nice and easy. Unfortunately the C compilers for x86 can't really use
MMX and SSE. You have to do it in asm or you can use somebody else's
library
The newer gcc (and I think intel) C compilers allow you to use intrinsic
functions to operate on vectors of 4 floats at a time (or vectors of
8,16,32 bit integers for mmx)
The newest compilers even let you go one step further and use operators
like + - * ^ and so on rather than the intrinsics like _mm_add_ps
It is now quite possible to get simd-optimized code without doing a bit
of assembly.
|
The compiler by itself is not going to optimize your C code into SIMD.
Telling compiler what to do at the low level is not much different from
the use of the assembler, because it is you who have to know the
hardware and who have to tell the compiler about the hardware.
Indeed the latest ICC is trying to use SSE by itself. However it does it
really lousy - no comparison with the hand written code.
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com |
|
| Back to top |
|
 |
Vladimir Vassilevsky
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Randy Yates wrote:
| Quote: | Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
|
Randy,
Comparing an elephant and a whale is an interesting idea. I doubt if you
can find the same DSP benchmark for x86 and for a DSP since there is no
common ground for comparison.
Nevertheless a while ago I compared the speed of x86 and ADSP-21xx on
typical DSP operations (FIR/IIR filters). The code was hand optimized
for both CPUs. For the same clock rate, the P5 is about 3 times slower
then the DSP. The 486 is about 5 times slower.
It is worth mentioning that on P5+ the floating point is faster then
integer calculations. There are also things like MMX and SSE however you
have to use asm to make it efficient.
I would say for the same clock rate a general P5+ is 2-3 times slower
then a general DSP.
VLV |
|
| Back to top |
|
 |
Andrew Reilly
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
On Fri, 16 Dec 2005 11:52:02 -0800, Randy Yates wrote:
| Quote: | Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
|
There are a few comparison sites around, but not really much at the level
that you're looking at. For example:
http://www.eembc.org/benchmark/telecom.asp?APPL=TLC
Now, telecom applications are probably (haven't looked at the code for
these) dominated by 16-bit fixed-point operations, so the floating point
parts are kind of the wrong shape. There are some x86 there, (old AMD
K6), and some fairly contemporary power PCs of the same sort of class (IBM
970FX aka G5) which shows up in a few industrial embedded products.
Certainly the TI C64x parts seem to kill them here, although there seem to
be a very wide range of performances for the same part. Must need some
careful tuning to get good figures.
In my personal experience, straight C code on GCC -O3 -fast on my old
P3/500 is worth about 150 Motorola DSP MIPS: about the factor of three
that Vladimir mentioned. Dunno about more modern machines, but I would
expect them to be comparatively faster per clock, since they have more and
better pipelines, larger caches and faster off-chip busses.
Let us know what you find out, when you run your own benchmarks?
Cheers,
--
Andrew |
|
| Back to top |
|
 |
Vladimir Vassilevsky
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Andrew Reilly wrote:
| Quote: | In my personal experience, straight C code on GCC -O3 -fast on my old
P3/500 is worth about 150 Motorola DSP MIPS: about the factor of three
that Vladimir mentioned. Dunno about more modern machines, but I would
expect them to be comparatively faster per clock, since they have more and
better pipelines, larger caches and faster off-chip busses.
|
Modern CPUs like P-3, P-4, etc. are not optimized for 16 bit and 8 bit
operations. The 32 bit int computation on P-4 runs almost three times
faster then 8 or 16 bit. I had difficulty believing that before I tried
the same code (int matrix multiply) with 8, 16 and 32 bit data.
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com |
|
| Back to top |
|
 |
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Randy Yates wrote:
| Quote: | Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
|
Try http://www.bdti.com/. They have very thorough comparison of DSP
performance for a wide range of processors. |
|
| Back to top |
|
 |
Vladimir Vassilevsky
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Andrew Reilly wrote:
| Quote: | Modern CPUs like P-3, P-4, etc. are not optimized for 16 bit and 8 bit
operations. The 32 bit int computation on P-4 runs almost three times
faster then 8 or 16 bit. I had difficulty believing that before I tried
the same code (int matrix multiply) with 8, 16 and 32 bit data.
I should clarify: I was thinking of floating point C code on the P3 vs
equivalent fixed point code on the DSP. Integer does go a bit slower, but
I expect that quite a bit of that is the paucity of registers.
If you need to do short integer work on a PC, isn't MMX the way it's done?
|
It is good to develop in C using the floating point instead of hacking
integers in the assembler. The whole point in developing on PC is doing
it nice and easy. Unfortunately the C compilers for x86 can't really use
MMX and SSE. You have to do it in asm or you can use somebody else's
library like this:
http://www.intel.com/cd/software/products/asmo-na/eng/238685.htm
BTW, the DSP performance benchmarks for x86 can be found here:
http://cache-www.intel.com/cd/00/00/21/93/219360_wp_ipp_benchmark.pdf
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com |
|
| Back to top |
|
 |
Mark Borgerding
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Vladimir Vassilevsky wrote:
| Quote: | It is good to develop in C using the floating point instead of hacking
integers in the assembler. The whole point in developing on PC is doing
it nice and easy. Unfortunately the C compilers for x86 can't really use
MMX and SSE. You have to do it in asm or you can use somebody else's
library
|
The newer gcc (and I think intel) C compilers allow you to use intrinsic
functions to operate on vectors of 4 floats at a time (or vectors of
8,16,32 bit integers for mmx)
The newest compilers even let you go one step further and use operators
like + - * ^ and so on rather than the intrinsics like _mm_add_ps
It is now quite possible to get simd-optimized code without doing a bit
of assembly.
-- Mark Borgerding |
|
| Back to top |
|
 |
Randy Yates
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
| Quote: | If you need to do short integer work on a PC, isn't MMX the way it's done?
|
Yes, I meant to include MMX-based performance in the comparison.
--Randy |
|
| Back to top |
|
 |
Andrew Reilly
Guest
|
Posted:
Sat Dec 17, 2005 1:16 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
On Fri, 16 Dec 2005 22:12:50 +0000, Vladimir Vassilevsky wrote:
| Quote: | Andrew Reilly wrote:
In my personal experience, straight C code on GCC -O3 -fast on my old
P3/500 is worth about 150 Motorola DSP MIPS: about the factor of three
that Vladimir mentioned. Dunno about more modern machines, but I would
expect them to be comparatively faster per clock, since they have more and
better pipelines, larger caches and faster off-chip busses.
Modern CPUs like P-3, P-4, etc. are not optimized for 16 bit and 8 bit
operations. The 32 bit int computation on P-4 runs almost three times
faster then 8 or 16 bit. I had difficulty believing that before I tried
the same code (int matrix multiply) with 8, 16 and 32 bit data.
|
I should clarify: I was thinking of floating point C code on the P3 vs
equivalent fixed point code on the DSP. Integer does go a bit slower, but
I expect that quite a bit of that is the paucity of registers.
If you need to do short integer work on a PC, isn't MMX the way it's done?
--
Andrew |
|
| Back to top |
|
 |
Steve Underwood
Guest
|
Posted:
Sat Dec 17, 2005 7:42 am Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
abariska@student.ethz.ch wrote:
| Quote: | Randy Yates wrote:
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Try http://www.bdti.com/. They have very thorough comparison of DSP
performance for a wide range of processors.
I don't really trust their comparisons. I haven't looked at the details |
of their benchmark code, but the results make the desktop processors
look really hot performers. While they are certainly faster than most
DSP chips, the extent of their lead in the BDTI benchmarks doesn't seem
realistic.
You really need to compare integer DSPs against comparable floating
point code on an x86 processor. The x86 devices really don't do integer
processing very well, but similar code can really shine in floating
point. Just be sure to use
int_variable = rint(floating_variable);
when integerising your results. If you don't the performance penalty of
implementing C89 or C99 compliant conversion can be severe.
If you use MMX the integer performance of an x86 can appear somewhat
better, but MMX is very poorly designed for general DSP. I think it was
narrow designed around some form of video processing. For example, major
DSP applications like adaptive filtering are terrible in MMX code. There
are strict MMX data alignment requirements on the Intel chips. If you
misalign there are big time penalties. For a simple FIR you can keep 4
sets of coeffs - 1 correctly aligned for each possible relationship
between the sample and coefficient vectors. For an adaptive FIR that
usually works out rather badly, as you need to keep updating 4
coefficient vectors.
Compilers for the x86 can still be rather fragile in their optimisation.
Just this week I had a simple DSP loop slow down by a factor of 5 after
a small cosmetic change. A little more tweaking and it was back to its
original speed. AMD devices tend to be much easier to work with than
Intel ones. They tend to tolerate misalignments and scheduling issues in
a much more graceful way. Intel chips tend to need more code tweaking
for good performance.
Regards,
Steve |
|
| Back to top |
|
 |
Guest
|
Posted:
Sun Dec 18, 2005 5:15 pm Post subject:
Re: Seeking DSP/x86 Performance Comparisons |
|
|
Steve Underwood wrote:
| Quote: | abariska@student.ethz.ch wrote:
Randy Yates wrote:
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Try http://www.bdti.com/. They have very thorough comparison of DSP
performance for a wide range of processors.
I don't really trust their comparisons. I haven't looked at the details
of their benchmark code, but the results make the desktop processors
look really hot performers. While they are certainly faster than most
DSP chips, the extent of their lead in the BDTI benchmarks doesn't seem
realistic.
|
Not at all, look for example at their floating-point processor's
benchmark:
http://www.bdti.com/bdtimark/chip_float_scores.pdf
I'm surprised that a 1.4GHz PIII processor actually has an absolute
benchmark value lower than a 500 MHz TigerSHARC. I guess this shows
that the benchmark takes more into consideration than simple number
crunching power. If you read their article on benchmarking, this seems
to really be the case:
http://www.bdti.com/bdtimark/BDTImark2000.pdf
| Quote: |
You really need to compare integer DSPs against comparable floating
point code on an x86 processor. The x86 devices really don't do integer
processing very well, but similar code can really shine in floating
point. Just be sure to use
int_variable = rint(floating_variable);
when integerising your results. If you don't the performance penalty of
implementing C89 or C99 compliant conversion can be severe.
If you use MMX the integer performance of an x86 can appear somewhat
better, but MMX is very poorly designed for general DSP. I think it was
narrow designed around some form of video processing. For example, major
DSP applications like adaptive filtering are terrible in MMX code. There
are strict MMX data alignment requirements on the Intel chips. If you
misalign there are big time penalties. For a simple FIR you can keep 4
sets of coeffs - 1 correctly aligned for each possible relationship
between the sample and coefficient vectors. For an adaptive FIR that
usually works out rather badly, as you need to keep updating 4
coefficient vectors.
Compilers for the x86 can still be rather fragile in their optimisation.
Just this week I had a simple DSP loop slow down by a factor of 5 after
a small cosmetic change. A little more tweaking and it was back to its
original speed. AMD devices tend to be much easier to work with than
Intel ones. They tend to tolerate misalignments and scheduling issues in
a much more graceful way. Intel chips tend to need more code tweaking
for good performance.
|
Good info. Have worked with PowerPCs as well?
Regards,
Andor |
|
| Back to top |
|
 |
|
|
|
|