It is good to develop in C using the floating point instead of hacking
integers in the assembler. The whole point in developing on PC is doing
it nice and easy. Unfortunately the C compilers for x86 can't really use
MMX and SSE. You have to do it in asm or you can use somebody else's
library
The newer gcc (and I think intel) C compilers allow you to use intrinsic
functions to operate on vectors of 4 floats at a time (or vectors of
8,16,32 bit integers for mmx)
The newest compilers even let you go one step further and use operators
like + - * ^ and so on rather than the intrinsics like _mm_add_ps
It is now quite possible to get simd-optimized code without doing a bit
of assembly.
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
In my personal experience, straight C code on GCC -O3 -fast on my old
P3/500 is worth about 150 Motorola DSP MIPS: about the factor of three
that Vladimir mentioned. Dunno about more modern machines, but I would
expect them to be comparatively faster per clock, since they have more and
better pipelines, larger caches and faster off-chip busses.
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Modern CPUs like P-3, P-4, etc. are not optimized for 16 bit and 8 bit
operations. The 32 bit int computation on P-4 runs almost three times
faster then 8 or 16 bit. I had difficulty believing that before I tried
the same code (int matrix multiply) with 8, 16 and 32 bit data.
I should clarify: I was thinking of floating point C code on the P3 vs
equivalent fixed point code on the DSP. Integer does go a bit slower, but
I expect that quite a bit of that is the paucity of registers.
If you need to do short integer work on a PC, isn't MMX the way it's done?
It is good to develop in C using the floating point instead of hacking
integers in the assembler. The whole point in developing on PC is doing
it nice and easy. Unfortunately the C compilers for x86 can't really use
MMX and SSE. You have to do it in asm or you can use somebody else's
library
If you need to do short integer work on a PC, isn't MMX the way it's done?
Andrew Reilly wrote:
In my personal experience, straight C code on GCC -O3 -fast on my old
P3/500 is worth about 150 Motorola DSP MIPS: about the factor of three
that Vladimir mentioned. Dunno about more modern machines, but I would
expect them to be comparatively faster per clock, since they have more and
better pipelines, larger caches and faster off-chip busses.
Modern CPUs like P-3, P-4, etc. are not optimized for 16 bit and 8 bit
operations. The 32 bit int computation on P-4 runs almost three times
faster then 8 or 16 bit. I had difficulty believing that before I tried
the same code (int matrix multiply) with 8, 16 and 32 bit data.
Randy Yates wrote:
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Try http://www.bdti.com/. They have very thorough comparison of DSP
performance for a wide range of processors.
I don't really trust their comparisons. I haven't looked at the details
abariska@student.ethz.ch wrote:
Randy Yates wrote:
Hi Folks,
Does anyone know when information on this topic may be
found? Specifically, I'm interested in comparing the TI 64x
family to the x86 family (Pentium IV, M, Xeon, etc.), especially
in terms of integer performance and multiply-accumulates.
Try http://www.bdti.com/. They have very thorough comparison of DSP
performance for a wide range of processors.
I don't really trust their comparisons. I haven't looked at the details
of their benchmark code, but the results make the desktop processors
look really hot performers. While they are certainly faster than most
DSP chips, the extent of their lead in the BDTI benchmarks doesn't seem
realistic.
You really need to compare integer DSPs against comparable floating
point code on an x86 processor. The x86 devices really don't do integer
processing very well, but similar code can really shine in floating
point. Just be sure to use
int_variable = rint(floating_variable);
when integerising your results. If you don't the performance penalty of
implementing C89 or C99 compliant conversion can be severe.
If you use MMX the integer performance of an x86 can appear somewhat
better, but MMX is very poorly designed for general DSP. I think it was
narrow designed around some form of video processing. For example, major
DSP applications like adaptive filtering are terrible in MMX code. There
are strict MMX data alignment requirements on the Intel chips. If you
misalign there are big time penalties. For a simple FIR you can keep 4
sets of coeffs - 1 correctly aligned for each possible relationship
between the sample and coefficient vectors. For an adaptive FIR that
usually works out rather badly, as you need to keep updating 4
coefficient vectors.
Compilers for the x86 can still be rather fragile in their optimisation.
Just this week I had a simple DSP loop slow down by a factor of 5 after
a small cosmetic change. A little more tweaking and it was back to its
original speed. AMD devices tend to be much easier to work with than
Intel ones. They tend to tolerate misalignments and scheduling issues in
a much more graceful way. Intel chips tend to need more code tweaking
for good performance.
Users browsing this forum: No registered users and 0 guests