| Author |
Message |
Guest
|
Posted:
Thu Nov 24, 2005 12:22 am Post subject:
Software Pipelining |
|
|
How can the following loop be pipelined?
int d = get_d();
int dmask = d >> 31;
for(int i = 0, j = 0; i < 128; )
{
d = d + C0 + (C1 & dmask);
i = i + C2;
j = j + (C3 & (~dmask));
dmask = d >> 31;
A[i + j] = C4;
}
C0, C1, C2, C3, C4 are constants. A is a unidimensional array.
All operations in this example excepting the shift have
latency/throughput of 1/1.
The shift has latency/throughput of 4/1.
You can choose any-issue architecture(please specify your
assumptions). |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Thu Nov 24, 2005 12:49 am Post subject:
Re: Software Pipelining |
|
|
| Quote: | How can the following loop be pipelined?
|
Via this easy method:
1) Receive homework question from professor
2) Post homework question to Usenet
3) Recieve answer
4) Turn in answer to professor
5) Profit!
-- greg |
|
| Back to top |
|
 |
Natarajan Krishnaswami
Guest
|
Posted:
Thu Nov 24, 2005 1:15 am Post subject:
Re: Software Pipelining |
|
|
On 2005-11-23, maytersft@yahoo.com <maytersft@yahoo.com> wrote:
| Quote: | How can the following loop be pipelined?
|
The easiest way to see this in general is via gcc. Compile and
generate listings both ways with
gcc -g -o foo.o -Wa,-ahls foo.c
and
gcc -g -o foo.o -Wa,-ahls -pipe foo.c
to see if any opportunities to exploit SW pipelining were present and
used.
By using a cross-compiler, you can see how this varies from
architecture to architecture. Note that in some cases it can be a
pessimization to perform SW pipelining, as dynamic microarchitectural
features like register renaming, OoO and SMT may do a better job of
extracting parallelism. Gcc is aware of this, and emits identical
code (refuses to pipeline) if it thinks that will be the case.
HTH,
N. |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Thu Nov 24, 2005 3:36 pm Post subject:
Re: Software Pipelining |
|
|
maytersft@yahoo.com wrote:
| Quote: | How can the following loop be pipelined?
int d = get_d();
int dmask = d >> 31;
for(int i = 0, j = 0; i < 128; )
{
d = d + C0 + (C1 & dmask);
i = i + C2;
j = j + (C3 & (~dmask));
dmask = d >> 31;
A[i + j] = C4;
}
C0, C1, C2, C3, C4 are constants. A is a unidimensional array.
All operations in this example excepting the shift have
latency/throughput of 1/1.
The shift has latency/throughput of 4/1.
You can choose any-issue architecture(please specify your
assumptions).
|
This is a trick assignment!
It quite obviously cannot be pipelined, due to the sequential dependencies.
Knowing the actual constant values, as well as the result of get_d()
_might_ make a small difference, but nothing you should worry about when
you turn in your homework!
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
kavi
Joined: 27 Jan 2005
Posts: 14
Location: Planet Earth
|
Posted:
Thu Nov 24, 2005 4:47 pm Post subject:
|
|
|
hi there
why is the shift operation so computationally intensive? For a typical RISC it doesn't make much sense.
Some VLIW compilers support software pipelining optimization, for example Trimaran (needs significant effort it get going with) and VEX (probably less effort).
-kavi _________________ -kavi a.k.a. "Uncle Noah" |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Nov 24, 2005 5:15 pm Post subject:
Re: Software Pipelining |
|
|
$ gcc --version
gcc (GCC) 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)
Copyright (C) 2004 Free Software Foundation, Inc.
'-ahls' is not supported as an argument. |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Nov 24, 2005 5:15 pm Post subject:
Re: Software Pipelining |
|
|
Neat!
I forgot to mention that the word length is 32 bits so,
although not standard C, d >> 31 actually has the semantics
we tend to assume. |
|
| Back to top |
|
 |
Ken Hagan
Guest
|
Posted:
Thu Nov 24, 2005 5:15 pm Post subject:
Re: Software Pipelining |
|
|
maytersft@yahoo.com wrote:
| Quote: | Neat!
I forgot to mention that the word length is 32 bits so,
although not standard C, d >> 31 actually has the semantics
we tend to assume.
|
OK, I'll have a more serious go. You say that all operations
have latency 1 except for the shift which is latency 4, but
you have nearly a dozen operations in each iteration with
only a single shift, the results of which are not needed until
the following iteration.
| Quote: | for(int i = 0, j = 0; i < 128; )
{
d = d + C0 + (C1 & dmask);
i = i + C2;
j = j + (C3 & (~dmask));
dmask = d >> 31;
A[i + j] = C4;
}
|
Why bother to pipeline anything? |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Thu Nov 24, 2005 5:15 pm Post subject:
Re: Software Pipelining |
|
|
maytersft@yahoo.com wrote:
| Quote: | How can the following loop be pipelined?
int d = get_d();
int dmask = d >> 31;
for(int i = 0, j = 0; i < 128; )
{
d = d + C0 + (C1 & dmask);
i = i + C2;
j = j + (C3 & (~dmask));
dmask = d >> 31;
A[i + j] = C4;
}
C0, C1, C2, C3, C4 are constants. A is a unidimensional array.
|
Since you've been getting lots of serious replies, I'll take another
look as well:
The most interesting optimization would seem to take most of the logic
out of the loop:
The dmask calculations simply propagate the sign of d into all bit
positions, effectively using it as a branchless replacement for the
following code:
int d = get_d();
int dm;
for(int i = C2, j = 0; i < 128+C2; i += C2 ) {
dm = d; // Remember previous value
d += (dm < 0)? C0 : C0 + C1;
j += (dm >= 0)? C3 : 0;
A[i + j] = C4;
}
We can split this into two parts, one for positive and one for negative
'd' values:
int i = 0, j = 0, ij = 0;
while (i < 128 ) {
while (d < 0) {
ij += C2;
i += C2;
d += C0+C1;
A[ij] = C4;
if (i >= 128) goto done;
}
while (d >= 0) {
ij += C2+C3;
i += C2;
d += C0;
j += C3;
A[ij] = C4;
if (i >= 128) goto done;
}
}
done:
The next step would be to determine on entry how many loop iterations to
run, and how many steps before the sign might/will change:
iterations = 128 / C2; // Static compiler calculation!
while (iterations) {
if (d < 0) {
while (iterations >= 4 && (-d >= (C0+C1)*4)) {
// Assume (C0+C1)*4 won't overflow!
A[ij+C2*1] = C4;
a[ij+C2*2] = C4;
a[ij+C2*3] = C4;
a[ij+C2*4] = C4;
ij += C2*4;
d += (C0+C1)*4;
iterations -= 4;
}
while (iterations && (d < 0)) {
A[ij+C2] = C4;
ij += C2;
d += C0+C1;
iterations--;
}
}
if (d >= 0) {
.... similar code for the positive case...
It should be quite obvious that the speed of shift instructions are
rather unimportant!
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Ken Hagan
Guest
|
Posted:
Thu Nov 24, 2005 5:15 pm Post subject:
Re: Software Pipelining |
|
|
maytersft@yahoo.com wrote:
| Quote: | How can the following loop be pipelined?
int d = get_d();
int dmask = d >> 31;
for(int i = 0, j = 0; i < 128; )
{
d = d + C0 + (C1 & dmask);
i = i + C2;
j = j + (C3 & (~dmask));
dmask = d >> 31;
A[i + j] = C4;
}
C0, C1, C2, C3, C4 are constants. A is a unidimensional array.
All operations in this example excepting the shift have
latency/throughput of 1/1.
The shift has latency/throughput of 4/1.
You can choose any-issue architecture(please specify your
assumptions).
|
Since you haven't constrained the architecture, I'll assume the C
code is supposed to be portable. That being the case, "d>>31" is
undefined behaviour. (In practice, the shift count may be truncated
to 15, so we get the top bit of d propogated through the dmask, or
it may be taken seriously, in which case dmask=0.) That being the
case, I think we can replace the whole loop with...
memset(A, 0, sizeof(A)); |
|
| Back to top |
|
 |
John Whorfin
Guest
|
Posted:
Fri Nov 25, 2005 9:15 am Post subject:
Re: Software Pipelining |
|
|
Natarajan Krishnaswami:
Bravo. Well played sir. |
|
| Back to top |
|
 |
|
|
|
|