| Author |
Message |
Grumble
Guest
|
Posted:
Mon Aug 29, 2005 12:15 am Post subject:
Opteron's branch prediction |
|
|
Hello group,
AMD's Software Optimization Guide states:
"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."
I'm hopelessly confused.
8 bits (branch history) + 4 bits (branch address) = 12 bits
How can 12 bits provide an index into a 16384-entry table?
Where do the 2 missing bits come from? |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Aug 29, 2005 12:15 am Post subject:
Re: Opteron's branch prediction |
|
|
Grumble wrote:
| Quote: | Hello group,
AMD's Software Optimization Guide states:
"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."
I'm hopelessly confused.
8 bits (branch history) + 4 bits (branch address) = 12 bits
How can 12 bits provide an index into a 16384-entry table?
Where do the 2 missing bits come from?
|
Look here:
http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Andi Kleen
Guest
|
Posted:
Mon Aug 29, 2005 3:14 pm Post subject:
Re: Opteron's branch prediction |
|
|
Bernd Paysan <bernd.paysan@gmx.de> writes:
| Quote: |
The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.
|
Actually it can only predict upto 3 jumps per 16 byte line. gcc has
a special optimization pass to add nops for this case. It makes
performance behave more stable because you don't get big performance
variations from seeminly innocent changes due to code alignment.
-Andi |
|
| Back to top |
|
 |
Grumble
Guest
|
Posted:
Fri Sep 02, 2005 12:15 am Post subject:
Re: Opteron's branch prediction |
|
|
Bernd Paysan wrote:
| Quote: | Grumble wrote:
Hello group,
AMD's Software Optimization Guide states:
"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."
I'm hopelessly confused.
8 bits (branch history) + 4 bits (branch address) = 12 bits
How can 12 bits provide an index into a 16384-entry table?
Where do the 2 missing bits come from?
Look here:
http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
|
Thanks for the pointer!
| Quote: | The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.
|
Yet... paragraph 6.1 states:
"The Opteron processors have the capability to cache branch prediction
history for a maximum of three near branches (CALL, JMP, conditional
branches, or returns) per 16-byte fetch window. A branch instruction
that crosses a 16-byte boundary is counted in the second 16-byte window.
Due to architectural restrictions, a branch that is split across a
16-byte boundary cannot dispatch with any other instructions when it is
predicted taken. Perform this alignment by rearranging code; it is not
beneficial to align branches using padding sequences.
The following branches are limited to three per 16-byte window:
jcc rel8
jcc rel32
jmp rel8
jmp rel32
jmp reg
jmp WORD PTR
jmp DWORD PTR
call rel16
call r/m16
call rel32
call r/m32
Coding more than three branches in the same 16-byte code window may lead
to conflicts in the branch target buffer. To avoid conflicts in the
branch target buffer, space out branches such that three
or fewer exist in a given 16-byte code window. For absolute optimal
performance, try to limit branches to one per 16-byte code window. Avoid
code sequences like the following:
ALIGN 16
label3:
call label1 ; 1st branch in 16-byte code window
jc label3 ; 2nd branch in 16-byte code window
call label2 ; 3rd branch in 16-byte code window
jnz label4 ; 4th branch in 16-byte code window
; Cannot be predicted.
If there is a jump table that contains many frequently executed
branches, pad the table entries to 8 bytes each to assure that there are
never more than three branches per 16-byte block of code. Only branches
that have been taken at least once are entered into the dynamic branch
prediction, and therefore only those branches count toward the
three-branch limit." |
|
| Back to top |
|
 |
|
|
|
|