Opteron's branch prediction
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Opteron's branch prediction

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Grumble
Guest





Posted: Mon Aug 29, 2005 12:15 am    Post subject: Opteron's branch prediction Reply with quote

Hello group,

AMD's Software Optimization Guide states:

"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."

I'm hopelessly confused.

8 bits (branch history) + 4 bits (branch address) = 12 bits

How can 12 bits provide an index into a 16384-entry table?

Where do the 2 missing bits come from?
Back to top
Bernd Paysan
Guest





Posted: Mon Aug 29, 2005 12:15 am    Post subject: Re: Opteron's branch prediction Reply with quote

Grumble wrote:

Quote:
Hello group,

AMD's Software Optimization Guide states:

"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."

I'm hopelessly confused.

8 bits (branch history) + 4 bits (branch address) = 12 bits

How can 12 bits provide an index into a 16384-entry table?

Where do the 2 missing bits come from?

Look here:

http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Back to top
Andi Kleen
Guest





Posted: Mon Aug 29, 2005 3:14 pm    Post subject: Re: Opteron's branch prediction Reply with quote

Bernd Paysan <bernd.paysan@gmx.de> writes:
Quote:

The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.

Actually it can only predict upto 3 jumps per 16 byte line. gcc has
a special optimization pass to add nops for this case. It makes
performance behave more stable because you don't get big performance
variations from seeminly innocent changes due to code alignment.

-Andi
Back to top
Grumble
Guest





Posted: Fri Sep 02, 2005 12:15 am    Post subject: Re: Opteron's branch prediction Reply with quote

Bernd Paysan wrote:

Quote:
Grumble wrote:

Hello group,

AMD's Software Optimization Guide states:

"The 16384-entry global history bimodal counter (GHBC) table contains
2-bit saturating counters used to predict whether a conditional branch
is taken. The GHBC table is indexed using the outcome (taken or not
taken) of the last eight conditional branches and 4 bits of the branch
address. The GHBC table allows the processors to predict branch patterns
of up to eight branches."

I'm hopelessly confused.

8 bits (branch history) + 4 bits (branch address) = 12 bits

How can 12 bits provide an index into a 16384-entry table?

Where do the 2 missing bits come from?

Look here:

http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

Thanks for the pointer!

Quote:
The four branch predictions indexed by the 12 bits refer to (up to) four
branches per 16-byte line.

Yet... paragraph 6.1 states:

"The Opteron processors have the capability to cache branch prediction
history for a maximum of three near branches (CALL, JMP, conditional
branches, or returns) per 16-byte fetch window. A branch instruction
that crosses a 16-byte boundary is counted in the second 16-byte window.
Due to architectural restrictions, a branch that is split across a
16-byte boundary cannot dispatch with any other instructions when it is
predicted taken. Perform this alignment by rearranging code; it is not
beneficial to align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc rel8
jcc rel32
jmp rel8
jmp rel32
jmp reg
jmp WORD PTR
jmp DWORD PTR
call rel16
call r/m16
call rel32
call r/m32

Coding more than three branches in the same 16-byte code window may lead
to conflicts in the branch target buffer. To avoid conflicts in the
branch target buffer, space out branches such that three

or fewer exist in a given 16-byte code window. For absolute optimal
performance, try to limit branches to one per 16-byte code window. Avoid
code sequences like the following:

ALIGN 16
label3:
call label1 ; 1st branch in 16-byte code window
jc label3 ; 2nd branch in 16-byte code window
call label2 ; 3rd branch in 16-byte code window
jnz label4 ; 4th branch in 16-byte code window
; Cannot be predicted.

If there is a jump table that contains many frequently executed
branches, pad the table entries to 8 bytes each to assure that there are
never more than three branches per 16-byte block of code. Only branches
that have been taken at least once are entered into the dynamic branch
prediction, and therefore only those branches count toward the
three-branch limit."
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB