Article: AMD's Mobile Strategy
By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 16, 2011 12:09 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton@gmail.com) on 12/15/11 wrote:
>
>Finding the start of instructions can be substantially
>simplified with predecode adding a continue/start bit for
>each byte. (Intel seems to manage even without this
>assistance.)
It's not *that* bad.
You can have an almost arbitrary amount of prefixes, but
you can do the "is this a prefix" check fairly cheaply and
in parallel for the whole line (or partial line, whatever
your I$ fetch width is). Then finding the "real" instruction
is just a matter of checking the result - a few bits.
At the same time you did the "is this a prefix" thing, you
do the "what does this prefix do" in parallel, and generate
a separate set of control bits for that - to be used by
the instruction decode.
And while the multi-byte instructions look complicated,
the first bytes can essentially be considered to be "prefix"
bytes too from an instruction decode standpoint - instead of
changing address size or register numbers or things like
that, it just modifies the instruction lookup itself.
So you can see pretty much all x86 instructions as being
a single byte - with the fairly simple prefixes.
Then, that byte does conditionally have extra stuff at the
end - usually a case of "modrm and/or immediate". But the
modrm stuff is actually very regular (and the immediate
is obviously not complicated). It's just that the regularity
is "odd" - but it's odd for humans, not so much for some
digital logic. It's the usual kind of "wires to move bits
around, with certain bits meaning certain things".
So scalar x86 decoding really isn't that hard. You would
definitely not need to bother with predecode for that. Just
look at the 8086 - it was 30k transistors total, and
while it did everything serially (ie one prefix byte a
cycle), you still have to realize how little that is. The
basic decoding really isn't hard.
And the patterns are all the same still. Sure, there are
new prefixes and the size of the constants have changed.
And yes, the REX bytes have register bit numbers in odd
new places, but again - that's the kind of thing that looks
complicated to a human, but spreading things out is just
"wires" to digital logic. So one of the control wires that
comes from the prefix decode logic ends up being used as the
high bit of a register number? Big deal. It's just an oddly
wired up line.
And the "instruction" byte isn't just a solid 8 bits either,
the same way the register bits aren't just three consecutive
bits any more. There's the mode bits, there's the bits that
come in from the prefix decode (ie the "multi-byte"
instructions, there's the modification bits that come from
the modrm byte etc. So instead of a simple 8-bit lookup,
the instruction lookup uses a few more bits. Not exactly
complex, just "details". The number of bits have grown
since the 8086, but there's a lot of the same structure.
And the new modes have changed the modrm bytes, but again,
you can see those as just having a perhaps 12-bit wide
modrm byte instead of the standard 8 bits - it's just that
the extra bits come from odd places (REX decode, CR0 mode
register etc) rather than from the "instruction byte"
itself.
So it's a lot of detail, and you would want to try
to share as much logic from older designs as possible rather
than try to create an all-new decoder ever time. You have
all those tricks you learnt from last time.
You don't need to do it in one cycle either. Almost nobody
does instruction fetch/decode as a single cycle, and you
can do part of this "as you fetch" things from the L1 etc.
So doing a single x86 instruction is pretty easy. There are
dependencies, but most of them are simple, ie the number
of prefixes don't really change much anything at all if you
just are willing to do the compares in parallel.
Doing multiple instructions in parallel is "more of the
same", except it's obviously quite a lot more.
I suspect the natural thing to do in some respects
is not actually to "decode X instructions", it's to some
degree much more natural to do "decode X bytes, and find
up to N instructions in those X bytes".
Because a fair amount of the work is about single bytes,
and for each byte doing "if this is a prefix, it means X,
if it's a modrm byte it means Y". Based on that output,
you can then use some fairly simple "scan this set of bits
to find where the instructions start", and by then you've
actually done most of the decoding too.
The fact that instructions aren't aligned means that in
order to get at least 16 bytes of instruction, a natural
layout would be two 16-byte aligned blocks. So you might
have a total of 32 bytes you are scanning at a time.
Or you might have some limited shifter at fetch time that
gets you "160 bits of instruction data, 4-byte aligned" or
whatever - the details really aren't all that important,
the point is just that you end up having some "block of
data" that you do parallel byte decoding on.
And no, I've ever only done decoding in software, but all
the things that were annoying about software decode, like
shifting and merging bits, and looking up tables etc are
all pretty simple for hw.
And I only did a single instruction at a time, obviously.
You could try to do parallel decode tricks with SSE, I
guess, but that sounds like seriously dubious value when
you then have to handle them serially in software anyway ;)
Linus
>
>Finding the start of instructions can be substantially
>simplified with predecode adding a continue/start bit for
>each byte. (Intel seems to manage even without this
>assistance.)
It's not *that* bad.
You can have an almost arbitrary amount of prefixes, but
you can do the "is this a prefix" check fairly cheaply and
in parallel for the whole line (or partial line, whatever
your I$ fetch width is). Then finding the "real" instruction
is just a matter of checking the result - a few bits.
At the same time you did the "is this a prefix" thing, you
do the "what does this prefix do" in parallel, and generate
a separate set of control bits for that - to be used by
the instruction decode.
And while the multi-byte instructions look complicated,
the first bytes can essentially be considered to be "prefix"
bytes too from an instruction decode standpoint - instead of
changing address size or register numbers or things like
that, it just modifies the instruction lookup itself.
So you can see pretty much all x86 instructions as being
a single byte - with the fairly simple prefixes.
Then, that byte does conditionally have extra stuff at the
end - usually a case of "modrm and/or immediate". But the
modrm stuff is actually very regular (and the immediate
is obviously not complicated). It's just that the regularity
is "odd" - but it's odd for humans, not so much for some
digital logic. It's the usual kind of "wires to move bits
around, with certain bits meaning certain things".
So scalar x86 decoding really isn't that hard. You would
definitely not need to bother with predecode for that. Just
look at the 8086 - it was 30k transistors total, and
while it did everything serially (ie one prefix byte a
cycle), you still have to realize how little that is. The
basic decoding really isn't hard.
And the patterns are all the same still. Sure, there are
new prefixes and the size of the constants have changed.
And yes, the REX bytes have register bit numbers in odd
new places, but again - that's the kind of thing that looks
complicated to a human, but spreading things out is just
"wires" to digital logic. So one of the control wires that
comes from the prefix decode logic ends up being used as the
high bit of a register number? Big deal. It's just an oddly
wired up line.
And the "instruction" byte isn't just a solid 8 bits either,
the same way the register bits aren't just three consecutive
bits any more. There's the mode bits, there's the bits that
come in from the prefix decode (ie the "multi-byte"
instructions, there's the modification bits that come from
the modrm byte etc. So instead of a simple 8-bit lookup,
the instruction lookup uses a few more bits. Not exactly
complex, just "details". The number of bits have grown
since the 8086, but there's a lot of the same structure.
And the new modes have changed the modrm bytes, but again,
you can see those as just having a perhaps 12-bit wide
modrm byte instead of the standard 8 bits - it's just that
the extra bits come from odd places (REX decode, CR0 mode
register etc) rather than from the "instruction byte"
itself.
So it's a lot of detail, and you would want to try
to share as much logic from older designs as possible rather
than try to create an all-new decoder ever time. You have
all those tricks you learnt from last time.
You don't need to do it in one cycle either. Almost nobody
does instruction fetch/decode as a single cycle, and you
can do part of this "as you fetch" things from the L1 etc.
So doing a single x86 instruction is pretty easy. There are
dependencies, but most of them are simple, ie the number
of prefixes don't really change much anything at all if you
just are willing to do the compares in parallel.
Doing multiple instructions in parallel is "more of the
same", except it's obviously quite a lot more.
I suspect the natural thing to do in some respects
is not actually to "decode X instructions", it's to some
degree much more natural to do "decode X bytes, and find
up to N instructions in those X bytes".
Because a fair amount of the work is about single bytes,
and for each byte doing "if this is a prefix, it means X,
if it's a modrm byte it means Y". Based on that output,
you can then use some fairly simple "scan this set of bits
to find where the instructions start", and by then you've
actually done most of the decoding too.
The fact that instructions aren't aligned means that in
order to get at least 16 bytes of instruction, a natural
layout would be two 16-byte aligned blocks. So you might
have a total of 32 bytes you are scanning at a time.
Or you might have some limited shifter at fetch time that
gets you "160 bits of instruction data, 4-byte aligned" or
whatever - the details really aren't all that important,
the point is just that you end up having some "block of
data" that you do parallel byte decoding on.
And no, I've ever only done decoding in software, but all
the things that were annoying about software decode, like
shifting and merging bits, and looking up tables etc are
all pretty simple for hw.
And I only did a single instruction at a time, obviously.
You could try to do parallel decode tricks with SSE, I
guess, but that sounds like seriously dubious value when
you then have to handle them serially in software anyway ;)
Linus