By: anon (spam.delete.delete@this.this.spam.com), May 11, 2017 10:28 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 10, 2017 9:34 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on May 9, 2017 6:53 am wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 11:34 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 3:15 am wrote:
> > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 2:38 am wrote:
> > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 1:39 am wrote:
> > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 8:57 pm wrote:
> > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 7, 2017 5:47 pm wrote:
> > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 1:56 pm wrote:
> > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 6, 2017 4:41 pm wrote:
> > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 6, 2017 3:16 pm wrote:
> > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 1:19 pm wrote:
> > > > > > > > > > > > NoSpammer (no.delete@this.spam.com) on April 21, 2017 12:40 pm wrote:
> > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 7:55 am wrote:
> > > > > > > > > > > > > > Like I said, if they have a plan on how to deal with the things
> > > > > > > > > > > > > > that usually kill VLIW it can work, if they don't it won't.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It doesn't matter if they have a plan. Where's the solution?
> > > > > > > > > > > >
> > > > > > > > > > > > Do you think a massively simplified version will put up numbers that'll impress anyone?
> > > > > > > > > > > > Yes, if they've never verified that their concept works they'll have a problem.
> > > > > > > > > > > >
> > > > > > > > > > > > > The alternative of course is you pretend all is great, go looking for naive investors,
> > > > > > > > > > > > > and while your are at it you go to RWT to try to make some hype and have several anonymous
> > > > > > > > > > > > > people talking favorably of it to maybe convince some non believers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If there's no compiler and no emulator in any form of even
> > > > > > > > > > > > > something architecturally similar to what you are
> > > > > > > > > > > > > selling after so many years, it's not difficult to conclude
> > > > > > > > > > > > > the game is about striking luck with some patents.
> > > > > > > > > > > >
> > > > > > > > > > > > http://millcomputing.com/topic/simulation/#post-1547
> > > > > > > > > > > > They have something, which they have probably shown investors.
> > > > > > > > > > > > Doesn't mean they have to show everyone their cards yet.
> > > > > > > > > > > >
> > > > > > > > > > > > So they're trying to make money by patenting something that doesn't even work?
> > > > > > > > > > > >
> > > > > > > > > > > > This seems like the highest effort, lowest reward (none) scam if you are right.
> > > > > > > > > > >
> > > > > > > > > > > Everyone seems to assume that the Mill is in-order, as do I, but
> > > > > > > > > > > how hard is it to scale this up to an out of order design?
> > > > > > > > > >
> > > > > > > > > > Basically impossible.
> > > > > > > > >
> > > > > > > > > Answers below.
> > > > > > > > >
> > > > > > > > > > > First you have to save the belt, but I assume each ALU has its own port to its own part of the
> > > > > > > > > > > scratchpad.
> > > > > > > > > > That would waste a lot of space both in terms of unused
> > > > > > > > > > capacity and implementation. Also how would you write
> > > > > > > > > > into the scratchpad from anything that's not an ALU. Even just the ALUs would mean 8 ports on a Gold.
> > > > > > > > >
> > > > > > > > > Yes you would waste some space with eight scratchpads. But the front end (and compiler) would
> > > > > > > > > load balance so that one ALU would not get too overloaded and fill its scratchpad.
> > > > > > > > >
> > > > > > > >
> > > > > > > > So classic hand wave "the compiler will take care of it".
> > > > > > >
> > > > > > > The compiler would not do a great job outside of loops, but one
> > > > > > > would not rely on the compiler for more than indirect hints.
> > > > > > >
> > > > > > > > > > > The namers would be ALU based, with only occasional accesses to other ALU's.
> > > > > > > > > > So you want OoOE but with static scheduling? Or you want to do the full OoO dependency analysis
> > > > > > > > > > and "optimal" scheduling so it's identical to current OoOE machines except wider?
> > > > > > > > >
> > > > > > > > > For this thought exercise I am assuming eight belts, running groups of opcodes, and only
> > > > > > > > > needing to communicating inputs, and net outputs to the scratchpad, not belt positions.
> > > > > > > >
> > > > > > > > So you're just using the scratchpad as a terrible register file and ignoring the belt almost completely.
> > > > > > >
> > > > > > > No, you would only spill values that other ALU's need. If an ALU gets too much state you would switch
> > > > > > > the next opcode group to a different ALU and spill values so the other ALU can pick the values up.
> > > > > > >
> > > > > > > In real world code dependancy chains are short and this would never happen.
> > > > > > > A typical use of eight ALU's is the loop counter and index pointer would sit in one
> > > > > > > ALU, index pointers in other ALU's, and actual compute in maybe two other ALU's.
> > > > > > >
> > > > > > > As you go through the loop you know what ALU's have the values from previous loops and you keep
> > > > > > > re-assigning those opcode chains to where those values and index register pointer are.
> > > > > > >
> > > > > > > Basically no data would ever spill to the scratchpad, only results get written.
> > > > > > >
> > > > > > > Rename is only needed to roll back writes that were caused by a mispredicted branch.
> > > > > >
> > > > > > No, rename is also needed to reorder.
> > > > >
> > > > > Not if you reset the belt size to zero between dependency batches, and spill the old
> > > > > belt for mispredicted branch recovery. Note I am assuming eight belts. The case of a
> > > > > unified belt and no belt resets is not much different, just harder to understand.
> > > > >
> > > > > The point of OoO is largely that indexing and loads can run ahead of other opcodes.
> > > > >
> > > > > I believe you are trying to rename belt positions, there is no need as belt positions are unique
> > > > > down one branch path. Holes in a belt due to OoO are irrelevant, they will get filled in later.
> > > > >
> > > > > The only renaming is on result writes to the scratchpad.
> > > > >
> > > > > I apologize in advance if I am missing something.
> > > > >
> > > > > > > > > This can radically reduce rename needs. The majority of ALU
> > > > > > > > > outputs do not need to be renamed with this approach.
> > > > > > > > >
> > > > > > > > > You do need to wait for all inputs to be available before issuing the group to an ALU.
> > > > > > > >
> > > > > > > > You are also forcing all outputs into the scratchpad which is
> > > > > > > > a) a terrible idea, because now we're back to registers, except wider and more expensive and
> > > > > > > > b) the complete opposite of what they tried to do with the mill.
> > > > > > >
> > > > > > > Better explanation above.
> > > > > > >
> > > > > > > > > > > The downside is you need a larger scratchpad, but the scratchpad
> > > > > > > > > > > scales far better than a multiported register file.
> > > > > > > > > >
> > > > > > > > > > How many ports do you think a register file got?
> > > > > > > > >
> > > > > > > > > The Pentium had two ports to the register file, but that was 1990's.
> > > > > > > > >
> > > > > > > > > The gold Mill has eight ALU's and multiplexing that to one memory array
> > > > > > > > > gets difficult and takes more space than independent memory arrays.
> > > > > > > >
> > > > > > > > The multiplexing gets really awkward and now you're limited by the read ports on your
> > > > > > > > register file / scratchpad partitions. Pray that a value isn't needed more than once
> > > > > > > > per cycle or that the compiler will just magically make all problems go away.
> > > > > > >
> > > > > > > Better explanation above and below.
> > > > > > >
> > > > > > > > > You still need a multiplexer to exchange data between ALU's, but that is
> > > > > > > > > less in bandwidth needs than sending all the data needlessly like now.
> > > > > > > > >
> > > > > > > > > > > There is also memory renaming, sometimes called store-load forwarding, but that is
> > > > > > > > > > > an in-order term. If you are already renaming the scratchpad this works fine, a cache
> > > > > > > > > > > line store would access all eight renamers to grab the eight longs of the line.
> > > > > > > > > > >
> > > > > > > > > > > It takes a while to get your head around what is happening because
> > > > > > > > > > > it is not RISC, but I do not see any show stoppers so far.
> > > > > > > > > > Belt encoding completely kills it.
> > > > > > > > > > Instead of having to update what each register "means" only for each writing operation you have to do
> > > > > > > > > > it for every belt position, every cycle, with information reaching I don't know how many cycles into
> > > > > > > > > > the past to update the belt positions with the names of the operation results that were expected to finish
> > > > > > > > > > in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> > > > > > > > > > Yeah, good luck.
> > > > > > > > >
> > > > > > > > > I covered this above.
> > > > > > > >
> > > > > > > > You did not.
> > > > > > > > To resume execution after a mispredict belt positions must point to the correct data.
> > > > > > > > So either no branches can execute while any belt is in use or you must save the data.
> > > > > > > > And even just for reordering you need renaming. Or can anything that accesses the belt not
> > > > > > > > be reordered? So you're forcing everything through the scratchpad again, building the world's
> > > > > > > > worst load/store architecture, because now every dependency chain at best and every single
> > > > > > > > instruction at worst needs an extra instruction for every input and every output.
> > > > > > >
> > > > > > > Yes, you need another port on the scratchpad to spill the belt,
> > > > > > > to deal with mispredicted branches and reload old belt values.
> > > > > > >
> > > > > >
> > > > > > You are still ignoring the question.
> > > > > > How do you reorder?
> > > > >
> > > > > Belt positions are unique.
> > > >
> > > > They are not. The belt "moves". So the same position points to a different value depending
> > > > on the cycle. Therefore you can't change the order of execution unless you rename.
> > >
> > > Belt movement is a simple queue, exactly matching an ordinary ALU bypass queue.
> > >
> > > If you just generated a results for R8 and R4 and you add them generating R6, the ALU
> > > is not going to get those values from the renamed register file, instead bypass slot
> > > 1 and 2 will feed back to the ALU and the result is put in slot 0, then the bypasses
> > > advance leaving data in slots 3, 2 and 1. Oh look I just described a Mill belt.
> > >
> > > If you want to be pedantic then yes a trivial form of rename that fits entirely
> > > in the ALU needs to take place for OoO to take place in a Mill belt.
> > >
> > > On a function call you get a new belt, 20 instructions in you are writing your 18th result,
> > > and the load finished for the 10th instruction which was supposed to write the 9th result.
> > > Take a guess how hard it is to write both these results to the correct spots.
> > > Also how hard it is to pick up belt references like the
> > > 3rd value back from the result you are going to write.
> > >
> > > Once the data gets out of the bypasses the results will get written to the right scratchpad
> > > slots so no further tracking is needed. A direct addressed loop buffer for Mill slots.
> > >
> > > Yes a OoO Mill will need to save the belt so that you can run that tenth instruction back
> > > that was skipped waiting on a load. The save window is sized to match your OoO window.
> > >
> > > Now am I making sense?
> > >
> >
> > Not really, I'm going to show you an example.
> > Let's assume a private belt for each ALU like you said.
> >
> > Some simple code for a single ALU, what happens in the others doesn't concern us.
> > add r1, b1
> > sub r2, b5
> > div b1, b2
> >
> > Statically this code works.
> > add executes, drops the result into b1, sub executes and drops
> > the result into b1, the result of the add becomes b2.
> >
> > Dynamically it doesn't.
> >
> > Let's say r1 got stalled on a load or is coming from a different
> > ALU and got reordered or whatever and the add can't be executed.
> > If you reorder and execute the sub first then its second operand is not in b5, it's still in b4.
> > It's already the wrong result and it drops into b1.
> > The add executes and drops the result into b1, the sub result becomes b2.
> > Then div executes.
> > Instead of executing
> > add r1, b1
> > sub r2, b5
> > div b1, b2
> >
> > you executed
> > add r1, b1
> > sub r2, b4
> > div b2, b1
> >
> > So yes, you need rename, and no, it's not trivial.
>
> Yes you need rename tracking, I assumed that this had a different name, but it makes sense that
> this work is just dumped in the bucket with the rest of rename as it is so tightly bound.
>
> My simplified model of dumping each belt to a loop of 32 direct mapped registers does get rid of some of
> the difficult rename tasks. No finding rename register slots in a big shared pool. No tracking that you
> are the fifth write to r5, belt positions are only ever written once. Belt values spend a few cycles in
> the bypasses, then a few cycles in maybe a fast scratch, (the rest of the Belt relative accesses) then
> dumped into a simple looping pool. At first You know where the data is by the number of belt writes that
> have taken place since the value was generated. Then the index is frozen as a register location.
>
> Right now I am stumped on how you would know how many writes have taken place, you don't want ~8 counters
> counting. Earlier I had thought you tracked cycles since execution, which is easy, but wrong.
>
Now you're getting it.
> The bypasses are cycle based, the near temps might be fixed, but loop
> based on write count not cycle count, then the pool of 32 are fixed.
> More complexity here than I appreciated at first, even after
> getting rid of what I had thought was the hard parts.
>
The thing is those were not the hard parts. Save/restore for speculation and renaming every single bit of the predicate bytes that govern your loops are the fun parts. Assuming 8 ALUs with 32 registers each then you'll have a lot of fun saving that much potentially every cycle. Now for a single ALU you really don't need the full belt, but you do want a bit of OoO window so you'll end up with potentially hundreds of in-flight values in terms of adresses anyway. Naively saving all of them isn't possible. You obviously don't need intermediate results when all newer results that depend on it have already been calculated, unless you need it to go down another path after misspeculation, but you could force those values into the registers/scratchpad. Either way, a lot of complexity to figure out what needs to be saved and even after that reduction still a lot of values to be saved.
And then the predicates. As is customary for VLIW and vectorization in general the Mill throws predicates at loops. 8 predicates for the ALUs/pick operations and 1 for the loop branch per cycle. More fun.
> Now am I in the right ballpark?
>
Slowly getting there.
> > > > > > How do you recover from a mispredict.
> > > > >
> > > > > Roll back the belt, and any writes.
> > > > >
> > > >
> > > > And how do you do that, if you haven't saved?
> > > >
> > > > > > Any changes in execution order and values be not in the same positions on the belt.
> > > > >
> > > > > No, belt positions are assigned at decode. Trying to assign
> > > > > belt positions at execution time would not work.
> > > > >
> > >
> > > An aside, my 'simplification' of having to write results to scratchpad to transfer
> > > data to other ALU's is probably not needed. I do not like the 'push' model, an ALU
> > > could get data it does not want yet, a 'pull' model is better. But the scheduler will
> > > know what is best and can match up pushes and pulls that are latency sensitive.
> > >
> > > > Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> > > > The position of a result is implicit. Again, if the execution order
> > > > changes the result does not end up in the same position.
> > > >
> > > > > > > > > This should give you a better idea of the post RISC ideas I am looking at.
> > > > > > > >
> > > > > > > > You're just trying to turn the Mill into a RISC architecture again and it's simply a terrible idea.
> > > > > > > >
> > > > > > > > > > Using the spiller to do speculation should be doable, as mentioned before, but OoOE is impossible.
> > > > > > > > > >
> > > > > > > > > > > By the way I live in Austin, TX and that is one of my emails listed in the By field.
>
> anon (spam.delete.delete@this.this.spam.com) on May 9, 2017 6:53 am wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 11:34 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 3:15 am wrote:
> > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 2:38 am wrote:
> > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 1:39 am wrote:
> > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 8:57 pm wrote:
> > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 7, 2017 5:47 pm wrote:
> > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 1:56 pm wrote:
> > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 6, 2017 4:41 pm wrote:
> > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 6, 2017 3:16 pm wrote:
> > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 1:19 pm wrote:
> > > > > > > > > > > > NoSpammer (no.delete@this.spam.com) on April 21, 2017 12:40 pm wrote:
> > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 7:55 am wrote:
> > > > > > > > > > > > > > Like I said, if they have a plan on how to deal with the things
> > > > > > > > > > > > > > that usually kill VLIW it can work, if they don't it won't.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It doesn't matter if they have a plan. Where's the solution?
> > > > > > > > > > > >
> > > > > > > > > > > > Do you think a massively simplified version will put up numbers that'll impress anyone?
> > > > > > > > > > > > Yes, if they've never verified that their concept works they'll have a problem.
> > > > > > > > > > > >
> > > > > > > > > > > > > The alternative of course is you pretend all is great, go looking for naive investors,
> > > > > > > > > > > > > and while your are at it you go to RWT to try to make some hype and have several anonymous
> > > > > > > > > > > > > people talking favorably of it to maybe convince some non believers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If there's no compiler and no emulator in any form of even
> > > > > > > > > > > > > something architecturally similar to what you are
> > > > > > > > > > > > > selling after so many years, it's not difficult to conclude
> > > > > > > > > > > > > the game is about striking luck with some patents.
> > > > > > > > > > > >
> > > > > > > > > > > > http://millcomputing.com/topic/simulation/#post-1547
> > > > > > > > > > > > They have something, which they have probably shown investors.
> > > > > > > > > > > > Doesn't mean they have to show everyone their cards yet.
> > > > > > > > > > > >
> > > > > > > > > > > > So they're trying to make money by patenting something that doesn't even work?
> > > > > > > > > > > >
> > > > > > > > > > > > This seems like the highest effort, lowest reward (none) scam if you are right.
> > > > > > > > > > >
> > > > > > > > > > > Everyone seems to assume that the Mill is in-order, as do I, but
> > > > > > > > > > > how hard is it to scale this up to an out of order design?
> > > > > > > > > >
> > > > > > > > > > Basically impossible.
> > > > > > > > >
> > > > > > > > > Answers below.
> > > > > > > > >
> > > > > > > > > > > First you have to save the belt, but I assume each ALU has its own port to its own part of the
> > > > > > > > > > > scratchpad.
> > > > > > > > > > That would waste a lot of space both in terms of unused
> > > > > > > > > > capacity and implementation. Also how would you write
> > > > > > > > > > into the scratchpad from anything that's not an ALU. Even just the ALUs would mean 8 ports on a Gold.
> > > > > > > > >
> > > > > > > > > Yes you would waste some space with eight scratchpads. But the front end (and compiler) would
> > > > > > > > > load balance so that one ALU would not get too overloaded and fill its scratchpad.
> > > > > > > > >
> > > > > > > >
> > > > > > > > So classic hand wave "the compiler will take care of it".
> > > > > > >
> > > > > > > The compiler would not do a great job outside of loops, but one
> > > > > > > would not rely on the compiler for more than indirect hints.
> > > > > > >
> > > > > > > > > > > The namers would be ALU based, with only occasional accesses to other ALU's.
> > > > > > > > > > So you want OoOE but with static scheduling? Or you want to do the full OoO dependency analysis
> > > > > > > > > > and "optimal" scheduling so it's identical to current OoOE machines except wider?
> > > > > > > > >
> > > > > > > > > For this thought exercise I am assuming eight belts, running groups of opcodes, and only
> > > > > > > > > needing to communicating inputs, and net outputs to the scratchpad, not belt positions.
> > > > > > > >
> > > > > > > > So you're just using the scratchpad as a terrible register file and ignoring the belt almost completely.
> > > > > > >
> > > > > > > No, you would only spill values that other ALU's need. If an ALU gets too much state you would switch
> > > > > > > the next opcode group to a different ALU and spill values so the other ALU can pick the values up.
> > > > > > >
> > > > > > > In real world code dependancy chains are short and this would never happen.
> > > > > > > A typical use of eight ALU's is the loop counter and index pointer would sit in one
> > > > > > > ALU, index pointers in other ALU's, and actual compute in maybe two other ALU's.
> > > > > > >
> > > > > > > As you go through the loop you know what ALU's have the values from previous loops and you keep
> > > > > > > re-assigning those opcode chains to where those values and index register pointer are.
> > > > > > >
> > > > > > > Basically no data would ever spill to the scratchpad, only results get written.
> > > > > > >
> > > > > > > Rename is only needed to roll back writes that were caused by a mispredicted branch.
> > > > > >
> > > > > > No, rename is also needed to reorder.
> > > > >
> > > > > Not if you reset the belt size to zero between dependency batches, and spill the old
> > > > > belt for mispredicted branch recovery. Note I am assuming eight belts. The case of a
> > > > > unified belt and no belt resets is not much different, just harder to understand.
> > > > >
> > > > > The point of OoO is largely that indexing and loads can run ahead of other opcodes.
> > > > >
> > > > > I believe you are trying to rename belt positions, there is no need as belt positions are unique
> > > > > down one branch path. Holes in a belt due to OoO are irrelevant, they will get filled in later.
> > > > >
> > > > > The only renaming is on result writes to the scratchpad.
> > > > >
> > > > > I apologize in advance if I am missing something.
> > > > >
> > > > > > > > > This can radically reduce rename needs. The majority of ALU
> > > > > > > > > outputs do not need to be renamed with this approach.
> > > > > > > > >
> > > > > > > > > You do need to wait for all inputs to be available before issuing the group to an ALU.
> > > > > > > >
> > > > > > > > You are also forcing all outputs into the scratchpad which is
> > > > > > > > a) a terrible idea, because now we're back to registers, except wider and more expensive and
> > > > > > > > b) the complete opposite of what they tried to do with the mill.
> > > > > > >
> > > > > > > Better explanation above.
> > > > > > >
> > > > > > > > > > > The downside is you need a larger scratchpad, but the scratchpad
> > > > > > > > > > > scales far better than a multiported register file.
> > > > > > > > > >
> > > > > > > > > > How many ports do you think a register file got?
> > > > > > > > >
> > > > > > > > > The Pentium had two ports to the register file, but that was 1990's.
> > > > > > > > >
> > > > > > > > > The gold Mill has eight ALU's and multiplexing that to one memory array
> > > > > > > > > gets difficult and takes more space than independent memory arrays.
> > > > > > > >
> > > > > > > > The multiplexing gets really awkward and now you're limited by the read ports on your
> > > > > > > > register file / scratchpad partitions. Pray that a value isn't needed more than once
> > > > > > > > per cycle or that the compiler will just magically make all problems go away.
> > > > > > >
> > > > > > > Better explanation above and below.
> > > > > > >
> > > > > > > > > You still need a multiplexer to exchange data between ALU's, but that is
> > > > > > > > > less in bandwidth needs than sending all the data needlessly like now.
> > > > > > > > >
> > > > > > > > > > > There is also memory renaming, sometimes called store-load forwarding, but that is
> > > > > > > > > > > an in-order term. If you are already renaming the scratchpad this works fine, a cache
> > > > > > > > > > > line store would access all eight renamers to grab the eight longs of the line.
> > > > > > > > > > >
> > > > > > > > > > > It takes a while to get your head around what is happening because
> > > > > > > > > > > it is not RISC, but I do not see any show stoppers so far.
> > > > > > > > > > Belt encoding completely kills it.
> > > > > > > > > > Instead of having to update what each register "means" only for each writing operation you have to do
> > > > > > > > > > it for every belt position, every cycle, with information reaching I don't know how many cycles into
> > > > > > > > > > the past to update the belt positions with the names of the operation results that were expected to finish
> > > > > > > > > > in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> > > > > > > > > > Yeah, good luck.
> > > > > > > > >
> > > > > > > > > I covered this above.
> > > > > > > >
> > > > > > > > You did not.
> > > > > > > > To resume execution after a mispredict belt positions must point to the correct data.
> > > > > > > > So either no branches can execute while any belt is in use or you must save the data.
> > > > > > > > And even just for reordering you need renaming. Or can anything that accesses the belt not
> > > > > > > > be reordered? So you're forcing everything through the scratchpad again, building the world's
> > > > > > > > worst load/store architecture, because now every dependency chain at best and every single
> > > > > > > > instruction at worst needs an extra instruction for every input and every output.
> > > > > > >
> > > > > > > Yes, you need another port on the scratchpad to spill the belt,
> > > > > > > to deal with mispredicted branches and reload old belt values.
> > > > > > >
> > > > > >
> > > > > > You are still ignoring the question.
> > > > > > How do you reorder?
> > > > >
> > > > > Belt positions are unique.
> > > >
> > > > They are not. The belt "moves". So the same position points to a different value depending
> > > > on the cycle. Therefore you can't change the order of execution unless you rename.
> > >
> > > Belt movement is a simple queue, exactly matching an ordinary ALU bypass queue.
> > >
> > > If you just generated a results for R8 and R4 and you add them generating R6, the ALU
> > > is not going to get those values from the renamed register file, instead bypass slot
> > > 1 and 2 will feed back to the ALU and the result is put in slot 0, then the bypasses
> > > advance leaving data in slots 3, 2 and 1. Oh look I just described a Mill belt.
> > >
> > > If you want to be pedantic then yes a trivial form of rename that fits entirely
> > > in the ALU needs to take place for OoO to take place in a Mill belt.
> > >
> > > On a function call you get a new belt, 20 instructions in you are writing your 18th result,
> > > and the load finished for the 10th instruction which was supposed to write the 9th result.
> > > Take a guess how hard it is to write both these results to the correct spots.
> > > Also how hard it is to pick up belt references like the
> > > 3rd value back from the result you are going to write.
> > >
> > > Once the data gets out of the bypasses the results will get written to the right scratchpad
> > > slots so no further tracking is needed. A direct addressed loop buffer for Mill slots.
> > >
> > > Yes a OoO Mill will need to save the belt so that you can run that tenth instruction back
> > > that was skipped waiting on a load. The save window is sized to match your OoO window.
> > >
> > > Now am I making sense?
> > >
> >
> > Not really, I'm going to show you an example.
> > Let's assume a private belt for each ALU like you said.
> >
> > Some simple code for a single ALU, what happens in the others doesn't concern us.
> > add r1, b1
> > sub r2, b5
> > div b1, b2
> >
> > Statically this code works.
> > add executes, drops the result into b1, sub executes and drops
> > the result into b1, the result of the add becomes b2.
> >
> > Dynamically it doesn't.
> >
> > Let's say r1 got stalled on a load or is coming from a different
> > ALU and got reordered or whatever and the add can't be executed.
> > If you reorder and execute the sub first then its second operand is not in b5, it's still in b4.
> > It's already the wrong result and it drops into b1.
> > The add executes and drops the result into b1, the sub result becomes b2.
> > Then div executes.
> > Instead of executing
> > add r1, b1
> > sub r2, b5
> > div b1, b2
> >
> > you executed
> > add r1, b1
> > sub r2, b4
> > div b2, b1
> >
> > So yes, you need rename, and no, it's not trivial.
>
> Yes you need rename tracking, I assumed that this had a different name, but it makes sense that
> this work is just dumped in the bucket with the rest of rename as it is so tightly bound.
>
> My simplified model of dumping each belt to a loop of 32 direct mapped registers does get rid of some of
> the difficult rename tasks. No finding rename register slots in a big shared pool. No tracking that you
> are the fifth write to r5, belt positions are only ever written once. Belt values spend a few cycles in
> the bypasses, then a few cycles in maybe a fast scratch, (the rest of the Belt relative accesses) then
> dumped into a simple looping pool. At first You know where the data is by the number of belt writes that
> have taken place since the value was generated. Then the index is frozen as a register location.
>
> Right now I am stumped on how you would know how many writes have taken place, you don't want ~8 counters
> counting. Earlier I had thought you tracked cycles since execution, which is easy, but wrong.
>
Now you're getting it.
> The bypasses are cycle based, the near temps might be fixed, but loop
> based on write count not cycle count, then the pool of 32 are fixed.
> More complexity here than I appreciated at first, even after
> getting rid of what I had thought was the hard parts.
>
The thing is those were not the hard parts. Save/restore for speculation and renaming every single bit of the predicate bytes that govern your loops are the fun parts. Assuming 8 ALUs with 32 registers each then you'll have a lot of fun saving that much potentially every cycle. Now for a single ALU you really don't need the full belt, but you do want a bit of OoO window so you'll end up with potentially hundreds of in-flight values in terms of adresses anyway. Naively saving all of them isn't possible. You obviously don't need intermediate results when all newer results that depend on it have already been calculated, unless you need it to go down another path after misspeculation, but you could force those values into the registers/scratchpad. Either way, a lot of complexity to figure out what needs to be saved and even after that reduction still a lot of values to be saved.
And then the predicates. As is customary for VLIW and vectorization in general the Mill throws predicates at loops. 8 predicates for the ALUs/pick operations and 1 for the loop branch per cycle. More fun.
> Now am I in the right ballpark?
>
Slowly getting there.
> > > > > > How do you recover from a mispredict.
> > > > >
> > > > > Roll back the belt, and any writes.
> > > > >
> > > >
> > > > And how do you do that, if you haven't saved?
> > > >
> > > > > > Any changes in execution order and values be not in the same positions on the belt.
> > > > >
> > > > > No, belt positions are assigned at decode. Trying to assign
> > > > > belt positions at execution time would not work.
> > > > >
> > >
> > > An aside, my 'simplification' of having to write results to scratchpad to transfer
> > > data to other ALU's is probably not needed. I do not like the 'push' model, an ALU
> > > could get data it does not want yet, a 'pull' model is better. But the scheduler will
> > > know what is best and can match up pushes and pulls that are latency sensitive.
> > >
> > > > Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> > > > The position of a result is implicit. Again, if the execution order
> > > > changes the result does not end up in the same position.
> > > >
> > > > > > > > > This should give you a better idea of the post RISC ideas I am looking at.
> > > > > > > >
> > > > > > > > You're just trying to turn the Mill into a RISC architecture again and it's simply a terrible idea.
> > > > > > > >
> > > > > > > > > > Using the spiller to do speculation should be doable, as mentioned before, but OoOE is impossible.
> > > > > > > > > >
> > > > > > > > > > > By the way I live in Austin, TX and that is one of my emails listed in the By field.
>