By: anon (spam.delete.delete@this.this.spam.com), May 12, 2017 10:06 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 11, 2017 10:12 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on May 11, 2017 2:46 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 11, 2017 10:31 am wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on May 10, 2017 9:34 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on May 9, 2017 6:53 am wrote:
> > > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 11:34 pm wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 3:15 am wrote:
> > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 2:38 am wrote:
> > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 1:39 am wrote:
> > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 8:57 pm wrote:
> > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 7, 2017 5:47 pm wrote:
> > > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 1:56 pm wrote:
> > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 6, 2017 4:41 pm wrote:
> > > > > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 6, 2017 3:16 pm wrote:
> > > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 1:19 pm wrote:
> > > > > > > > > > > > > > > NoSpammer (no.delete@this.spam.com) on April 21, 2017 12:40 pm wrote:
> > > > > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 7:55 am wrote:
> > > > > > > > > > > > > > > > > Like I said, if they have a plan on how to deal with the things
> > > > > > > > > > > > > > > > > that usually kill VLIW it can work, if they don't it won't.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It doesn't matter if they have a plan. Where's the solution?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you think a massively simplified version will put up numbers that'll impress anyone?
> > > > > > > > > > > > > > > Yes, if they've never verified that their concept works they'll have a problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The alternative of course is you pretend all is great, go looking for naive investors,
> > > > > > > > > > > > > > > > and while your are at it you go to RWT to try to make some hype and have several anonymous
> > > > > > > > > > > > > > > > people talking favorably of it to maybe convince some non believers.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If there's no compiler and no emulator in any form of even
> > > > > > > > > > > > > > > > something architecturally similar to what you are
> > > > > > > > > > > > > > > > selling after so many years, it's not difficult to conclude
> > > > > > > > > > > > > > > > the game is about striking luck with some patents.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > http://millcomputing.com/topic/simulation/#post-1547
> > > > > > > > > > > > > > > They have something, which they have probably shown investors.
> > > > > > > > > > > > > > > Doesn't mean they have to show everyone their cards yet.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So they're trying to make money by patenting something that doesn't even work?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This seems like the highest effort, lowest reward (none) scam if you are right.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Everyone seems to assume that the Mill is in-order, as do I, but
> > > > > > > > > > > > > > how hard is it to scale this up to an out of order design?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Basically impossible.
> > > > > > > > > > > >
> > > > > > > > > > > > Answers below.
> > > > > > > > > > > >
> > > > > > > > > > > > > > First you have to save the belt, but I assume each ALU has its own port to its own part of the
> > > > > > > > > > > > > > scratchpad.
> > > > > > > > > > > > > That would waste a lot of space both in terms of unused
> > > > > > > > > > > > > capacity and implementation. Also how would you write
> > > > > > > > > > > > > into the scratchpad from anything that's not an ALU. Even just the ALUs would mean 8 ports on a Gold.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes you would waste some space with eight scratchpads. But the front end (and compiler) would
> > > > > > > > > > > > load balance so that one ALU would not get too overloaded and fill its scratchpad.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > So classic hand wave "the compiler will take care of it".
> > > > > > > > > >
> > > > > > > > > > The compiler would not do a great job outside of loops, but one
> > > > > > > > > > would not rely on the compiler for more than indirect hints.
> > > > > > > > > >
> > > > > > > > > > > > > > The namers would be ALU based, with only occasional accesses to other ALU's.
> > > > > > > > > > > > > So you want OoOE but with static scheduling? Or you want to do the full OoO dependency analysis
> > > > > > > > > > > > > and "optimal" scheduling so it's identical to current OoOE machines except wider?
> > > > > > > > > > > >
> > > > > > > > > > > > For this thought exercise I am assuming eight belts, running groups of opcodes, and only
> > > > > > > > > > > > needing to communicating inputs, and net outputs to the scratchpad, not belt positions.
> > > > > > > > > > >
> > > > > > > > > > > So you're just using the scratchpad as a terrible register file and ignoring the belt almost completely.
> > > > > > > > > >
> > > > > > > > > > No, you would only spill values that other ALU's need. If an ALU gets too much state you would switch
> > > > > > > > > > the next opcode group to a different ALU and spill values so the other ALU can pick the values up.
> > > > > > > > > >
> > > > > > > > > > In real world code dependancy chains are short and this would never happen.
> > > > > > > > > > A typical use of eight ALU's is the loop counter and index pointer would sit in one
> > > > > > > > > > ALU, index pointers in other ALU's, and actual compute in maybe two other ALU's.
> > > > > > > > > >
> > > > > > > > > > As you go through the loop you know what ALU's have the values from previous loops and you keep
> > > > > > > > > > re-assigning those opcode chains to where those values and index register pointer are.
> > > > > > > > > >
> > > > > > > > > > Basically no data would ever spill to the scratchpad, only results get written.
> > > > > > > > > >
> > > > > > > > > > Rename is only needed to roll back writes that were caused by a mispredicted branch.
> > > > > > > > >
> > > > > > > > > No, rename is also needed to reorder.
> > > > > > > >
> > > > > > > > Not if you reset the belt size to zero between dependency batches, and spill the old
> > > > > > > > belt for mispredicted branch recovery. Note I am assuming eight belts. The case of a
> > > > > > > > unified belt and no belt resets is not much different, just harder to understand.
> > > > > > > >
> > > > > > > > The point of OoO is largely that indexing and loads can run ahead of other opcodes.
> > > > > > > >
> > > > > > > > I believe you are trying to rename belt positions, there is no need as belt positions are unique
> > > > > > > > down one branch path. Holes in a belt due to OoO are irrelevant, they will get filled in later.
> > > > > > > >
> > > > > > > > The only renaming is on result writes to the scratchpad.
> > > > > > > >
> > > > > > > > I apologize in advance if I am missing something.
> > > > > > > >
> > > > > > > > > > > > This can radically reduce rename needs. The majority of ALU
> > > > > > > > > > > > outputs do not need to be renamed with this approach.
> > > > > > > > > > > >
> > > > > > > > > > > > You do need to wait for all inputs to be available before issuing the group to an ALU.
> > > > > > > > > > >
> > > > > > > > > > > You are also forcing all outputs into the scratchpad which is
> > > > > > > > > > > a) a terrible idea, because now we're back to registers, except wider and more expensive and
> > > > > > > > > > > b) the complete opposite of what they tried to do with the mill.
> > > > > > > > > >
> > > > > > > > > > Better explanation above.
> > > > > > > > > >
> > > > > > > > > > > > > > The downside is you need a larger scratchpad, but the scratchpad
> > > > > > > > > > > > > > scales far better than a multiported register file.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How many ports do you think a register file got?
> > > > > > > > > > > >
> > > > > > > > > > > > The Pentium had two ports to the register file, but that was 1990's.
> > > > > > > > > > > >
> > > > > > > > > > > > The gold Mill has eight ALU's and multiplexing that to one memory array
> > > > > > > > > > > > gets difficult and takes more space than independent memory arrays.
> > > > > > > > > > >
> > > > > > > > > > > The multiplexing gets really awkward and now you're limited by the read ports on your
> > > > > > > > > > > register file / scratchpad partitions. Pray that a value isn't needed more than once
> > > > > > > > > > > per cycle or that the compiler will just magically make all problems go away.
> > > > > > > > > >
> > > > > > > > > > Better explanation above and below.
> > > > > > > > > >
> > > > > > > > > > > > You still need a multiplexer to exchange data between ALU's, but that is
> > > > > > > > > > > > less in bandwidth needs than sending all the data needlessly like now.
> > > > > > > > > > > >
> > > > > > > > > > > > > > There is also memory renaming, sometimes called store-load forwarding, but that is
> > > > > > > > > > > > > > an in-order term. If you are already renaming the scratchpad this works fine, a cache
> > > > > > > > > > > > > > line store would access all eight renamers to grab the eight longs of the line.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It takes a while to get your head around what is happening because
> > > > > > > > > > > > > > it is not RISC, but I do not see any show stoppers so far.
> > > > > > > > > > > > > Belt encoding completely kills it.
> > > > > > > > > > > > > Instead of having to update what each register "means" only for each writing operation you have to do
> > > > > > > > > > > > > it for every belt position, every cycle, with information reaching I don't know how many cycles into
> > > > > > > > > > > > > the past to update the belt positions with the names of the operation results that were expected to finish
> > > > > > > > > > > > > in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> > > > > > > > > > > > > Yeah, good luck.
> > > > > > > > > > > >
> > > > > > > > > > > > I covered this above.
> > > > > > > > > > >
> > > > > > > > > > > You did not.
> > > > > > > > > > > To resume execution after a mispredict belt positions must point to the correct data.
> > > > > > > > > > > So either no branches can execute while any belt is in use or you must save the data.
> > > > > > > > > > > And even just for reordering you need renaming. Or can anything that accesses the belt not
> > > > > > > > > > > be reordered? So you're forcing everything through the scratchpad again, building the world's
> > > > > > > > > > > worst load/store architecture, because now every dependency chain at best and every single
> > > > > > > > > > > instruction at worst needs an extra instruction for every input and every output.
> > > > > > > > > >
> > > > > > > > > > Yes, you need another port on the scratchpad to spill the belt,
> > > > > > > > > > to deal with mispredicted branches and reload old belt values.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You are still ignoring the question.
> > > > > > > > > How do you reorder?
> > > > > > > >
> > > > > > > > Belt positions are unique.
> > > > > > >
> > > > > > > They are not. The belt "moves". So the same position points to a different value depending
> > > > > > > on the cycle. Therefore you can't change the order of execution unless you rename.
> > > > > >
> > > > > > Belt movement is a simple queue, exactly matching an ordinary ALU bypass queue.
> > > > > >
> > > > > > If you just generated a results for R8 and R4 and you add them generating R6, the ALU
> > > > > > is not going to get those values from the renamed register file, instead bypass slot
> > > > > > 1 and 2 will feed back to the ALU and the result is put in slot 0, then the bypasses
> > > > > > advance leaving data in slots 3, 2 and 1. Oh look I just described a Mill belt.
> > > > > >
> > > > > > If you want to be pedantic then yes a trivial form of rename that fits entirely
> > > > > > in the ALU needs to take place for OoO to take place in a Mill belt.
> > > > > >
> > > > > > On a function call you get a new belt, 20 instructions in you are writing your 18th result,
> > > > > > and the load finished for the 10th instruction which was supposed to write the 9th result.
> > > > > > Take a guess how hard it is to write both these results to the correct spots.
> > > > > > Also how hard it is to pick up belt references like the
> > > > > > 3rd value back from the result you are going to write.
> > > > > >
> > > > > > Once the data gets out of the bypasses the results will get written to the right scratchpad
> > > > > > slots so no further tracking is needed. A direct addressed loop buffer for Mill slots.
> > > > > >
> > > > > > Yes a OoO Mill will need to save the belt so that you can run that tenth instruction back
> > > > > > that was skipped waiting on a load. The save window is sized to match your OoO window.
> > > > > >
> > > > > > Now am I making sense?
> > > > > >
> > > > >
> > > > > Not really, I'm going to show you an example.
> > > > > Let's assume a private belt for each ALU like you said.
> > > > >
> > > > > Some simple code for a single ALU, what happens in the others doesn't concern us.
> > > > > add r1, b1
> > > > > sub r2, b5
> > > > > div b1, b2
> > > > >
> > > > > Statically this code works.
> > > > > add executes, drops the result into b1, sub executes and drops
> > > > > the result into b1, the result of the add becomes b2.
> > > > >
> > > > > Dynamically it doesn't.
> > > > >
> > > > > Let's say r1 got stalled on a load or is coming from a different
> > > > > ALU and got reordered or whatever and the add can't be executed.
> > > > > If you reorder and execute the sub first then its second operand is not in b5, it's still in b4.
> > > > > It's already the wrong result and it drops into b1.
> > > > > The add executes and drops the result into b1, the sub result becomes b2.
> > > > > Then div executes.
> > > > > Instead of executing
> > > > > add r1, b1
> > > > > sub r2, b5
> > > > > div b1, b2
> > > > >
> > > > > you executed
> > > > > add r1, b1
> > > > > sub r2, b4
> > > > > div b2, b1
> > > > >
> > > > > So yes, you need rename, and no, it's not trivial.
> > > >
> > > > Yes you need rename tracking, I assumed that this had a different name, but it makes sense that
> > > > this work is just dumped in the bucket with the rest of rename as it is so tightly bound.
> > > >
> > > > My simplified model of dumping each belt to a loop of 32 direct mapped registers does get rid of some of
> > > > the difficult rename tasks. No finding rename register slots in a big shared pool. No tracking that you
> > > > are the fifth write to r5, belt positions are only ever written once. Belt values spend a few cycles in
> > > > the bypasses, then a few cycles in maybe a fast scratch, (the rest of the Belt relative accesses) then
> > > > dumped into a simple looping pool. At first You know where the data is by the number of belt writes that
> > > > have taken place since the value was generated. Then the index is frozen as a register location.
> > > >
> > > > Right now I am stumped on how you would know how many writes have taken place, you don't want ~8 counters
> > > > counting. Earlier I had thought you tracked cycles since execution, which is easy, but wrong.
> > > >
> > > > The bypasses are cycle based, the near temps might be fixed, but loop
> > > > based on write count not cycle count, then the pool of 32 are fixed.
> > > > More complexity here than I appreciated at first, even after
> > > > getting rid of what I had thought was the hard parts.
> > >
> > > Actually the Mill uses write count based addressing, so everything falls out easy.
> >
> > Write count yes, easy no.
> >
> > >
> > > On a function call you get a new belt, you just need to know the current write count and the
> > > write count for the instruction at decode time, and you know whether the value is in the bypasses
> > > or the belt buffer or the scratchpad, and you know the offset in that buffer.
> > >
> >
> > 1. It's the hardware's job to figure out where it is. That's not encoded. How do you do that?
>
> It should be clear from my description that I would be assigning
> scratchpad/register slots on instruction decode.
For the last time: How?
Belt positions are based on when an instruction retires, not when it decodes. Are you changing that or do you just hope all instructions have the exact same latency?
> Note that with a unified register file you could not do this, it would "waste" to many precious
> slots, forcing a larger file which would be too slow. Thus your confusion I think.
>
It still does.
> 32 scratchpad slots with three write ports, plus 8 belt slots, times
> 8 ALU's is 320 registers, way more than the 196 you are used to.
>
So you're just assigning a register to every result.
That is no different than what OoO hardware already does. Except it wastes less registers and ALUs can communicate through registers.
> > 2. You need to track the latency and decode time of all instructions to figure out the name for your result.
>
> Nope, once you assign an ALU, late in decode, you know the name, then you just
> need tracking in the bypasses belt, once it hits the scratchpad you are done.
>
See above.
> > 3. Guess who "provides" new belt for each function? The hardware. You have to make sure
> > that anything in-flight from the calling function lands in a different buffer.
>
> No, it lands in the same buffer. Now each belt does start at zero, but you would use the top of the last
> belt as the base for that zero. So you do need an add to get the index into the scratchpad loop buffer.
>
> > 4. In case you haven't thought about it yet, a function call rearranges the results on the belt and
> > drops them onto the new belt, which means you have to either copy them all into your new buffer or and
> > another level of renaming. No, you can't take more than one cycle because it's needed for loops.
>
> You copy onto the new belt, same as any other opcode. I
> have not looked at what the Mill team says for this.
>
So you copy everything within a single cycle, but you only have 3 write ports. You do see how that could be a problem?
> An example:
>
> Belt call with a copy four values, the second two values have
> not been written and so those opcodes go in the OoO queue.
> You generate four values that hit the belt.
> You have eight more values backlogged to belt.
> Then 16 values hit the belt
>
> Assume the belt holds 8 values.
> Half the last 16 values hit the scratchpad, followed by a gap of eight values waiting to
> be written. Then four more values, a two slot gap, and the first two belt call values.
>
> Behind these values are the values from the calling function, which has
> at least two unfilled slots as two call copies are still queued.
>
> Now a branch mispredict fires, the last 24 values written or queued are aborted.
> The belt top is reset back to the four values and call arguments.
>
Try some actual code and write down what lands where.
Then try to figure out communication between the ALUs.
It won't work like this.
This all just leads an OoOE architecture with slightly different adressing (which depending on how you think it works doesn't even work yet) which wastes a lot of registers and pays higher forwarding penalties between ALUs.
> > > I could build this using the tools I had in college in my one transistor hardware course,
> > > it would take me a few weeks, but I am a software guy who only dabbles in hardware.
> > >
> > > > Now am I in the right ballpark?
> > > >
> > > > > > > > > How do you recover from a mispredict.
> > > > > > > >
> > > > > > > > Roll back the belt, and any writes.
> > > > > > > >
> > > > > > >
> > > > > > > And how do you do that, if you haven't saved?
> > > > > > >
> > > > > > > > > Any changes in execution order and values be not in the same positions on the belt.
> > > > > > > >
> > > > > > > > No, belt positions are assigned at decode. Trying to assign
> > > > > > > > belt positions at execution time would not work.
> > > > > > > >
> > > > > >
> > > > > > An aside, my 'simplification' of having to write results to scratchpad to transfer
> > > > > > data to other ALU's is probably not needed. I do not like the 'push' model, an ALU
> > > > > > could get data it does not want yet, a 'pull' model is better. But the scheduler will
> > > > > > know what is best and can match up pushes and pulls that are latency sensitive.
> > > > > >
> > > > > > > Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> > > > > > > The position of a result is implicit. Again, if the execution order
> > > > > > > changes the result does not end up in the same position.
> > > > > > >
> > > > > > > > > > > > This should give you a better idea of the post RISC ideas I am looking at.
> > > > > > > > > > >
> > > > > > > > > > > You're just trying to turn the Mill into a RISC architecture again and it's simply a terrible idea.
> > > > > > > > > > >
> > > > > > > > > > > > > Using the spiller to do speculation should be doable, as mentioned before, but OoOE is impossible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > By the way I live in Austin, TX and that is one of my emails listed in the By field.
>
> anon (spam.delete.delete@this.this.spam.com) on May 11, 2017 2:46 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 11, 2017 10:31 am wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on May 10, 2017 9:34 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on May 9, 2017 6:53 am wrote:
> > > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 11:34 pm wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 3:15 am wrote:
> > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 8, 2017 2:38 am wrote:
> > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 8, 2017 1:39 am wrote:
> > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 8:57 pm wrote:
> > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 7, 2017 5:47 pm wrote:
> > > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 7, 2017 1:56 pm wrote:
> > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on May 6, 2017 4:41 pm wrote:
> > > > > > > > > > > > > Brett (ggtgp.delete@this.yahoo.com) on May 6, 2017 3:16 pm wrote:
> > > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 1:19 pm wrote:
> > > > > > > > > > > > > > > NoSpammer (no.delete@this.spam.com) on April 21, 2017 12:40 pm wrote:
> > > > > > > > > > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 21, 2017 7:55 am wrote:
> > > > > > > > > > > > > > > > > Like I said, if they have a plan on how to deal with the things
> > > > > > > > > > > > > > > > > that usually kill VLIW it can work, if they don't it won't.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It doesn't matter if they have a plan. Where's the solution?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you think a massively simplified version will put up numbers that'll impress anyone?
> > > > > > > > > > > > > > > Yes, if they've never verified that their concept works they'll have a problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The alternative of course is you pretend all is great, go looking for naive investors,
> > > > > > > > > > > > > > > > and while your are at it you go to RWT to try to make some hype and have several anonymous
> > > > > > > > > > > > > > > > people talking favorably of it to maybe convince some non believers.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If there's no compiler and no emulator in any form of even
> > > > > > > > > > > > > > > > something architecturally similar to what you are
> > > > > > > > > > > > > > > > selling after so many years, it's not difficult to conclude
> > > > > > > > > > > > > > > > the game is about striking luck with some patents.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > http://millcomputing.com/topic/simulation/#post-1547
> > > > > > > > > > > > > > > They have something, which they have probably shown investors.
> > > > > > > > > > > > > > > Doesn't mean they have to show everyone their cards yet.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So they're trying to make money by patenting something that doesn't even work?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This seems like the highest effort, lowest reward (none) scam if you are right.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Everyone seems to assume that the Mill is in-order, as do I, but
> > > > > > > > > > > > > > how hard is it to scale this up to an out of order design?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Basically impossible.
> > > > > > > > > > > >
> > > > > > > > > > > > Answers below.
> > > > > > > > > > > >
> > > > > > > > > > > > > > First you have to save the belt, but I assume each ALU has its own port to its own part of the
> > > > > > > > > > > > > > scratchpad.
> > > > > > > > > > > > > That would waste a lot of space both in terms of unused
> > > > > > > > > > > > > capacity and implementation. Also how would you write
> > > > > > > > > > > > > into the scratchpad from anything that's not an ALU. Even just the ALUs would mean 8 ports on a Gold.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes you would waste some space with eight scratchpads. But the front end (and compiler) would
> > > > > > > > > > > > load balance so that one ALU would not get too overloaded and fill its scratchpad.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > So classic hand wave "the compiler will take care of it".
> > > > > > > > > >
> > > > > > > > > > The compiler would not do a great job outside of loops, but one
> > > > > > > > > > would not rely on the compiler for more than indirect hints.
> > > > > > > > > >
> > > > > > > > > > > > > > The namers would be ALU based, with only occasional accesses to other ALU's.
> > > > > > > > > > > > > So you want OoOE but with static scheduling? Or you want to do the full OoO dependency analysis
> > > > > > > > > > > > > and "optimal" scheduling so it's identical to current OoOE machines except wider?
> > > > > > > > > > > >
> > > > > > > > > > > > For this thought exercise I am assuming eight belts, running groups of opcodes, and only
> > > > > > > > > > > > needing to communicating inputs, and net outputs to the scratchpad, not belt positions.
> > > > > > > > > > >
> > > > > > > > > > > So you're just using the scratchpad as a terrible register file and ignoring the belt almost completely.
> > > > > > > > > >
> > > > > > > > > > No, you would only spill values that other ALU's need. If an ALU gets too much state you would switch
> > > > > > > > > > the next opcode group to a different ALU and spill values so the other ALU can pick the values up.
> > > > > > > > > >
> > > > > > > > > > In real world code dependancy chains are short and this would never happen.
> > > > > > > > > > A typical use of eight ALU's is the loop counter and index pointer would sit in one
> > > > > > > > > > ALU, index pointers in other ALU's, and actual compute in maybe two other ALU's.
> > > > > > > > > >
> > > > > > > > > > As you go through the loop you know what ALU's have the values from previous loops and you keep
> > > > > > > > > > re-assigning those opcode chains to where those values and index register pointer are.
> > > > > > > > > >
> > > > > > > > > > Basically no data would ever spill to the scratchpad, only results get written.
> > > > > > > > > >
> > > > > > > > > > Rename is only needed to roll back writes that were caused by a mispredicted branch.
> > > > > > > > >
> > > > > > > > > No, rename is also needed to reorder.
> > > > > > > >
> > > > > > > > Not if you reset the belt size to zero between dependency batches, and spill the old
> > > > > > > > belt for mispredicted branch recovery. Note I am assuming eight belts. The case of a
> > > > > > > > unified belt and no belt resets is not much different, just harder to understand.
> > > > > > > >
> > > > > > > > The point of OoO is largely that indexing and loads can run ahead of other opcodes.
> > > > > > > >
> > > > > > > > I believe you are trying to rename belt positions, there is no need as belt positions are unique
> > > > > > > > down one branch path. Holes in a belt due to OoO are irrelevant, they will get filled in later.
> > > > > > > >
> > > > > > > > The only renaming is on result writes to the scratchpad.
> > > > > > > >
> > > > > > > > I apologize in advance if I am missing something.
> > > > > > > >
> > > > > > > > > > > > This can radically reduce rename needs. The majority of ALU
> > > > > > > > > > > > outputs do not need to be renamed with this approach.
> > > > > > > > > > > >
> > > > > > > > > > > > You do need to wait for all inputs to be available before issuing the group to an ALU.
> > > > > > > > > > >
> > > > > > > > > > > You are also forcing all outputs into the scratchpad which is
> > > > > > > > > > > a) a terrible idea, because now we're back to registers, except wider and more expensive and
> > > > > > > > > > > b) the complete opposite of what they tried to do with the mill.
> > > > > > > > > >
> > > > > > > > > > Better explanation above.
> > > > > > > > > >
> > > > > > > > > > > > > > The downside is you need a larger scratchpad, but the scratchpad
> > > > > > > > > > > > > > scales far better than a multiported register file.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How many ports do you think a register file got?
> > > > > > > > > > > >
> > > > > > > > > > > > The Pentium had two ports to the register file, but that was 1990's.
> > > > > > > > > > > >
> > > > > > > > > > > > The gold Mill has eight ALU's and multiplexing that to one memory array
> > > > > > > > > > > > gets difficult and takes more space than independent memory arrays.
> > > > > > > > > > >
> > > > > > > > > > > The multiplexing gets really awkward and now you're limited by the read ports on your
> > > > > > > > > > > register file / scratchpad partitions. Pray that a value isn't needed more than once
> > > > > > > > > > > per cycle or that the compiler will just magically make all problems go away.
> > > > > > > > > >
> > > > > > > > > > Better explanation above and below.
> > > > > > > > > >
> > > > > > > > > > > > You still need a multiplexer to exchange data between ALU's, but that is
> > > > > > > > > > > > less in bandwidth needs than sending all the data needlessly like now.
> > > > > > > > > > > >
> > > > > > > > > > > > > > There is also memory renaming, sometimes called store-load forwarding, but that is
> > > > > > > > > > > > > > an in-order term. If you are already renaming the scratchpad this works fine, a cache
> > > > > > > > > > > > > > line store would access all eight renamers to grab the eight longs of the line.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It takes a while to get your head around what is happening because
> > > > > > > > > > > > > > it is not RISC, but I do not see any show stoppers so far.
> > > > > > > > > > > > > Belt encoding completely kills it.
> > > > > > > > > > > > > Instead of having to update what each register "means" only for each writing operation you have to do
> > > > > > > > > > > > > it for every belt position, every cycle, with information reaching I don't know how many cycles into
> > > > > > > > > > > > > the past to update the belt positions with the names of the operation results that were expected to finish
> > > > > > > > > > > > > in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> > > > > > > > > > > > > Yeah, good luck.
> > > > > > > > > > > >
> > > > > > > > > > > > I covered this above.
> > > > > > > > > > >
> > > > > > > > > > > You did not.
> > > > > > > > > > > To resume execution after a mispredict belt positions must point to the correct data.
> > > > > > > > > > > So either no branches can execute while any belt is in use or you must save the data.
> > > > > > > > > > > And even just for reordering you need renaming. Or can anything that accesses the belt not
> > > > > > > > > > > be reordered? So you're forcing everything through the scratchpad again, building the world's
> > > > > > > > > > > worst load/store architecture, because now every dependency chain at best and every single
> > > > > > > > > > > instruction at worst needs an extra instruction for every input and every output.
> > > > > > > > > >
> > > > > > > > > > Yes, you need another port on the scratchpad to spill the belt,
> > > > > > > > > > to deal with mispredicted branches and reload old belt values.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You are still ignoring the question.
> > > > > > > > > How do you reorder?
> > > > > > > >
> > > > > > > > Belt positions are unique.
> > > > > > >
> > > > > > > They are not. The belt "moves". So the same position points to a different value depending
> > > > > > > on the cycle. Therefore you can't change the order of execution unless you rename.
> > > > > >
> > > > > > Belt movement is a simple queue, exactly matching an ordinary ALU bypass queue.
> > > > > >
> > > > > > If you just generated a results for R8 and R4 and you add them generating R6, the ALU
> > > > > > is not going to get those values from the renamed register file, instead bypass slot
> > > > > > 1 and 2 will feed back to the ALU and the result is put in slot 0, then the bypasses
> > > > > > advance leaving data in slots 3, 2 and 1. Oh look I just described a Mill belt.
> > > > > >
> > > > > > If you want to be pedantic then yes a trivial form of rename that fits entirely
> > > > > > in the ALU needs to take place for OoO to take place in a Mill belt.
> > > > > >
> > > > > > On a function call you get a new belt, 20 instructions in you are writing your 18th result,
> > > > > > and the load finished for the 10th instruction which was supposed to write the 9th result.
> > > > > > Take a guess how hard it is to write both these results to the correct spots.
> > > > > > Also how hard it is to pick up belt references like the
> > > > > > 3rd value back from the result you are going to write.
> > > > > >
> > > > > > Once the data gets out of the bypasses the results will get written to the right scratchpad
> > > > > > slots so no further tracking is needed. A direct addressed loop buffer for Mill slots.
> > > > > >
> > > > > > Yes a OoO Mill will need to save the belt so that you can run that tenth instruction back
> > > > > > that was skipped waiting on a load. The save window is sized to match your OoO window.
> > > > > >
> > > > > > Now am I making sense?
> > > > > >
> > > > >
> > > > > Not really, I'm going to show you an example.
> > > > > Let's assume a private belt for each ALU like you said.
> > > > >
> > > > > Some simple code for a single ALU, what happens in the others doesn't concern us.
> > > > > add r1, b1
> > > > > sub r2, b5
> > > > > div b1, b2
> > > > >
> > > > > Statically this code works.
> > > > > add executes, drops the result into b1, sub executes and drops
> > > > > the result into b1, the result of the add becomes b2.
> > > > >
> > > > > Dynamically it doesn't.
> > > > >
> > > > > Let's say r1 got stalled on a load or is coming from a different
> > > > > ALU and got reordered or whatever and the add can't be executed.
> > > > > If you reorder and execute the sub first then its second operand is not in b5, it's still in b4.
> > > > > It's already the wrong result and it drops into b1.
> > > > > The add executes and drops the result into b1, the sub result becomes b2.
> > > > > Then div executes.
> > > > > Instead of executing
> > > > > add r1, b1
> > > > > sub r2, b5
> > > > > div b1, b2
> > > > >
> > > > > you executed
> > > > > add r1, b1
> > > > > sub r2, b4
> > > > > div b2, b1
> > > > >
> > > > > So yes, you need rename, and no, it's not trivial.
> > > >
> > > > Yes you need rename tracking, I assumed that this had a different name, but it makes sense that
> > > > this work is just dumped in the bucket with the rest of rename as it is so tightly bound.
> > > >
> > > > My simplified model of dumping each belt to a loop of 32 direct mapped registers does get rid of some of
> > > > the difficult rename tasks. No finding rename register slots in a big shared pool. No tracking that you
> > > > are the fifth write to r5, belt positions are only ever written once. Belt values spend a few cycles in
> > > > the bypasses, then a few cycles in maybe a fast scratch, (the rest of the Belt relative accesses) then
> > > > dumped into a simple looping pool. At first You know where the data is by the number of belt writes that
> > > > have taken place since the value was generated. Then the index is frozen as a register location.
> > > >
> > > > Right now I am stumped on how you would know how many writes have taken place, you don't want ~8 counters
> > > > counting. Earlier I had thought you tracked cycles since execution, which is easy, but wrong.
> > > >
> > > > The bypasses are cycle based, the near temps might be fixed, but loop
> > > > based on write count not cycle count, then the pool of 32 are fixed.
> > > > More complexity here than I appreciated at first, even after
> > > > getting rid of what I had thought was the hard parts.
> > >
> > > Actually the Mill uses write count based addressing, so everything falls out easy.
> >
> > Write count yes, easy no.
> >
> > >
> > > On a function call you get a new belt, you just need to know the current write count and the
> > > write count for the instruction at decode time, and you know whether the value is in the bypasses
> > > or the belt buffer or the scratchpad, and you know the offset in that buffer.
> > >
> >
> > 1. It's the hardware's job to figure out where it is. That's not encoded. How do you do that?
>
> It should be clear from my description that I would be assigning
> scratchpad/register slots on instruction decode.
For the last time: How?
Belt positions are based on when an instruction retires, not when it decodes. Are you changing that or do you just hope all instructions have the exact same latency?
> Note that with a unified register file you could not do this, it would "waste" to many precious
> slots, forcing a larger file which would be too slow. Thus your confusion I think.
>
It still does.
> 32 scratchpad slots with three write ports, plus 8 belt slots, times
> 8 ALU's is 320 registers, way more than the 196 you are used to.
>
So you're just assigning a register to every result.
That is no different than what OoO hardware already does. Except it wastes less registers and ALUs can communicate through registers.
> > 2. You need to track the latency and decode time of all instructions to figure out the name for your result.
>
> Nope, once you assign an ALU, late in decode, you know the name, then you just
> need tracking in the bypasses belt, once it hits the scratchpad you are done.
>
See above.
> > 3. Guess who "provides" new belt for each function? The hardware. You have to make sure
> > that anything in-flight from the calling function lands in a different buffer.
>
> No, it lands in the same buffer. Now each belt does start at zero, but you would use the top of the last
> belt as the base for that zero. So you do need an add to get the index into the scratchpad loop buffer.
>
> > 4. In case you haven't thought about it yet, a function call rearranges the results on the belt and
> > drops them onto the new belt, which means you have to either copy them all into your new buffer or and
> > another level of renaming. No, you can't take more than one cycle because it's needed for loops.
>
> You copy onto the new belt, same as any other opcode. I
> have not looked at what the Mill team says for this.
>
So you copy everything within a single cycle, but you only have 3 write ports. You do see how that could be a problem?
> An example:
>
> Belt call with a copy four values, the second two values have
> not been written and so those opcodes go in the OoO queue.
> You generate four values that hit the belt.
> You have eight more values backlogged to belt.
> Then 16 values hit the belt
>
> Assume the belt holds 8 values.
> Half the last 16 values hit the scratchpad, followed by a gap of eight values waiting to
> be written. Then four more values, a two slot gap, and the first two belt call values.
>
> Behind these values are the values from the calling function, which has
> at least two unfilled slots as two call copies are still queued.
>
> Now a branch mispredict fires, the last 24 values written or queued are aborted.
> The belt top is reset back to the four values and call arguments.
>
Try some actual code and write down what lands where.
Then try to figure out communication between the ALUs.
It won't work like this.
This all just leads an OoOE architecture with slightly different adressing (which depending on how you think it works doesn't even work yet) which wastes a lot of registers and pays higher forwarding penalties between ALUs.
> > > I could build this using the tools I had in college in my one transistor hardware course,
> > > it would take me a few weeks, but I am a software guy who only dabbles in hardware.
> > >
> > > > Now am I in the right ballpark?
> > > >
> > > > > > > > > How do you recover from a mispredict.
> > > > > > > >
> > > > > > > > Roll back the belt, and any writes.
> > > > > > > >
> > > > > > >
> > > > > > > And how do you do that, if you haven't saved?
> > > > > > >
> > > > > > > > > Any changes in execution order and values be not in the same positions on the belt.
> > > > > > > >
> > > > > > > > No, belt positions are assigned at decode. Trying to assign
> > > > > > > > belt positions at execution time would not work.
> > > > > > > >
> > > > > >
> > > > > > An aside, my 'simplification' of having to write results to scratchpad to transfer
> > > > > > data to other ALU's is probably not needed. I do not like the 'push' model, an ALU
> > > > > > could get data it does not want yet, a 'pull' model is better. But the scheduler will
> > > > > > know what is best and can match up pushes and pulls that are latency sensitive.
> > > > > >
> > > > > > > Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> > > > > > > The position of a result is implicit. Again, if the execution order
> > > > > > > changes the result does not end up in the same position.
> > > > > > >
> > > > > > > > > > > > This should give you a better idea of the post RISC ideas I am looking at.
> > > > > > > > > > >
> > > > > > > > > > > You're just trying to turn the Mill into a RISC architecture again and it's simply a terrible idea.
> > > > > > > > > > >
> > > > > > > > > > > > > Using the spiller to do speculation should be doable, as mentioned before, but OoOE is impossible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > By the way I live in Austin, TX and that is one of my emails listed in the By field.
>