By: anon (spam.delete.delete@this.this.spam.com), May 16, 2017 1:44 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 15, 2017 10:23 pm wrote:
> Heikki Kultala (heikki.kultala.delete@this.tut.fi) on May 15, 2017 12:24 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 14, 2017 7:42 pm wrote:
> >
> > > > For the last time: How?
> > > > Belt positions are based on when an instruction retires, not when it decodes. Are you
> > > > changing that or do you just hope all instructions have the exact same latency?
> > >
> > > The integer multiply instruction generally has variable latency depending on the size of the operands,
> > > between 4 and 16 cycles. Are you telling me a multiply will take a dump on a random belt position?
> >
> > is 80386 your general processor?
>
> Embedded CPU's, so roughly correct.
>
> I may be a bit out of date, a full multiply is large compared
> a simple 5 stage pipeline, but small on a real CPU.
>
Do you honestly believe that your modified Gold "Mill" with 8x128b ALUs that needs ~15W per core without all the OoO bolted on would be used as an embedded CPU?
Pipeline length is completely irrelevant for this.
> > Lets talk about modern CPUs instead.
> >
> > Both Zen and Skylake have fixed 3 cycle latency for integer multiplication.
> >
> > > You are taking this Belt paradigm too far, like Santa Clause that is not how the real world works.
> >
> > no, he's not.
> >
> > > The easy solution is to say a multiply result takes the next slot, but like OoO that
> > > slot will get filled with a result many cycles after subsequent slots are filled.
> >
> > no, that's a bad solution.
> >
> > The reasonable (but still not very good) solution to those instructions that
> > really has variable latency is to take the position of the lowest possible
> > latency, and in case of longer latency, stall all dependent instructions.
>
> For opcode offset size that does work best, the downside is that you are giving grief
> to an OoO version that needs to schedule the result write to the belt. Even with a tiny
> latency of 4 cycles that means you have to look at what the next 4 instructions are doing
> for belt writes before you can find out the slot the multiply result will use.
> This is truely awful, and is probably what Anon is complaining about, but never mentioned.
>
I most definitely did, when I still believed you were talking about modifying the Mill and not something made up that you think is the Mill. I mean you want a different architecture so there's nothing wrong with that, but you can't take something that lacks everything that makes the Mill work then modify it and claim it'll work because the Mill would've worked with these modifications. On the Mill the modifications wouldn't have worked and what you're using as basis doesn't work at all.
> > > Any changes in execution order and values be not in the same positions on the belt.
> >
> > No, belt positions are assigned at decode. Trying to assign
> > belt positions at execution time would not work.
> >
>
> Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> The position of a result is implicit. Again, if the execution order
> changes the result does not end up in the same position.
>
> > It takes a while to get your head around what is happening because
> > it is not RISC, but I do not see any show stoppers so far.
> Belt encoding completely kills it.
> Instead of having to update what each register "means" only for each writing operation you have to do
> it for every belt position, every cycle, with information reaching I don't know how many cycles into
> the past to update the belt positions with the names of the operation results that were expected to finish
> in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> Yeah, good luck.
> Again, the different versions of Mill are not compatible, so an OoO version
> of Mill would pick the next cycle for all results regardless of latency.
>
> Does this sound reasonable, or have I gone ape shit?
>
> Heikki Kultala (heikki.kultala.delete@this.tut.fi) on May 15, 2017 12:24 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 14, 2017 7:42 pm wrote:
> >
> > > > For the last time: How?
> > > > Belt positions are based on when an instruction retires, not when it decodes. Are you
> > > > changing that or do you just hope all instructions have the exact same latency?
> > >
> > > The integer multiply instruction generally has variable latency depending on the size of the operands,
> > > between 4 and 16 cycles. Are you telling me a multiply will take a dump on a random belt position?
> >
> > is 80386 your general processor?
>
> Embedded CPU's, so roughly correct.
>
> I may be a bit out of date, a full multiply is large compared
> a simple 5 stage pipeline, but small on a real CPU.
>
Do you honestly believe that your modified Gold "Mill" with 8x128b ALUs that needs ~15W per core without all the OoO bolted on would be used as an embedded CPU?
Pipeline length is completely irrelevant for this.
> > Lets talk about modern CPUs instead.
> >
> > Both Zen and Skylake have fixed 3 cycle latency for integer multiplication.
> >
> > > You are taking this Belt paradigm too far, like Santa Clause that is not how the real world works.
> >
> > no, he's not.
> >
> > > The easy solution is to say a multiply result takes the next slot, but like OoO that
> > > slot will get filled with a result many cycles after subsequent slots are filled.
> >
> > no, that's a bad solution.
> >
> > The reasonable (but still not very good) solution to those instructions that
> > really has variable latency is to take the position of the lowest possible
> > latency, and in case of longer latency, stall all dependent instructions.
>
> For opcode offset size that does work best, the downside is that you are giving grief
> to an OoO version that needs to schedule the result write to the belt. Even with a tiny
> latency of 4 cycles that means you have to look at what the next 4 instructions are doing
> for belt writes before you can find out the slot the multiply result will use.
> This is truely awful, and is probably what Anon is complaining about, but never mentioned.
>
I most definitely did, when I still believed you were talking about modifying the Mill and not something made up that you think is the Mill. I mean you want a different architecture so there's nothing wrong with that, but you can't take something that lacks everything that makes the Mill work then modify it and claim it'll work because the Mill would've worked with these modifications. On the Mill the modifications wouldn't have worked and what you're using as basis doesn't work at all.
> > > Any changes in execution order and values be not in the same positions on the belt.
> >
> > No, belt positions are assigned at decode. Trying to assign
> > belt positions at execution time would not work.
> >
>
> Belt "positions" are assigned at compile time. Based on when the previous instructions finish.
> The position of a result is implicit. Again, if the execution order
> changes the result does not end up in the same position.
>
> > It takes a while to get your head around what is happening because
> > it is not RISC, but I do not see any show stoppers so far.
> Belt encoding completely kills it.
> Instead of having to update what each register "means" only for each writing operation you have to do
> it for every belt position, every cycle, with information reaching I don't know how many cycles into
> the past to update the belt positions with the names of the operation results that were expected to finish
> in that cycle. 32 wide rename with definitely >5 cycles of history having to be saved as well?
> Yeah, good luck.
> Again, the different versions of Mill are not compatible, so an OoO version
> of Mill would pick the next cycle for all results regardless of latency.
>
> Does this sound reasonable, or have I gone ape shit?
>