Coarser-grained checkpointing/tracking

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), August 28, 2022 9:56 am
Room: Moderated Discussions
avianes (vianes.arthur.delete@this.protonmail.com) on August 28, 2022 5:59 am wrote:
[snip]
> Pretty sure the 256-entry "Scheduler/Instruction Control" acts similar to a ROB.
>
> They claim to use a checkpointing system to do OoO, but all
> modern high-end OoO processors already use checkpointing.
> I believe the bottom line is that their micro-architecture relies much more on checkpointing.
> My guess is that instruction retire on Prodigy can only be performed on instruction marked
> by a checkpoint, which groups instructions into retire instruction groups.
> This should greatly simplify rollback but requires inserting
> checkpoints where they would usually not be required.

An ROB can track execution at coarser granularity than one instruction (or µop). A branch is a natural point to end an instruction group (I think POWER5 did this), but I think an ARM implementation ended groups on a store instruction.

> But anyway if you are doing OoO execution (with exception or interrupt) then you have to
> track all instructions between 2 "retire-points" (just like a ROB) no matter if retire can
> be done on any individual instruction or on instruction groups marked by a checkpoint.

An exception within a group could be handled by having a mode where each instruction used an entry, so the instructions within a group would not have to be tracked individually. (Interrupts are asynchronous so can enter at an arbitrary point.)

Each checkpoint might even be considered a micro-function-call in having preserved values and temporary values (explicitly declaring inputs and outputs may also be useful for hardware). A typical "dataless" ROB OoO implementation of a conventional ISA defers freeing rename registers until the architectural register has been overwritten; while a compiler could theoretically reuse register names as if under extreme register pressure (i.e., as soon as the value is dead rather than picking from a free list based on other criteria — avoiding moves for a function call interface or simply preferring calee-saved registers to avoid the possibility of having to save the value) such seems unlikely to be common practice.

With RAT-based renaming, moving tags/names is much cheaper (in general) than moving values. I am not certain how an ISA (and compiler) could exploit this nor whether such would expose too much of the implementation. (Ideally implementation details would be hidden behind an abstraction layer such that the translation from the software distribution format and the native format could be more readily and persistently cached compared to a µop cache (persistence of caching also reduces the criticality and total cost of translation, facilitating more complex transformations.)

Research into Explicit Data Graph Execution has presented some possibilities for greater efficiency, but I very much doubt this is a solved problem.

Since Tachyum started with VLIW, I doubt they would be inclined to develop an extremely out-of-order processor. Communication between threads also does not seem to have been given significant priority (at least I do not recall reading any claims about improvements in that area); for cloud workloads this might not be important (communication between threads and 'services' is expected to be expensive since such may reside on different machines). On the other hand, 'uncore' efficiency is probably significant even for cloud workloads.

The initial claims seemed extraordinary — not only the amount of improvement but the claim that increasing utilization by avoiding special-purpose hardware would greatly improve performance. I am a clumper, liking to repurpose functionality with small additions, but I recognize that specialization has value. As a clumper, I am also disappointed that the SIMD storage and functionality is specialized. While lane restrictions are effectively required to simplify data and control communication, I suspect a useful intermediate design exists between strict one operation broadcast to all lanes with each lane only communicating data to itself and unrestricted 16-wide superscalar execution.

While publicly communicating the hows and whys of a commercial processor design is not in fashion (even for startups that need the attention and explanation to encourage funding and early purchases, though startups are also more limited by labor availability), Tachyum seems to have been rather opaque. (While secrecy can encourage buzz and hide difficulties, it can also encourage skepticism because secrecy can also hide flaws.)

Making a product that can be evaluated by others also has significant value even if the product is a market failure. Even failing from technical mistakes can be illuminating beyond project management considerations; knowing what not to do and why is useful for increasing the rate of informative failures (and successes).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureNobod2022/08/27 09:21 AM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 10:35 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:04 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:05 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 11:07 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:18 AM
        Typo, I meant Like nv denver (NT)Kara2022/08/27 11:19 AM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:06 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 05:59 AM
          Coarser-grained checkpointing/trackingPaul A. Clayton2022/08/28 09:56 AM
            Coarser-grained checkpointing/trackingavianes2022/08/29 05:02 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:21 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:04 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 12:30 PM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureAnon2022/08/27 09:54 PM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 02:38 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/28 02:24 PM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/28 03:14 PM
        Energy cost of fetch width?Paul A. Clayton2022/08/28 05:50 PM
          It's not about width in absolute bits. It's about duty cycleHeikki Kultala2022/08/29 02:28 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 09:53 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 02:26 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 10:11 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 04:00 PM
            or patentsChester2022/08/29 09:54 PM
              or patentsanon22022/08/29 10:54 PM
                or patentsChester2022/08/29 11:37 PM
                  or patentsAnon2022/08/29 11:46 PM
                  or patentsanon22022/08/30 01:35 AM
                    or patentsChester2022/08/30 02:07 PM
              or patents---2022/08/30 11:29 AM
                or patentsChester2022/08/30 07:29 PM
                  or patents---2022/08/31 10:44 AM
                    or patentsUngo2022/08/31 01:10 PM
                      or patents---2022/08/31 04:01 PM
                        or patentsChester2022/08/31 07:05 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊