By: Peter (not.delete@this.likely.com), October 31, 2008 3:38 pm
Room: Moderated Discussions
>It fetches 16 bytes at a time, though... Does it fetch unaligned, beginning from
>the branch target? If so, then potentially close to 25% of fetches would be 2 icache accesses
I doubt it.
I would expect that the processor would do an aligned 16-byte fetch and then send the critical first word from this fetch directly from a line-fill buffer past the instruction cache into the fetch buffer.
I doubt that it is counting cache lines, but cache fetches - and there should be four 128-bit fetches in one cache line.
Performance-montiors/event-counters can be a real pain to implement so I wouldn't be surprized if there is a difference between how Intel and AMD count icache fetches. At a very coarse level, do you count a fetch as one that is completed and put in the fetch queue or do you count a fetch from the moment the ICache is addressed - even if the fetch isn't completed due to being in a branch shadow? At a fine grain level, Core-2 has a 16-byte queue while K8 has a 24-byte queue so the actual fetch strategy to avoid bubbles may be very different.
Unless you've got a detailed description of how performance monitors are implemented (unlikely) I suspect it will be hard to compare two different designs.
To give you an idea, when I put performance counters into a design I try to get them right in the majority of cases, but I'm not going to break my ass trying to get them super-accurate because that would usually involve factoring in very timing critical signals from elsewhere in the design.
Most other designers have the same mindset.
>the branch target? If so, then potentially close to 25% of fetches would be 2 icache accesses
I doubt it.
I would expect that the processor would do an aligned 16-byte fetch and then send the critical first word from this fetch directly from a line-fill buffer past the instruction cache into the fetch buffer.
I doubt that it is counting cache lines, but cache fetches - and there should be four 128-bit fetches in one cache line.
Performance-montiors/event-counters can be a real pain to implement so I wouldn't be surprized if there is a difference between how Intel and AMD count icache fetches. At a very coarse level, do you count a fetch as one that is completed and put in the fetch queue or do you count a fetch from the moment the ICache is addressed - even if the fetch isn't completed due to being in a branch shadow? At a fine grain level, Core-2 has a 16-byte queue while K8 has a 24-byte queue so the actual fetch strategy to avoid bubbles may be very different.
Unless you've got a detailed description of how performance monitors are implemented (unlikely) I suspect it will be hard to compare two different designs.
To give you an idea, when I put performance counters into a design I try to get them right in the majority of cases, but I'm not going to break my ass trying to get them super-accurate because that would usually involve factoring in very timing critical signals from elsewhere in the design.
Most other designers have the same mindset.