By: Patrick Chase (patrickjchase.delete@this.gmail.com), January 30, 2013 10:36 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on January 30, 2013 9:08 pm wrote:
> Unfortunately cycle-accurate system simulation (including a realistic memory controller [one
> not especially old presentation claimed that some "cycle-accurate" simulators use a single
> value for main memory latency] and I/O) is relatively slow. Of course, this would probably
> not be a big deal for evaluating the effect of a single parameter for four values.
Modern cycle accurate sims for the sorts of CPUs we're discussing run at O(1 MHz) on a single core of a modern CPU. Such simulators unfortunately tend to be inherently single-threaded affairs. I happen to have characterized one earlier today (before I even saw your post) that simulates a 4-wide VLIW, and that came in at *exactly* 1 million simulated clocks/sec on an E5-2670. You can obviously do better in aggregate with multiple instances, though that's more than enough for the sort of evaluation Facebook did.
Basically all such simulators accurately model the full cache/TLB hierarchy. They vary in quality when you get to main memory. The best ones provide a plugin interface, that allows the SoC developer to implement their own model of the memory subsystem. As you point out, memory controller behaviors (queue depths, request reordering, when/if pages are kept open in unused banks, etc) have a pretty significant impact.
[snip]
> From the little I have read, early versions of eval boards not infrequently have performance
> bugs (or poor documentation of configuration which can have a similar effect).
Yep. That's one huge reason why the sim is a better bet.
> The ability to tweak parameters and repeat operation can certainly be helpful.
> However, repeatability can also have a disadvantage in not exposing "random" factors
> like page placement or relative timing of execution phases in multiprocessing.
True, though there are well-known ways around that. You can boot a full OS in the sim and run whatever "conditioning" workloads you want either before or during execution, in which case you can expose the sources of variation that you mention. I don't don't know that all that many people take the trouble, but I've certainly done so in my own work.
> I think ideally one would be able to use hardware with limited configurability to do some gross evaluations (particularly
> for software evaluation). Analytical models, functional simulation, and cycle-accurate simulation all have places
> (in my ignorant opinion), but being able to adjust cache sizes, issue queue sizes, etc. would seem to allow very
> fast exploration of certain factors. There might even be a place for FPGA-based evaluation methods (which might
> be facilitated by the availability of a "cloud" service [universities already have compute clusters and commercial
> compute services are available, but comparable FPGA-based services seem not to be well established--in theory,
> such could be similar to platform evaluation access to high-end machines]).
People who design SoCs use FPGAs all the time, mostly to enable software/firmware co-development (i.e. the firmware is developed in parallel with the HW, using an FPGA to emulate the SoC). There are a couple issues that come into play when it comes to performance modelling, though:
1. Modern SoCs are often too big to fit into a single FPGA (even the $10000+ kind), so you end up having to either simulate a subset of the SoC or partition across multiple FPGAs. If you do the former then you end up nonrepresentative bus and memory loads. If you do the latter then you typically end up muxing multiple signals onto each of your limited set of FPGA I/Os via serdes, and then you end up running slow (or with compromised timing relationships).
2. A CPU core that would synthesize at 1 GHz in a modern SoC process/library might run at a couple hundred MHz at best in a single top-end FPGA (I'm not pulling these numbers out of thin air :-). At that point the timing relationships between the core, fabric, and memory subsystem end up completely whacked
3. The memory controller that's available "off the shelf" for the FPGA seldom behaves like the one you'll use for your SoC, which goes back to your point about fidelity of memory modelling.
> Anyway, thanks for the heads-up.
You're very welcome!
One other way to consider this question is from the perspective of a CPU architect. If you're, say, designing Haswell and you need to decide:
1. How many and what type of functional units to include
2. Whether that third ld/st unit is really a win in a uarch that decodes and retires 4 instructions per clock
3. How big to make your ROB and schedulers
4. How many outstanding loads and stores to support in the memory subsystem
5. How wide or fast the ring bus needs to be to service the CPU, GPU, and I/O loads
Then a very accurate simulator of both the core and its companion/supporting subsystems is basically the only game in town. Such a CPU is far too large and fast for any FPGA, and the variables I listed above are such fundamentally defining features of a microarchitecture that you can't use an existing CPU as a proxy. CPU architects live and die by their simulators (and by extension their choice of workloads to simulate), and if it's good enough for them then it's probably OK for cache sizing :-).
Best rgds,
Patrick
> Unfortunately cycle-accurate system simulation (including a realistic memory controller [one
> not especially old presentation claimed that some "cycle-accurate" simulators use a single
> value for main memory latency] and I/O) is relatively slow. Of course, this would probably
> not be a big deal for evaluating the effect of a single parameter for four values.
Modern cycle accurate sims for the sorts of CPUs we're discussing run at O(1 MHz) on a single core of a modern CPU. Such simulators unfortunately tend to be inherently single-threaded affairs. I happen to have characterized one earlier today (before I even saw your post) that simulates a 4-wide VLIW, and that came in at *exactly* 1 million simulated clocks/sec on an E5-2670. You can obviously do better in aggregate with multiple instances, though that's more than enough for the sort of evaluation Facebook did.
Basically all such simulators accurately model the full cache/TLB hierarchy. They vary in quality when you get to main memory. The best ones provide a plugin interface, that allows the SoC developer to implement their own model of the memory subsystem. As you point out, memory controller behaviors (queue depths, request reordering, when/if pages are kept open in unused banks, etc) have a pretty significant impact.
[snip]
> From the little I have read, early versions of eval boards not infrequently have performance
> bugs (or poor documentation of configuration which can have a similar effect).
Yep. That's one huge reason why the sim is a better bet.
> The ability to tweak parameters and repeat operation can certainly be helpful.
> However, repeatability can also have a disadvantage in not exposing "random" factors
> like page placement or relative timing of execution phases in multiprocessing.
True, though there are well-known ways around that. You can boot a full OS in the sim and run whatever "conditioning" workloads you want either before or during execution, in which case you can expose the sources of variation that you mention. I don't don't know that all that many people take the trouble, but I've certainly done so in my own work.
> I think ideally one would be able to use hardware with limited configurability to do some gross evaluations (particularly
> for software evaluation). Analytical models, functional simulation, and cycle-accurate simulation all have places
> (in my ignorant opinion), but being able to adjust cache sizes, issue queue sizes, etc. would seem to allow very
> fast exploration of certain factors. There might even be a place for FPGA-based evaluation methods (which might
> be facilitated by the availability of a "cloud" service [universities already have compute clusters and commercial
> compute services are available, but comparable FPGA-based services seem not to be well established--in theory,
> such could be similar to platform evaluation access to high-end machines]).
People who design SoCs use FPGAs all the time, mostly to enable software/firmware co-development (i.e. the firmware is developed in parallel with the HW, using an FPGA to emulate the SoC). There are a couple issues that come into play when it comes to performance modelling, though:
1. Modern SoCs are often too big to fit into a single FPGA (even the $10000+ kind), so you end up having to either simulate a subset of the SoC or partition across multiple FPGAs. If you do the former then you end up nonrepresentative bus and memory loads. If you do the latter then you typically end up muxing multiple signals onto each of your limited set of FPGA I/Os via serdes, and then you end up running slow (or with compromised timing relationships).
2. A CPU core that would synthesize at 1 GHz in a modern SoC process/library might run at a couple hundred MHz at best in a single top-end FPGA (I'm not pulling these numbers out of thin air :-). At that point the timing relationships between the core, fabric, and memory subsystem end up completely whacked
3. The memory controller that's available "off the shelf" for the FPGA seldom behaves like the one you'll use for your SoC, which goes back to your point about fidelity of memory modelling.
> Anyway, thanks for the heads-up.
You're very welcome!
One other way to consider this question is from the perspective of a CPU architect. If you're, say, designing Haswell and you need to decide:
1. How many and what type of functional units to include
2. Whether that third ld/st unit is really a win in a uarch that decodes and retires 4 instructions per clock
3. How big to make your ROB and schedulers
4. How many outstanding loads and stores to support in the memory subsystem
5. How wide or fast the ring bus needs to be to service the CPU, GPU, and I/O loads
Then a very accurate simulator of both the core and its companion/supporting subsystems is basically the only game in town. Such a CPU is far too large and fast for any FPGA, and the variables I listed above are such fundamentally defining features of a microarchitecture that you can't use an existing CPU as a proxy. CPU architects live and die by their simulators (and by extension their choice of workloads to simulate), and if it's good enough for them then it's probably OK for cache sizing :-).
Best rgds,
Patrick