Shared State in Video Games
There was a rather interesting talk by Micah Best of Simon Frase University on the management of shared state within video games. Video games represent a multi billion dollar industry where performance and deadline constraints are paramount. Increasingly complex game architectures crafted for multicore processors are causing more bugs to creep into the game experience and ultimately result in schedule slippage. For multicore programming, shared state mutations is a vector for increasingly hard to reproduce bugs. In a typical game’s task execution graph, accessing global memory will serialize otherwise data independent tasks, as the compiler and system cannot ensure the two tasks are accessing input-dependent disjoint areas in the heap. Therefore it is up to the developer to explicitly indicate whether such accesses are implicitly dependent or not. Incorrect assessments result in concurrency errors, thus the cautious programmer will tend towards safety if in doubt and create extraneous dependencies. As a result games will be unnecessarily serialized on increasingly parallel hardware.
Figure 7 – Example game architecture and subsystems
The speaker presented the Cascade project’s framework for rethinking shared state in video game engines. The developer is still tasked with explicitly specifying control flow dependencies. However implicit heap based dynamic data dependencies are recognized at compile time and managed by a runtime system. After determining that C++ is too permissive for the task, project members created Cascade Data Management Language (CDML), a subset C++ forbidding pointers, among other restrictions. If the compiler determines that two tasks may access the same data structure then a dependency is created. However such potential accesses may never actually occur at runtime.
Therefore such overconstraints are refined through the runtime system. Before refinement occurs such tasks cannot be co-scheduled to run in parallel. If a specific variable input to a consumer task is part of its read constraint set and has not been modified then it can be compared to another task’s read set and if the two runtime values are distinct then there is no actual runtime dependency and the tasks can be scheduled to run simultaneously. Thus refinement determines the precise data constraints at runtime. Note there is no speculation performed at any point, for this is unlike software transactional memory, and all actual dependencies are respected at runtime by the system.
Figure 8 – Animation test for Cascade framework
The Cascade framework was evaluated on a 2-socket, 4-core Intel Xeon system (8 cores total) with a character animation library where characters are represented by a hierarchy of bones. Animations on a character model are blended together, but applying them concurrently may cause conflicts. Fine-grained parallel Cascade implementations were compared to a baseline “restricted” to very coarse grained parallelism at the level of entire character models. Preliminary results processing four character models with eight animations ranged from comparable runtime (parallel tasks are “partitioned” to cores thus increasing locality and decreasing scheduling overhead) to four times slower performance (unrestricted parallelism – “signatures”) than the baseline.
Figure 9 – Animation evaluation for Cascade framework
Such results are for the most part due to a surfeit of fine-grained parallelism. When it rains it pours and the significant scheduling overhead from coordinating so many miniscule tasks to run on eight cores vastly outweighs any potential benefit from running the tasks in parallel. The results motivate the need to match the granularity to the workload running on a given system.