By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), March 22, 2021 1:45 pm
Room: Moderated Discussions
Source code is unlikely to be desirable for the primary distribution format. Much software is distributed with an assumption that the end-user is not allowed to modify the program outside of specifically implemented mechanisms and is not interested in detailed debugging or bug-seeking. Security (by obscurity) and protecting trade secrets (including interoperability aspects like file formats as well as algorithms and practices of questionable legality) can motivate prohiting disassembly (much less decompilation). Source code is also a low density and human-friendly rather than compiler-friendly.
Web Assembly, Java bytecode, Microsoft CLR, and similar intermediate representations are not primarily designed for install-time or even load-time compilation but place some importance on intrepretation (simplicity and/or performance thereof). Just-in-time compilation is also assumed to be somewhat straightforward, at least for a first compilation. Density of representation is also considered important. These design choices are partially motivated by an assumption of low persistence relative to reuse (i.e., caching is generally not profitable). I speculate that Web Assembly targets low reuse due to many software sources and frequent modification — such seems common practice for the Web where even frameworks (cf. standard libraries) seem to be fragmented not only by libraries with similar functionality but also by version and provider (url vs. theoretical singular uri).
Distributing a directly executable format avoids translation overhead but typically constrains localized optimization and diversity of platforms supported. This is attractive to established and popular ISA and OS vendors since it reduces the availability of software for potential and existing alternatives. With long-term ISA and OS compatibility, end-users benefit from simplified management of software (such a lower level format encourages this compatibility). With limited diversity (also encouraged by lower level format), software developers can more easily adopt more responsibility for reliability and performance since much of such is tied to the platform (controlling the intermediate translation layers also helps). Long-term interfaces are also more thoroughly exercised and the maintainers have some motivation and resources to provide operational compatibility and marketable improvements. The thickness of a persistent interface layer not only extends the persistence over more functionality but contains more interactions with a single layer (which is significant with leaky layer abstractions). Such formats also provide good density.
An intermediate format between source code and virtual machine language is possible. This format would remove human-meaningful names, perform some optimizations, and provide metadata for further optimization. The nature and diversity of expected end-targets would influence the selection of metadata and the closeness of the format to machine language. The software distribution format proposed for the Mill can assume substantial commonality of functionality (operations are directly on queue/belt entries, static scheduling, select/predication almost always preferred for short branches [falling out of static scheduling and wider execution], no implicit threading, etc.), and so more optimizations/processing can be done in advance.
The distribution format also has implications for responsibility for platform specific fixes (from a hardware bug, miscommunication of the hardware specification, or programming bug that is not universally exposed) and for reliability generally. Even software distributed in a directly executable format may include resource recommendations or unsupported configurations; a higher-level distribution format would seem to increase the incentives for caveats and replacing best effort or guaranteed effect with good faith or reasonable effort. This implies a significant cultural change, perhaps comparable or greater in significance to that associated with lease vs. own.
Availability of the software in a given format is also a consideration. The persistence failure of source code is a well-known issue and license managment issues are fairly well-known, but when a necessary software component is not stored locally an unexpected failure of remote storage can generate an unexpected failure. (Even with local storage licensing fine print can unexpectedly remove availability.) While some software vendors might prefer requiring relicensing on any platform change, users often have an expectation of ownership (use not limited by time or execution platform); if a portable software format is managed remotely, some efficiencies of caching translation work are possible but ownership-prevention also becomes easier. Back-up and restore may also become more complicated (as shown by the cases of source code persistence failure).
Software distribution can also exploit various opportunities for sharing. As with processor memory cache sharing, even when sharing would provide a reduction of retrieval work replication can be more efficient overall.
Software formats can be viewed as levels of caching (as well as interface stack levels — and interface stack levels have some relation to pipeline stages). C language development environments have long cached object files under the assumption that modifications are often localized. Theoretically, a programming language and development environment could be developed which facilitated broader use of caching, but any caching should consider the costs of cache hits, the cost of cache management (storage, consistency/coherence, retention choices), and the cost of cache misses (at various levels in a multi-level cache). (If one wishes to be more theoretical and abstract, the software concept could be viewed as a caching level drawn from reality by market research and high-level development and source code is a caching of the programmer effort translating the software concept.)
Mutability and persistence of information are also considerations. Some information is considered highly mutable and is only cached locally in the processor; branch predictors and cache replacement information are common examples. Yet profile information is considered useful for software-managed optimization, implying that earlier optimization lacked this information and it is generic or that the information has limited temporal (e.g., processing one dataset) or spatial (e.g., user) persistence. Another mutability that has been explored is microarchitecture-specific and machine-specific optimization; a machine executable can be optimized for one microarchitecture and a re-optimization for the actual microarchitecture in use might provide other benefits. One could imagine different optimization goals also generating different end-formats; the relative value of different resources (time-to-solution [worst, average, good-enough fraction, variability, etc.], energy, power, memory bandwidth, etc.) can vary among users and time.
Just as bug-reporting is a common up-stack information transmission, one could imagine performance and usage patterns being useful more broadly than a local system. As with bug-reporting, privacy issues exist. As with misfeature-reporting (e.g., observing user interface activity that hints at misunderstanding or confusion), logging and sending performance information could be more expensive than useful — and the utility likely correlates with software maturity.
Web Assembly, Java bytecode, Microsoft CLR, and similar intermediate representations are not primarily designed for install-time or even load-time compilation but place some importance on intrepretation (simplicity and/or performance thereof). Just-in-time compilation is also assumed to be somewhat straightforward, at least for a first compilation. Density of representation is also considered important. These design choices are partially motivated by an assumption of low persistence relative to reuse (i.e., caching is generally not profitable). I speculate that Web Assembly targets low reuse due to many software sources and frequent modification — such seems common practice for the Web where even frameworks (cf. standard libraries) seem to be fragmented not only by libraries with similar functionality but also by version and provider (url vs. theoretical singular uri).
Distributing a directly executable format avoids translation overhead but typically constrains localized optimization and diversity of platforms supported. This is attractive to established and popular ISA and OS vendors since it reduces the availability of software for potential and existing alternatives. With long-term ISA and OS compatibility, end-users benefit from simplified management of software (such a lower level format encourages this compatibility). With limited diversity (also encouraged by lower level format), software developers can more easily adopt more responsibility for reliability and performance since much of such is tied to the platform (controlling the intermediate translation layers also helps). Long-term interfaces are also more thoroughly exercised and the maintainers have some motivation and resources to provide operational compatibility and marketable improvements. The thickness of a persistent interface layer not only extends the persistence over more functionality but contains more interactions with a single layer (which is significant with leaky layer abstractions). Such formats also provide good density.
An intermediate format between source code and virtual machine language is possible. This format would remove human-meaningful names, perform some optimizations, and provide metadata for further optimization. The nature and diversity of expected end-targets would influence the selection of metadata and the closeness of the format to machine language. The software distribution format proposed for the Mill can assume substantial commonality of functionality (operations are directly on queue/belt entries, static scheduling, select/predication almost always preferred for short branches [falling out of static scheduling and wider execution], no implicit threading, etc.), and so more optimizations/processing can be done in advance.
The distribution format also has implications for responsibility for platform specific fixes (from a hardware bug, miscommunication of the hardware specification, or programming bug that is not universally exposed) and for reliability generally. Even software distributed in a directly executable format may include resource recommendations or unsupported configurations; a higher-level distribution format would seem to increase the incentives for caveats and replacing best effort or guaranteed effect with good faith or reasonable effort. This implies a significant cultural change, perhaps comparable or greater in significance to that associated with lease vs. own.
Availability of the software in a given format is also a consideration. The persistence failure of source code is a well-known issue and license managment issues are fairly well-known, but when a necessary software component is not stored locally an unexpected failure of remote storage can generate an unexpected failure. (Even with local storage licensing fine print can unexpectedly remove availability.) While some software vendors might prefer requiring relicensing on any platform change, users often have an expectation of ownership (use not limited by time or execution platform); if a portable software format is managed remotely, some efficiencies of caching translation work are possible but ownership-prevention also becomes easier. Back-up and restore may also become more complicated (as shown by the cases of source code persistence failure).
Software distribution can also exploit various opportunities for sharing. As with processor memory cache sharing, even when sharing would provide a reduction of retrieval work replication can be more efficient overall.
Software formats can be viewed as levels of caching (as well as interface stack levels — and interface stack levels have some relation to pipeline stages). C language development environments have long cached object files under the assumption that modifications are often localized. Theoretically, a programming language and development environment could be developed which facilitated broader use of caching, but any caching should consider the costs of cache hits, the cost of cache management (storage, consistency/coherence, retention choices), and the cost of cache misses (at various levels in a multi-level cache). (If one wishes to be more theoretical and abstract, the software concept could be viewed as a caching level drawn from reality by market research and high-level development and source code is a caching of the programmer effort translating the software concept.)
Mutability and persistence of information are also considerations. Some information is considered highly mutable and is only cached locally in the processor; branch predictors and cache replacement information are common examples. Yet profile information is considered useful for software-managed optimization, implying that earlier optimization lacked this information and it is generic or that the information has limited temporal (e.g., processing one dataset) or spatial (e.g., user) persistence. Another mutability that has been explored is microarchitecture-specific and machine-specific optimization; a machine executable can be optimized for one microarchitecture and a re-optimization for the actual microarchitecture in use might provide other benefits. One could imagine different optimization goals also generating different end-formats; the relative value of different resources (time-to-solution [worst, average, good-enough fraction, variability, etc.], energy, power, memory bandwidth, etc.) can vary among users and time.
Just as bug-reporting is a common up-stack information transmission, one could imagine performance and usage patterns being useful more broadly than a local system. As with bug-reporting, privacy issues exist. As with misfeature-reporting (e.g., observing user interface activity that hints at misunderstanding or confusion), logging and sending performance information could be more expensive than useful — and the utility likely correlates with software maturity.