By: Maynard Handley (name99.delete@this.name99.org), May 16, 2013 10:22 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on May 16, 2013 6:57 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 16, 2013 12:29 am wrote:
>
> > Is it reasonable to expect competent developers to be able to handle that extra complexity when
> > it's beneficial? I guess this depends on how you define "competent". I'd say "it's definitely
> > reasonable" (it's not the 20th century anymore) but other people may have lower standards.
>
> See this paper http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
>
> Key quote from the conclusion: "non-trivial multi-threaded programs are incomprehensible
> to humans".
>
> And this experience with an expert team using best practices:
>
> "A part of the Ptolemy Project experiment was to see whether effective software engineering
> practices could be developed for an academic research setting. We developed a process that included
> a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code
> reviews, nightly builds, regression tests, and automated code coverage metrics [43]. The portion
> of the kernel that ensured a consistent view of the program structure was written in early 2000,
> design reviewed to yellow, and code reviewed to green. The reviewers included concurrency experts,
> not just inexperienced graduate students (Christopher Hylands (now Brooks), Bart Kienhuis, John
> Reekie, and myself were all reviewers). We wrote regression tests that achieved 100 percent code
> coverage. The nightly build and regression tests ran on a two processor SMP machine, which
> exhibited different thread behavior than the development machines, which all had a single processor.
> The Ptolemy II system itself began to be widely used, and every use of the system exercised this
> code. No problems were observed until the code deadlocked on April 26, 2004, four years later.
> It is certainly true that our relatively rigorous software engineering practice identified and fixed
> many concurrency bugs. But the fact that a problem as serious as a deadlock that locked up the
> system could go undetected for four years despite this practice is alarming. How many more such
> problems remain? How long do we need test before we can be sure to have discovered all such
> problems? Regrettably, I have to conclude that testing may never reveal all the problems in nontrivial
> multithreaded code."
>
The actual situation is even worse than is suggested here. There are two separate issues:
- programming correctness
- programming efficiency
For parallelization to be worthwhile, even before we get to writing the code correctly. we have to have algorithms and data structures which parallelize well. In spite of what certain people on this board have claimed, we simply do not have these in many important cases, and it isn't because the entire freaking world is too lazy and stupid to do the job.
If you look at something like Haswell, what Intel is trying to do there is make the correctness part of the problem easier (basically by allowing the use of many fewer locks without losing performance), and this is obviously a help; but it still doesn't solve the problem of algorithms/data structures which are ESSENTIALLY sequential.
> Brendan (btrotter.delete@this.gmail.com) on May 16, 2013 12:29 am wrote:
>
> > Is it reasonable to expect competent developers to be able to handle that extra complexity when
> > it's beneficial? I guess this depends on how you define "competent". I'd say "it's definitely
> > reasonable" (it's not the 20th century anymore) but other people may have lower standards.
>
> See this paper http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
>
> Key quote from the conclusion: "non-trivial multi-threaded programs are incomprehensible
> to humans".
>
> And this experience with an expert team using best practices:
>
> "A part of the Ptolemy Project experiment was to see whether effective software engineering
> practices could be developed for an academic research setting. We developed a process that included
> a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code
> reviews, nightly builds, regression tests, and automated code coverage metrics [43]. The portion
> of the kernel that ensured a consistent view of the program structure was written in early 2000,
> design reviewed to yellow, and code reviewed to green. The reviewers included concurrency experts,
> not just inexperienced graduate students (Christopher Hylands (now Brooks), Bart Kienhuis, John
> Reekie, and myself were all reviewers). We wrote regression tests that achieved 100 percent code
> coverage. The nightly build and regression tests ran on a two processor SMP machine, which
> exhibited different thread behavior than the development machines, which all had a single processor.
> The Ptolemy II system itself began to be widely used, and every use of the system exercised this
> code. No problems were observed until the code deadlocked on April 26, 2004, four years later.
> It is certainly true that our relatively rigorous software engineering practice identified and fixed
> many concurrency bugs. But the fact that a problem as serious as a deadlock that locked up the
> system could go undetected for four years despite this practice is alarming. How many more such
> problems remain? How long do we need test before we can be sure to have discovered all such
> problems? Regrettably, I have to conclude that testing may never reveal all the problems in nontrivial
> multithreaded code."
>
The actual situation is even worse than is suggested here. There are two separate issues:
- programming correctness
- programming efficiency
For parallelization to be worthwhile, even before we get to writing the code correctly. we have to have algorithms and data structures which parallelize well. In spite of what certain people on this board have claimed, we simply do not have these in many important cases, and it isn't because the entire freaking world is too lazy and stupid to do the job.
If you look at something like Haswell, what Intel is trying to do there is make the correctness part of the problem easier (basically by allowing the use of many fewer locks without losing performance), and this is obviously a help; but it still doesn't solve the problem of algorithms/data structures which are ESSENTIALLY sequential.