Windows vs Unix/Linux culture

Richard Cownie ( on 8/23/09 wrote:
>Agreed. Actually the synchronization overhead can be as
>low as you feel like making it: you can have a small
>block of shared memory for synchronization, and do a
>user-space spinlock. That aspect isn't fundamentally
>different from lightweight threads.

Sure, if you go down the shared memory route then you end up writing almost exactly the same code as a threaded implementation with just a different "default" assumption (threads: data is shared by default, steps need to be taken to allocate thread-local storage[*]; processes: data is independent by default, steps need to be taken to allocate shared storage).

However a different default assumption alone doesn't buy you much if you consider writing correct and safe threaded code to be too difficult unless your perceptions are coloured by widespread and reckless use of global variables. (A no-no in our shop; we only have a few read-only status variables that are initialised during program startup.)

[*] In MSVC the easiest way to declare a variable thread-local using a declspec:

__declspec(thread) int tls_i = 1;

Of course all the local variables are thread-local as well.

>My pipe example just
>showed that even doing it in the dumbest most straightforward
>way possible isn't prohibitively slow.

For some cases, yes. Not all. As I said, there are cases where the performance penalty of using separate processes means the parallel implementation is not worth it. You're the one who tried to minimise the importance of those cases, despite them representing a large proportion of the cases we have encountered. I have repeatedly said that using separate processes represents the ideal case to be used when circumstances allow.

>Jason's claim that serialize/deserialize makes it slow
>also seems very dubious. Fundamentally, the serialize/
>deserialize needs to do a hashtable lookup for each pointer,
>and simply copy other data directly. Obviously the
>hashtable lookups will make it slower than merely copying,
>but not massively slower - maybe 10x, but nowhere near
>1000x. Possibly the claim arises because most serialize/
>deserialize code is only designed and optimized to match
>the bandwidths of disks or 100Mbit Ethernet.

Excuse me, but before calling my claims "dubious" please make sure you understand what I said.

I never said that serialisation/deserialisation was a thousand times slower than a straight copy. Most of our serialisation/deserialisation is a straight copy! The way we're implemented it we don't even need to mess around with preserving pointers because nothing retains a pointer to anything it doesn't own.

What I did say was that it would need to be about a thousand times faster to compete with not having to copy it at all.

>The real weakness is in having to know in advance all
>the data structures that could be accessed, and to copy
>them all.

That's also a huge burden. You've stated many times how much effort is required to revisit the assumptions of a large codebase stretching back 15 years while at the same time advocating a sweeping change like this.

Using threading I can look at a tiny part of the code, analyse it, and thread it safely with minimal changes. I can do this bit-by-bit over time, finding the next best operation to parallelise and implementing it.

To do it your way I would have to be in a position to predict all the ways I might want to slice and dice the data in future if I want to avoid having to go through the whole process every time I wanted to parallelise a certain operation, which would be tricky since different operations really want to partition the data in different ways.

>That undoubtedly puts a constraint on the
>kind of algorithms that will parallelize in this way.


>But I believe it will work out just
>fine for many of the problems I'm interested in - with
>run-times of many hours, shooting for computation
>granularity of 5mS or greater leaves room for a heck of
>a lot of granules.

I have never said that your problems aren't of the category that are amenable to multiprocessing. I have no idea, I've never seen them. What I've been responding to are your claims about the value of multithreading.

Some of our operations are sufficiently course-grained that parallel computation is worthwhile. Most are not. If we restricted ourselves to only the former category we would be sacrificing a significant performance advantage that justifies a higher price for our software than our competitors'.

>Also let's note that threading doesn't make you immune
>to the problem of compute-communication ratio: if your
>thread really does access 25MB of data (or at least,
>locations in enough cache lines to add up to 25MB of cache
>lines), then you don't actually get that data for free -
>it's going to take significant time to get into the
>L1 cache of the worker thread. It only looks great if
>there's 25MB of data that you *might* access, and in fact
>you only access a small fraction of it.

As I already said, modern processors are very good at hiding latency when you aren't bandwidth constrained.

Consider the case I mentioned: the processor is accessing shared data at a rate of no more than a few hundred MB/sec, and each access is surrounded by quite a few computations. There is also a high probability that the next work item will be very close to recent work items, enhancing the effectiveness of the caches. The data that is being shared is read-only for this operation, meaning the caches won't need to be flushed all the time due to dirty data. During processing, pretty much all of the shared memory will be accessed by at least one thread (although we can't predict by which, because that depends on the results of the computation), and probably several.

That means that although all 25-100 MB of data will ultimately be read by the CPU(s), the cost of doing so will be very effectively hidden in a way that physically copying it up-front into each memory space cannot. Furthermore, having one complete copy of the data in memory for each process not only wastes memory but decreases the effectiveness of the caches.

(In case you're wondering, this operation repeats itself many times with new shared data at each iteration.)

So the next step is to use shared memory instead, and, as I said, now you're doing exactly the same sort of analysis that would be required if you were using threading. The only difference is the default status of global variables, and minimising the use of global variables has been in our coding standard for a very long time.

BTW, I did forget one feature of C++ that does help with threading, and that's const, because if something isn't being modified during the execution of the parallel operation then things become much simpler. Our code has been const-correct for a long time because I'm a great believer in putting all your assumptions into the code so the compiler can check them for you. (It can also help the optimiser, in theory.)

Of course const can be defeated by const_cast (easy to search for) and standard C-style casts (very hard to search for, which is why they're prohibited under our coding guidelines), as well as mutable. If you remove the constness of an object (and the only justification I can think of is so it can cache a computation so the only difference to the outside world is how quickly it returns a result) then it's up to you to ensure that access to the mutable portion is thread-safe if it's ever used by threaded code.
