Just get on with it will you!

Concurrency seems to be harder than it ought to be, even when you take that into account.
I was doing a code review a couple of days ago. The code in question had a main thread that launched a series of worker threads and a timer event that reported on their progress. The structure of the main thread (when simplified) looked like:
            AutoResetEvent xx = new AutoResetEvent(false);
            AutoResetEvent yy = new AutoResetEvent(false);
            WaitHandle[] handles = new WaitHandle[] { xx, yy };
            ThreadPool.QueueUserWorkItem(DoWork, xx);
            ThreadPool.QueueUserWorkItem(DoWork2, yy);
            // Output state of workers using timed event

The workers were structured as follows
            AutoResetEvent eventToSignal = state as AutoResetEvent;
            // Do some processing
What I realised, when I looked at it, is that I didn’t know at what point the auto reset event gets reset to a state when it can receive the next Set method call. If it isn’t reset, then subsequent calls to Set will effectively be lost. Reading the documentation on WaitAll didn’t give much of a clue, though the documentation on AutoResetEvent did hint at how the WaitAll and the AutoResetEvent interact. It is only when a thread is released that the events are reset. This is obvious when you think about it as there could be multiple WaitAll methods waiting for arrays of events that contain the same handle. You wouldn’t want one of them to be picked arbitarily as the WaitAll that resets the event and it would be hard to have the two WaitAll methods coordinating after resetting the event. Sure enough, changing one of the worker threads to sleep before calling Set for the first time, causes the main thread to hang on the second WaitAll as the first worker’s second call to Set is lost.
This made me realise yet again how subtle some of this concurrency code can be. The example this was extracted from would work virtually all of the time as the work section of the threads would usually take several minutes. It would only be when the thread pool was short of worker threads, so that worker1 got time to run and finish before worker2 started, or when the system hit an extreme condition that slowed one of the threads down when the bug would become apparant. This kind of thing would be quite hard to find using black box testing without injecting delays into the thread code. Maybe it is down to people correctly reasoning about code behaviour, but then the documentation should make the behaviour of some of the concurrency primitives much more obvious so that people’s mental model is more accurate.
This entry was posted in Computers and Internet. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s