I was thinking the other day about how hard it is to evaluate the impact of a bug fix. You have a bug report and determine the fix for it – just how do you then weigh the impact of the bug fix against the instability that this might cause if you release it? And I think that this is a very hard call.
I came up against this problem nearly a year ago. Microsoft were just about to release the 4.6 version of the .NET framework, and we were lucky enough to get a beta version to try. Several of us installed this beta onto our development machines, and continued working as normal. One of the testers in the team noticed that a PDF rendering component that we were using, and had been using for years, was no longer laying out the graph but was putting various elements in the wrong position. Mysteriously, it seemed to work on other people’s machines, and also seemed to work when we ran the application inside Visual Studio, but not when we ran it inside Windbg. We also didn’t see any failures if we built as x86, and so we spent a while checking whether previous builds of the product had accidentally been x86. It was only when I was doing the washing up that night, that I twigged that this is exactly what happens if you have a JIT bug. Running inside VS is going to turn off some of JIT optimisations, whereas running inside Windbg is going to leave these optimisations turned on.
The next morning I went in to work and verified this by setting the application config to use the legacy JIT, and sure enough the bug didn’t happen. It was then a case of gathering more data and isolating the method that was giving the wrong result. This turned out to be a point where the JIT made an optimised tail call. I therefore reported this on Connect. As is usual, I was then asked for a self contained reproduction, which I supplied the next day. Time passed and then the issue was marked as fixed, with the fix being noted as available in a later release of 4.6
The question is: how do you gauge the impact of a bug like this and decide whether the fix goes out straight away, or whether you test it a lot more and release it then? In the advisory, Microsoft said that they had run lots of in house code and hadn’t found a manifestation of the issue. However, that might not have been quite good enough, as people can take such a bug and try to convert it into an exploit – see the comment on the thread in the CoreClr issue where the return address can be obtained – so if you get unlucky and there’s a widely available framework that allows an exploit to work, you probably do need to push out a fix. There is also the issue that the .NET framework is often used to run code written in C#, which probably doesn’t have a lot of tail call optimisable method invocations, but it can also be used as runtime for languages like F# where the design patterns lead to lots more tail calls happening [as you recur down algebraic data types].
Pondering the issue, it seems to me that it is really hard to get enough data to inform your decision. What’s the ratio between CLR runs of F# programs compared to C# programs? What’s the percentage of C# programs that would hit this issue? And F# programs? What’s the chance of the bad compilation being turned into an exploit?
In the end I suspect it just comes down to a call by a product manager who guesses the severity, and then uses feedback to determine if the call was right – obviously the more “important” the sender of the feedback, the more weigh they are given. And that’s a shame.