This is a very interesting real life example of a classic concurrency issue and very neatly illustrates the abstract process of fault finding a race condition. What makes it particularly interesting is that rather than involving high performance multi-threaded code it is in an automated user interface test.
A test developer is in the process of working on an automated TestComplete test which is part of a suite that is run on a number of benchmark virtual machines. He discovers he has an interesting problem: his test fails quite regularly with an odd error on his machine but works on the test virtual machines. In an even more confusing development, when he debugs the test script in the test tool IDE the bug does not reproduce.
The error is "invalid operation" when the fault reproduces and happens in a routine that simply invokes the close action on all the forms contained in an MDI application. The routine is coded adequently enough: if a form is present the close action is invoked: else nothing is done and no error returned. It's a simple fire and forget function intended to get a simple job done prior to starting another test.
Browsing in the code before and after the point of the error reveals the close all forms code is actually run twice: once at the very end of a function which has been called as a sub-routine just before the same procedure attempts the same process.
The Detailed Diagnosis
Recording, running and debugging tests using TestComplete illustrates that tools like TestComplete do a very good job of making an inherently very complex problem appear very simple. There are features in the engine that mask the asynchronous nature of the communication of the simulated user actions from the TestComplete engine to the application under test. This is only a fiction, however and under the right circumstances this fiction can be exposed; this is just such an occasion and how this fiction was exposed is revealing.
What is happening is that the test was exceuting the function directing any open windows to close, and then shortly afterwards again running the same code.
In investigating the difference between the various scenarios is was found:
Around this point it became clear what had been happening and it's related to the way the OS apportions time slices and the time it takes for these windows to close.
- The test code being executed was indeed identical (always worth checking)
- The execution environment of the tests was identical in version
- The developer's machine had 2 CPUs and the virtual machines had 1 CPU among other less obvious differences
On a single core machine time slices are doled out to the many tasks on the machine in a pre-emptive manner in order to achieve multi-tasking. Of course this is going on with the processes of TestComplete and the application under test; on single core it's something like 5 ms of TestComplete then 5 ms of the application, plus change for vital tasks the operating system has to perform like passing messages between applications. With so much time (?! - really!) when an application effectively has the machine to itself it's easy to see how an action initiated by TestComplete can be entirely completed when the application under test gets its time slice. Although there is a race condition in the test script code, having a single core "forces a winner" which results in correct operation for a large proportion of the time.
On multiple cores the race condition can be exposed and if in fact highly likely to be exposed: TestComplete goes through all the open forms and directs them to close; the application under test, having another core to run on starts to do this immediately; TestComplete then repeats the order to the forms to close; some of the forms being in process of closing down, this is an error and is raised as such in the test script.
The cure should be fairly obvious: remove the superfluous code that closes all windows; as it turns out the problem code was due to a copy and paste when it should have been a cut and paste.
The Moral of the Story
There are several:
In this case is seemed natural to assume that what TestComplete presents as happening as being the exact case. Namely the script directs the application windows to close and they close without issue; in this view, what could be incorrect in performing the same logically flawless operation twice? If all the windows closed, then closing zero windows is wasteful but not incorrect.
- when coding, use source control to review what you've been doing and make sure all the code you've been moving around has moved to the right places and also removed from the original places.
- when debugging a problem related to concurrency there are some watchwords that developers are very familiar with:
- assume nothing about the sequence of events of concurrent processes, which can be done by considering
- what happens when everything happens in the order you expect
- what happens when nothing happens in the order you expect
Since we know the reality is not that simple and the window closing process was initiated and then it was attempted again on windows that were already closing it is easy to see how it is incorrect to tell a window to close twice. At best it might ignore the request, at worst it may be literally unable to respond to the request.
Hopefully this simple example has illustrated a vital issue in the testing of complex applications and the development of test scripts and in the future when a similar issue crops up the same checklist will help uncover the root cause quickly:
- Discover what parameters are varied between the situations
- Examine what is setting up the conditions to reproduce the fault
- Desk-check the code by working through it logically to confirm your understanding
- Don't panic: if you are presented with a working scenario and a broken scenario you are being handed the conditions for success on a plate, albeit a little obliquely and with some user assembly required.
来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/11379785/viewspace-690485/，如需转载，请注明出处，否则将追究法律责任。