Yet Another Issue
One of the features we added to the 1.4.4 release of CruiseControl.NET was the ability to hot-swap the DLLs. This means you can xcopy new DLLs in over top of the existing DLLs and CruiseControl.NET would play nice, i.e. just reload itself with the new DLLs.
Unfortunately, after this enhancement we started hearing of people having problems with stopping the service. It appears that people would attempt to stop the service using the SCM, but it would fail. They would get an error message telling them a problem happened, and the service would continue running. Not very good
Anyway, this week-end I spent some time tracking down the issue. Actually, it was more like ten minutes here, another ten minutes a few hours later, and so on…
Normally, this would be a disruptive process, but today, it actually solved the problem!!!
Cross-Domain Calls
In order to get the hot-swapping working, we fire up a second AppDomain in CruiseControl.NET and load all the DLLs there. The reason this works, is the DLLs are shadow-copied (more details on this are available here). This is where the fun starts – in order to make this work, we needed to make some cross-domain calls. The following diagram shows the two AppDomains and the calls needed:
Basically, CCService starts up in the primary AppDomain. It starts up a second AppDomain and instantiates an instance of AppRunner in this domain. AppRunner then interacts with CruiseServer – which is one of the classes in the DLLs. CruiseServer then does all the real work.
Now, this part all worked pretty well – the problem comes with the cross-domain calls between the two AppDomains.
Introducing MarshalByRefObject
In order to make cross-domain calls, the called class must inherit from MarshalByRefObject. This class abstracts all the logic of making cross-domain calls, so we don’t need to worry about them.
However, to solve this problem, we do need to know a little bit about them. Basically, MarshalByRefObject generates a couple of proxies – one on each side:
When either side needs to talk with the other, they go via these proxies. These proxies then handle the communications between the two AppDomains (since this is handled by .NET, I won’t go into how this is done). What is important to know, is there is no direct referencing between the two instances. Instead CCService holds a reference to the AppRunner proxy, and vice-versa for AppRunner to CCService.
What does this mean for our problem?
The garbage collection in .NET works by checking if there are any references between instances. If there are, it checks to see if any of the referenced instances are active, and so on. If it checks all the references for an instance and finds no active instance instances, it garbage collects the instance.
Now, as I understand it, garbage collection only works within an AppDomain (or at least it functions that way). So, the garbage collection for the primary AppDomain knows about CCService, but not AppRunner, and vice versa for the secondary AppDomain. Which causes an issue for garbage collection, since there are cross-domain objects.
But, there is no actual reference to the cross-domain object, instead it is only to the proxy instances. This means, the two AppDomains don’t know when to garbage collect these objects!
Leases to the Rescue
To get around this problem, each proxy has a “lease”. This is like a property lease – the two sides have agreed to keep the lease “active” for a certain period. After this period is up, either side can clean up.
Now, by default, a MarshalByRefObject instance has a lease period of (I think) five minutes. This means, after the proxies have been active for five minutes, they can be garbage collected (although this can happen later).
So, that’s the background, hopefully by now you’ve figured out what has happened. AppRunner inherits from MarshalByRefObject. When the cross domain calls are required, it set up the proxies automatically for us, and then generated a five minute lease. The initial calls work fine,but after a while garbage collection comes along (after the five minutes), sees the lease has expired and so cleans up the proxies!
Sometime after this, someone decides to shut down CCService. CCService receives the call, and tries to pass it onto AppRunner – expect the proxies in-between have been cleaned up and no longer exist! Poor .NET gets confused and just spits the dummy
Now, during development and testing, we didn’t detect this. Why? Because when we did our testing, we’d fire up the service, make the various calls, and then shut down the service. And normally, this was all done before the proxies were cleaned-up, so the issue never raised its head.
After the week-end, because my testing was interspaced with gaps, I suddenly came across an error when I returned from a break and tried to shut down the service. Basically, the error was telling me a reference couldn’t be found – and that’s when it clicked – leases!
And Finally, a LifetimeService
So, to round off this post, there is a very simple solution – we just need to extend the leases. This is done by a LifetimeService in AppRunner – the actual method to do this is InitializeLifetimeService().
Now, we could just set a longer time-out period for the lease, but considering some CI servers run for a very, very, very long time between shut-downs, how would we know what it should be?
Instead, there is an alternate approach – disable the LifetimeService altogether. Doing this has the effect of setting infinite leases or leases that only expire when both sides shutdown. To do this, we merely return null from InitializeLifetimeService(). Simple!
Why didn’t we do this earlier? Well, we do for other MarshalByRefObject-inherited classes, like the remote cruise server, etc. It was just forgotten in this case
Anyway, problem now fixed. In the next release of CruiseControl.NET, we won’t have to worry about it.
Now, in future, we just have to remember to always check the lease on any MarshalByRefObject-inherited classes, but that’s a task for another day…
RSS - Posts