Automated Coder

Exploring the Code of CruiseControl.Net

Archive for the ‘.NET’ Category

Fun With Leases and Lifetimes

Posted by Craig Sutherland on 21 June, 2009

Yet Another Issue

One of the features we added to the 1.4.4 release of CruiseControl.NET was the ability to hot-swap the DLLs. This means you can xcopy new DLLs in over top of the existing DLLs and CruiseControl.NET would play nice, i.e. just reload itself with the new DLLs.

Unfortunately, after this enhancement we started hearing of people having problems with stopping the service. It appears that people would attempt to stop the service using the SCM, but it would fail. They would get an error message telling them a problem happened, and the service would continue running. Not very good :-(

Anyway, this week-end I spent some time tracking down the issue. Actually, it was more like ten minutes here, another ten minutes a few hours later, and so on…

Normally, this would be a disruptive process, but today, it actually solved the problem!!!

Cross-Domain Calls

In order to get the hot-swapping working, we fire up a second AppDomain in CruiseControl.NET and load all the DLLs there. The reason this works, is the DLLs are shadow-copied (more details on this are available here). This is where the fun starts – in order to make this work, we needed to make some cross-domain calls. The following diagram shows the two AppDomains and the calls needed:

AppDmains

Basically, CCService starts up in the primary AppDomain. It starts up a second AppDomain and instantiates an instance of AppRunner in this domain. AppRunner then interacts with CruiseServer – which is one of the classes in the DLLs. CruiseServer then does all the real work.

Now, this part all worked pretty well – the problem comes with the cross-domain calls between the two AppDomains.

Introducing MarshalByRefObject

In order to make cross-domain calls, the called class must inherit from MarshalByRefObject. This class abstracts all the logic of making cross-domain calls, so we don’t need to worry about them.

However, to solve this problem, we do need to know a little bit about them. Basically, MarshalByRefObject generates a couple of proxies – one on each side:

Cross-Domains

When either side needs to talk with the other, they go via these proxies. These proxies then handle the communications between the two AppDomains (since this is handled by .NET, I won’t go into how this is done). What is important to know, is there is no direct referencing between the two instances. Instead CCService holds a reference to the AppRunner proxy, and vice-versa for AppRunner to CCService.

What does this mean for our problem?

The garbage collection in .NET works by checking if there are any references between instances. If there are, it checks to see if any of the referenced instances are active, and so on. If it checks all the references for an instance and finds no active instance instances, it garbage collects the instance.

Now, as I understand it, garbage collection only works within an AppDomain (or at least it functions that way). So, the garbage collection for the primary AppDomain knows about CCService, but not AppRunner, and vice versa for the secondary AppDomain. Which causes an issue for garbage collection, since there are cross-domain objects.

But, there is no actual reference to the cross-domain object, instead it is only to the proxy instances. This means, the two AppDomains don’t know when to garbage collect these objects!

Leases to the Rescue

To get around this problem, each proxy has a “lease”. This is like a property lease – the two sides have agreed to keep the lease “active” for a certain period. After this period is up, either side can clean up.

Now, by default, a MarshalByRefObject instance has a lease period of (I think) five minutes. This means, after the proxies have been active for five minutes, they can be garbage collected (although this can happen later).

So, that’s the background, hopefully by now you’ve figured out what has happened. AppRunner inherits from MarshalByRefObject. When the cross domain calls are required, it set up the proxies automatically for us, and then generated a five minute lease. The initial calls work fine,but after a while garbage collection comes along (after the five minutes), sees the lease has expired and so cleans up the proxies!

Sometime after this, someone decides to shut down CCService. CCService receives the call, and tries to pass it onto AppRunner – expect the proxies in-between have been cleaned up and no longer exist! Poor .NET gets confused and just spits the dummy :-(

Now, during development and testing, we didn’t detect this. Why? Because when we did our testing, we’d fire up the service, make the various calls, and then shut down the service. And normally, this was all done before the proxies were cleaned-up, so the issue never raised its head.

After the week-end, because my testing was interspaced with gaps, I suddenly came across an error when I returned from a break and tried to shut down the service. Basically, the error was telling me a reference couldn’t be found – and that’s when it clicked – leases!

And Finally, a LifetimeService

So, to round off this post, there is a very simple solution – we just need to extend the leases. This is done by a LifetimeService in AppRunner – the actual method to do this is InitializeLifetimeService().

Now, we could just set a longer time-out period for the lease, but considering some CI servers run for a very, very, very long time between shut-downs, how would we know what it should be?

Instead, there is an alternate approach – disable the LifetimeService altogether. Doing this has the effect of setting infinite leases or leases that only expire when both sides shutdown. To do this, we merely return null from InitializeLifetimeService(). Simple!

Why didn’t we do this earlier? Well, we do for other MarshalByRefObject-inherited classes, like the remote cruise server, etc. It was just forgotten in this case :-(

Anyway, problem now fixed. In the next release of CruiseControl.NET, we won’t have to worry about it.

Now, in future, we just have to remember to always check the lease on any MarshalByRefObject-inherited classes, but that’s a task for another day…

Posted in .NET, CruiseControl.Net | Tagged: | 2 Comments »

Time Zone Mayhem

Posted by Craig Sutherland on 17 June, 2009

We just resolved an interesting problem tonight with some unit tests in CruiseControl.NET. But before I get to the resolution, here is some background.

We have a number of developers around the world. For example, I live in New Zealand, which is UTC+12. Some of the other developers live in Europe, which is UTC+1 (although it is currently UTC+2 due to daylight savings). And just to confuse things, our build server – CCNetLive – is in Chicago, Illinois, UTC-5 currently.

One of the developers had written some unit tests that was checking some date/time comparisons in the underlying code. When he tested it on his machine, it worked beautifully – no problems whatsoever. Same for my machine, but CCNetLive was failing the test!

So, what was happening?

The code was parsing a date/time in a string and then checking to see if it was in a valid date/time range. The string included the time zone (2009-06-13 10:37:42 +0000), which was UTC+0.

Since the time zone was UTC+0, he generated a date using a UTC date/time kind – new DateTime(2009, 06, 13, 10, 00, 00, DateTimeKind.Utc). This date was the lower bound, while the upper bound was DateTime.Now.

Here is where things got interesting – he assumed, and I did also, that since the string contained the time zone that it would be a UTC date/time. Therefore, the UTC date/time would work nicely. And it did – on any machine that was ahead of UTC. On CCNetLive, which was behind UTC, it was failing!!!

Why?

DateTime.Parse generates a Local date/time, no matter what the incoming string has. If the string contains the time zone, it uses this to generate a date/time and then moves it to the local date/time. So, the string that was being parsed was being converted to 5:36:42am, instead of 10:36:42am. This was then being compared to 10:00:00am and naturally failing.

This is where it got us – the 10:00:00am date/time was UTC, and the 5:36:42am was Local. Shouldn’t the .NET date/times automatically convert to the same type in order to perform the comparison?

It turns out, they don’t. They just literally compare the date and time components, without taking into consideration the different date kinds.

So, I don’t know whether this is a bug or an oversight (or even deliberate), but it is worth while making sure you are comparing the same date/time kinds when doing a date/time comparison. Otherwise, like us, you’ll end up wondering why things aren’t working in different places around the world.

So what was the resolution to our issue – we converted the UTC date/time to a local date/time using .ToLocal(). After this, the test worked beautifully!

Posted in .NET | 3 Comments »