Automated Coder

Exploring the Code of CruiseControl.Net

Archive for the ‘Project Ares’ Category

Reducing Strings: Transferring Data

Posted by Craig Sutherland on 29 November, 2009

A Multi-OS Challenge

CruiseControl.NET has gone through a few iterations of file transfer. Prior to version 1.4, the only files that could be transferred was the build log and other special files like the statistics or RSS feed. To handle these files there were special methods in the Remote API, and all they did was transfer the entire file across the network as a single string (serialised via .NET Remoting).

With CruiseControl.NET 1.4 we added the ability to transfer any file that was located in the artefacts folder of a project. This was a new method in the Remote API, but it now allowed the remote client to ask for a file. If the file was there, it was transferred across the web. And this is where things got interesting!

Originally, this method generated a MarshalByRef-based object that would connect back to the server and transfer the data a block at a time. This ensured that only small blocks of data were transferred, thus reducing the memory usage on the server.

Unfortunately Windows 2008 did not like this approach! IPSec blocked the object from connecting back to the server, so no data was transferred (yes, it worked on every other OS we tried, just one that stopped it.) After several frustrating attempts to resolve the issue, we gave up and went to transferring the entire file as a single byte[] buffer over the network. Once again memory usage went up :-(

So, for CruiseControl 2.0 we are back to the issue of how can we transfer large files across the network?

Let’s Chat

The problem with transferring a single large block of data across the network is it needs to be loaded into memory first, and then transferred across the network. Even .NET Remoting does not like transferring large blocks across the network, but it does it. Under the hood .NET Remoting takes the large block of data and breaks it into smaller blocks, which then get transferred across the network. This is what our original approach did, but IPSec did not like it!

So, how can we go back to the block-based approach and get past IPSec? The answer is to move from our nice chunky interface, were .NET Remoting handles the blocks, to a more chatty interface (yes, I know, normally not a good practise, but sometimes we don’t live in an ideal world!)

The current situation looks something like this:

image

With .NET Remoting handling the blocking within the red arrow. The new approach will look like this:

image

Yes, it will be a multi-step approach. The steps are:

  1. Open the file on the server
  2. Transfer each block of data
  3. Close the file on the server

This approach requires tighter coupling between the client and the server (sigh). When the client opens the file, a file identifier if returned. This file identifier is what the client then passes to the server for all subsequent operations.

So, why a file identifier? The identifier is used to identify which instance of the opened file is being used (the file is opened in read-mode so it can be opened multiple times). The server then maps the identifier to the stream instance (which is only held on the server). When each block is transferred, the stream is repositioned so the next block is ready. This means the client does not care where it is in the process, it only cares that there is more data to fetch.

The other downside for this approach is the file now needs to be closed on the server – other the number of open streams will slowly increase (i.e. a memory leak!) So closing the file will clean up after the transfer has been completed.

The other approach I looked at was for the client to pass in the starting position. But in the end I decided against this approach for two reasons:

  1. It would require the client know what data it wanted (i.e. the position within the file). If it requested a position beyond the file length, then the server would need some error handling to not fail.
  2. It would require opening the file, positioning to the correct location and closing the file every data fetch! I’m not sure on the performance of this, but it sounds slow.

Of course, if we run into issues with my current approach we can switch to the alternate and test it out, but for now, it’s the approach we are using :-)

Being Helpful

Now, rather than forcing the clients to implement their own versions of file transfer (and potentially duplicate their code), I’ve added a new method to CruiseServerClient in the Remote API. This method is called TransferFile() and it literally does that – it handles all the logic of transferring a file from the server into a local stream.

To use this method is as simple as the following code:

client.TransferFile(projectName, fileName, outputStream);

Where client is an instance of CruiseServerClient (this can be generated from a CruiseServerClientFactory).

I have gone through and removed the old file transfer mechanism (in both the web dashboard and CCTray) and changed to using this method. This means all file transfers will now use the new approach and use less memory. However in a future post I’ll do some tests to see how much (if any) memory is being saved.

Coming Up

Now that we have a file transfer mechanism, the next step is to look at handling the new build log format. In my next post (or posts) I’ll look at the modifications to retrieve the build data and the changes to the dashboard.

Stay tuned…

Posted in CruiseControl.Net, Project Ares | Tagged: , , , | Leave a Comment »

Reducing Strings: A Quick Look Back

Posted by Craig Sutherland on 24 November, 2009

Where Have We Come From?

Before I start modifying the remote interfaces to use the new stream-based data storage, I thought I’d take a quick look back at the memory performance of these interfaces. Again I’ll be using ANTS Memory Profiler 5 to look at how memory is being used.

Just to show the impact of strings in the system I’ll take a look at the original system before I added caching, and then after to see what impact this change had. These will then provide the baselines for looking at how the stream changes can reduce the memory footprint of the CruiseControl.NET server.

But before I get into the analysis, let’s revise how data is sent from the server to a client (e.g. CCTray, the dashboard, etc.) The entire build log is stored in a single file on disk. The client makes a request to the server via .NET Remoting. This then opens the file, reads it all into memory and returns the entire file back to the client as a string. The caching changes modify this procedure only by ensuring there is a single instance of each build log in memory – subsequent calls to retrieve the same log will use the cached log (note, this is cached in memory).

The Method

This is not as nice and tidy as testing the memory usage for a project. The server needs to be up and running and the client then makes the call to the server. However since we are not interested in the client memory usage (yet), we need to be monitoring the server and we cannot directly tell when a memory usage scenario is beginning. To get around this limitation I wrote a small application (stress client) that makes multiple calls to the server to fetch the latest build log for a project. The build log I am fetching is 6Mb in size, the application will attempt to download this log concurrently multiple times to give the profiler plenty of attempts to analyse. On my memory I was able to get four concurrent instances connecting at once, each instance fetched the same build log ten times.

With this in place, I started the profiler and started profiling the server (ccnet.exe). This was run without any start-up parameters and left running for the duration at the test. I waited 30 seconds after starting the application and after the test had finished to make sure the relevant data was collected. After the 30 seconds for start-up had passed I ran the stress client and waited for it to finished. 30 seconds after the client had finished I shut down the server. I performed two sets of tests. The first set of tests I just ran the profiler and collected the memory stats. The second set of tests I captured the memory usage every 5-10 seconds during the test (this was done manually).

These tests were run three times, with the following versions:

  1. This is the original CruiseControl.NET server before the caching was added
  2. This is CruiseControl.NET server with caching added
  3. This is CruiseControl.NET server with caching added, plus an extra modification to garbage collect after each build log fetch

The 1.5.0 codebase was used for all three versions. Only the get build log method was modified for each version.

The Data

These are the results from the original version (#1):

Run1

For the original version, just accessing the build log allocated 234Mb of memory to the server – of which most is used by the large object heap.

Changing to the modified version of the server we get the following results:

Run2

The first thing to notice is the private bytes is also half the original version – and the large object heap is very flat. In the original version a new string was added to the heap every time a client accessed the build log, in the new version there is only one version – hence the reason for the heap being much flatter. However the application is still using a lot of memory – there is a total of 80Mb assigned to all the heaps.

So, the final change is to force garbage collection:

Run3

Again, there is a drastic reduction in memory – this time there is only a maximum of 40MB in all heaps – and the total memory allocated has also been reduced. However, looking at the large object heap, there has been no change – so the question is what has changed and why?

Digging Deeper

The ANTS memory profiler has the ability to take a snapshot of what is currently being used by an application. This shows what objects have been allocated and what is associated with what. The only down side is the snapshots have to be manually taken, but the graphs help to show where snapshots need to be taken. These snapshots can then be viewed in the application to see where the memory is being used and what is holding onto the memory.

For this analysis I only took snapshots of versions 2 and 3 to see exactly what was using memory. Version 2 showed the following information:

Memory-1

While version 3 showed the following:

Memory-2

This shows that string and byte[] has the most memory usage in both versions, but it is the amount of byte[] memory that has changed. The following shows why this is the case:

byte-data

The remoting infrastructure is generating blocks of byte[] data (byte[4096] to be exact) which are then being transmitted to the client. When the data is returned to the client the remoting infrastructure is breaking it down into blocks of byte[] data, these are what then get sent over the wire. These blocks are only 4096b in size, so they are not stored on the large object heap, but the large number of them (4,800 in the example for version 2) means they are taking up a large amount of space. So, in order to reduce memory usage we also need to look into this area also. This is important to note, because in CruiseControl.NET 1.5.0 (and earlier versions) all data transfer is handled by sending a single block of data. We did try during the development of 1.5.0 to do a block transfer, but IPSec in Win2K8 blocked it, so we had to reverse the changes. So we need to find a permanent solution to this issue that will work on all OS versions!

In Summary

We had a quick look at memory consumption under 1.5.0 seeing what could be done to reduce the amount of memory usage. However we’ve pretty much run into a stumbling block with the current log size and that it is a single (huge) block of data. For CruiseControl.NET 2.0 we have already split the build log into smaller files, but we also need to look at what can be done to reduce memory usage while transferring over the wire.

So, in my next post I’ll take a look at an alternate transfer mechanism. Stay tuned…

Posted in CruiseControl.Net, Project Ares | Tagged: , , | Leave a Comment »

Reducing Strings: Converting Tasks

Posted by Craig Sutherland on 23 November, 2009

A Quick Review: Task Classifications

A few posts ago (here), I classified the tasks into how they generate output. Broadly, the output can be split between the build log or directly generated to the file system. There is also a third type of output – external output. This is output that is generated by an external application and stored directly to the file system.

These three types all need to be modified to generate index entries, so the results can be accessed later on. Tasks that write to the build log (e.g. via strings in the old version) or those that write to streams (e.g. direct output to the file system) are easy enough – we already have methods on TaskContext to  generate indexed streams. However external applications that generate files directly need consideration.

So, for this post I am going to convert two tasks, one that generates build log data and the other that manipulates external data. The tasks I will convert are:

  • ExecutableTask
  • NDependTask

This is also the order of complexity of the conversion :) I omitted a task from the internal generation, as those tasks often have additional (and complex) logic. I will hopefully look at them in a future post.

General Execution: ExecutableTask

One of the most basic tasks in the system is the executable task. This task is used for executing any external application, which also makes it one of the most versatile tasks. This task attempts to execute the application, nothing more, nothing less. Any output to standard out or standard error is captured by CruiseControl.NET and written to the build log.

In the current version of CruiseControl.NET (1.5 or earlier), generating the build output from the executable task is a two stage process. First, everything is written to a StringBuilder and then dumped into a string. In the second stage, this string is converted into XML.

For my first version of the conversion to streams, I did a similar process – except using streams. The process executor would write the standard out and standard error to a stream. Once this had finished, it would then go through and convert the results into XML. This is what I used to check the proof of concept was working.

However, this was still a two stage process – generate the results and format them. There is no reason why this is required – it should be possible to generate the results directly as XML – which is what I have done.

Most of the changes to handle this were done in ProcessExecutor. Previously I had modified the signature for Execute() to take in two streams – one for standard out and the other for standard error. These streams were then wrapped in StreamWriter instances and passed onto the runnable process. When the external application wrote to standard out or standard error, the output was sent to the corresponding StreamWriter, which sent the data to the stream.

My new change was to introduce a new class – XmlStreamWriter. This class wrapped a StreamWriter and overrode some of its methods. The most important method it overrides is WriteLine() – when this method is called it wraps the line in an XML block and ensures the data is valid (e.g. has no invalid XML characters). It also appends a line number and the date/time, just to make the output more meaningful. The other methods that I overrode were to start/end the XML document (since an XML document can have only one root) and to clean up when necessary.

Now, all ExecutableTask does is start two result streams and pass these through the levels to ProcessExecutor. It also tells the logic to wrap these streams in the new XmlStreamWriter class, which does the automatic conversion. And that’s it – nice and easy. the infrastructure does the rest of the work – generating the index entries, writing them to the log file, ensuring the output is XML, etc.

Analysing Code: NDependTask

The NDependTask calls an external application called NDepend. This application performs a static analysis on a codebase and produces a number of statistics about the code, including complexity, cyclic redundancy and interrelationships. The reason I choose this task is it generates a number of files which need to be included in the build results. These files include both XML data and non-XML data (e.g. images). Additionally, since it is an external task it generates standard out and standard error to be included.

The first part of the conversion is easy – this is almost identical to ExecutableTask – initialise the streams for standard out and standard error, pass them through the layers to where the application is actually executed and store the result for later.

It’s the second part that requires some additional changes. After the external application has finished, the task checks for any new files that were generated. These files are then imported into the results.

To handle this I added a new method to TaskContext – ImportResultFile(). This method has a similar method signature to CreateResultStream() but with two important differences – it has a filename argument and it does  not return anything. This method simply generates a new index entry and moves the file to the results folder. Now external files can be included in the results index.

A Simple Path

Now we have a nice simple migration path for tasks to use the new index. For tasks that generate data they can use the CreateResultStream() method on TaskContext. This will generate a new stream for the data, plus the associated index entries. These streams can then be passed into ProcessExecutor for handling the stdout and stderr output.

For tasks that generate files via an external application, the ImportResultFile() method can be used. This moves the external file into the data folder, plus generates the associated index entries. This also offers the choice to move or copy the files so the calling task can tidy up if necessary.

These two methods, either separate or together, handle most of the data generation for the current tasks/publishers. There are a few tasks/publishers that won’t like this approach, mainly because they need to manipulate files (i.e. read/write access), but as these tasks/publisher tend to be outside the norm I’m going to ignore them for the moment.

The next challenge is to look into accessing these new index entries and the associated files. So stay tuned for a review of how we currently use the build logs and how it will change…

Posted in CruiseControl.Net, Project Ares | Tagged: , , | Leave a Comment »

Reducing Strings: Additional Task Details

Posted by Craig Sutherland on 23 November, 2009

An Extra Feature

In looking at reducing strings (and thus memory usage) I realise we also have another opportunity to enhance CruiseControl.NET. A while back CruiseControl.NET was enhanced to display current task status. This would display a list of all the tasks for a build, and the current progress through the tasks. However, this information is only ever stored in memory – as soon as the server restarts or another build is triggered (even if it is just a source control check) the information is lost!

With the new changes it is easily possible to persist this information – plus we can go one better and associate the task output with the task information :-) So in this post, I’ll look into how these changes work.

Currently

When the task outputs are added to the results they are added to an array (this is all done in the task context). These are accumulated per context – and when a context is completed the outputs are merged into the parent context. When the build is completed, all the outputs are written directly to the build log, one element per output.

Each output has a number of attributes – task type, result name, task identifier and data type. Some of these are used to identify the output, while other attributes identify the task that it is associated with.

The Changes

The changes fall into two basic categories: first the task context now has the task level attributes (task type and identifier) and the outputs are now associated with a task element in the build log. To handle this, I have split TaskResultDetails into two classes: TaskResult and TaskOutput. TaskResult holds the task level attributes, plus the outcome of the task, while TaskOutput holds the output details.

The task context has been expanded to allow the task type and identifier (currently the task name/description) to be set. At the same time, a new TaskResult is generated for the context (remembering that a context is being generated per task). To enable linking between the task status and the outcome to be stored, the context now requires an IIntegrationResult instance. The status will be set here (it is currently) and the context will pull the status when the context is finalised.

The ImportResultFile() and CreateResultStream() methods have been modified to generate TaskOutput instances now. These instances get stored in the associated TaskResult for the context.

Finally, TaskResult now contains a list of child TaskResult instances. When a context is merged into a parent context the associated TaskResult is appended to the child instances of the parent. When the project-level context is finalised, the associated TaskResult is what is serialised.

Serialisation

Both TaskResult and TaskOutput have a WriteTo() method that takes in an XmlWriter. When these methods are called, they will output the properties of each instance to the writer, and then call the WriteTo() method on any child items (either TaskResult or TaskOutput). To ensure this is written to the build log, XmlIntegrationResultWriter is responsible for calling this method on the project-level context.

The following XML is generated:

<build date="2009-11-12 03:37:37" buildtime="00:00:00" buildcondition="ForceBuild">
  <result type="project" identifier="Test" outcome="Unknown">
    <result type="exec" identifier="exec #1" outcome="CompletedSuccess">
      <output file="D:\Open Source\CruiseControl.NET 2.0\project\console\bin\Debug\Test\Artifacts\233e1e69d-f0ab-441c-bc6f-030e52b61f58.data" name="stdout" data="data/xml" />
      <output file="D:\Open Source\CruiseControl.NET 2.0\project\console\bin\Debug\Test\Artifacts\23\2d1a5ede-b2c8-432e-bf87-af01a75ebd2d.data" name="stderr" data="data/xml" />
    </result>
    <result type="exec" identifier="exec #2" outcome="CompletedSuccess">
      <output file="D:\Open Source\CruiseControl.NET 2.0\project\console\bin\Debug\Test\Artifacts\23\7f9f9bfe-2d5d-43b1-be1f-a61303701b72.data" name="stdout" data="data/xml" />
      <output file="D:\Open Source\CruiseControl.NET 2.0\project\console\bin\Debug\Test\Artifacts\23\1b65cf49-4902-441f-848b-c2f22234d6b7.data" name="stderr" data="data/xml" />
    </result>
  </result>
</build>

Now it is easy to see which tasks ran, what the outcome of the tasks were and what output they generated.

 

In Conclusion

This (hopefully) concludes the changes to the build engine on the server for the stream-based output. In the next set of posts I will delve into using the new output.

Stay tuned…

Posted in CruiseControl.Net, Project Ares | Tagged: , , | Leave a Comment »

Reducing Strings: XmlLogPublisher is Dead, Long Live XmlLogPublisher

Posted by Craig Sutherland on 21 November, 2009

Good Bye XmlLogPublisher

As part of the changes for implementing the task context I am removing XmlLogPublisher. This is to provide a standardised way of generating the log – while still exposing the required functionality to the other tasks that need it.

In the current version of CruiseControl.NET users need to add an xmllogger element to their project configuration (preferably in the publishers section, but there is no reason why they can’t put it in the tasks section). When this task is run it will generate the log file, which can then be used by sequent tasks. However, to complicate things, some tasks generate their own version of the log file (normally in memory) – e.g. StatisticsPublisher and EmailPublisher. Plus if the user adds a publishers section and forgets to add the xmllogger element, then no log will be written!

Note: if the user does not define a publishers section, a publishers section will be generated by default that contains the xmllogger element.

So, to standardise and simplify things, I am moving the functionality from XmlLogPublisher into TaskContext. This functionality will be called automatically by Finialise() in TaskContext (this method is called when an integration finishes).

Currently XmlLogPublisher generates the folder for saving the log in, generates the name of the log file and then writes out the log (via XmlIntegrationResultWriter). I have moved these into TaskContext as separate methods (as GenerateLogFolder(), GenerateLogFilename() and WriteCurrentLog()). The Finalise() method calls all of these, but I have exposed them as public methods so they can be called from other locations.

Additionally, I made a small change to XmlIntegrationResultWriter. Because I want to write the current result entries out via the writer, I changed the constructor to take in an enumeration of TaskResultDetails. I also changed the WriteTaskResults() to output these results (previously it was outputting the data from the tasks). This required changing the statistics and e-mail publishers to use TaskContext instead of XmlIntegrationResultWriter. Unfortunately it also means these are both breaking at the moment, as they are expecting a valid log file to be generated, and it has now been changed to only have index entries! In a future post I will look at modifying these so they work again :)

Finally to round out this change, I added another new method to TaskContext. This method (GenerateResultsSnapshot()) will generate the current snapshot of all the results. This breaks the normal working of TaskContext, in that it allows a child context to access a property on the parent context. This means that property (the list of index entries) now needs to be locked before it can be accessed. I did consider using the reader/writer lock instead of the general monitor lock, but it sounds like the .NET 2.0 version is flawed, so I decided to avoid it.

So, in saying farewell to XmlLogPublisher, I have moved most of its functionality to TaskContext. I’ve also expanded both TaskContext and XmlIntegrationResultWriter to handle writing the index entries to the log file.

Hello XmlLogPublisher

Having removed most of the functionality from XmlLogPublisher, I rebuilt it to use the new functionality from TaskContext. Why? So it is still possible to generate a build log part way through the build process. This is required for those tasks/publishers that want to do something with a physical build log file. For example, the FTP publisher might want to send the build log to a remote machine, the package publisher might want to include it in a package, etc. The new version allows this functionality.

However, I also expanded some other functionality to make this even easier. As part of the changes in removing the functionality, I removed the log directory configuration setting to Project. While the new XmlLogPublisher still has this setting, I wanted to expose the project-level setting to the publisher. Currently tasks/publishers work in isolation – they don’t know anything about the project that owns them (beyond what is passed about in the IIntegrationResult.) So, I wanted to expose project configuration settings, but in an immutable way.

To do this, I have added a new class called ProjectConfiguration. This class exposes a number of the projects from Project in an immutable form. These are loaded from the project when the class is initialised and are read-only. I then added this new class as a property on TaskContext. Now, any task with an associated TaskContext can access these project configuration settings. At the moment there are only a few settings, but over time I will expand ProjectConfiguration to include most of the configuration settings (yes, it is a work in progress!)

So, now the new XmlLogPublisher can use its own log folder, the project’s log folder (if set) or the project’s artefact folder. At the same time, other tasks can now access project settings in a safe, immutable way.

So, that’s the first set of changes – XmlLogPublisher has been changed to fit the new structure, and the index entries are now being written to the disk. In the next set of changes I’ll look at modifying some tasks/publishers to use the new indexing methods.

Stay tuned…

Posted in CruiseControl.Net, Project Ares | Tagged: , , | Leave a Comment »

Reducing Streams: Stream Metadata and Indexing

Posted by Craig Sutherland on 19 November, 2009

Data about Data (and so it goes…)

In my last post (here) I reviewed all the tasks and publishers currently implemented. As part of the review there were a number of tasks/publishers identified that generated output that is not currently included in the build log. As part of the changes, we want to be able to handle these files with a standard process as well.

In general, the strategy is to let each task write directly to disk, instead of writing to memory (in a string). It is the task’s job to decide what to write and how. In an even earlier post (here) I introduced the concept of a TaskContext. This context is responsible for starting new streams and storing references to them. While this concept is still valid, it needs a couple of extensions.

My Initial Plan

In my initial plan I was going to let the task choose the file name. The task would call an open stream method on the context and tell the context the file name and a result type name. The context would then open a new file stream using the file name in a folder for the build. If the file name was already in use, the context would generate a unique name (normally by inserting a literal and number into the name). To handle multi-threading, each context required its own folder, and when a context was finished, the owner would then move all the files from that folder into its own folder, again ensuring uniqueness of the names.

This added some extra complexity, just to produce easily readable file names. It did not guarantee that a file name would match a the expected task (i.e. there could be a nunit.xml that did not come from the NUnit task). The task type name was only stored in the index file, and the index file name was not unique (it would overwrite previous versions!)

Additionally, the context only handled generating new streams, it didn’t handle importing files generated externally (e.g. output from external applications). These would still exist outside the generation process, and they were not included in the index file.

Finally, to handle the duo-stream output from external tasks, I had put a stream merging process into place. This would literally take two (or more) streams and merge them together, plus it also converted the streams into an XML format (e.g. one element per line, with a line type per element). Nice maybe, but of questionable value since it adds additionally processing overhead!

Changing the Process

In retrospect I made things harder by trying to be nice and also by having the entire indexing outside of the current way of doing things (since this would also have a minimal impact on the current way of doing things.) Now, I think that was a bad decision – these changes are going to be breaking anyway, so lets try and do things properly!

First, unique file names: the name of the file needs to be unique, no matter which task generates it, whether there are duplicates of a task type, or multiple files of the same name being merged. The uniqueness needs to be enforced even when different threads generate files, and whether the files are generated internally or externally. Rather than inventing a new process for unique names I’ll use a tried and true method – GUIDs. When a new result stream is opened the physical name of the stream will be a GUID – no more, no less.

The index entries for each stream will contain the necessary data to identify what each file is for and the type of data contained. These metadata items are:

  • Task type: the type of task that generated the result. This will be the name of the task in the configuration (e.g. nunit, msbuild, nant, etc.) and it will be up to the calling task to pass in.
  • Result name: this is the internal name of the file. Examples would be stdout, stderr, test results, etc. Again, it is up to the task to pass in this name and it should be unique per task type (although not sure how to enforce this!)
  • Task identifier: this is an identifier for the actual instance of the task. Again, this will come from the calling task and could be something like the user-configured task name, an internally generated identifier or the date/time. Basically, it is used to group streams together within an actual instance of a task.
  • Task order: this is the order in which the result was generated. The context will generate this number and associate it with a result. When a child context is merged, the owning context is responsible for ensuring the order numbers are unique. As such this identified the order in which tasks are completed.
  • Data type: this is the type of data the file contains. Ideally, this will be the mime type of the data, so it can be used in serving the files via the dashboard. Again, this will need to come from the calling task.

Creating Index Entries

Now that I know what I want to store, and who is responsible for populating streams, the question is how are these index entries and streams generated?

Based on the sources of data, there needs to be two ways of generating the indexes. First, there are the internally generated entries. In this scenario the task will request a stream and pass in the required metadata. The context will generate the stream and at the same time generate an index entry. After this, it is the responsibility of the task to manage the stream (including closing it!)

The second scenario is an externally generated file. In this case the task will tell the context where the file is, plus the required metadata. The context will then import (move/copy?) the file into the build folder and add the index entry. The task does not need to do anything else.

And Finally, Some Context

I’ve written a bit about the context and some of how it works, but I still haven’t covered how it is generated or how it writes the indexes.

An initial context will be generated when a project starts building. At the moment this will be associated with the Project, although later on I may move it (especially when we come to splitting the context from the configuration). As each task is run, a child context is started for the task. The task then runs within this context, generating its own indexes. If the task starts child tasks (e.g. parallel task, sequential task, etc.) it must generate a child context for each child task.

After a task is completed, the associated context is finalised (either by the project or the parent task). Now an important point – each task or project only works with one context, and no other task or project will change that context! This ensures that the context will not be messed up by two tasks trying to modify the context at the same time. There is only one exception to this rule – when a context is finalised. At this time the context is locked so the owning task cannot use it anymore and all the index entries are transferred to the parent context.

If the context does not have a parent context (i.e. it is the project-level context), it will generate the final index. This index will then be included in the standard build log. This will require a slight shift in the way the build log is generated. Currently the build log is generated by the xmllogger task, which means it can be generated at different points in a build. This task will be removed from the available tasks, and the build log generated automatically at the end of the process.

At this point you are wondering how can publishers access the build log? For example, the e-mail publisher may want to generate an e-mail containing the results of the build, the FTP publisher might want to send the log to a remote machine, etc. This will be handled in two ways:

  1. Each task context will have access to the current list of indexes. This will need to be accessed in a thread safe way, just in case the user has configured a multi-thread project. Either way, the task context will have access to the entire set of index entries from its parent and all ancestors, all the way to the project context. This means the task will only be able to access index entries from finalised task contexts.
  2. At will be possible to generate a build log at any point. These build logs will only be in-memory, but will contain all the current build details, plus the current list of index entries (see point 1. above). At the end of a build this will be the same file that is generated for the overall build log. The new build log structure will replace the build results that currently exist with the index entries.

Time To Work

This describes my plan on how I will implement things. Coming up next, some actual implementation details.

Posted in CruiseControl.Net, Project Ares | Tagged: , , | Leave a Comment »

Reducing Strings: The Current Situation

Posted by Craig Sutherland on 18 November, 2009

What’s Currently Happening

Before I move onto converting the current tasks from using strings to streams, I thought it would be a good time to review the existing tasks and publishers. Looking at the tasks they can be divided into four categories:

  1. Those that don’t generate any output
  2. Those that generate output in the build log
  3. Those that generate external output
  4. Those that generate both

The following tasks do not generate any output:

  • ArtifactCleanUpTask
  • ConditionalPublisher
  • CruiseServerControlTask
  • EmailPublisher
  • ForceBuildPublisher
  • ModificationReaderTask
  • NullTask
  • ParallelTask
  • SequentialTask
  • SynchronisationTask

Since these tasks do not generate any output, we don’t have to worry about these tasks (yeah!!)

The second category are those tasks that generate output for the build log. The following tasks fall into this category:

  • DevenvTask
  • DupFinderTask
  • ExecutableTask
  • FinalBuilderTask
  • GendarmeTask
  • MsBuildTask
  • NAntTask
  • NCoverProfileTask
  • NUnitTask
  • PowerShellTask
  • RakeTask

These are the tasks that I have been primarily concerned with, as they currently generating strings. These strings are then concatenated into one massive string before being written into a file. Since the build log is an XML file, all the strings are XML’enised before they are concatenated. This means if the string is valid XML, it is concatenated directly, otherwise it is wrapped into a CDATA section.

These tasks can work in two general ways. First, they can generate output that is stored in memory as strings (e.g. output from stdout or stderr) – these strings are not persisted to disk at all. Second, they can generate output that is persisted disk, which later gets read into memory and then appended into the log file.

The third category is those tasks that generate individual files directly. These tasks are:

  • BuildPublisher
  • FtpTask
  • ModificationHistoryPublisher
  • ModificationWriterTask
  • PackagePublisher
  • RssPublisher
  • StatisticsPublisher
  • XmlLogPublisher

These tasks all generate output, but they are responsible for storing the files. They also require custom methods to retrieve the data (normally implemented in CruiseServer and/or Project). Plus, these files can be XML or non-XML (e.g. images, HTML, etc). And as a final complication, these files can read/write – i.e. the task may re-open an existing version of the file, add some more output and save it again.

These tasks open a second area for improvement – providing a standardised way for all output to be stored. But I’ll write more about this soon.

The final category generates both data for the build log and individual data files. These tasks are:

  • HttpStatusTask
  • MergeFilesTask
  • NCoverReportTask
  • NDependTask

These tasks basically have both data for the build log and individual data files. The log data is all standard XML, while the non-log data can be anything!

The Strategy for Standardisation

Currently I am changing to use a TaskContext for generating output streams and indexing these streams. This approach can also be expanded to handling the external output files.  There are two main differences between the two: external files can include non-XML data and they can also modify existing files. To simplify things, I decided on to exclude the tasks that modify existing files. These tasks are:

  • ModificationHistoryPublisher
  • ModificationWriterTask
  • RssPublisher
  • StatisticsPublisher

The reason why is very simple – the tasks that edit existing files have a set location for the files. These files are managed outside of a build, i.e. they are not build-related. In contrast, the output form the other tasks are specific to the build they were generated in. Once the build has finished, they are never modified (at least in theory).

To handle the XML/non-XML data, I’ll add an extra attribute to the index – the data type. This will be set by the task that generates the index. This will be the standard mime types, although it will involve some work in setting these properly.

Executable Tasks

A second way that tasks can be categorised is whether they execute external application or not. Most (but not all) of these tasks inherit from BaseExecutableTask. To actually execute the application they normally call TryToExecute(), which returns a ProcessResult. The current implementation of this class contains a string containing the standard output and a second string containing standard error. It is then up to the calling task what it does with the results. Currently the results are handling in three ways:

  1. The standard output and standard error strings are XML’enised (as above) and concatenated to the build log.
  2. The standard error only is XML’enised and concatenated to the build log.
  3. Both standard output and standard error are converted into XML lines (all XML reserved characters are converted) and then concatenated to the build log.

The following tasks fall into this category that inherit from BaseExecutableTask:

  • DupFinderTask
  • ExecutableTask
  • GendarmeTask
  • MsBuildTask
  • NAntTask
  • NCoverProfileTask
  • NCoverReportTask
  • NDependTask
  • PowerShellTask
  • RakeTask

Additionally NUnitTask calls an external application, but it does not inherit from BaseExecutableTask. This task should also be converted to inherit from BaseExecutableTask, so I will include this task in the list of changes.

The subtle difference between how the output is concatenated means my original approach needs to be revised. Now I’m planning on modifying TryToExecute() so it takes into two streams as arguments, and it is up to the calling task on how these streams will be generated and how they will be processed after the task has executed.

This leaves me with a general outline of how to proceed. Next I will take a look at the approach I’ve implemented and how it affects the existing tasks.

Posted in CruiseControl.Net, Project Ares | Tagged: , , | 2 Comments »

I’m Back: Holiday’s Over, Time to Write!

Posted by Craig Sutherland on 17 November, 2009

Back to Life, Back to Reality

I have returned from my annual visit to see the in-laws in China, and once again I have worked on some nice goodies to add to CruiseControl.NET (CC.NET).

This time I was looking at two specific areas:

  • Converting to streams for results
  • A Silverlight RIA

Stream-based Results (Project Ares)

The streams functionality is to try and resolve a long-time issue with running out of memory. Currently CC.NET performs all of its result processing in-memory as strings. This means when a task runs, it generates an in-memory version of the results (e.g. from stdout/stderr or from importing file results). These results are then appended to an ever-increasing copy of the log file, which is finally written out to disk. Now this is fine if 1) the results are small or 2) you have lots of memory, but this can cause problems if these conditions are not met! Additionally the same problem applies not only to generating the results, but retrieving the results!

The solution is simple in concept – instead of writing to memory, write directly to disk instead – hence changing to using streams instead of strings. However like any non-trivial modification, this has a number of far-reaching implications, so it was not as easy as just changing from generating streams to writing to streams.

However, the good news is I got it working, although it still needs a bit of polish to get it working nicely. And the other good news is I documented what I did along the way :-) So over the next few weeks I’ll be reviewing and publishing this documentation on my blog. These posts will be published under Project Ares.

Silverlight Client (Project Capricorn)

The other area I played with was writing a Silverlight 3.0 client for CruiseControl.NET. This was more of a fun project to see what was possible (although it didn’t help that I was offline and had to go by trial and error).

For this project I wanted a similar type of interface to the current dashboard. This allows people to view information at four levels: farm (all monitored servers), server, project and build. More challengingly it allows people to develop their own plug-ins (either via code or XSL-T) and use these within the UI.

As well as implementing a basic dashboard-like UI, I wanted to lever some of the rich functionality that is available in Silverlight – things that are possible in HTML/CSS/Javascript, but are challenging to do.

So, I have put together a very rough implementation of a Silverlight client, although it is very much at a prototype stage. This allows for the basics of the UI (layout, navigation, etc.) plus a plug-in infrastructure for adding new plug-ins. Unfortunately both need work to get up to release level.

So, once I have finished writing about the stream changes I’ll write up about the Silverlight client under Project Capricorn. Hopefully there will be some interest in it, so we can look at completing the project and including it in the official codebase (probably for CC.NET 2.0). Otherwise I’ll move it to the FastForward.NET project and work on it as I have time.

But Wait, There’s More!

Another area that I’ve been slowly working towards for a while now is the ability to make CC.NET distributed. CC.NET as it currently stands has some distributed elements, but it doesn’t really work as a distributed application. Some of the changes I’ve been working on (messaging, hot-upgrades, etc.) have been pieces of the distributed puzzle. The streaming work added a couple more pieces of the puzzle, plus showed a few more challenges to be resolved!

So I’ll be adding a few posts on enhancing CC.NET to be distributed – either some of the issues involved or some of the pieces that have been added. Hopefully by the time we (finally) get to the 2.0 release CC.NET will be able to work as a distributed application :-)

That’s All For Now

So that’s what I’m planning on writing up over the next few weeks. At the same time I’ll be adding the source to SourceForge (under the CCNet2 branch) and hopefully spending some time with the other devs on getting CC.NET ready for the “official” 1.5 release.

Stay tuned…

Posted in CruiseControl.Net, Project Ares, Project Capricorn | Tagged: , , , , | Leave a Comment »

Reducing Strings 3: Some Preliminary Results

Posted by Craig Sutherland on 16 October, 2009

Introduction

This is the third post in converting CruiseControl.NET from using string to streams. The other posts are:

In this post I’m going to take another slight detour from the planned changed. The reason – to see whether the changes are actually working!

The reason why is pretty simple – one of the other devs on the project asked for some proof that my changes would reduce memory usage. Now my rough tests and knowledge of the system said they would, but I thought I should do some proper tests to prove it. And here they are!

Methods

The aim of this set of tests to see whether memory usage is reduced with the new stream-based processing. Since I’ve only converted one task at the memory, all my tests will use that task.

First I put together a small application that would write out “The Preventer of Information Services” to standard output 100,000 times. Using the same test, but writing to a file, generates a ~5 Mb file – not too small and not too big (previous tests on my machine showed I could handle files of up to ~50 Mb within CC.NET).

Next, I put together a simple ccnet.config that would run the application. I then ran this config using the –p switch to execute the project and stop (ccnet -p=Test). This allows me to quickly and easily run just that task without any confounding tasks.

Finally, I profiled the test using ANTS Memory Profiler 5.1 (see credits below).

I performed the same test with both the 1.5 codebase and the new 2.0 “experimental” codebase.

Results

Here are the results from running in 1.5:

image

This is looking at the private bytes – this is the amount of memory that has been allocated to the process.

image

And this is looking at the # bytes in all heaps (Gen 0, gen 1, gen 2 and large object).

There is a slow but steady increase in memory for around two thirds of the time, and then there is a huge massive jump in memory usage around the 2:20s mark. There’s also a jump in memory usage right at the end, but as the application is terminating it only appears briefly.

Overall execution time was around 3:12s. Total memory allocated was in excess of 128Mb, while the actual memory usage was fast catching up!

And now, the results from the 2.0 codebase:

image

image

There is a huge difference – in both graphs!

The private bytes only reached 32Mb – a quarter the amount of memory allocated for the 1.5 codebase. And the change is even more dramatic for the bytes in all heaps – I had to use the tooltip to find out how much memory was used – a mere 1.31 Mb at the most. Additionally, this high usage point was right at the beginning of the task executable, as opposed to a spike at the end. Finally, the memory usage is actually dropping at the end of the execution in the 2.0 codebase.

Finally, and this surprised me, the execution time was better for the 2.0 codebase (approximately 2:13s). However this might be because my machine is running at near physical memory capacity anyway, so the extra memory needs to come from the paging file.

Analysis

In the 1.5 codebase there are a number of memory jumps. I think these jumps are being caused by different steps in the process of persisting the results to the log file. The slow increase is due to the test task writing to standard output. The first big jump is when the task is completed and the data is being converted into an XML format. The final jump is when the data is actually being merged into the log file for writing to disk.

In contrast, the 2.0 codebase has a much lower memory usage level. The memory level is staying low because the output is being written directly to disk and then freed. The memory manager in .NET can then reuse the memory instead of needing to ask for more.

The second difference lies in the way the results are converted – this is done on a line by line basis in the new code. In the 1.5 codebase everything is done in memory, thus requiring sufficient memory for both the original string and the new string, plus any memory for working space. In the 2.0 codebase the conversion is done on a line by line basis. The line is loaded from the original stream, converted and written to the new stream. Again, a much smaller memory footprint being needed.

Finally, the spike at the end is eliminated as the data is already in the files, so there is no need to merge everything together (although in theory this could still be done.)

Summary

So, in conclusion, the stream changes have significantly reduced the memory footprint – at least for my simple test. Comparing the two codebases, the 2.0 codebase only requires a quarter of the total memory allocated, and internally less than 1% of the old heap allocations! For those machines like mine with only limited physical memory, the changes also result in faster performance!

Credits

I was able to perform these tests thanks to Red Gate. They provided a free software license of their .NET Developer’s Bundle to allow me to do some performance tuning, and I am very thankful to them for this (it turns out they use CruiseControl.NET, so hopefully this will help them too). Their website (http://www.red-gate.com/index.htm) contains the details on their products, plus a lot of useful information on how to use their tools.

For these tests I used the ANTS Memory Profiler, I haven’t tried the other tools in the bundle yet.

I really liked the way you could see what was happening in real-time, it was very easy to see the memory jumps and tie it into what was happening in the application. They also have a whole heap of additional information that is available – like snapshots of how the memory is being used and what has changed. But for this post the graphs had all the information – in future posts I’ll delve into some of the other information available.

So if you are looking for a memory profiling tool, I definitely recommend taking a look at the ANTS Memory Profiler, it takes the guesswork out of what is happening in memory :-)

Posted in CruiseControl.Net, Project Ares | Tagged: , | 2 Comments »

Reducing Strings 2: Getting to the Root

Posted by Craig Sutherland on 15 October, 2009

Continuation

In my last post on this subject (read it here) I added the concept of a task context. This is a context that the task runs within and stores all the output from a task. The next step is to start writing to this context.

One of the core concepts in the context is it generates streams that can be used by the task. The task is then responsible for managing the stream, but the context is responsible for managing the referencing to the stream. So, the trick to generate the streams from the context, pass it through to where they are needed in the tasks and then clean up when the task has finished with the streams.

And of course, that’s where it gets tricky!

The Current Situation

90% of the time, a task does not directly execute an external application. Instead it calls through a number of layers. This allows a number of cross-cutting functions to be built in, but at the same time it makes changes from strings to streams harder.

Here is the current way it works:

image

The calling task calls the TryToRun() method on BaseExecutableTask, which passes it onto ProcessExecutor and finally RunnableProcess. RunnableProcess internally creates two StringBuilder instances – one for standard output (StdOut) and one for standard error (StdErr). RunnableProcess also provides all the functionality necessary for getting the data from StdOut and StdErr and putting it into these two instances.

When the Run() method on RunnableProcess has finished, it generates a ProcessResult and stores the strings from the two StringBuilder instances into there. From here on, the StdOut and StdErr are stored as strings in memory.

So, the issue now becomes one of where should the streams be initialised? How many streams should be generated? And what do we want to store for future usage?

The New Situation

The main change I am making is removing the instantiation of StdOut and StdErr from within RunnableProcess to BaseExecutableTask. These will be instantiated as streams and passed through to RunnableProcess, where it is an easy enough change to write to the streams instead of a StringBuilder.

Additionally, I’m going to get BaseExecutableTask to generate two streams – one for StdOut and one for StdErr. These get passed down the chain. When the Execute() method on ProcessExecutor has completed, these two streams will be merged into one.

So, this shows how I am changing things:

image

On a side note, a ProcessResult will still be generated as this contains additional information needed for the tasks (e.g. exit codes, time-out details, etc.)

Based on this plan, most of the work is in the BaseExecutableTask, with minor changes to the other two classes. Additionally, I’m going to do some work around the merging of StdErr and StdOut in TaskContext, as this class is responsible for managing the references.

The Actual Changes

Most of the changes are straight-forward, and reasonably boring (except when I made a mistake and have to debug it!)

RunnableProcess and ProcessExecutor were both modified to accept streams and use them (instead of the internal StringBuilders.) Reasonably simple change – just needed to remember to close the StreamWriters I was using.

BaseExecutableTask got a new override for TryToRun() that uses the new streams functionality. This override includes the task name and type (required for creating a result). Additionally I added a couple of new protected virtual methods methods to allow people to override some of the functionality. This is the creating the result stream and merging results functionality (more on this below).

Finally TaskContext got a new method – MergeResultStreams(). This will literally merge two or more streams into a single stream. It also manages the references – the old references are removed and a new reference is added for the merged file. The merging is handled by a delegate, so individual tasks can define how the results are merged. The default merge is a binary merge – copy all the bytes from each of the streams into a single stream.

BaseExecutableTask defines a custom merge delegate. This merge delegate will merge all the results into an XML format – similar to how ExecutableTask does it currently. However this uses the streams to handle the merging and formatting, instead of directly manipulating strings in memory – so we shouldn’t have the out of memory exceptions :-)

And that’s all there is to it – at least for this phase. At this point the new code breaks a number of other tasks – those that rely on the data being in a single massive string. Plus there are a number of additional overrides that I have added temporarily to reduce the amount of breaking changes.

So, What’s Next?

Looking over my previous what’s next list, I realise I’ve skipped over a couple of points – all of this is included in the task changing. I have modified Project to both generate the task context and associate it with tasks, but I think a bit more work is needed there. Additionally I need to look at the container tasks (e.g. parallel task, sequential task, etc.) to see what needs to change to pass on child contexts.

So, in my next post I plan on covering the “plumbing” for the contexts, and then I’ll return to modifying tasks.

Stay tuned…

Posted in CruiseControl.Net, Project Ares | Tagged: , | Leave a Comment »