Data about Data (and so it goes…)
In my last post (here) I reviewed all the tasks and publishers currently implemented. As part of the review there were a number of tasks/publishers identified that generated output that is not currently included in the build log. As part of the changes, we want to be able to handle these files with a standard process as well.
In general, the strategy is to let each task write directly to disk, instead of writing to memory (in a string). It is the task’s job to decide what to write and how. In an even earlier post (here) I introduced the concept of a TaskContext. This context is responsible for starting new streams and storing references to them. While this concept is still valid, it needs a couple of extensions.
My Initial Plan
In my initial plan I was going to let the task choose the file name. The task would call an open stream method on the context and tell the context the file name and a result type name. The context would then open a new file stream using the file name in a folder for the build. If the file name was already in use, the context would generate a unique name (normally by inserting a literal and number into the name). To handle multi-threading, each context required its own folder, and when a context was finished, the owner would then move all the files from that folder into its own folder, again ensuring uniqueness of the names.
This added some extra complexity, just to produce easily readable file names. It did not guarantee that a file name would match a the expected task (i.e. there could be a nunit.xml that did not come from the NUnit task). The task type name was only stored in the index file, and the index file name was not unique (it would overwrite previous versions!)
Additionally, the context only handled generating new streams, it didn’t handle importing files generated externally (e.g. output from external applications). These would still exist outside the generation process, and they were not included in the index file.
Finally, to handle the duo-stream output from external tasks, I had put a stream merging process into place. This would literally take two (or more) streams and merge them together, plus it also converted the streams into an XML format (e.g. one element per line, with a line type per element). Nice maybe, but of questionable value since it adds additionally processing overhead!
Changing the Process
In retrospect I made things harder by trying to be nice and also by having the entire indexing outside of the current way of doing things (since this would also have a minimal impact on the current way of doing things.) Now, I think that was a bad decision – these changes are going to be breaking anyway, so lets try and do things properly!
First, unique file names: the name of the file needs to be unique, no matter which task generates it, whether there are duplicates of a task type, or multiple files of the same name being merged. The uniqueness needs to be enforced even when different threads generate files, and whether the files are generated internally or externally. Rather than inventing a new process for unique names I’ll use a tried and true method – GUIDs. When a new result stream is opened the physical name of the stream will be a GUID – no more, no less.
The index entries for each stream will contain the necessary data to identify what each file is for and the type of data contained. These metadata items are:
- Task type: the type of task that generated the result. This will be the name of the task in the configuration (e.g. nunit, msbuild, nant, etc.) and it will be up to the calling task to pass in.
- Result name: this is the internal name of the file. Examples would be stdout, stderr, test results, etc. Again, it is up to the task to pass in this name and it should be unique per task type (although not sure how to enforce this!)
- Task identifier: this is an identifier for the actual instance of the task. Again, this will come from the calling task and could be something like the user-configured task name, an internally generated identifier or the date/time. Basically, it is used to group streams together within an actual instance of a task.
- Task order: this is the order in which the result was generated. The context will generate this number and associate it with a result. When a child context is merged, the owning context is responsible for ensuring the order numbers are unique. As such this identified the order in which tasks are completed.
- Data type: this is the type of data the file contains. Ideally, this will be the mime type of the data, so it can be used in serving the files via the dashboard. Again, this will need to come from the calling task.
Creating Index Entries
Now that I know what I want to store, and who is responsible for populating streams, the question is how are these index entries and streams generated?
Based on the sources of data, there needs to be two ways of generating the indexes. First, there are the internally generated entries. In this scenario the task will request a stream and pass in the required metadata. The context will generate the stream and at the same time generate an index entry. After this, it is the responsibility of the task to manage the stream (including closing it!)
The second scenario is an externally generated file. In this case the task will tell the context where the file is, plus the required metadata. The context will then import (move/copy?) the file into the build folder and add the index entry. The task does not need to do anything else.
And Finally, Some Context
I’ve written a bit about the context and some of how it works, but I still haven’t covered how it is generated or how it writes the indexes.
An initial context will be generated when a project starts building. At the moment this will be associated with the Project, although later on I may move it (especially when we come to splitting the context from the configuration). As each task is run, a child context is started for the task. The task then runs within this context, generating its own indexes. If the task starts child tasks (e.g. parallel task, sequential task, etc.) it must generate a child context for each child task.
After a task is completed, the associated context is finalised (either by the project or the parent task). Now an important point – each task or project only works with one context, and no other task or project will change that context! This ensures that the context will not be messed up by two tasks trying to modify the context at the same time. There is only one exception to this rule – when a context is finalised. At this time the context is locked so the owning task cannot use it anymore and all the index entries are transferred to the parent context.
If the context does not have a parent context (i.e. it is the project-level context), it will generate the final index. This index will then be included in the standard build log. This will require a slight shift in the way the build log is generated. Currently the build log is generated by the xmllogger task, which means it can be generated at different points in a build. This task will be removed from the available tasks, and the build log generated automatically at the end of the process.
At this point you are wondering how can publishers access the build log? For example, the e-mail publisher may want to generate an e-mail containing the results of the build, the FTP publisher might want to send the log to a remote machine, etc. This will be handled in two ways:
- Each task context will have access to the current list of indexes. This will need to be accessed in a thread safe way, just in case the user has configured a multi-thread project. Either way, the task context will have access to the entire set of index entries from its parent and all ancestors, all the way to the project context. This means the task will only be able to access index entries from finalised task contexts.
- At will be possible to generate a build log at any point. These build logs will only be in-memory, but will contain all the current build details, plus the current list of index entries (see point 1. above). At the end of a build this will be the same file that is generated for the overall build log. The new build log structure will replace the build results that currently exist with the index entries.
Time To Work
This describes my plan on how I will implement things. Coming up next, some actual implementation details.