Kitware Source Feature Article: July 2010

Packaging VTK For Astrophysical Data Analysis Using CPack

The formation of structure in the Universe, specifically the formation of the very first stars and galaxies, can be reproduced through computational simulation. To conduct these reproductions I, along with several collaborators, run adaptive mesh refinement (AMR) simulations using the code Enzo, wherein we dynamically insert nested regions of higher resolution inside larger, coarser regions. To enable the easiest possible handling of this data, we have developed a Python package called ‘yt’ to analyze and visualize this data from a physically-motivated perspective. This allows us to inspect data while abstracting away all of the underlying code objects. Our simulations have been successfully deployed on many individually-operated Mac and Linux machines, as well as across the TeraGrid and even on Compute Node Linux-based Cray systems. The simulations utilizes Python, NumPy, HDF5 and we are currently developing a VTK frontend to the data structures.

However, the hardest problem we’ve had with yt has actually been that of packaging. Packaging as a whole, but specifically across heterogeneous systems, can be very challenging for scientific software. In fact, this has been the primary obstacle to further development of GUI systems within yt, and one of the reasons that yt strives to remain as free of dependencies as possible. Each additional dependency adds on an extra step that can cause problems during the installation process. This is why yt as a whole relies mostly on scripting and command-line usage, largely shunning GUIs. But while this provides the advantage of easily enabling reproducible analysis and reducing the dependencies of the package, it cuts off a large avenue for serendipitous discovery and immersive examination. To that end, we have begun working on an interface between yt and VTK, to outsource to VTK some of the tasks of interactive 3D visualization.

As the capabilities of yt expand, the potential barrier to utilizing those new capabilities should always be going down. If installing a new package is more trouble than it’s worth or if the functionality is not suitably compelling compared to the effort on the part of the user, then the user will either become frustrated with the process, or the feature will simply go unused. As we have moved forward on constructing an interface to VTK, we have kept this primarily in mind.

The VTK interface is based on the TVTK package from Enthought, which provides a simple manner of passing NumPy arrays as data to VTK objects. We utilize the vtkHierarchicalBoxDataSet to construct AMR datasets in memory, which we are then able to interact with and visualize using slicing widgets, marching cubes and isosurfaces. Furthermore, this interface can be used as an interactive environment of its own accord, and as an interface to our homegrown AMR volume renderer, providing both camera orientation and isosurface information. We hope that over time our utilization of VTK’s capabilities will grow, and that more and more options are made available to the user.

On Linux systems, difficulty in installation of yt is typically limited to a few packages -- the majority of the software stack necessary to run yt installs out of the box. yt ships with an installation script that constructs a full sandbox, building a Python interpreter, the necessary IO libraries, the Python packages, and finally yt. The main issues are typically a mismatch in the current compiler module that's loaded (for instance, having PGI loaded while trying to compile with GCC), the lack of an underlying GUI toolkit such as wx or Qt, and occasional contamination of the Python sandbox.

Mac OS X, though, is an exceptionally different beast, and with Snow Leopard it only became more difficult. As a result of the relative difficulty of OS X installations to Linux installations, we base most of our featureset on what we believe will install seamlessly on both platforms. The installation script that we provide for OS X is distinct from the installation script provided for Linux, as we tend to rely more heavily on binary distributions of software components on OS X than on Linux. Previous to 10.6, the official MacPython distributions were sufficiently general that they could be installed everywhere with roughly equal levels of support. However, with the advent of 10.5 and the deprecation of Carbon, the yt installation script for OS X had to be bifurcated into a 10.5 script and a 10.6 script, and we had to provide our own Python distribution. After much struggling with install paths and compiler locations, I managed to get a relocatable Numpy + Matplotlib + hdf5 + Python2.6 stack based exclusively on the OS X 10.6 SDK.

However, some users have reported a relative amount of difficulty in installing the necessary components on OS X 10.6. VTK in particular has a number of moving parts. Its ability to adapt to meet many different needs, as well as its wrapping schemes provide a finely articulated configuration system handled by CMake. Because we provide our own binary distributions of Python, we need to ensure that the VTK installation is linked against that particular version of Python. The easier we make this process for the user, the more likely that we will be able to reach a critical mass of people interested in the VTK capabilities of the toolkit, and the more likely that component will “make the cut” from a provided feature to a utilized feature. As developers who are also full-time working scientists, we have to maximize both our own productivity as well as not focus on corner cases of installation for other users, and we also need to make sure that our time is well-spent on development that benefits the most people possible, ourselves included.

Just by chance, around the time the OS X 10.6 packaging issues were being dealt with, I happened to listen to the FLOSS Weekly episode (#111) featuring Bill Hoffman discussing the capabilities of CMake as a whole, and specifically touching on some of its lesser known features such as CPack and CTest. CPack specifically was new to me; from the underlying CMake files, it can construct binary installation packages suitable for a variety of different operating systems and packaging methods. I was the most interested in the ability of CPack to take the binary packages that I built and linked against our hand-rolled Python build on OS X and provide a binary package that users could download and install, or that we could provide with our installation script. This is the ideal solution, as it not only lowers the barrier to entry, but it ensures control over the software stack.

Enzo Simulation

An Enzo simulation of cosmological structure formation as displayed by yt using VTK’s hierarchical box data set.

With only one minor modification, the entire process worked perfectly. The VTKCPack.cmake file had to be modified to enable installation if VTK_WRAP_PYTHON is enabled, rather than just if VTK_WRAP_TCL is enabled. Following this, executing "cpack -G PackageMaker" happily and easily generated a .mpkg file, which is the binary installation format for OS X for distributions of applications and libraries. However, the .mpkg file doesn't include the Python package. Python’s own distutils was able to handle this aspect of it by executing "python setup.py mpkg_bdist", which created a second .mpkg file that installed the Python wrappers and Python-VTK interface libraries. These two packages have been tested, and they work seamlessly across OS X installations of yt.

At this point, we have essentially crested the steepest part of the climb for using VTK as an interface to our data; by using the CPack-created binary distributions, users of our software are able to use a 3D, interactive environment for exploring their data. The problem with adding interfaces like this to yt has always been just in distribution, never in the capability of VTK itself. And the distribution problem has been solved for a while -- we just didn't know it! We’re excited to get started developing more eagerly on this track, and providing access to these methods both to ourselves, the developers, and to the users of our analysis package as well.

Acknowledgements
MJT is supported by NASA ATFP NNX08AH26G at the University of California, San Diego. He would also like to acknowledge the efforts and assistance of the VTK developers, Prabhu Ramachandran and the other members of the MayaVi and TVTK development groups, and the yt developers.

References
[1] http://lca.ucsd.edu/projects/enzo
[2] http://yt.enzotools.org/
[3] http://twit.tv/floss111

Matt Turk  Matt Turk is a postdoctoral researcher at University of California, San Diego, working on super-computer simulations of the formation of the first stars and galaxies in the Universe.