Wednesday, 22 December 2010

ParaView, VTK files and endianness

Recently, I've been doing some work with ParaView, a rather nice open-source visualization application that's aimed at the display and analysis of large data sets. It has a wide range of visualization functionality, an interface that allows for interaction via its graphical user interface, or through scripting, and a distributed architecture. Executable versions are available for a variety of platforms, or you can download the source and compile it yourself. I'm currently running the Windows 64bit distribution on my laptop, and have also been building the source on HECToR, with a view to trying the application out on that machine, and using it for the analysis of some of the large data sets generated there.

ParaView uses the open-source Visualization Toolkit (VTK) for data processing and rendering; this consists of a core C++ class library and interpreted interface layers for Tcl/Tk, Java and Python. Much of VTK's functionality - specifically, its visualization techniques and modelling methods - is available from within ParaView, and it's also possible to extend the application to support other VTK classes by (for example) providing an XML description of the interface.

The first thing the user of any visualization system wants to do is to read their own data into it (of course, this will also be the last thing they ever do with the system, if it proves to be too recalcitrant). Doing this requires some understanding of the type of data that the system can process, and the translation of the different components of the user's data into those types. Generally speaking, it's usually most straightforward for the user to write their data to a file in a format which is supported by the system (the alternative approach - more useful when, for example, the the user has a lot of data files in a particular format - is to extend the system to support the reading of that format).

ParaView can read files in several formats, including so-called VTK Legacy files (so named because the format was introduced in an earlier version of VTK; since then, it's been supplemented by a more flexible XML-based format). Its structure is pretty straightforward, and well-documented. Here's the first part of an example Legacy file, which contains scalar data values located on a regular 3D mesh:

# vtk DataFile Version 2.0
3D example
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 61 61 101
ORIGIN 0.0 0.0 0.0
SPACING 0.05 0.05 0.05
POINT_DATA 375821
SCALARS volume_scalars float 1
LOOKUP_TABLE default

The coordinates of the nodes on this kind of mesh (which VTK calls a Structured Point dataset) are fully specified by providing values for the parameters given by the keywords DIMENSION, ORIGIN and SPACING on each axis; the other file keywords are described elsewhere, along with an account of the other types of mesh and the other data structures which VTK supports. More specifically here, the keyword ASCII specifies that the data values (which come next in the file) are written in this format. Selecting this option ensures that the entire file is human-readable and portable from one machine to another, but it might not be the best choice if the data file is very large (for the usual reasons of space and speed). Switching to BINARY makes the file more compact and allows it to be written and read more quickly.

Here's where I ran into a (slight) problem: although ParaView read the the ASCII version of my file and displayed it correctly, this broke when I switched to the binary version (more accurately, ParaView read the file in happily but the values which it found there were wrong). After some poking around in the documentation, I realised the problem was that VTK - and hence, ParaView - always expects the binary data in the Legacy file to be stored with the most significant byte first - also known as the big-endian representation. (To be strictly honest, I couldn't find this point specified unambiguously in the otherwise-excellent VTK file formats guide, but it was emphasised more explicitly elsewhere). However, my file had been output by a program in which the data was stored with the least significant byte first - i.e., in the so-called little-endian fashion. The representation used by the program to store the data depends on both the operating system and hardware architecture for which it's been compiled (for example, it's little-endian for Windows on x64 - which is what I'm currently using - but big-endian for Solaris on SPARC), and this representation is preserved when the data is output as a raw stream of bytes - as is done, for example, by the C fwrite() function.

Having realised the problem, the solution was clear: swap the order of the bytes around in each data value before writing it out. There are many ways to do this, but I found this document helpful. It proposes a framework for handling several data types which is intended to be portable across both little- and big-endian architectures (note also that the framework receives much attention here as alternatives which are possibly more robust, general and/or portable are suggested). Since my data is stored as floats, I only needed a small part of the framework; here's the C function I used:

float FloatSwap( float f )
{    
   union
   {
      float f;
      byte b[4];
   } dat1, dat2;

   dat1.f = f;
   dat2.b[0] = dat1.b[3];
   dat2.b[1] = dat1.b[2];
   dat2.b[2] = dat1.b[1];
   dat2.b[3] = dat1.b[0];
   return dat2.f;
}

and here's how I used it when writing out the data values:

float *data, val;
FILE *fp;

/* Generate the data values, open the file and write the header.  */
...

/* Output the data values.  */
index = 0;
for( k=0; k<nz; k++ )
   for( j=0; j<ny; j++ )
      for( i=0; i<nx; i++ ) {
         /* Swap the bytes in this data value, write it out.  */
         val = FloatSwap(data[index]);
         fwrite((void *)&val, sizeof(float), 1, fp);
         index++;
      }

fclose(fp);

Finally, here's the visualization which I created in ParaView. The dataset is the probability density function corresponding to the so-called 3dz2 atomic orbital, and it's visualized using volume rendering and an isosurface, which highlights its characteristic shape: two lobes and a torus (I once heard the profile of rather overweight chemistry teacher being compared unflatteringly to this).

4 comments:

  1. Hi Jeremy, Thanks for the post. I tried with my own code but I got some errors from ParaView. I have a VTK Legacy file with data type write in ASCII format, but I would like to write it in BINARY format because there some NaN values and this is the only way in which ParaView accept NaN values. Could you please send me an example code? Thanks in advance for your reply.

    ReplyDelete
  2. Hi Juan - thanks for the question. The VTK Legacy format has a header that's always written in ASCII (and contains a keyword whose value is ASCII or BINARY depending on the way the data values are written). For data in binary format, the code is pretty much what I gave in the extract above if you're writing the file on a little-endian machine; if it's a big-endian machine, you don't need the call to FloatSwap in the loop over data values.

    Hope this helps - please let me know if you need any more here.

    ReplyDelete
  3. Thanks for sharing your code. I was able to use your code to create binary endian data for double and int type. The binary dataset is able to load in paraview without any error. The axis dimensions are correct too. I am trying to load 4 variables, 1 for density and 3 components of velocity (ux, uy, uz). Only density is listed in pipeline, 3 u's are not listed. Also, the value of density is 1 all over but paraview is showing the range of value for density as -3.10e+231 to 1. what does this mean? do you have any suggestion what may be wrong here? I can share my dataset and code if needed. Please let me know. Thanks!

    ReplyDelete
    Replies
    1. Byte swap is same for both int and float type, but it's different for double. One should not use byte swap for float to change Endianess of double data.

      Delete

NAG moderates all replies and reserves the right to not publish posts that are deemed inappropriate.