Version 8 (modified by domlowe, 13 years ago) (diff)

Added pml data interface example to documentation

Read methods needed to integrate a new data format into the CSML API

The CSML API uses a unified 'data interface' class (DI) to read various data formats. To see how this fits in with the CSML parser etc here is a diagram showing an overview of CSML tooling:

Overview of CSML tooling

The basic read methods that need implementing for a new data format are:

  • DI.openFile(self, fileName) --- opens the file
  • DI.setAxis(self,axisName) --- this 'sets' the axis you want to read (axis: e.g latitude, time, pressure, depth.. etc)
  • DI.getDataForAxis(self) --- this returns the entire set of values for that axis
  • DI.setVariable(self,variableName) --- this 'sets' the variable you want to read (variable: e.g Temperature, WindSpeed etc..)
  • DI.getDataForVar(self,) --- this returns the entire set of values for that variable
  • DI.getSubsetOfDataForVar(self,kwargs) --- this returns a subset of values for that variable
  • DI.closeFile(self) --- closes the open file

When the CSML API instantiates a DataInterface object (from now on, DI), what is actually returned is a data interface specific to the data format.

In the DataInterface class there is a bit of python code that does something like this:

                if self.iface == 'nappy':
                        return NappyInterface()
                elif self.iface == 'cdunif':
                        return cdunifInterface()

So if you want to integrate your format, XYZFormat, the first thing to do is to create an XYZInterface() and we can then have:

                if self.iface == 'nappy':
                        return NappyInterface()
                elif self.iface == 'cdunif':
                        return cdunifInterface()
                elif self.iface == 'XYZ':
                        return XYZInterface()

So (in python) you should create a class that looks like this:

class XYZInterface(AbstractDI):
    #Data Interface for XYZ File format

    def __init__(self):
        #this might change when CSML is revamped
        self.extractPrefix = '_XYZextract_'

    def openFile(self, filename):
        #some code to open the file

    def setAxis(self,axis):
        #some code to set an axis to be queried, may not need to do much, depending on your format

    def getDataForAxis(self):
        #some code to return the values for an axis
        return data

    def setVariable(self,varname):
        #some code to set a variable to be queried, may not need to do much, depending on your format

    def getDataForVar(self):
        #some code to return all values for a variable
        return data

    def getSubsetOfDataForVar(self, **kwargs):
        #takes keyword args defining subset eg
        #subset=getSubsetOfDataForVar(latitude=(0.,10.0), longitude=(90, 100.0), ...)
        #and returns a subset of data for tha variable 
        return data

    def closeFile(self):
        #some code to close the file

Example Data Interfaces

I think perhaps it is best to explain the XYZInterface() by showing how the interface differs for cdms/cdunif and NAPPY data interfaces. The details of each API aren't important to understand, rather it is the structure of the DataInterface that I am trying to illustrate.

So here are the methods for two different data interfaces, cdfunifInterface() and NappyInterface?(). First, the openFile method. This is pretty straightforward, we open the file and assign the open file to self.file.

CDMS: openFile

        def openFile(self, filename):

NAPPY: openFile

        def openFile(self, filename):

The set axis method differs for the two interfaces. The cdunif method is straightforward and grabs an axis object direct from the file whereas the NAPPY method stores the name of the axis in a variable called self.axisstub for reference later - however to do this it has to get all the axes and then strip the units from them. This is confusing detail but the basic idea is to store 'something' that will give you a handle back to the axis. This something will be internal to the XYZInterface class.

CDMS: setAxis

    def setAxis(self,axis):

NAPPY: setAxis

        def __stripunits(self,listtostrip):
                #strips units of measure from list
                #eg ['Universal time (hours)', 'Altitude (km)', 'Latitude (degrees)', 'Longitude (degrees)']
                #becomes ['Universal time', 'Altitude', 'Latitude', 'Longitude']
                cleanlist = []
                for item in listtostrip:
                        if openbracket != -1:
                                #if brackets exist, strip units.
                return cleanlist

        def __getListOfAxes(self):
                return axes

        def setAxis(self,axis):
                axes = self.__getListOfAxes()

Now the 'handle' to the axis is used internally within getDataForAxis.

CDMS: getDataForAxis

        def getDataForAxis(self):
                data = self.axisobj.getValue()
                return data

NAPPY: getDataForAxis

        def getDataForAxis(self):
                #this is a Nappy thing - it needs to call the readData() method if it hasn't already done so  
                if self.file.X == None:
                #if more than one axis you need to specify which one you want (using self.axisstub)
                if type(self.file.X[1])==list:
                        data = self.file.X[self.axisstub]
                        data =self.file.X
                return data

setVariable and getDataForVariable work in a similar way to the axis methods just shown.

CDMS: setVariable

    def setVariable(self,varname):

NAPPY: setVariable

        def setVariable(self,varname):

CDMS: getDataForVar

    def getDataForVar(self):
        data = self.varobj.getValue()
        return data

NAPPY: getDataForVar

        def getDataForVar(self):
                if self.file.V == None:
                    if type(self.file.V[1])==list:
                        data = self.file.V[self.varstub]
                    return data
                    data = self.file.X
                    return data

Getting subsets of data is something that may or may not be complicated. CDMS has subsetting built-in so we just call the CDMS methods.

CDMS: getSubsetOfDataForVariable

    def getSubsetOfDataForVar(self, **kwargs):
        #takes keyword args defining subset eg
        #subset=getSubsetOfDataForVar(latitude=(0.,10.0), longitude=(90, 100.0))
        #data = subset.getValue()
        data = subset  #doesn't seem to matter which
        return data

Nappy on the other hand doesn't so we need to do the subsetting within the NappyInterface class itself.

NAPPY: getSubsetOfDataForVariable

    def getSubsetOfDataForVar(self, **kwargs):
        #Hmmm I haven't implemented this yet.
        #Either the subsetting needs to happen in NAPPY (which it doesn't)
        #Or I need to write some code here to handle the subsetting

And finally the closeFile method. Both cdunifInterface(AbstractDI) and NappyInterface?(AbstractDI) inherit the generic close file method from AbstractDI().

AbstractDI: closeFile

        def closeFile(self):
                #closes file

There is a bit more to all this in terms of integrating it into the main trunk of the CSML code, but the first stage is to work on these methods for your file format.

Appendix 1: Data Interfaces for raw and image files

Staff at Plymouth Marine Laboratory have written DataInterfaces to read image files and raw binary files. The following code shows how they did it:

class RawFileInterface(AbstractDI):

   def __init__(self):
      self.extractType   = 'rawExtract'
      self.extractPrefix = '_rawextract_'
   def openFile(self, filename):
      self.file = open(filename, "rb")

   def closeFile(self):

   # Read the data from the raw file into a multidimensional Numeric array
   def readFile(self, **kwargs):
        # Determine the numeric type:
        if 'numericType' in kwargs:
                numericTypeCode = {
            except KeyError:
                raise TypeError("Invalid numericType: " + str(kwargs['numericType']))
            raise KeyError("numericType is mandatory.")
        # Read the file into a numpy array: = Numeric.fromstring(, numericTypeCode)
        # If necessary, swap the byte order:
        if 'endianess' in kwargs:
            if ((kwargs['endianness'] == 'little' and not Numeric.LittleEndian) or (kwargs['endianness'] == 'big' and Numeric.LittleEndian)):
        # Declare the shape of the array:
        dimensions = map(int,kwargs['dimensions'])
        dimensions.reverse() = tuple(dimensions)
        # If numericTransform or fillValue were provided, store them as
        # attributes.
        if 'numericTransform' in kwargs:
            self.numericTransform = NumericTransform.infixExpression(kwargs['numericTransform'])
        if 'fillValue' in kwargs:
            self.fillValue = kwargs['fillValue']
   # Return the fill value, if set, and transform if necessary:
   def getFillValue(self):
      # Both fillValue and numericTransform attributes may or may
      # not exist...
         return self.numericTransform.solve( n = float(self.fillValue) )
      except AttributeError:
            return self.fillValue
         except AttributeError:
            return None

   # This is a just a special case of getSubsetOfDataForVar. To avoid
   # duplication of code, just subset the entire array. (getSubset.. is
   # optimised for this case)
   def getDataForVar(self):
      return self.getSubsetOfDataForVar(lower = (0,)*len(,
                                        upper =

   # Subset the n-dimensional file based on array indices. Accepts parameters
   # in the form of e.g.
   # getSubsetOfDataForVar( lower=(0,0), upper=(1,2) )
   # Note: The rank of the required subset, must exactly match the
   # rank of the original data: len(lower) == len(upper) == rank of file
   def getSubsetOfDataForVar(self, **kwargs):
      # Assume subset parameters are passed as: lower=(0,0) upper=(512,512)
      if 'lower' not in kwargs or 'upper' not in kwargs:
         # Have not specified any subset parameters that we recognise, so raise
         # an exception:
         raise NotImplementedError("Only supports subsetting with lower+upper array indices")
      elif not len(kwargs['lower']) == len(kwargs['upper']) == len(
         # Rank of subset is not the same as rank of full data array. so raise
         # an exception:
         raise NotImplementedError("Only supports subsets of same rank as full dataset")
      elif Numeric.sometrue(Numeric.greater(kwargs['upper'],
         # Requested upper bound of subset is beyond the size of the the full
         # data array, so raise an exception
         raise IndexError("Subset out of range")
      elif Numeric.sometrue(Numeric.less( kwargs['upper'], Numeric.zeros(len(
         # Requested lower bound of subset is beyond the size of the the full
         # data array, so raise an exception
         raise IndexError("Subset out of range")
      elif Numeric.sometrue(Numeric.less_equal(kwargs['upper'], kwargs['lower'])):
         # lower bound <= upper_bound for at least one dimension, so raise an
         # exception
         raise IndexError("Upper bound less than lower bound")
      elif tuple(kwargs['lower']) == (0,)*len( and tuple(kwargs['upper']) ==
         # Special case of requested subset == entire data file.
         subset =
         # We are okay to subset.

         # I cant see any nice (and speedy) way of subsetting a Numeric
         # array of arbitrary rank without the use of eval. By doing it
         # this way, we can make use of the (possible) acceleration provided
         # by Numeric/NumPy.
         slices = []
         for i in range(len(
            lower = int(kwargs['lower'][i])
            upper = int(kwargs['upper'][i]) 
         subset = eval("["+','.join(slices)+"]")

      # Attempt to perform the numericTransform on the data array, if we get
      # AttributeError, it most likely means that numericTransform was not
      # specified, so return the data as-is
         return self.numericTransform.solve( n = subset )
      except AttributeError:
         return subset.copy()

# Interface for reading data from image files. Requires PIL Image module.
class ImageFileInterface(RawFileInterface):
   def __init__(self):
      self.extractType   = 'imageExtract'
      self.extractPrefix = '_imageextract_'
   def image2array(self,im):
       #Adapted from code by Fredrik Lundh, 
        if im.mode not in ("L", "F"):
            raise ValueError, "can only convert single-layer images"
        if im.mode == "L":
            a = Numeric.fromstring(im.tostring(), Numeric.UnsignedInt8)
            a = Numeric.fromstring(im.tostring(), Numeric.Float32)
        a.shape = im.size[1], im.size[0]
        return a

   def openFile(self, filename):
      self.file =

   def closeFile(self):
      self.file = None #...Image does not seem to have a close() method.

   def readFile(self, **kwargs):
      # Convert the image to a Numeric array
      #slower method: = Numeric.array(self.file.getdata())

      if 'numericTransform' in kwargs:
         # numericTransform was specified, so compile the expression:
         self.numericTransform = NumericTransform.infixExpression(kwargs['numericTransform'])
      if 'fillValue' in kwargs:
         self.fillValue = kwargs['fillValue']