wiki:csmlAPIQuestions

Version 6 (modified by astephen, 13 years ago) (diff)

--

The scope of the CSML API for Alpha was limited to providing access to a single CSML feature type (Grid Series) and performing a single operation (subsetting) on that feature.

The API now needs to undergo a through review to ensure it structured to fullfill our longterm requirements which involve multiple operations on multiple feature types.

We want to follow the 'processing affordance' pattern as per the Met Office  Exeter Communique (pdf). A major part of the API rethink will involve deciding how best to implement this in practice.

Additionally integration of the alpha CSML API with WCS and the Data Extractor has brought up plenty of other issues that we also need to take into account. Most of which (except minor bugs) are documented here:

  • Idenfication of axes - although CSML does not place specific requirements on axis names, applications need to know which are the longitude/latitude/level/time axis. This can usually be inferred from the CSML context/feature type, but should we explicitly have attributes in the CSML document such as isLatitude, isLongitude, isTime etc. ? An example is that in the current csml API you need to be able to find out what the time axis is called (id) so that you can handle it accordingly. Doing the standard CF-compliance checks is the best way to ensure you have the correct axis.
  • Calendaring - The frame of reference for times may depend on different calendars e.g 360_day, Gregorian etc. This information should be stored in the CSML probably in a srsName attribute - is this always possible? What about when times are encoded as file extracts rather than inline? Currently (alpha) the API refers back to the original data to check the calendar attribute, but this is inefficient.
  • Path names - in the CSML, should we store relative paths or absolute paths. The 'delivered' CSML should probably be relative, but what about the stored CSML?
  • domain - the domain complement and reference could be a hinderance when we want to do basic axis indexing. The CSML file doesn't tell us how an array is structured in terms of axis order. When we are defining an interface to the data it is a common requirement to query the axis order, and often we would like to be able to subset just based on a set of axis indices. In order for the DX to cope with very different types of Grid Series Feature (rather than just "tzyx" ordered data) it needs to be able to call feature.getAxisList() and then to associate a subset specifier with that domain.
  • Global attributes - applications can need to know 'global attributes', for presentation or otherwise. Without MOLES, we don't have access to these directly from CSML. Is there a case for storing some global attributes in the CSML.
  • CF compliance - to write CF-compliant NetCDF we may need to store additional attributes alongside 'variable' and 'axes' (i.e. attributes of the feature and domain). Which attributes are required for CF compliance? "units" is a significant example as CF says you can determine the axis identity from the "units" attribute - we need to be able to do that so we need to capture the units attribute. Will this model be viable when the CSML is not derived from underlying CF-NetCDF?

  • Data in memory - applications may want to hold the data in memory rather than receiving a link to NetCDF file.
  • Multiple feature selection - the Data Extractor allows selection of more than one variable (feature). Need to write CSML containing multiple features.
  • Multiple NetCDF files - when the data selection is large we may want to deliver multiple NetCDF files.
  • Upside-down data - in Alpha we didn't use the mappingRule element to specify the orientation of the data, so some of it was the wrong way up. The orientation (e.g. +x-y+z+series) should be calculated when scanning and processed by the API.
  • The DX (and other) GUI needs to provide enough information about a variable so that the user can identify what it is. The name is not enough because you can have different cell_methods (CF-talk) on a variable (such as mean and std deviation). But, even cell_methods might not be definitive. We need to consider what/how this stuff gets to users.
  • Re-use of existing code where available. There is likely to be different interfaces required for different feature types. The Grid Series Feature and Grid Feature are both already well served with the cdms library (which is already python, already using Numeric/Masked? Arrays and already provides an appropriate interface). In analysing what we can achieve for Beta we should consider where to plug into libraries like cdms and just use them as-is for now. This might free resource to work on supporting Features that do not have a useful library in place yet.
  • Reducing multiple calls to data files. On testing the Grid Series Feature API it was apparent that many files were being re-opened a number of times to fetch metadata or axis information. This repeated I/O should be reduced. One method might be to read in all coordinate variable values when scanning and store these inline. This shouldn't be a massive bulk because most datasets will only need to reference a few domains that are repeated for different variables (phenomena).
  • Typecode - a service sitting on top of CSML will get requests for data. It will have to estimate the cost (size, duration) of a request. It therefore needs to know the typecode of the data arrays. If they are short integers the output will be a lot smaller than an array full of doubles. It would be useful to put typecode into CSML.