Version 10 (modified by domlowe, 13 years ago) (diff)

Added in ticket links

The scope of the CSML API for Alpha was limited to providing access to a single CSML feature type (Grid Series) and performing a single operation (subsetting) on that feature.

The API now needs to undergo a through review to ensure it structured to fullfill our longterm requirements which involve multiple operations on multiple feature types.

We want to follow the 'processing affordance' pattern as per the Met Office  Exeter Communique (pdf). A major part of the API rethink will involve deciding how best to implement this in practice.

Additionally integration of the alpha CSML API with WCS and the Data Extractor has brought up plenty of other issues that we also need to take into account. Most of which (except minor bugs) are documented here:

  • Idenfication of axes - although CSML does not place specific requirements on axis names, applications need to know which are the longitude/latitude/level/time axis. This can usually be inferred from the CSML context/feature type, but should we explicitly have attributes in the CSML document such as isLatitude, isLongitude, isTime etc. ? An example is that in the current csml API you need to be able to find out what the time axis is called (id) so that you can handle it accordingly. Doing the standard CF-compliance checks is the best way to ensure you have the correct axis.

This information is contained in CSML, but for NDG2 we need to explcitly provide support for certain coordinate reference systems and have a mechanism for exposing the information ('this axis is longitude') to the client. See ticket: 407

  • Calendaring - The frame of reference for times may depend on different calendars e.g 360_day, Gregorian etc. This information should be stored in the CSML probably in a srsName attribute - is this always possible? What about when times are encoded as file extracts rather than inline? Currently (alpha) the API refers back to the original data to check the calendar attribute, but this is inefficient.

Calendar info should be stored in CSML. Also need to define which calendars we support. See ticket: 408

  • Path names - in the CSML, should we store relative paths or absolute paths. The 'delivered' CSML should probably be relative, but what about the stored CSML?

If in eXist, the paths will have to be absolute. If on disk, could use either - BADC decision. The paths should be relative at the point of delivery (eg when delivering a subsetted csml doc to a user).

  • domain - the domain complement and reference could be a hinderance when we want to do basic axis indexing. The CSML file doesn't tell us how an array is structured in terms of axis order. When we are defining an interface to the data it is a common requirement to query the axis order, and often we would like to be able to subset just based on a set of axis indices. In order for the DX to cope with very different types of Grid Series Feature (rather than just "tzyx" ordered data) it needs to be able to call feature.getAxisList() and then to associate a subset specifier with that domain.

This is to be considered as part of CSML v2

  • Global attributes - applications can need to know 'global attributes', for presentation or otherwise. Without MOLES, we don't have access to these directly from CSML. Is there a case for storing some global attributes in the CSML.

No. It is too tied into the NetCDF data model to start adding global attributes to CSML. A connection to the MOLES document needs to be made.

  • CF compliance - to write CF-compliant NetCDF we may need to store additional attributes alongside 'variable' and 'axes' (i.e. attributes of the feature and domain). Which attributes are required for CF compliance? "units" is a significant example as CF says you can determine the axis identity from the "units" attribute - we need to be able to do that so we need to capture the units attribute. Will this model be viable when the CSML is not derived from underlying CF-NetCDF?

The crux of this issue is that CSML is a lossy format, and not a mirror of the underlying NetCDF data model. Action to come up with a specification for a minimally CF compliant net CDF document. We will then assess if this is sufficient for use by client applications, and whether CSML needs to contain any additional attribute information. TICKET. If additional attribute information is required, eg. cell_methods, then we should see where this fits into the conceptual model - i.e is it part of the phenomenon or where else might it go in CSML, See ticket: 409

  • Data in memory - applications may want to hold the data in memory rather than receiving a link to NetCDF file.

Not part of original use case, but wouldn't be hard to implement when/if required. Leave for now.

  • Multiple feature selection - the Data Extractor allows selection of more than one variable (feature). Need to write CSML containing multiple features.

For alpha the writing of CSML and NetCDF was pretty adhoc so needs redesigning anyway - the idea of selecting multiple features will be taken into account. Although an implementation of this may be beyond the original use case.

  • Multiple NetCDF files - when the data selection is large we may want to deliver multiple NetCDF files. As above.
  • Multiple time steps per NetCDF file - ideally CSML should be able to scan and write multiple time steps per input and output NetCDF file without expecting a one-to-one mapping of timestep and file. Did we discuss this one? I think it's not a significant problem that needs fixing now.
  • Upside-down data - in Alpha we didn't use the mappingRule element to specify the orientation of the data, so some of it was the wrong way up. The orientation (e.g. +x-y+z+series) should be calculated when scanning and processed by the API.

Problem understood and fairly simple to solve. Andrew to come up with a clear explanation of the sequence Rule element. See ticket 410

  • The DX (and other) GUI needs to provide enough information about a variable so that the user can identify what it is. The name is not enough because you can have different cell_methods (CF-talk) on a variable (such as mean and std deviation). But, even cell_methods might not be definitive. We need to consider what/how this stuff gets to users.

Comes under 'CF-compliance' above.

  • Re-use of existing code where available. There is likely to be different interfaces required for different feature types. The Grid Series Feature and Grid Feature are both already well served with the cdms library (which is already python, already using Numeric/Masked? Arrays and already provides an appropriate interface). In analysing what we can achieve for Beta we should consider where to plug into libraries like cdms and just use them as-is for now. This might free resource to work on supporting Features that do not have a useful library in place yet. Dom to look at CDMS code to see what might be useful. See ticket: 411
  • Reducing multiple calls to data files. On testing the Grid Series Feature API it was apparent that many files were being re-opened a number of times to fetch metadata or axis information. This repeated I/O should be reduced. One method might be to read in all coordinate variable values when scanning and store these inline. This shouldn't be a massive bulk because most datasets will only need to reference a few domains that are repeated for different variables (phenomena).

This is a combination of issues - such as the calendar info, and the expectations of the type of CF-NetCDF that is produced.

  • Typecode - a service sitting on top of CSML will get requests for data. It will have to estimate the cost (size, duration) of a request. It therefore needs to know the typecode of the data arrays. If they are short integers the output will be a lot smaller than an array full of doubles. It would be useful to put typecode into CSML.

AbstractArray has a numericType attribute, which 'could' be used for this. Not urgent now.

  • Subsetting - it would be nice to allow subsetting by index or by value. Also, need to allow subsetting across the Greenwich Meridian.

Subsetting by index would be nice, but probably not required yet. Dom to fix bug with subsetting across Greenwich Meridian See ticket: 412