wiki:MolesDiscussion

Version 11 (modified by ko23, 13 years ago) (diff)

--

Discussion about MOLES issues and priorities

History:

  • Initial Version, BNL, reporting a discussion between BNL and KON. Based on ndgmetadata1.2.5

Issues on the Table

  1. The CoatHanger?
  2. Granules
  3. Stub-B Schema

The aim of this document is to discuss these issues and identify the particular tickets we need to raise towards solving them as soon as we can.

Stub-B

Work is underway to modularise MOLES so that components can be used in the MOLES schema itself and in the stub-B schema. (No new ticket needed, this is ticket:287, but it does need more detail. What is involved?).

  • Noting that stub-b is a major interface for browse and travelling metadata
  • Noting also that we accept that very large deployment lists may occur, but we'll worry about that when it happens.

The CoatHanger?

Issue: how do we import material into MOLES? It turns out we already have the dgMetadataDescriptionType:

which should occur in each moles record, although it doesn't form part of one of the major entities (Activity, Data Production Tool, Observation Station, Deployment, Data Entity). The descriptionSection would seem to be a useful adendum to each of these in their own right (possibly instead of making it part of the overall description, since we see this additional information being additional attribute(s) of the entities).

Agreed? Then: Ticket Needed:Making a descriptionSection part of each of the major entities, allowing a stub-b to include this information for each of the first order entities in a natural way.

(KDO - Not agreed. The description is here to help provide a minimum amount of information where the record is being interpreted outside an envirionment that acknowledges the metadata record types, by putting these standard fragments in a standard and easily accessible place. It also simplifies the schema by having a single place for this section that all metadata records should have.)

Issue: The Online Reference type

should evolve towards something that exploits xlink, so that we can indicate whether one expects to insert the linked object, point to the linked object, or render the remote object, and insert it ...

Issue: Ticket Needed: Provide a suggested mechanism of exploiting xlink to do this. (A proposal should be a schema fragment which includes a controlled vocabulary for the attributes of the xlink, recognising that we will be on the bleeding edge here and some future changes in our technology may be necessary).

(KDO - the mechanism I expected to use was to extend the online reference type into a choice between the existing simple reference and an "xlink type structure" (maybe a citation type structure as well?). However, as Bryan points out, anothre schema fragment and associated enumerations or vocabs is required.)

Issue NumSim in particular. Here we expect to wait for an ISO19139 compliant version (ticket:284), which will have clear subcomponents targetted for the deployment and data production tools.

Granules

Here the issue is to come up with schema content that maximises the amount of information we can promote from CSML and provides content to clearly indicate to the browse user which granules are of interest.

There are two parts of the Data Entity which are of interest: overall information about the data entity, and the information we put in a data granule.

(KDO question - I've always assumed that there would be a level of summarisation in the parameters presented via the DE, but I'm getting the feeling that this might not be so...)

Starting with the data granule:

we see that there is a datamodel id and an instance uri.

Issue: BNL is confused, should we expect the datamodel id be the uri of the the csml document? (e.g. equivalent in content to badc.nerc.ac.uk:CSML:blah) and the uri to be a service binding to that instance, e.g.  http://badchost/dX?uri=badc.nerc.ac.uk:CSML:blah)

(KDO - there's some confusion due to history here I think. Given what has been said in the past, my expectation was that the data granule ID was the key needed by the relevant services, so the instance was redundant for the NDG. However, it was intended to provide a hook for data that may be accessed outside the NDG SOA.)

Note that the granulecoverage is the spatio-temporal bounding box, it doesn't cover the sort of averaging (if any used), more of that later.

All the interesting stuff is in the dgParameterSummary ...

Looking through this we can see the

  • IsOutput? variable (boolean). BNL can't really see the point of this. KON did explain, but this needs revisiting. Decide: In or out?

In, At BODC, we are considering if IsOutput? is True, then that Parameter is visible in data discovery, and is invisible if it is False.(Siva,BODC).

(KDO - whoops, looks like something got lost here... The original intent was to differentiate between fixed parameters, eg, data taken at a constant height, and non-fixed, aka measured, parameters, such as the temperature at a particular time at the constant height. Siva's case, I expected to be dealt with by excluding the parameter from the DE parameter summary, and leaving it to be found at data browse time.)

  • The next thing is a choice of four items, only one of which should appear for any parameter. Either the value, or the range of values, or an enumeration list of the value types, or a compound group should appear. Yes/No?? If so, ticket needed: It needs to be a choice as to whether this thing exists and it needs a name. Also another ticket: Roy to give us a few practical examples of how the parameter group is intended to work Yes,at BODC we are using the following Strategy.Go for dgRangeDataParameter and check if HighValue?=LowValue?, in which case we use dgValueDataParameter.The way we get the HighValue? and LowValue? is, by opening each Series data file (QXF file) and the min and max value for the required data channel is obtained.Once the limits for each Series have been obtained, the extremes may be determined to give the limits for the dataset.We cannot envisage using dgEnumerationParameter.(Siva,BODC).

I am concerned that it's not practical to obtain the High and Low values for a parameter when you are dealing with very (very) large datasets e.g. atmospheric model runs. Not practical in the sense that it would increase the processing time to generate CSML by many orders of magnitude. (Dominic)

  • The other elements are rather obvious, but ...
    • Note that we would expect to use the dgStdParameterMeasured variable to encode both the phenomenon name and the cell bounds (so we get the averaging information here). Can we promote something useful from the CF cell methods? Ticket Needed
  • I suppose we imagine a granule of consisting of multiple phenomena with multiple feature types, but we would expect that any one phenomenon in one granule to have one feature type (Andrew/Dominic??). In which case the feature type name and the feature type catalogue from which it is governed should also be encoded per parameter. However, one might argue that the assumption might be violated, and in any case, at this point the user might be pointed to the WFS level. It would certainly be simpler, and possibly more useful to generate a list of feature types present in the granule (along with their FTC antecedents). Yes/No?? Ticket Needed?''

I think that assumption (one phenomenon -- one feature type (for a given granule)) is correct. (Dominic)

Now we have this information at the granule level, how much of it should be summarised up at the data entity level by the moles creator? (Ticket: We would need tools to do this'')

The overall material includes the following data summary:

It is a moot question as to how much of this needs to be replicated from the granule content. Tickets needed on some of the following

  • BNL would argue that the spatio-temporal coverage should be the *union* of the granule coverages (need a tool to produce this).
  • The parameter coverage is a bit more complicated, because now we think we could have, for example, temperature monthly means and temperature annual means in the granules. I think the only thing that makes sense is to aggregate the granule parameter summaries. In which case why bother? We can parse the granule content. Remove?
  • There ought however to be a consolidated lists of feature types present ... as well. Add?
  • The other elements seem appropriate.

(KDO - ok, at this point I'm going to talk about summarisation: I thought there was a need to actively summarise the data to aid understanding, with the data browse phase dealing with the real detail. Also, this summarisation could take into account the needs of those from other disciplines who may need to access the data.)

Now looking at the other two elements in the data entity which are relevant:

  • The dgDataSet Type should allow 'Mixed' (as for example both model and obs may be included in a dataset). One assumes these are effectively booleans? Ticket''
  • I don't really understand dbBasicData and dgDerivedData. In particular, the basic data context is really about listing the feature types, but we think we have that elsewhere, and we have in the dgDataSet information as to whether the data is simulated or an analysis. The only other option is that the data has been processed (derived) in someway, in which case there is utility in providing links to underlying datasets but these ought to be DataEntities? not data granules ... assuming that the detailsof teh derivation/processing are in the dpt, the links are all that are really needed. The choice of timeseries, integration etc is redundant as that information exists in the feature type and phenomenon information. Remove most of this section in the schema?

(KDO - earlier versions of the schema had a comment for dgDataObjectType along the lines of "why isn't this just a term from a vocab to identify the "feature type", with answer that some data entity types might have attributes only of interest to discovery, that would rarely be populated for other types, and just confuse things. Examples are: input data entities/granules for the "derived DEs"; and a notional dgImage which would have details about the camera used and pixel resolution. Hence, restricting the number of types to only those with such attributes (suggestions wanted), and having a list of CSML feature types involved is probably a good way to go)

Attachments