Ticket #582 (closed task: fixed)

Opened 13 years ago

Last modified 12 years ago

[DS] Access to metadata in CSML read methods

Reported by: mggr Owned by: mggr
Priority: blocker Milestone: BETA
Component: CSML Version:
Keywords: Cc:

Description

A number of the PML data files require some limited metadata in order to open the files (e.g. width & height for raw 8-bit images). There are a couple of ways to get around this:

  • Convert all PML formats to something else (e.g. netcdf) on the way out the door (possibly a lot of work, also processing and storage implications at our end)
  • Encode metadata in the filename (ick)
  • Pass pairs of files (1 data, 1 metadata) either via delivery service or by combining in an archive (requires explicit support for pairing or, at our end, creating the archives)
  • Allow access to the CSML metadata in the read method so it can access CSML.RSDAS_8bit_extract.width or whatever (requires a minor API change and makes the read methods dependent on the CSML)

My current (slight) preference is the API accessing the CSML metadata - seems like it'd be best to keep the information required to open the file available to all in the XML. It also might be useful for other file types.

I'm not sure what the impact on the CSML read method API would be - Dom: could you have a peek and see if it would be horrible?

If anyone has alternative suggestions, pop them in this ticket :)

Change History

comment:1 Changed 13 years ago by domlowe

  • Status changed from new to assigned

Looking at the options I agree that option 4 is probably the best. As you say it may be useful for other data formats to have access to the contents of the CSML document (whether they choose to use it or not). I've had a look at the code and the impact is minimal - it only needs a few minor changes.

I'm happy to do this if that's the way you want to proceed.

Can you also supply me with details of what you think a <RSDAS_8bit_extract> should look like as this will need adding to the schema/parser.

Note, one side effect of doing this is that you'll never be able to use the CSML scanner (as it uses the read methods to create the csml). But I don't think that's an issue for you.

comment:2 Changed 13 years ago by lawrence

Hopefully PML is phasing this stuff out now anyway? Is this a legacy only issue. Please tell me you're not still creating these ...

comment:3 Changed 13 years ago by mggr

Two part reply ;)

Dom: yes, I think that's probably the nicest way to proceed if you're happy with it.

This probably belongs in a new ticket, but off the top of my head, we need:

  • width & height
  • bit depth (we have different depths - might be better to call it a RSDAS_raw_extract)
  • might need some scaling parameters (slope, intercept, scaling type) depending on whether we're outputting raw values or pseudo-physical values (probably the latter, so put them in)

I think all the other metadata belongs in the body of the CSML rather than the extract, or at least isn't needed to read the file.

We also have some data in PNG & GIF formats, so might need the scaling parameters there too. I guess that'd be another different ticket..


Part 2..

At the risk of disappointing Bryan, lets say it's a matter of internal debate and development ;) It's not something that'll change in the NDG timeframe though, so we do need to address this one way or another.

Some systems produce these raw files routinely and we have a very large legacy archive of them (so much so that our disks would explode if we inflated them all into netcdfs - we're addressing that one internally too).

Other systems produce image output (PNG/GIF) and we may want some additional metadata for those too (not 100% sure if it's needed at the read methods level, that depends on what we're supposed to be outputting).

And, yes, we do also produce some data in fully marked up formats ;)

comment:4 Changed 13 years ago by domlowe

  • Type changed from issue to task

Changing to a task.

comment:5 Changed 13 years ago by mggr

Regarding the slope & intercept values mentioned above, we note there's a numericTransform attribute in CSML's AbstractArrayDescriptor?. Is this in use/implemented? If so, we could (given the right formatting) encode the transformation here instead of doing the transform in the read method.

I'm guessing we should do it in the read method, but worth asking ;)

comment:6 Changed 13 years ago by mhenning

For info, the kind of numeric transformation we will need to do is:

42 (42 * x + 42)

comment:7 Changed 13 years ago by mhenning

In addition,

AbstractArrayDescriptor?'s arraySize can provide width/height information needed for reading the raw file. Which leaves bit depth.

This could either be an additional element, or encoded in the numericType element (which would involve assuming that short == 16bit, int == 32bit etc. or making this more explicit by adding additional types: int8, int16, int32 etc.

Also, it may be desirable to add byte-order and 'signness', which while unimportant for reading rsdas raw files (all are unsigned big-endian) would make the read method more generic, and capable of handling many formats of raw data.

Thoughts?

comment:8 Changed 13 years ago by awoolf

The correct thing to do here - if there's no existing FileExtract subclass for the data - is to define a new subclass with appropriate attributes. The attributes that are required for a netCDFFileExtract are filename, variable, etc. For a 'RawFile?' format, they would probably be bit-depth, endianness, etc - the things you've identified. These rightly belong as attributes in an appropriate fil format class within the CSML storage descriptor. For image file formats, necessry metadata is normally already in the file's bitstream, so shouldn't be necessary.

comment:9 Changed 12 years ago by domlowe

  • Status changed from assigned to new
  • Owner changed from domlowe to mggr

Reassigning to Mike to finalise definition of file extract subclass.

comment:10 Changed 12 years ago by mggr

  • Status changed from new to assigned

We're currently looking at something like:

<rawExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.8bit</filename>
   <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
   <endianness>big-endian</endianness>
   <numericType>unsigned int</numericType>
   <bitDepth>8</bitDepth>
   <uom>mg m^-3</uom>
</rawExtract>

The new parts are:

  • <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
    • In the CSML2 spec, this is a free text field. We're going to be interpreting it as above (fixed formats, one each for log and linear scaling, using "n" to represent the data value). We considered embedding MathML here, but that seems to be a NDG3 and beyond thing.
  • <numericType>unsigned int</numericType>
    • We're currently adding unsigned (or, implicitly, signed). This is a small expansion of the existing defintion or, alternatively, we could create a seperate element ("isSigned"?). Any preferences?

comment:11 Changed 12 years ago by domlowe

It seems to me that the signedness (hmm..?) is a property of the data type, so should probably be an (optional) xml attribute of <numericType>: e.g:

<numericType unsigned="true">int</numericType>

Where leaving off the attribute:

<numericType>int</numericType>

implies this:

<numericType unsigned="false">int</numericType>

(i.e. signed).

comment:12 Changed 12 years ago by mggr

Following the NDG AG meeting on 18/April/2007, the decisions were:

  1. Keep numericTransform as described
  2. Make numericType more explicit, encoding size and signedness similarly to C99 (e.g. uint16, float32, etc)
  3. Drop the bitDepth element as it'll now be encoded in numericType

So an example fragment would now be:

<rawExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.8bit</filename>
   <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
   <endianness>big-endian</endianness>
   <numericType>uint16</numericType>
   <uom>mg m^-3</uom>
</rawExtract>

Note we also need to do something like this for data "held" in png, gif, etc. In that instance, all we'd need is the numericTransform part, e.g.

<imageExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.png</filename>
   <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
   <uom>mg m^-3</uom>
</imageExtract>

Possibly one might want to subclass this to PML_PNG, PML_GIF, etc if you want to avoid forcing people with unscaled image "data" to include a numeric transform. Probably not worth worrying about too much for NDG2 though.

If people are happy with this proposal, we can probably close this ticket and use #387 for tracking implementing the read methods.

comment:13 Changed 12 years ago by domlowe

Maybe it's the limitations of access grid but I interpreted point 2 as meaning we'd do something like this:

<numericType unsigned="true" bits="16">int</numericType>

Which would make writing code like this possible:

numbits=numericType.bits

...rather than having to do string manipulation on "uint16" to get the number of bits. I suppose we'd then have to decide whether 'bits' is a mandatory attribute...

On imageExtract: numericTransform isn't mandatory anyway in the parent ArrayDescriptor class so unless you were to specifically make it mandatory it wouldn't be in imageExtract. You might however want to subclass to PML_imageExtract and make numericTransform mandatory for PML images... (but probably not worth worrying about now as you say)

p.s. Note that we discovered spaces are invalid in units of measure in gml, so spaces need to be replaced by dots... i.e. mg.m-3

comment:14 Changed 12 years ago by mggr

Dom's modification looks fine to me. I thought we were talking about type names like ISO C99 over AG, but could just be it's difficult to hear exactly on a phone ;) The only advantage of the c99 types is that all the typing information is clearly defined (the string parsing part sucks, admittedly). We would have to make the attributes either mandatory or explicitly describe a clear assumption in the documentation - it can't really be left undefined.

re: the imageExtract class, we might as well make it imageExtract->NEODAAS_imageExtract while it's easy to do.

comment:15 Changed 12 years ago by domlowe

Sorry I must have dozed off at the first mention of the ISO word ;-) You're right, that makes sense. The string parsing shouldn't be the main consideration.

comment:16 Changed 12 years ago by mggr

  • Status changed from assigned to closed
  • Resolution set to fixed

Final decision: (probably ;) )

<rawExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.8bit</filename>
   <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
   <endianness>big-endian</endianness>
   <numericType>uint16</numericType>
   <uom>mg.m^-3</uom>
</rawExtract>

numericType to be IS* C99 style (but missing the "_t" part..).

<NEODAAS_imageExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.png</filename>
   <numericTransform>10 ^ (0.0015 * n - 2)</numericTransform>
   <uom>mg.m^-3</uom>
</NEODAAS_imageExtract>

(subclass of an "imageExtract" without mandatory numericTransform element)

Track implementation in ticket #387.

comment:17 Changed 12 years ago by mhenning

RawFileExtract? and ImageFileExtract? should have a <fillValue> element.

<RawFileExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.8bit</filename>
   <numericTransform>10 ^ ( 0.0015 * n - 2 )</numericTransform>
   <endianness>big</endianness>
   <numericType>uint16</numericType>
   <fillValue>0</fillValue>
   <uom>mg.m^-3</uom>
</RawFileExtract>
<ImageFileExtract>
   <arraySize>512 512</arraySize>
   <filename>chlorophyll.png</filename>
   <numericTransform>10 ^ ( 0.0015 * n - 2 )</numericTransform>
   <fillValue>0</fillValue>
   <uom>mg.m^-3</uom>
</ImageFileExtract>
Note: See TracTickets for help on using tickets.