wiki:T02_CSML/Csmlscan

Version 3 (modified by domlowe, 12 years ago) (diff)

First draft of csml scanning notes complete.

How to use the CSML Scanner

The CSML scanner will scan a collection of files (or just a single file) to create a CSML Dataset which contains a collection of CSML Features.

The scanner currently works with GridSeriesFeatures, e.g. model output, with support for other feature types being under development.

To install the scanner follow the instructions for installing CSML which start at section 6 here: InstallDiscoveryBrowse

When CSML is installed you should have a script called csmlscan, which will be somewhere like /usr/local/bin/csmlscan.

At the command line type:

csmlscan

You should get an exception:

No featuretype declared in config file (or no config file)

This shows csmlscan is working, so now we need to create a config file to tell it what to scan. We will then run csmlscan with the -c option to read this config file:

csmlscan -c yourconfigfile.cfg

And this is what a config file looks like...

Example contents of config file

[dataset]
dsID:UBXyeKx6

[features]
type: GridSeries
number: many

[files]
root: /badc/coapec/data/HadCM3_beowulf/64-bit/wholerun/annual/ocean/
mapping: onetomany
output: /home/dom/coapec/COAPEC_500YrRun_wholerun_annual_ocean.xml
printscreen:1

[spatialaxes]
spatialstorage:fileextract

[values]
valuestorage:fileextract

[time]
timedimension: t
timestorage:inline

Now to look at the sections of this file in more detail:

[dataset] section:

dsID: The Dataset id you want to see in your final csml document. This will be the unique id that is used to reference the document locally. In NDG the dataset id forms the end part of the NDG URI: badc.nerc.ac.uk__NDG-A0__UBXyeKx6 This id should be unique within your domain, and can be opaque or otherwise. If you don't provide an id the scanner will create a random id for you.

[features] section:

type: The type of feature, e.g. GridSeries, PointSeries, RaggedSection. See the  CSML User Manual for more info on feature types.

number: How many GridSeries features there are. Should be 'one' or 'many'. In CSML, each phenomenon e.g. Temperature, Precipitation is a separate feature. So for model data this will usually be 'many'. Note: will probably deprecate this option in the future, it shouldn't have to go in a config file!! But for now it's there.

[files] section:

root: The root level of your dataset directory (the data to be scanned). The scanner will also scan sub directories of this directory.

mapping: This is a phrase to describe the mapping between the CSML features and the data files. Csmlscan has the concept of a representative file. e.g. In a typical model run with timesteps in separate file, one (netcdf) file is typical of all the others. The various maps available are:

onetomany: onetomany means one file per directory/subdirectory is the representative file

onetoseveral: Like onetomany, but a directory may contain several representative files e.g. 5 files containing one (or more) feature(s), and then 5 files containing one (or more) different feature(s) etc. Need to examine the contents of each file to see if it's like another. Might be slow on large datsets as it examines the files individually.

onetoone: onetoone means each feature is self contained within any individual file

oneonly: oneonly means one file represents feature spanning multiple directories - assumes no file in toplevel directory, and then lots of subdirectories at next level containing files

These patterns are quite common and hopefully cover most scanning cases. However as there are other possible combinations of features and files, there is a framework within csmlscan for defining new mappings in the  source code, with the actual details of the mappings defined  here. Defining a new mapping is a development task though, not something that can be configured at the config file level.

Model data is typically a onetomany or onetoseveral mapping.

output: The name of the csml output file that will be created. Make sure you have the correct write permissions for the location of the file.

printscreen: Whether or not to print out the resulting csml at the command line. 1 for printing, 0 for no printing.

[spatialaxes] section:

spatialstorage: Whether to store the spatial info (e.g. Latitude, Longitude, Height) values inline in the CSML file, or to use CSML arraydescriptors to reference the original data files. Can be 'inline' or 'fileextract'

[values] section:

valuestorage: Whether to store the rangeset values (e.g. the Temperature measurements) in the CSML file or reference the original data files. Can be 'inline' or 'fileextract'

[time] section:

timedimension Exact name of the time variable in the data file(s) e.g. 'time1'. The time dimension is given special treatment when scanning so this information helps identify it. If a value is not given the scanner will attempt to work out which is the time dimension but this method may fail to correctly identify it.

timestorage: Whether to store the time values in the CSML file or reference the original data files. Can be 'inline' or 'fileextract'

Scanning

Now save your config file as myconfigfile.cfg and run the scanner.

csmlscan -c ''myconfigfile''.cfg

The output should be a csml document!

Problems?

Things to check are:

  1. The config file is correct and can be read by the scanner.
  2. The data sources you are trying to read are correctly mounted and you have read permissions.
  3. The config file again.

If it still doesn't work, then please report a bug, preferably with example data and config file.