wiki:T02_CSML/Csmlscan

Version 5 (modified by domlowe, 12 years ago) (diff)

Added csmlscan command argument notes

How to use the CSML Scanner

The CSML scanner will scan a collection of files (or just a single file) to create a CSML Dataset which contains a collection of CSML Features.

The scanner currently works with GridSeriesFeatures, e.g. model output, with support for other feature types being under development.

To install the scanner follow the instructions for installing CSML which start at section 6 here: InstallDiscoveryBrowse

When CSML is installed you should have a script called csmlscan, which will be somewhere like /usr/local/bin/csmlscan.

At the command line type:

csmlscan -x

You should get the following output:

 Starting scan...


The following config options have been set:

 dataset:dsID = None

 features:type = GridSeries

 features:minaxes = 2

 files:root = /your/current/directory

 files.selection = []

 files:mapping = onetomany

 files:output = csmloutput.xml

 files:printscreen = 0

 time:timedimension = time

 time:timestorage = inline

 spatialaxes:spatialstorage = fileextract

 values:valuestorage = fileextract

 For more information on config options run the command "csmlscan --help"

 Abandoning scan: remove "-x" flag to scan for real.

This shows csmlscan is working. Note that the -x flag is useful for checking you the arguments you have passed to csmlscan without actually running a real scan.

To see the available command line options run:

csmlscan --help

You will see that one of the available options is '-c' which is used to specify the location of a config file. So csmlscan can either be run directly with command line args, or with the aid of a config file. In some ways a config file is better as you have a record of the request.

In the following example we will use a config file, but command line arguments could be used instead.

So we need to create a config file to tell csmlscan what to scan. We will then run csmlscan with the -c option to read this config file:

csmlscan -c yourconfigfile.cfg

And this is what a config file looks like...

Example contents of config file

[dataset]
dsID:UBXyeKx6

[features]
type: GridSeries
number: many

[files]
root: /badc/coapec/data/HadCM3_beowulf/64-bit/wholerun/annual/ocean/
mapping: onetomany
output: /home/dom/coapec/COAPEC_500YrRun_wholerun_annual_ocean.xml
printscreen:1

[spatialaxes]
spatialstorage:fileextract

[values]
valuestorage:fileextract

[time]
timedimension: t
timestorage:inline

Now to look at the sections of this file in more detail:

[dataset] section:

dsID: The Dataset id you want to see in your final csml document. This will be the unique id that is used to reference the document locally. In NDG the dataset id forms the end part of the NDG URI: badc.nerc.ac.uk__NDG-A0__UBXyeKx6 This id should be unique within your domain, and can be opaque or otherwise. If you don't provide an id the scanner will create a random id for you.

[features] section:

type: The type of feature, e.g. GridSeries, PointSeries, RaggedSection. See the  CSML User Manual for more info on feature types.

number: How many GridSeries features there are. Should be 'one' or 'many'. In CSML, each phenomenon e.g. Temperature, Precipitation is a separate feature. So for model data this will usually be 'many'. Note: will probably deprecate this option in the future, it shouldn't have to go in a config file!! But for now it's there.

[files] section:

root: The root level of your dataset directory (the data to be scanned). The scanner will also scan sub directories of this directory.

selection: As an alternative to root you may enter a list of one or more files separated by spaces. e.g:

[files]
selection: /home/myname/myfiles/file1.nc /home/myname/morefiles/file99.nc

These files may be in different directories or the same directory. Either way, the full path names must be given.

mapping: This is a phrase to describe the mapping between the CSML features and the data files. Csmlscan has the concept of a representative file. e.g. In a typical model run with timesteps in separate file, one (netcdf) file is typical of all the others. The various maps available are:

onetomany: onetomany means one file per directory/subdirectory is the representative file

onetoseveral: Like onetomany, but a directory may contain several representative files e.g. 5 files containing one (or more) feature(s), and then 5 files containing one (or more) different feature(s) etc. Need to examine the contents of each file to see if it's like another. Might be slow on large datsets as it examines the files individually.

onetoone: onetoone means each feature is self contained within any individual file

oneonly: oneonly means one file represents feature spanning multiple directories - assumes no file in toplevel directory, and then lots of subdirectories at next level containing files

These patterns are quite common and hopefully cover most scanning cases. However as there are other possible combinations of features and files, there is a framework within csmlscan for defining new mappings in the  source code, with the actual details of the mappings defined  here. Defining a new mapping is a development task though, not something that can be configured at the config file level.

Model data is typically a onetomany or onetoseveral mapping.

output: The name of the csml output file that will be created. Make sure you have the correct write permissions for the location of the file.

printscreen: Whether or not to print out the resulting csml at the command line. 1 for printing, 0 for no printing.

[spatialaxes] section:

spatialstorage: Whether to store the spatial info (e.g. Latitude, Longitude, Height) values inline in the CSML file, or to use CSML arraydescriptors to reference the original data files. Can be 'inline' or 'fileextract'

[values] section:

valuestorage: Whether to store the rangeset values (e.g. the Temperature measurements) in the CSML file or reference the original data files. Can be 'inline' or 'fileextract'

[time] section:

timedimension Exact name of the time variable in the data file(s) e.g. 'time1'. The time dimension is given special treatment when scanning so this information helps identify it. If a value is not given the scanner will attempt to work out which is the time dimension but this method may fail to correctly identify it.

timestorage: Whether to store the time values in the CSML file or reference the original data files. Can be 'inline' or 'fileextract'

Scanning

Now save your config file as myconfigfile.cfg and run the scanner.

csmlscan -c ''myconfigfile''.cfg

The output should be a csml document!

Problems?

Things to check are:

  1. The config file is correct and can be read by the scanner.
  2. The data sources you are trying to read are correctly mounted and you have read permissions.
  3. The config file again.

If it still doesn't work, then please report a bug, preferably with example data and config file.

Defaults

If arguments are not supplied, either in a config file or at the command line, then csmlscan has the following defaults:

    DATASETID=None
    MINAXES = 2
    PRINTSCREEN = 0
    MAPPING=None
    OUTPUTFILE='csmloutput.xml'
    TIMEDIMENSION='time'
    TIMESTORAGE='inline'
    VALUESTORAGE='fileextract'
    SPATIALSTORAGE='fileextract'
    FEATURETYPE='GridSeries'
    SELECTION=None
    ROOTDIRECTORY=os.getcwd()

Where os.getcwd() is the directory from which you are running csmlscan.