source: FCC2/trunk/README @ 41

Subversion URL: http://proj.badc.rl.ac.uk/svn/exarch/FCC2/trunk/README@41
Revision 41, 6.1 KB checked in by mjuckes, 6 years ago (diff)

reworked and MIP tables added

Line 
1BASIC File Compliance Checker.
2
3This script is designed to scan a collection of files and test conformance to a set of rules.
4
5The rules apply to the filename, to consistency of time ranges, to global attributes and variable attributes.
6
7Groups of files (datasets) are defined by specifying a mapping from file properties to a dataset identifier. The script is designed to accomodate multiple such mappings (e.g. to a variable level dataset containing several files for one variable, or to a publication dataset containing thousands of files from one simulation), but so far only tested with a single mapping.
8
9The output is a log file which contains a line including "FAIL" for each failed test. The script can be made to raise an exception on completion if a test is failed [option to raise an exception on first test failure not yet included].
10
11The present tests are based on the filename, global attributes of the netcdf file, some variable names and variable attributes.
12
13Three directories of dummy files are includes: examples_good, examples_bad and example_good_nc. The former should trigger no errors when the checker is run with the "--oic" (omit in-file checks) option (see below). examples_good and examples_bad contain empty files, so only test the file name checking aspects of the code. example_good_nc contains a real netcdf file.
14
15CONFIGURATION
16-------------
17
18The constants startyear, endyear and enddec in the configuration file "config_cordex_mip.txt" will, in general, vary from experiment to experiment and need to be set by the users. With the present code, this may require a separate configuration file for each experiment. "startyear" and "endyear" are the first and last year of the data collection being submitted (these affect the valid time ranges for daily and monthly files). "enddec" is the last day of December (i.e. "31" for a true calendar, "30" for a 360 day calendar). "enddec" is needed because the software currently checks the file names before looking in the files, and it needs to know the calendar being used in order to know what are valid end points for some time ranges.
19
20The variable "ncdumpCmd" in qc_utils.py should be set to the full path of the ncdump command to be used.
21
22The following example configuration files are provided:
23config_cordex.txt
24config_cordex_mip.txt
25config_cmip5_mip.txt
26
27The first two are for use with CORDEX data, the last with CMIP5. Those with "_mip" in the name also check consistency of variable attributes with CMOR2 MIP tables.
28
29The MIP tables included in the distibution were downloaded from PCMDI on June 6th, 2013. To get the lastest tables run "git clone git://uv-cdat.llnl.gov/gitweb/cordex-cmor-tables.git" and "git clone git://uv-cdat.llnl.gov/cmip5-cmor-tables.git" for CORDEX anc CMIP5 respectively and then copy the contents of the "Tables" subdirectory into the "mip" subdirectory of the relevant vocabularies directory.
30
31WARNING: the June 6th tables have no support for "fx" fields, and data of this frquency will cause the script to fail.
32
33CORDEX vocabularies can be updated from the DMI CORDEX site using:
34wget -nH http://cordex.dmi.dk/joomla/images/CORDEX/RCMModelName.txt
35wget -nH http://cordex.dmi.dk/joomla/images/CORDEX/GCMModelName.txt
36
37TIME RANGE
38----------
39
40The most complex tests are on the time range. These are handled as regular expressions, with a different set of expressions for each frequency. Named groups in the regular expressions can be captured and subject to constraints. E.g. time ranges should start at the beginning of the decade or at the first year of the simulation. The first year can either be specified in the user configuration file or, if simulations with multiple start years are to be scanned, a slightly weaker constraint that only one start year which is not at the start of a decade is allowed in each simulation.
41
42EXCEPTIONS
43----------
44The default behaviour is to catch exceptions, record a note in the log file and then raise the exception again. The final raising can be supressed. It is also possible to raise an exception when a test has been failed (--exception-on-fail).
45
46USAGE
47-----
48
49./fcc fcc -d tmp -c config_cordex.txt [--ef] [-q] [--oic] [--nlp] [--append-log] [-l log-file.txt]
50
51-b: base directory
52-d: directory :: -b and -d provide redundancy; if -l is not specified the argument for -d will be used to construct a log file name;
53-c: configuration file specifying tests to be run;
54--ef: raise an exception on completion if test(s) have been failed;
55-q: suppress printing of script completion message to std-out;
56--nlp: do not log passed tests;
57--oic: omit "in-file" tests, i.e. test looking at netcdf attributes etc.
58--append-log: append log to existing log file, if present;
59-l <filename>: file to receive log messages;
60./run_qc.py -b ./ -d examples_bad/ -c config_cordex.txt --nlp --append-log;`
61--R, --repeatable: omit timing info and other housekeeping notes from log file so that log file is reproducible: useful in development;
62--uc --user-config: user configuration information, over-riding configuration in configuration file.
63--swallow-exceptions: caught exceptions are recorded in the output log, but are not passed on.
64--exception-on-fail: raise an exception at end of execution if a test has been failed.
65
66TEST
67----
68
69Run:
70./fcc -d examples_bad/ -c config_cordex.txt --nlp --oic --R -l examples_bad_test.log
71md5sum examples_bad_test.log
72
73The output from md5sum should be:
74ed4cadd219bc951b1dde3cbda7ff1e11  examples_bad_test.log
75
76The "--oic" flag is needed with "examples_bad" and "examples_good" test directories, because these are not data files -- they only serve to test the filename parsing section of the code.
77
78TO DO
79-----
80
81001: Add more vocabularies;
82003: Structure logging: add option for multiple log files, e.g. file level, dataset level;
83004: Structure test categorisation, e.g. multiple tests associated with checking time range: only need to report final success;
84006: Add option to echo test failures to std-err;
85007: Scope syntax for heierarchical datasets and tests associated with them -- e.g. a simulation/run/variable.
86008: Clean up test configuration file;
87009: Document internal namespace and how mapping from file properties (name, path, attributes) to internal namespace is controlled;
Note: See TracBrowser for help on using the repository browser.