source: FCC/README @ 39

Subversion URL: http://proj.badc.rl.ac.uk/svn/exarch/FCC/README@1037
Revision 39, 4.7 KB checked in by mjuckes, 7 years ago (diff)

2 new vocabularies, minor changes to accomodate these

Line 
1BASIC QC SCRIPT.
2May 7th, 2013
3
4This script is designed to scan a collection of files and test conformance to a set of rules.
5
6In the initial release, the rules apply to the filename and to consistency of time ranges.
7
8Groups of files (datasets) are defined by specifying a mapping from file properties to a dataset identifier. The script is designed to accomodate multiple such mappings (e.g. to a variable level dataset containing several files for one variable, or to a publication dataset containing thousands of files from one simulation), but so far only tested with a single mapping.
9
10This release is at an early stage of development.
11
12The output is a log file which contains a line including "FAIL" for each failed test. The script can be made to raise an exception on completion if a test is failed [option to raise an exception on first test failure not yet included].
13
14The present tests are based on the filename, directory path, global attributes of the netcdf file, and some variable names.
15
16Two directories of dummy files are includes: examples_good and examples_bad. The former should trigger no errors when the checker is run with the "--oic" (omit in-file checks) option (see below).
17
18CONFIGURATION
19-------------
20
21The constants startyear, endyear and enddec in the configuration file "config_cordex.txt" will, in general, vary from experiment to experiment and need to be set by the users. With the present code, this may require a separate configuration file for each experiment. "startyear" and "endyear" are the first and last year of the data collection being submitted (these affect the valid time ranges for daily and monthly files). "enddec" is the last day of December (i.e. "31" for a true calendar, "30" for a 360 day calendar). "enddec" is needed because the software currently checks the file names before looking in the files, and it needs to know the calendar being used in order to know what are valid end points for some time ranges.
22
23The variable "ncdumpCmd" in qc_utils.py should be set to the full path of the ncdump command to be used.
24
25TIME RANGE
26----------
27
28The most complex tests are on the time range. These are handled as regular expressions, with a different set of expressions for each frequency. Named groups in the regular expressions can be captured and subject to constraints. E.g. time ranges should start at the beginning of the decade or at the first year of the simulation. The first year can either be specified in the user configuration file or, if simulations with multiple start years are to be scanned, a slightly weaker constraint that only one start year which is not at the start of a decade is allowed in each simulation.
29
30USAGE
31-----
32
33run_qc.py -b ../ -d tmp -c config_cordex.txt [--ef] [-q] [--oic] [--nlp] [--append-log] [-l log-file.txt]
34
35-b: base directory
36-d: directory :: -b and -d provide redundancy; if -l is not specified the argument for -d will be used to construct a log file name;
37-c: configuration file specifying tests to be run;
38--ef: raise an exception on completion if test(s) have been failed;
39-q: suppress printing of script completion message to std-out;
40--nlp: do not log passed tests;
41--oic: omit "in-file" tests, i.e. test looking at netcdf attributes etc.
42--append-log: append log to existing log file, if present;
43-l <filename>: file to receive log messages;
44./run_qc.py -b ./ -d examples_bad/ -c config_cordex.txt --nlp --append-log;`
45--R, --repeatable: omit timing info and other housekeeping notes from log file so that log file is reproducible: useful in development;
46--uc --user-config: user configuration information, over-riding configuration in configuration file.
47
48VOCABULARIES
49-----------
50The RCMModelname vocabulary is evolving -- users shoud get a current copy from:
51http://cordex.dmi.dk/joomla/images/CORDEX/RCMModelName.txt
52
53
54TEST
55----
56
57Run:
58./run_qc.py -b ./ -d examples_bad/ -c config_cordex.txt --nlp --oic --R -l examples_bad_test.log
59md5sum examples_bad_test.log
60
61The output from md5sum should be:
62f93fbd3f4deee41ad35baf70724356b4  examples_bad_test.log
63
64The "--oic" flag is needed with "examples_bad" and "examples_good" test directories, because these are not data files -- they only serve to test the filename parsing section of the code.
65
66TO DO
67-----
68
69001: Add more vocabularies;
70003: Structure logging: add option for multiple log files, e.g. file level, dataset level;
71004: Structure test categorisation, e.g. multiple tests associated with checking time range: only need to report final success;
72006: Add option to echo test failures to std-err;
73007: Scope syntax for heierarchical datasets and tests associated with them -- e.g. a simulation/run/variable.
74008: Clean up test configuration file;
75009: Document internal namespace and how mapping from file properties (name, path, attributes) to internal namespace is controlled;
Note: See TracBrowser for help on using the repository browser.